Metadata-Version: 2.1
Name: grobid-client-python-test
Version: 0.0.10
Summary: Simple python client for GROBID REST services
Author-email: Patrice Lopez <patrice.lopez@science-miner.com>
Maintainer-email: Patrice Lopez <patrice.lopez@science-miner.com>, Luca Foppiano <lucanoro@duck.com>
License: 
                                        Apache License
                                  Version 2.0, January 2004
                               http://www.apache.org/licenses/
        
          TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
          1. Definitions.
        
             "License" shall mean the terms and conditions for use, reproduction,
             and distribution as defined by Sections 1 through 9 of this document.
        
             "Licensor" shall mean the copyright owner or entity authorized by
             the copyright owner that is granting the License.
        
             "Legal Entity" shall mean the union of the acting entity and all
             other entities that control, are controlled by, or are under common
             control with that entity. For the purposes of this definition,
             "control" means (i) the power, direct or indirect, to cause the
             direction or management of such entity, whether by contract or
             otherwise, or (ii) ownership of fifty percent (50%) or more of the
             outstanding shares, or (iii) beneficial ownership of such entity.
        
             "You" (or "Your") shall mean an individual or Legal Entity
             exercising permissions granted by this License.
        
             "Source" form shall mean the preferred form for making modifications,
             including but not limited to software source code, documentation
             source, and configuration files.
        
             "Object" form shall mean any form resulting from mechanical
             transformation or translation of a Source form, including but
             not limited to compiled object code, generated documentation,
             and conversions to other media types.
        
             "Work" shall mean the work of authorship, whether in Source or
             Object form, made available under the License, as indicated by a
             copyright notice that is included in or attached to the work
             (an example is provided in the Appendix below).
        
             "Derivative Works" shall mean any work, whether in Source or Object
             form, that is based on (or derived from) the Work and for which the
             editorial revisions, annotations, elaborations, or other modifications
             represent, as a whole, an original work of authorship. For the purposes
             of this License, Derivative Works shall not include works that remain
             separable from, or merely link (or bind by name) to the interfaces of,
             the Work and Derivative Works thereof.
        
             "Contribution" shall mean any work of authorship, including
             the original version of the Work and any modifications or additions
             to that Work or Derivative Works thereof, that is intentionally
             submitted to Licensor for inclusion in the Work by the copyright owner
             or by an individual or Legal Entity authorized to submit on behalf of
             the copyright owner. For the purposes of this definition, "submitted"
             means any form of electronic, verbal, or written communication sent
             to the Licensor or its representatives, including but not limited to
             communication on electronic mailing lists, source code control systems,
             and issue tracking systems that are managed by, or on behalf of, the
             Licensor for the purpose of discussing and improving the Work, but
             excluding communication that is conspicuously marked or otherwise
             designated in writing by the copyright owner as "Not a Contribution."
        
             "Contributor" shall mean Licensor and any individual or Legal Entity
             on behalf of whom a Contribution has been received by Licensor and
             subsequently incorporated within the Work.
        
          2. Grant of Copyright License. Subject to the terms and conditions of
             this License, each Contributor hereby grants to You a perpetual,
             worldwide, non-exclusive, no-charge, royalty-free, irrevocable
             copyright license to reproduce, prepare Derivative Works of,
             publicly display, publicly perform, sublicense, and distribute the
             Work and such Derivative Works in Source or Object form.
        
          3. Grant of Patent License. Subject to the terms and conditions of
             this License, each Contributor hereby grants to You a perpetual,
             worldwide, non-exclusive, no-charge, royalty-free, irrevocable
             (except as stated in this section) patent license to make, have made,
             use, offer to sell, sell, import, and otherwise transfer the Work,
             where such license applies only to those patent claims licensable
             by such Contributor that are necessarily infringed by their
             Contribution(s) alone or by combination of their Contribution(s)
             with the Work to which such Contribution(s) was submitted. If You
             institute patent litigation against any entity (including a
             cross-claim or counterclaim in a lawsuit) alleging that the Work
             or a Contribution incorporated within the Work constitutes direct
             or contributory patent infringement, then any patent licenses
             granted to You under this License for that Work shall terminate
             as of the date such litigation is filed.
        
          4. Redistribution. You may reproduce and distribute copies of the
             Work or Derivative Works thereof in any medium, with or without
             modifications, and in Source or Object form, provided that You
             meet the following conditions:
        
             (a) You must give any other recipients of the Work or
                 Derivative Works a copy of this License; and
        
             (b) You must cause any modified files to carry prominent notices
                 stating that You changed the files; and
        
             (c) You must retain, in the Source form of any Derivative Works
                 that You distribute, all copyright, patent, trademark, and
                 attribution notices from the Source form of the Work,
                 excluding those notices that do not pertain to any part of
                 the Derivative Works; and
        
             (d) If the Work includes a "NOTICE" text file as part of its
                 distribution, then any Derivative Works that You distribute must
                 include a readable copy of the attribution notices contained
                 within such NOTICE file, excluding those notices that do not
                 pertain to any part of the Derivative Works, in at least one
                 of the following places: within a NOTICE text file distributed
                 as part of the Derivative Works; within the Source form or
                 documentation, if provided along with the Derivative Works; or,
                 within a display generated by the Derivative Works, if and
                 wherever such third-party notices normally appear. The contents
                 of the NOTICE file are for informational purposes only and
                 do not modify the License. You may add Your own attribution
                 notices within Derivative Works that You distribute, alongside
                 or as an addendum to the NOTICE text from the Work, provided
                 that such additional attribution notices cannot be construed
                 as modifying the License.
        
             You may add Your own copyright statement to Your modifications and
             may provide additional or different license terms and conditions
             for use, reproduction, or distribution of Your modifications, or
             for any such Derivative Works as a whole, provided Your use,
             reproduction, and distribution of the Work otherwise complies with
             the conditions stated in this License.
        
          5. Submission of Contributions. Unless You explicitly state otherwise,
             any Contribution intentionally submitted for inclusion in the Work
             by You to the Licensor shall be under the terms and conditions of
             this License, without any additional terms or conditions.
             Notwithstanding the above, nothing herein shall supersede or modify
             the terms of any separate license agreement you may have executed
             with Licensor regarding such Contributions.
        
          6. Trademarks. This License does not grant permission to use the trade
             names, trademarks, service marks, or product names of the Licensor,
             except as required for reasonable and customary use in describing the
             origin of the Work and reproducing the content of the NOTICE file.
        
          7. Disclaimer of Warranty. Unless required by applicable law or
             agreed to in writing, Licensor provides the Work (and each
             Contributor provides its Contributions) on an "AS IS" BASIS,
             WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
             implied, including, without limitation, any warranties or conditions
             of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
             PARTICULAR PURPOSE. You are solely responsible for determining the
             appropriateness of using or redistributing the Work and assume any
             risks associated with Your exercise of permissions under this License.
        
          8. Limitation of Liability. In no event and under no legal theory,
             whether in tort (including negligence), contract, or otherwise,
             unless required by applicable law (such as deliberate and grossly
             negligent acts) or agreed to in writing, shall any Contributor be
             liable to You for damages, including any direct, indirect, special,
             incidental, or consequential damages of any character arising as a
             result of this License or out of the use or inability to use the
             Work (including but not limited to damages for loss of goodwill,
             work stoppage, computer failure or malfunction, or any and all
             other commercial damages or losses), even if such Contributor
             has been advised of the possibility of such damages.
        
          9. Accepting Warranty or Additional Liability. While redistributing
             the Work or Derivative Works thereof, You may choose to offer,
             and charge a fee for, acceptance of support, warranty, indemnity,
             or other liability obligations and/or rights consistent with this
             License. However, in accepting such obligations, You may act only
             on Your own behalf and on Your sole responsibility, not on behalf
             of any other Contributor, and only if You agree to indemnify,
             defend, and hold each Contributor harmless for any liability
             incurred by, or claims asserted against, such Contributor by reason
             of your accepting any such warranty or additional liability.
        
          END OF TERMS AND CONDITIONS
        
          APPENDIX: How to apply the Apache License to your work.
        
             To apply the Apache License to your work, attach the following
             boilerplate notice, with the fields enclosed by brackets "[]"
             replaced with your own identifying information. (Don't include
             the brackets!)  The text should be enclosed in the appropriate
             comment syntax for the file format. We also recommend that a
             file or class name and description of purpose be included on the
             same "printed page" as the copyright notice for easier
             identification within third-party archives.
        
          Copyright 2018-2024 The Contributors
        
          Licensed under the Apache License, Version 2.0 (the "License");
          you may not use this file except in compliance with the License.
          You may obtain a copy of the License at
        
              http://www.apache.org/licenses/LICENSE-2.0
        
          Unless required by applicable law or agreed to in writing, software
          distributed under the License is distributed on an "AS IS" BASIS,
          WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          See the License for the specific language governing permissions and
          limitations under the License.
        
        
Project-URL: Homepage, https://github.com/kermitt2/grobid_client_python
Project-URL: Repository, https://github.com/kermitt2/grobid_client_python
Project-URL: Changelog, https://github.com/kermitt2/grobid_client_python
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests

[![PyPI version](https://badge.fury.io/py/grobid_client_python.svg)](https://badge.fury.io/py/grobid_client_python)
[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/kermitt2/grobid_client_python/)](https://archive.softwareheritage.org/browse/origin/https://github.com/kermitt2/grobid_client_python/)
[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)

# Simple python client for GROBID REST services

This Python client can be used to process in an efficient concurrent manner a set of PDF in a given directory by the [GROBID](https://github.com/kermitt2/grobid) service. It includes a command line for processing PDF on a file system and write results in a given output directory and a library for import in other python scripts. The client can also process similarly a list of files with reference strings (one per line) and patents in XML ST36 formats.

## Before you start

Please be aware that, at the moment, [grobid does not support Windows](https://grobid.readthedocs.io/en/latest/Troubleshooting/#windows-related-issues).
If you are a Windows user, don't worry. You can still [run grobid 
via Docker](https://grobid.readthedocs.io/en/latest/Grobid-docker/).

## Build and run

You need first a running *grobid* service, latest stable version, see the [documentation](http://grobid.readthedocs.io/) for installation. 
By default, it is assumed that the server will run on the address `http://localhost:8070`. 
You can change the server address by editing the file `config.json`, see below.

## Requirements

This client has been developed and was tested with Python `3.5`-`3.9` and should work with any higher `3.*` versions. It uses `requests` as dependency beyond the Standard Python Library.

## Install

The client can be installed with any of the following ways:

* Install *latest stable release* from PyPI:

```console
python3 -m pip install grobid-client-python
```

* Install *current master development version* from GitHub:

```console
python3 -m pip install git+https://github.com/kermitt2/grobid_client_python.git
```

* Install and build from a clone of the repo (*current master development version*): 

```
git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
python3 setup.py install
```

There is nothing more needed to start using the python command lines, see the next section. 

## Usage and options

The call to the script can normally be realized interchangeably with `python3 -m grobid_client.grobid_client` or simply `grobid_client`. 

```
usage: grobid_client [-h] [--input INPUT] [--output OUTPUT] [--config CONFIG]
                     [--n N] [--generateIDs] [--consolidate_header]
                     [--consolidate_citations] [--include_raw_citations]
                     [--include_raw_affiliations] [--force] [--teiCoordinates]
                     [--verbose]
                     service

Client for GROBID services

positional arguments:
  service               one of ['processFulltextDocument',
                        'processHeaderDocument', 'processReferences',
                        'processCitationList','processCitationPatentST36',
                        'processCitationPatentPDF']

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         path to the directory containing PDF files or .txt
                        (for processCitationList only, one reference per line)
                        to process
  --output OUTPUT       path to the directory where to put the results
                        (optional)
  --config CONFIG       path to the config file, default is ./config.json
  --n N                 concurrency for service usage
  --generateIDs         generate random xml:id to textual XML elements of the
                        result files
  --consolidate_header  call GROBID with consolidation of the metadata
                        extracted from the header
  --consolidate_citations
                        call GROBID with consolidation of the extracted
                        bibliographical references
  --include_raw_citations
                        call GROBID requesting the extraction of raw citations
  --include_raw_affiliations
                        call GROBID requestiong the extraciton of raw
                        affiliations
  --force               force re-processing pdf input files when tei output
                        files already exist
  --teiCoordinates      add the original PDF coordinates (bounding boxes) to
                        the extracted elements
  --segmentSentences    segment sentences in the text content of the document
                        with additional <s> elements
  --verbose             print information about processed files in the console


```

Examples:

```console
> grobid_client --input ~/tmp/in2 --output ~/tmp/out processFulltextDocument
```

This command will process all the PDF files present under the input directory recursively (files with extension `.pdf` only) with the `processFulltextDocument` service of GROBID, and write the resulting XML TEI files under the output directory, reusing the file name with a different file extension (`.grobid.tei.xml`), using the default `10` concurrent workers. 

If `--output` is omitted, the resulting XML TEI documents will be produced alongside the PDF in the `--input` directory.

```console
> grobid_client --input ~/tmp/in2 --output ~/tmp/out --n 20 processHeaderDocument
```

This command will process all the PDF files present in the input directory (files with extension `.pdf` only) with the `processHeaderDocument` service of GROBID, and write the resulting XML TEI files under the output directory, reusing the file name with a different file extension (`.grobid.tei.xml`), using `20` concurrent workers. 

By default if an existing `.grobid.tei.xml` file is present in the output directory corresponding to a PDF in the input directory, this PDF will be skipped to avoid reprocessing several times the same PDF. To force the processing of PDF and over-write of existing TEI files, use the parameter `--force`.   

`processCitationList` does not take a repertory of PDF as input, but a repertory of `.txt` files, with one reference raw string per line, for example:

```console
> grobid_client --input resources/test_txt/ --output resources/test_out/ --n 20 processCitationList
```

The following command example will process all the PDF files present in the input directory and add bounding box coordinates (`--teiCoordinates`) relative to the original PDFs for the elements listed in the config file. It will also segment the sentences (`--segmentSentences`, this is a "layout aware" sentence segmentation) in the identified paragraphs with bounding box coordinates for the sentences. 

```console
> grobid_client --input ~/tmp/in2 --output ~/tmp/out --teiCoordinates --segmentSentences processFulltextDocument
```

The file `example.py` gives an example of usage as a library, from a another python script. 

## Using the client in your python

Import and call the client as follow:

```python
from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./config.json")
client.process("processFulltextDocument", "/mnt/data/covid/pdfs", n=20)
```

See also `example.py`.

## Configuration of the client

There are a few parameters that can be set with the `config.json` file. 

- `grobid_server` indicates the URL of the GROBID server to be used by the client. 

- `batch_size` is the the size of the pool of threads used by ThreadPoolExecutor, you normally don't want to change this. This should be a high number (default 1000) - but not too high to protect the memory on the machine running the client. This should not be confused with the concurrency parameter `n` which indicates how many parallel requests can be send to GROBID.

- `sleep_time` indicates in seconds the time to wait for sending a new request to GROBID when the server indicates that all its threads are currently used. The client need to re-send the query after a wait time that will allow the server to free some threads. This wait time usually depends on the service and the capacities of the server, we suggest 5-10 seconds for the `processFulltextDocument` service and 2 seconds for `processHeaderDocument` service.

- `timeout` is a client side timeout - the process on server side will still be running until the server finished the task or the server timeout is reached.

- `coordinates` indicates the structure XML elements that should contains PDF coordinates when the parameters `--teiCoordinates` is used see [here](https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/) for more details.

Here is the default `config.json` file for the client:

```
{
    "grobid_server": "http://localhost:8070",
    "batch_size": 1000,
    "sleep_time": 5,
    "timeout": 60,
    "coordinates": [ "persName", "figure", "ref", "biblStruct", "formula", "s" ]
}
```

## Benchmarking

Full text processing of __136 PDF__ (total 3443 pages, in average 25 pages per PDF) on Intel Core i7-4790K CPU 4.00GHz, 4 cores (8 threads), 16GB memory, `n` being the concurrency parameter:

| n  | runtime (s)| s/PDF | PDF/s |
|----|------------|-------|-------|
| 1  | 209.0      | 1.54  | 0.65  |
| 2  | 112.0      | 0.82  | 1.21  |
| 3  | 80.4       | 0.59  | 1.69  |
| 5  | 62.9       | 0.46  | 2.16  |
| 8  | 55.7       | 0.41  | 2.44  |
| 10 | 55.3       | 0.40  | 2.45  |

![Runtime Plot](resources/20180928112135.png)

As complementary info, GROBID processing of header of the 136 PDF and with `n=10` takes 3.74 s (15 times faster than the complete full text processing because only the two first pages of the PDF are considered), 36 PDF/s. 

In similar conditions, extraction and structuring of bibliographical references takes 26.9 s (5.1 PDF/s).

Processing of 3500 raw bibliographical take 4.3 s with `n=10` (814 references parsed per second).


## Developer notes 

### New release 

New releases can be published by using `bump-my-version`:

```shell
pip install bump-my-version
bump-my-version bump patch 
```

Use of  `major`, `minor`, or `patch` or  will increment the first, second or the third digit of the version, respectively.  
The release will be published automatically on pypy. 

## License and contact

Distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). 

Main author and contact: Patrice Lopez (<patrice.lopez@science-miner.com>)
