SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF.

NOTE: the recommended way to run grobid is via docker, so make sure it's running on your machine. Update the script so that you are using latest version. Generally, at every version there are substantial improvements.

bash serve_grobid.sh

This script will run GROBID at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

Python PDF parser for scientific publications: content and figures

SciPDF Parser

Installation

Usage

Free Open Source Document Management System (mirror, no pull request or issues)

Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

🔐Free GPT-3.5 chat with your docs (PDF, WORD, CSV, TXT)

Abjad is a Python API for building LilyPond files. Use Abjad to make PDFs of music notation.

Performing the following operations using python on PDF.

This repository is for learning and understanding how algorithms work.

en-crypt, de-crypt, si-gn, ve-rify - smime, pdf, xades and plain files in pure python

A utility to check if a document's contents are plagiarised

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!