🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pd3f Last update: Mar 18, 2024

`pd3f`

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

calculate runtime based on job.started_at and job.ended_at
Get average runtime of jobs and store data in redis list

more information about PDF

NER
entity linking
extract keywords
use textacy

add more language

check if flair has model
what to do if there is no fast model?

Python client

simple client based on request
send whole folders

Markdown / HTML export

go beyond text

use pdf-scripts / allow more processing

reduce size
repair PDF
detect if scanned
force to OCR again

improve logs / get better feedback

show uncertainty of ML model
allow different log levels

Related Work

https://github.com/axa-group/Parsr
https://github.com/jzillmann/pdf-to-markdown
some PDF processing tools in my blog post

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

`pd3f`

Documentation

Future Work / TODO

statics about how long processing (per page) took in the past

more information about PDF

add more language

Python client

Markdown / HTML export

use pdf-scripts / allow more processing

improve logs / get better feedback

Related Work

Development

Contributing

License

Free Open Source Document Management System (mirror, no pull request or issues)

Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

🔐Free GPT-3.5 chat with your docs (PDF, WORD, CSV, TXT)

Abjad is a Python API for building LilyPond files. Use Abjad to make PDFs of music notation.

Performing the following operations using python on PDF.

This repository is for learning and understanding how algorithms work.

en-crypt, de-crypt, si-gn, ve-rify - smime, pdf, xades and plain files in pure python

A utility to check if a document's contents are plagiarised

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!