Python implementation of TextRank algorithm for automatic keyword extraction and summarization using Levenshtein distance as relation between text units. This project is based on the paper "TextRank: Bringing Order into Text" by Rada Mihalcea and Paul Tarau. https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

davidadamojr Last update: Mar 18, 2024

TextRank

This is a python implementation of TextRank for automatic keyword and sentence extraction (summarization) as done in https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf. However, this implementation uses Levenshtein Distance as the relation between text units.

This implementation carries out automatic keyword and sentence extraction on 10 articles gotten from http://theonion.com

100 word summary
Number of keywords extracted is relative to the size of the text (a third of the number of nodes in the graph)
Adjacent keywords in the text are concatenated into keyphrases

Usage

To install the library run the setup.py module located in the repository's root directory. Alternatively, if you have access to pip you may install the library directly from github:

pip install git+https://github.com/davidadamojr/TextRank.git

Use of the library requires downloading nltk resources. Use the textrank initialize command to fetch the required data. Once the data has finished downloading you may execute the following commands against the library:

textrank extract_summary <filename>
textrank extract_phrases <filename>

Contributing

Install the library as "editable" within a virtual environment.

pip install -e .

Dependencies

Dependencies are installed automatically with pip but can be installed serparately.

Networkx - https://pypi.python.org/pypi/networkx/
NLTK 3.0 - https://pypi.python.org/pypi/nltk/3.2.2
Numpy - https://pypi.python.org/pypi/numpy
Click - https://pypi.python.org/pypi/click

Tags:

TextRank

Usage

Contributing

Dependencies

Free Open Source Document Management System (mirror, no pull request or issues)

Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

🔐Free GPT-3.5 chat with your docs (PDF, WORD, CSV, TXT)

Abjad is a Python API for building LilyPond files. Use Abjad to make PDFs of music notation.

Performing the following operations using python on PDF.

This repository is for learning and understanding how algorithms work.

en-crypt, de-crypt, si-gn, ve-rify - smime, pdf, xades and plain files in pure python

A utility to check if a document's contents are plagiarised

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!