1. scraping documents
Tools for extracting semantic content out of documents (e.g pdfs, …)
1.1. programming-frameworks/libraries etc.
1.1.1. R
- LexisNexisTools (website-cran repository )
-
This package provides functions to read files manually downloaded from ‘LexisNexis’ and comes with a few other features I think come in handy while working with data from the popular newspaper archive. < | GPL-3.0 | library | R | >
- readtext (website website-cran repository )
-
readtext is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call “docvars”, for document variables. Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do. < | GPL-3.0 | library | R | >
1.1.2. Python
- news_extract (repository )
-
news_extract allows the output of the NexisUni and Factiva databases to be imported into Python. Note, you must export your documents manually first! This module does not scrape the databases directly; rather, it extracts articles and associated metadata from pre-exported output files. < | BSD | library | Python | >
1.1.3. Others
- Grobid (website-demo website-documentation repository )
-
GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. < | Apache-2.0 | library | Java | >
- pdf2xml (repository )
-
convert PDF files to XML. This script heavily relies on Apache Tika and pdftotext for the extraction of text and the conversion to XML. It tries to combine information from both tools and different conversion modes: < | GPL-3.0 | library | Perl, Java | >
- PdfAct (repository )
-
A basic tool that extracts the structure from the PDF files of scientific articles. < | Apache-2.0 | library | Java | >
- pdfalto (repository )
-
pdfalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content in ALTO format. < | GPL-2.0 | library | C | >
- PDFBox (website repository )
-
The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. < | Apache-2.0 | library | Java | >
- PDFMiner.six (repository )
-
< | None | library | | >
- poppler (website repository )
-
A library for rendering PDF files, and examining or modifying their structure. < | GPL-3.0 | library | C | >
- xpdf (repository )
-
Xpdf is a free PDF viewer and toolkit, including a text extractor, image converter, HTML converter, and more. Most of the tools are available as open source. < | GPL-3.0 | library | C | >