1. scraping websites
Tools for extraction of text body in html websites.
1.1. programming-frameworks/libraries etc.
1.1.1. R
- boilerpipeR (website-cran )
-
Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates. < | GPL-3.0 | library | R | >
1.1.2. Python
- goose3 (repository )
-
'Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project. This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.' < | Apache-2.0 | library | Python | >
- jusText (website )
-
'jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.' < | BSD | library | Python | >
- newspaper3k (website repository )
-
Newspaper delivers Instapaper style article extraction. < | MIT | library | Python | >
- python-boilerpipe (repository )
-
A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages. < | Apache-2.0 | library | Python | >