link:Tool_jusText[jusText] (link:http://corpus.tools/wiki/Justext[website] ):: 'jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.' < | BSD | library | Python | >
link:Tool_goose3[goose3] (link:https://github.com/goose3/goose3[repository] ):: 'Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project. This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.' < | Apache-2.0 | library | Python | >
link:Tool_python-boilerpipe[python-boilerpipe] (link:https://github.com/misja/python-boilerpipe[repository] ):: A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages. < | Apache-2.0 | library | Python | >
link:Tool_jusText[jusText] (link:http://corpus.tools/wiki/Justext[website] ):: 'jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.' < | BSD | library | Python | >
link:Tool_python-boilerpipe[python-boilerpipe] (link:https://github.com/misja/python-boilerpipe[repository] ):: A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages. < | Apache-2.0 | library | Python | >