1. general crawling/scraping
Tools in the area of web-crawling and scraping.
1.1. programming-frameworks/libraries etc.
1.1.1. R
- Rcrawler (website repository )
-
'RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications.' Retrieved 22.03.2019 < | GPL-3.0 | library | R | >
- rvest (website-cran repository )
-
< | GPL-3.0 | library | R | >
1.1.2. Python
- Scrapy (website repository )
-
< | BSD | framework | Python | >
- Beautiful Soup (website repository )
-
'Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful: Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application;Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.;Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.'Retrieved 22.03.2019 < | MIT | library | Python | >
- Robobrowser (website repository )
-
< | MIT | library | Python | >