... | ... | @@ -144,10 +144,17 @@ _Tools in the area of web-scraping_ |
|
|
|
|
|
link:Tool_YouTubeComments[YouTubeComments] (link:https://osf.io/hqsxe/[website] link:https://github.com/JuKo007/YouTubeComments[repository] ):: 'This repository contains an R script as well as an interactive Jupyter notebook to demonstrate how to automatically collect, format, and explore YouTube comments, including the emojis they contain. The script and notebook showcase the following steps: Getting access to the YouTube API Extracting comments for a video Formatting the comments & extracting emojis Basic sentiment analysis for text & emojis' link:https://github.com/JuKo007/YouTubeComments[Retrieved 07.03.2019] < | Unknown | library | R | >
|
|
|
|
|
|
link:Tool_Rcrawler[Rcrawler] (link:https://www.sciencedirect.com/science/article/pii/S2352711017300110[website] link:https://github.com/salimk/Rcrawler[repository] ):: 'RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications.' link:https://www.sciencedirect.com/science/article/pii/S2352711017300110[Retrieved 22.03.2019] < | GPL-3.0 | library | R | >
|
|
|
|
|
|
link:Tool_RSelenium[RSelenium] (link:https://github.com/ropensci/RSelenium[repository] ):: < | AGPL-3.0 | library | R | >
|
|
|
|
|
|
link:Tool_rvest[rvest] (link:https://cran.r-project.org/web/packages/rvest/index.html[website-cran] link:https://github.com/tidyverse/rvest[repository] ):: < | GPL-3.0 | library | R | >
|
|
|
|
|
|
==== data wrangling
|
|
|
_tools for data wrangling_
|
|
|
|
|
|
link:Tool_boilerpipeR[boilerpipeR] (link:https://cran.r-project.org/web/packages/boilerpipeR/index.html[website-cran] ):: Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates. < | GPL-3.0 | library | R | >
|
|
|
|
|
|
==== tools for corpus linguistics/text mining/(semi-)automated text analysis
|
|
|
_Integrated platforms for corpus analysis and processing._
|
|
|
|
... | ... | @@ -205,6 +212,15 @@ link:Tool_BeautifulSoup[Beautiful Soup] (link:https://www.crummy.com/software/Be |
|
|
|
|
|
link:Tool_Robobrowser[Robobrowser] (link:https://robobrowser.readthedocs.io/en/latest/readme.html[website] link:https://github.com/jmcarp/robobrowser[repository] ):: < | MIT | library | Python | >
|
|
|
|
|
|
==== data wrangling
|
|
|
_tools for data wrangling_
|
|
|
|
|
|
link:Tool_jusText[jusText] (link:http://corpus.tools/wiki/Justext[website] ):: 'jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.' < | BSD | library | Python | >
|
|
|
|
|
|
link:Tool_python-boilerpipe[python-boilerpipe] (link:https://github.com/misja/python-boilerpipe[repository] ):: A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages. < | Apache-2.0 | library | Python | >
|
|
|
|
|
|
link:Tool_Newspaper3k[Newspaper3k] (link:https://newspaper.readthedocs.io/en/latest/[website] link:https://github.com/codelucas/newspaper[repository] ):: Newspaper delivers Instapaper style article extraction. < | MIT | library | Python | >
|
|
|
|
|
|
==== natuaral language processing(NLP)
|
|
|
__
|
|
|
|
... | ... | |