- 1. user-consented tracking
- 2. scraping
- 3. tools for corpus linguistics
- 4. computer assisted/aided qualitative data analysis software (CAQDAS)
- 5. text mining/natuaral language processing(NLP)
- 6. topic-models
- 7. sentiment analysis
- 8. visualization
- 9. collaborative annotation
- 10. collaborative writing
- 11. research data archiving
- 12. statistical software
- 13. nowcasting
- 14. network analysis
- 15. search
- 16. ESM/EMA surveys
- 17. audio-transcriptions
- 18. optical character recognition (OCR)
- 19. online experiments
- 20. miscellaneous
1. user-consented tracking
Collection of sensor data on (mobile) devices in accordance with data protection laws.
- AWARE (website repository-android repository-iOS repository-OSX repository-server )
-
< Apache-2.0 | framework | Java >
- MEILI (repository-group )
-
< GPL-3.0 | framework | Java >
- Passive Data Kit (website repository-djangoserver repository-android repository-iOS )
-
< Apache-2.0 | framework | Python Java >
2. scraping
Tools in the area of web-scraping
- facepager (wiki repository )
-
None < MIT | package | Python >
- Scrapy (website repository )
-
None < BSD | package | Python >
- RSelenium (repository )
-
None < AGPL-3.0 | package | R >
3. tools for corpus linguistics
Integrated platforms for corpus analysis and processing.
- AmCAT (website-entwickler repository wiki )
-
The Amsterdam Content Analysis Toolkit (AmCAT) is an open source infrastructure that makes it easy to do large-scale automatic and manual content analysis (text analysis) for the social sciences and humanities. < AGPL-3.0 | SaaS | Python >
- COSMOS (website )
-
COSMOS Open Data Analytics software < Proprietary | standalone | >
- CWB (website repository-cwb repository-cqpweb )
-
a fast, powerful and extremely flexible corpus querying system < GPL-3.0 | framework | C >
- LCM (website )
-
Leipzig Corpus Miner a decentralized SaaS application for the analysis of very large amounts of news texts < LGPL | framework | Java R >
- iLCM (website )
-
The iLCM(LCM=Leipzig Corpus Miner) project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a ‘Software as a Service’ architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data using text mining methods and requirements for the reproducibility of data-driven research designs in the social sciences. < LGPL | SaaS | Java Python R >
4. computer assisted/aided qualitative data analysis software (CAQDAS)
assist with qualitative research such as transcription analysis, coding and text interpretation, recursive abstraction, content analysis, discourse analysis, grounded theory methodology, etc.
- ATLAS.ti (website )
-
< Proprietary | standalone | >
- Leximancer (website )
-
Leximancer automatically analyses your text documents to identify the high level concepts in your text documents, delivering the key ideas and actionable insights you need with powerful interactive visualisations and data exports. < Proprietary | standalone | >
- MAXQDA (website-uhh )
-
< Proprietary | standalone | >
- NVivo (website )
-
< Proprietary | standalone | >
- QDAMiner (website )
-
< Proprietary | standalone | >
- ORA Pro (website )
-
< Proprietary | standalone | >
- Quirkos (website repository )
-
< Proprietary | standalone | >
- RQDA (website repository )
-
It includes a number of standard Computer-Aided Qualitative Data Analysis features. In addition it seamlessly integrates with R, which means that a) statistical analysis on the coding is possible, and b) functions for data manipulation and analysis can be easily extended by writing R functions. To some extent, RQDA and R make an integrated platform for both quantitative and qualitative data analysis. < BSD | package | R >
- TAMS (website )
-
Text Analysis Markup System (TAMS) is both a system of marking documents for qualitative analysis and a series of tools for mining information based on that syntax. < GPL-2.0 | standalone | >
5. text mining/natuaral language processing(NLP)
__
- Apache OpenNLP (website repository )
-
OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution. < Apache-2.0 | package | Java >
- GATE (website repository )
-
GATE - General Architecture for Text Engineering < LGPL | package | Java >
- Gensim (website repository )
-
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. < LGPL | package | Python >
- NLTK (website repository )
-
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. < Apache-2.0 | package | Python >
- Pandas (website repository )
-
< BSD | package | Python >
- polmineR (website-cran repository )
-
< GPL-3.0 | package | R >
- quanteda (website repository )
-
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers. < GPL-3.0 | package | R >
- RapidMiner (website repository )
-
< AGPL-3.0 | framework | Java >
- spaCy (website repository )
-
spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. < MIT | package | Cython >
- Stanford CoreNLP (website repository )
-
< GPL-3.0 | framework | Java >
- tm (website website-cran repository )
-
< GPL-3.0 | package | R >
- xtas (website repository )
-
the eXtensible Text Analysis Suite(xtas) is a collection of natural language processing and text mining tools, brought together in a single software package with built-in distributed computing and support for the Elasticsearch document store. < Apache-2.0 | framework | Python >
6. topic-models
__
- MALLET (website repository )
-
< Apache-2.0 | package | Java >
- TOME (website repository )
-
TOME is a tool to support the interactive exploration and visualization of text-based archives, supported by a Digital Humanities Startup Grant from the National Endowment for the Humanities (Lauren Klein and Jacob Eisenstein, co-PIs). Drawing upon the technique of topic modeling—a machine learning method for identifying the set of topics, or themes, in a document set—our tool allows humanities scholars to trace the evolution and circulation of these themes across networks and over time. < Unknown | package | Python Jupyter Notebook >
- Stm (website repository )
-
The Structural Topic Model (STM) allows researchers to estimate topic models with document-level covariates. The package also includes tools for model selection, visualization, and estimation of topic-covariate regressions. Methods developed in Roberts et al (2014) <doi:10.1111/ajps.12103> and Roberts et al (2016) <doi:10.1080/01621459.2016.1141684>. < MIT | package | R >
7. sentiment analysis
__
- lexicoder (website )
-
Lexicoder performs simple deductive content analyses of any kind of text, in almost any language. All that is required is the text itself, and a dictionary. Our own work initially focused on the analysis of newspaper stories during election campaigns, and both television and newspaper stories about public policy issues. The software can deal with almost any text, however, and lots of it. Our own databases typically include up to 100,000 news stories. Lexicoder processes these data, even with a relatively complicated coding dictionary, in about fifteen minutes. The software has, we hope, a wide range of applications in the social sciences. It is not the only software that conducts content analysis, of course - there are many packages out there, some of which are much more sophisticated than this one. The advantage to Lexicoder, however, is that it can run on any computer with a recent version of Java (PC or Mac), it is very simple to use, it can deal with huge bodies of data, it can be called from R as well as from the Command Line, and its free. < Proprietary | package | Java >
- OpinionFinder (website repository )
-
OpinionFinder is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opinion, direct subjective expressions and speech events, and sentiment expressions. OpinionFinder was developed by researchers at the University of Pittsburgh, Cornell University, and the University of Utah. In addition to OpinionFinder, we are also releasing the automatic annotations produced by running OpinionFinder on a subset of the Penn Treebank. < Unknown | package | Java >
- Readme (website )
-
The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. < CC BY-NC-ND-3.0 | package | R >
8. visualization
__
- Gephi (website repository )
-
Gephi is an award-winning open-source platform for visualizing and manipulating large graphs. < GPL-3.0 | package | Java >
- scikit-image (website repository )
-
scikit-image is a collection of algorithms for image processing. It is available free of charge and free of restriction. We pride ourselves on high-quality, peer-reviewed code, written by an active community of volunteers. < BSD | package | Python >
9. collaborative annotation
__
- CATMA (website website-uhh repository )
-
CATMA (Computer Assisted Text Markup and Analysis) is a practical and intuitive tool for text researchers. In CATMA users can combine the hermeneutic, ‘undogmatic’ and the digital, taxonomy based approach to text and corpora—as a single researcher, or in real-time collaboration with other team members. < Apache-2.0 | package | Python >
- WebAnno (website repository )
-
WebAnno is a multi-user tool supporting different roles such as annotator, curator, and project manager. The progress and quality of annotation projects can be monitored and measuered in terms of inter-annotator agreement. Multiple annotation projects can be conducted in parallel. < Apache-2.0 | package | Python >
10. collaborative writing
__
- FidusWriter (website repository )
-
< AGPL-3.0 | package | Python Javascript >
11. research data archiving
__
- dataverse (website repository )
-
< Apache-2.0 | framework | Java >
12. statistical software
software that helps calcualting with specific statistical models
- gretl (website repository )
-
Is a cross-platform software package for econometric analysis < GPL-3.0 | package | C >
- MLwiN (website repository )
-
< Proprietary | package | >
- SPSS (website-uhh )
-
< Proprietary | package | >
- STATA (website-uhh )
-
< Proprietary | package | >
13. nowcasting
__
- Nowcasting (website-cran repository )
-
< GPL-3.0 | package | R >
14. network analysis
social network analysis
- AutoMap (website )
-
AutoMap enables the extraction of information from texts using Network Text Analysis methods. AutoMap supports the extraction of several types of data from unstructured documents. The type of information that can be extracted includes: content analytic data (words and frequencies), semantic network data (the network of concepts), meta-network data (the cross classification of concepts into their ontological category such as people, places and things and the connections among these classified concepts), and sentiment data (attitudes, beliefs). Extraction of each type of data assumes the previously listed type of data has been extracted. < Proprietary | package | Java >
- NodeXL (website )
-
< Proprietary | package | >
- ORA Pro (website repository )
-
< Proprietary | package | >
- Pajek (website )
-
< Proprietary | package | >
- NetworkX (website repository )
-
Data structures for graphs, digraphs, and multigraphs Many standard graph algorithms Network structure and analysis measures Generators for classic graphs, random graphs, and synthetic networks Nodes can be 'anything' (e.g., text, images, XML records) Edges can hold arbitrary data (e.g., weights, time-series) Open source 3-clause BSD license Well tested with over 90% code coverage Additional benefits from Python include fast prototyping, easy to teach, and multi-platform < BSD | package | Python >
- UCINET (website )
-
UCINET 6 for Windows is a software package for the analysis of social network data. It was developed by Lin Freeman, Martin Everett and Steve Borgatti. It comes with the NetDraw network visualization tool. < Proprietary | package | >
15. search
information retrieval in large datasets.
- LuceneSolr (website repository )
-
< Apache-2.0 | package | >
16. ESM/EMA surveys
Datenerhebung in 'natürlicher' Umgebung.
- paco (website repository )
-
< Apache-2.0 | framework | Objective-C Java >
17. audio-transcriptions
software that converts speech into electronic text document.
- f4analyse (website )
-
< Proprietary | standalone | >
- EXMARaLDA (website repository )
-
EXMARaLDA ist ein System für das computergestützte Arbeiten mit (vor allem) mündlichen Korpora. Es besteht aus einem Transkriptions- und Annotationseditor (Partitur-Editor), einem Tool zum Verwalten von Korpora (Corpus-Manager) und einem Such- und Analysewerkzeug (EXAKT). < Unknown | framework | Java >
18. optical character recognition (OCR)
OCR is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.
- tesseract (repository )
-
Tesseract is an open source text recognizer (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. < Apache-2.0 | package | Python >
19. online experiments
__
- nodeGame (website repository )
-
NodeGame is a free, open source JavaScript/HTML5 framework for conducting synchronous experiments online and in the lab directly in the browser window. It is specifically designed to support behavioral research along three dimensions: larger group sizes, real-time (but also discrete time) experiments, batches of simultaneous experiments. < MIT | package | Javascript >
20. miscellaneous
__
- scikit-learn (website repository )
-
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. < BSD | package | Python >