This is an old version of this page.

Go to most recent version Browse history

digital methods

Table of contents (work in progress: this list is preliminary and will be updated continuously)

1. data mining
2. digital research practice in social or economic sciences
References

[Rogers2013] distinguishes between digitalized/virtual and digital methods. The former methods import standard methods from the social sciences and humanities into the emerging medium. The latter are completly new methods which emerge following the new structures and their properties.
In this project a more inclusive conception of digital methods is assumed: the use of digital technology or technique during the research.

1. data mining

Refers to the complete process of 'knowledge mining from data'.[Han_etal2012] Can be applied on various data types and consists of different steps and paradigms. For an application in the context of text mining in the social science see the concept "blended-reading" ([Stulpe_etal2016]).

1.1. data wrangling

Translate data into suited formats for automatic analysis. Examples: PDFs ⇒ Text . For a practical framework refer also [Wickham_etal2017].

1.1.1. regular expressions

Complex string manipulations by searching and replacing specific patterns.

1.1.2. data-format conversions

Transfer between different formats in order to unify and handle vacancies.

1.2. text preprocessing

Some text preprocessing tasks in natuaral language processing.

1.2.1. tokenization

Identify words in character input sequence.

1.2.2. stop-word removal

Removing high-frequency words like pronoums, determiners or prepositions.

1.2.3. stemming

Identify common stems on a syntactical level.

1.2.4. word/sentence segmentation

Separate a chunk of continuous text into separate words/sentences.

1.2.5. part-of-speech(POS)-tagging

Identify the part of speech for words.

1.2.6. dependency parsing

Create corresponding syntactic, semantic or morphologic trees from input text.

1.2.6.1. syntactic parsing

Create syntactic trees from input text using mostly unsupervised learning on manually annotated treebanks ([Ignatow_etal2017],61).

1.2.7. word-sense disambiguation

Recognizing context-sensetive meaning of words.

1.3. information extraction

Extract factual information(e.g. people, places or situations) in free text.

1.3.1. (named-)entity-recognition/resolution/extraction/tagging

Identify instances of specific (pre-)defined types(e.g place, name or color) in text.

1.3.1.1. relation extraction

Extract relationships between entities.

1.4. information retrieval

Retrieve relevant informations in response to the information requests.

1.4.1. indexing

'organize data in such a way that it can be easily retrieved later on'([Ignatow_etal2017],137)

1.4.2. searching/querying

'take information requests in the form of queries and return relevant documents'([Ignatow_etal2017],137). There are different models in order to estimate the similarity between records and the search queries (e.g. boolean, vector space or a probabilistic model)(ibid).

1.5. statistical analysis

1.5.1. frequency analysis

Descriptiv statistical analysis by using specific text abundances.

1.5.1.1. word frequencies/dictionary analysis

Analyse statistical significant occurence of words/word-groups. Can also be combined with meta-data (e.g. creation time of document).

1.5.1.2. co-occurence analysis

Analyse statistical significant co-occurence of words in different contextual units.

1.5.1.3. context volatility

'Analyse contextual change for certain words over a period of time.'([Niekler_etal2018],1316)

1.5.2. classification/machine learning

Various techniques to (semi-)automatically identify specific classes.

1.5.2.1. supervised classification

Use given training examples in order to classify certain entities.

1.5.2.2. latent semantic analysis

'The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome.'(CRAN-R)

1.5.2.3. topic modeling

Probabilistic models to infer semantic clusters. See especially [Papilloud_etal2018].

1.5.2.3.1. latent dirichlet allocation

'The application of LDA is based on three nested concepts: the text collection to be modelled is referred to as the corpus; one item within the corpus is a document, with words within a document called terms.(…)
The aim of the LDA algorithm is to model a comprehensive representation of the corpus by inferring latent content variables, called topics. Regarding the level of analysis, topics are heuristically located on an intermediate level between the corpus and the documents and can be imagined as content-related categories, or clusters. (…) Since topics are hidden in the first place, no information about them is directly observable in the data. The LDA algorithm solves this problem by inferring topics from recurring patterns of word occurrence in documents.'([Maier_etal2018],94)

1.5.2.3.2. non-negative-matrix-factorization

Inclusion of non-negative constraint.

1.5.2.3.3. structural topic modeling

Inclusion of meta-data. Refer especially to [roberts2013].

1.5.2.4. sentiment analysis

'Subjectivity and sentiment analysis focuses on the automatic identification of private states, such as opinions, emotions, sentiments, evaluations, beliefs, and speculations in natural language. While subjectivity classification labels text as either subjective or objective, sentiment classification adds an additional level of granularity, by further classifying subjective text as either positive, negative, or neutral.' ([Ignatow_etal2017] pp. 148)

1.5.2.5. automated narrative, argumentative structures, irony, metaphor detection/extraction

For automated narrative methapor analysis see ([Ignatow_etal2017], 89-106. For argumentative structures(Task: Retrieving sentential arguments for any given controversial topic) [Stab_etal2018] .Refer for a current overview [Cabrio2018].

1.5.3. network analysis/modeling

Generate networks out of text/relationships between text.

1.5.3.1. knowledge graph construction

Modelling entities and their relationships.

1.6. data visualization

Visualize the mined informations.

1.6.1. word relationships

1.6.2. networks

1.6.3. geo-referenced

1.6.4. dynamic visualizations

Visualizations with user interaction or animations.

2.1. automated data collection

In principal there are multiple possible data sources in a data mining process. A basic distinction in relevance to automated data collection can be drawn between connected devices(internet, intranets) or unconnected devices(sensors, etc.).
Furthermore the server-client-model is the established communication paradigms for connected devices. In order to obtain data either from server or client there exists three different interfaces: log files, apis and user interfaces which constitute the available procedures [Jünger2018].

2.1.1. collect log-data

Collect log data which occur during providing the (web-)service or the information processing.

2.1.2. parsing from api

Parse structured data from via a documented REST-API.

2.1.3. scraping

Automatically parse unstructured or semi-structured data from a normal website (⇒ web-scraping) or service.

2.1.3.1. scraping (static content)

Automatically parse data from static HTML websites.

2.1.3.2. scraping (dynamic content)

Automatically parse dynamic content (HTML5/Javascript,) ⇒ sometimes requires mimicking user-interaction.

2.1.4. crawling

Collect websites with an initial set of webpages by following contained links [Ignatow_etal2017].

2.1.5. tracking

User-informed passive data collection.

2.1.5.1. ecological momentary assessments (EMA)/Experience Sampling Method (ESM)

EMA and ESM are mostly equivalent. EMA focusses on medical questions or measurements in a natural environment; ESM more on subjective Questions in the real life. Four characteristics: 1) data collection in natural environments 2) Focussing on near events/impressions/actions 3) questions triggered randomly or event-based 4) multiple questions over a certain period of time [Citation after Stone and Shiffmann 1994] ([Salganik2018],109)

2.2. blended reading/text mining

Application of data mining methods on texts in social research. Refer [Stulpe_etal2016] for a detailed explanation.

2.3. new forms of digital research design

New possibilities in surveys or data aquisition techniques.

2.3.1. wiki surveys

Guide open-answer questions with user feedback. Refer also ([Salganik2018],111)

2.3.2. online experiments

Synchronous or asynchronous online experiments.

2.3.3. survey data linked to big data sources

2.3.3.1. enriched asking

'In enriched asking, survey data build context around a big data source that contains some important measurements but lacks others.'([Salganik2018],118)

2.3.3.2. amplified asking

'Amplified asking using a predictive model to combine survey data from few people with a big data source from many people.'([Salganik2018],122)

2.4. collaborative work

2.4.1. open call projects

(e.g. annotation).

2.4.2. distributed data collection

2.5. digital communication

2.6. digital data/phenomena as reasearch-objective

2.7. statistical modeling

2.7.1. regression analysis

2.7.2. time-series analysis

2.7.2.1. forecasting/nowcasting

Using methods to predict the future for estimation of current values. (Example: predict influenza epidemiology combining CDC Data and Google Trends([Salganik2018],46–50)).

2.8.1. agent-based modeling

References

[user-content-Cabrio2018] Cabrio, E., & Villata, S. (2018). Five years of argument mining: a data-driven analysis. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 5427–5433).
[user-content-Han_etal2012] Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques. Saint Louis, UNITED STATES: Elsevier Science & Technology.
[user-content-Ignatow_etal2017] Ignatow, G., & Mihalcea, R. F. (2017). Text mining: A guidebook for the social sciences. Los Angeles, London, New Delhi, Singapore, Washington DC, Melbourne: Sage.
[user-content-Jünger2018] Jünger, Jakob (2018): Mapping the Field of Automated Data Collection on the Web. Data Types, Collection Approaches and their Research Logic. In: Stützer, Cathleen / Welker, Martin / Egger, Marc (Hg). Computational Social Science in the Age of Big Data. Concepts, Methodologies, Tools, and Applications. Neue Schriften zur Online-Forschung der Deutschen Gesellschaft für Online-Forschung (DGOF). Köln: Halem-Verlag, S. 104-130.
[user-content-Lemke_etal2016] Lemke, M., & Wiedemann, G. (Eds.). (2016). Text Mining in den Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitativer und quantitativer Diskursanalyse. Wiesbaden: Springer VS.
[user-content-Maier_etal2018] Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., . . . Adam, S. (2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2-3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
[user-content-Niekler_etal2018] Niekler, A., Bleier, A., Kahmann, C., Posch, L., Wiedemann, G., Erdogan, K., . . . Strohmaier, M. (2018). ILCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resource Association. Retrieved from http://aclweb.org/anthology/L18-1209
[user-content-Papilloud_etal2018] Papilloud, C., & Hinneburg, A. (Eds.). (2018). Studienskripten zur Soziologie. Qualitative Textanalyse mit Topic-Modellen: Eine Einführung für Sozialwissenschaftler. Wiesbaden: Springer VS.
[user-content-Roberts2013] Roberts, M. E., Stewart, B. M., Tingley, D., Airoldi, E. M., & others (2013). The structural topic model and applied social science. In Advances in neural information processing systems workshop on topic models: computation, application, and evaluation (pp. 1–20).
[user-content-Rogers2013] Rogers, R. (2013). Digital methods. Cambridge, Massachusetts, London, England: The MIT Press.
[user-content-Salganik2018] Salganik, M. J. (2018). Bit by bit: Social research in the digital age.
[user-content-Stab_etal2018] Stab, C., Daxenberger, J., Stahlhut, C., Miller, T., Schiller, B., Tauchmann, C., . . . Gurevych, I. (2018). ArgumenText: Searching for Arguments in Heterogeneous Sources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 21–25).
[user-content-Stulpe_etal2016] Stulpe, A., & Lemke, M. (2016). Blended Reading. In Text Mining in den Sozialwissenschaften (pp. 17–61). Springer.
[user-content-Wickham_etal2017] Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, tidy, transform, visualize, and model data. Beijing, Boston, Farnham, Sebastopol, Tokyo: O’Reilly UK Ltd.

Comments

Please register or sign in to add a comment.

Admin message

MethodsList