Data Protection

Written consent

An example written consent form used for the learner corpus of Lavtian "LaVA" consists of the following (taken from [Kaija, Auziņa 2019]):1



Which data to anonymize or pseudonymize

Article 4 of the General Data Protection Regulation (GDPR) defines personal data as any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.2

Note that a person you have signed a consent with might share personal information regarding third parties during the session - this data has to be de-identified.

In practice, potentially sensitive data include:

Two common approaches to de-identifying the data are anonymization and psedunymization. Some researchers suggest to keep the raw data layer intact but only disclose to the public a separate layer which has undergone de-identification.



Anonymization

Anonymization is a process of replacing sensitive data with random strings or standartized category names (also known as categorization), e.g., "Michael" is replaced with "PERSON_NAME", "Berlin" with "LOCATION_NAME", "mail@example.com" with "EMAIL", etc.

Since voice and appearance might be interpreted as personal data, an ideal anonymization technique for audio- and video data would be hiring an actor to recite or re-enact the original recording. This however is often not feasible due to time and/or budget constraints, so you might consider the following measures instead:


Pseudonymization

Pseudonymization is a process of replacing sensitive data with semantically similar expressions in such a manner that the data can no longer be attributed to a specific person. For example, "Michelle" becomes "Sandra", "Berlin" - "Münich", etc. Pseudonymization takes more time to carry out than anonymization, however the resulting data is more human-readable and has more potential to be re-used by third party researches (e.g., in a study focusing on certain linguistic properties of named entities).


  1. https://doi.org/10.3384/ecp2020172006↩︎

  2. https://gdpr-info.eu/art-4-gdpr/↩︎