An example written consent form used for the learner corpus of Lavtian "LaVA" consists of the following (taken from [Kaija, Auziņa 2019]):1
Information letter:
- basic information about the project, the institutions that are carrying it out, and contact information;
- brief instructions for the participant;
- information about the security of the data on the server used for corpus and privacy;
- explanation on expressing one's will regarding participation in the project (i.e. what to do if the author decides they no longer want their texts to be used in the corpus)
Permission with the following statements:
- The author agrees that the corpus is available for free and is made for scientific and teaching purposes. The authors do not receive any financial reward for having their texts included in the corpus.
- The author confirms that none of the data in this text can lead to identification of any existing people.
- The author agrees that the text is anonymous and their name is not mentioned anywhere on the corpus website or its public documentation. Each author receives an anonymous code which makes it possible to recognize several texts written by the same author but does not reveal the identity of the author.
- The data included in the corpus can be cited in the educational materials, research papers, and other work in various forms.
- The corpus and all materials included in it can be publicly accessible for an unlimited period and can be viewed and researched an unlimited amount of times.
- All texts included in the corpus can have linguistic information added to them (e.g. error corrections, part-of-speech annotation, etc.).
- The author will have the right to withdraw their consent at any time. The withdrawal of consent shall not affect the lawfulness of processing based on consent before its withdrawal. The author is aware of this opportunity as a data provider.
Metadata collection questionnaire asking about:
- age;
- gender;
- other corpus-specific metadata
Article 4 of the General Data Protection Regulation (GDPR) defines personal data as any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.2
Note that a person you have signed a consent with might share personal information regarding third parties during the session - this data has to be de-identified.
In practice, potentially sensitive data include:
Two common approaches to de-identifying the data are anonymization and psedunymization. Some researchers suggest to keep the raw data layer intact but only disclose to the public a separate layer which has undergone de-identification.
Anonymization is a process of replacing sensitive data with random strings or standartized category names (also known as categorization), e.g., "Michael" is replaced with "PERSON_NAME", "Berlin" with "LOCATION_NAME", "mail@example.com" with "EMAIL", etc.
Since voice and appearance might be interpreted as personal data, an ideal anonymization technique for audio- and video data would be hiring an actor to recite or re-enact the original recording. This however is often not feasible due to time and/or budget constraints, so you might consider the following measures instead:
Pseudonymization is a process of replacing sensitive data with semantically similar expressions in such a manner that the data can no longer be attributed to a specific person. For example, "Michelle" becomes "Sandra", "Berlin" - "Münich", etc. Pseudonymization takes more time to carry out than anonymization, however the resulting data is more human-readable and has more potential to be re-used by third party researches (e.g., in a study focusing on certain linguistic properties of named entities).