Skip to content
Snippets Groups Projects
Commit bfbd7d93 authored by Lange, Dr. Herbert's avatar Lange, Dr. Herbert
Browse files

the original english files from the web knowledge base

parents
No related branches found
No related tags found
No related merge requests found
Best Practices
===========================================================
(an excerpt from *Guidelines for Building Language Corpora Under German Law*, licensed under a CC-BY 4.0 International license) [1]_
Recommendations for building corpora
####################################
- **In case of doubt, you should try to obtain licenses and consent**. Right holders are usually cooperative when it comes to non-commercial, scientific purposes and no economic or other interests are violated e.g. by an unrestricted distribution of copies.
- The attempt to get licenses should begin **as early as possible in the planning phase** of a project, since the negotiations may drag on over a long period of time and this is the only way to ensure that the necessary rights may be obtained before the project starts and therefore any **license fees** or other rewards may be included in the **calculation of the project costs**.
- Also as early as possible in the planning phase, **a center should be approached** that is experienced with licensing of the relevant type of resource. It may provide assistance or in some circumstances take care about obtaining the licenses, and at the same time ensure that the licensing terms are drafted so that the data and the results of the projects may be included into their own archives/projects after the duration of the project and made available for the long term.
- Recommendations for the draft of license agreements can be found on the CLARIN-D Legal Information Platform. [#]_
- License agreements typically have a **limited term**, especially if they are associated with fees. Particularly in these cases, it is recommended to develop a strategy in cooperation with a center for making the content sustainably available. It also should be noted that **unintentional interpretations of the licensor** can prevent license renewals and additional licenses regardless of their legality.
- In cases where it is not possible to obtain sufficient rights to make available a text corpus to the scientific community permanently, but the reasons to build the corpus were nevertheless strong enough, the reasons should be documented and **compromise strategies** should be found on how a sustainable availability may be achieved at least rudimentarily. One possible model is e.g. to comprehensibly document how they may obtain the necessary rights themselves for subsequent users.
- Data protection issues should already be included in the planning phase of a project. If it is intended to collect personal data to a greater extent, an explicit document on the subject should be created and maintained (data protection concept). It must be captured which data is collected for which purposes. If necessary, appropriate consent declaration forms need to be developed and to be signed by the people affected by the data processing.
----
.. [#] http://clarin-d.de/legalissues
----
Recommendations for making written corpora available
####################################################
- It is usually necessary and common practice to **limit the number of users** of corpora to people who identified and agreed with an End User License Agreement (see below) and, if necessary, additional data protection regulations. De facto this can be achieved by e.g. data access regulations via passwords which is allocated only on application and only in person or via a DFN-AAIAuthentication and web forms to request consent.
- As a general rule, rights and obligations which result from licensing agreements between right holders and corpus provider, need to be passed on to end-users via **end user license agreements** and **data privacy policies** (for example if a corpus provider undertakes an obligation to the licensor to document access to the corpus).
- With regard to personal data, anonymization and pseudoanonymization should be considered when making corpora available.
----
Recommendations for creating and making own works available: derivative works and databases
###########################################################################################
- Works that are **created by scientists themselves should always be released under license terms**, in order that subsequent users in the future may know if they can use the work for their own purposes. At the same time, contents that are (or become) free of copyright and on which the scientist did not acquire any other rights should not be portrayed as protected by law, and as far as possible explicitly marked as unprotected, e.g.with the help of "Public Domain Mark" (PDM).
- When selecting license terms, **existing, widely-used standard licenses that are as liberal as possible should be used** (e.g. one of the two Creative Commons licenses recognized in terms of the Open Definition [#]_ , namely Creative Commons license versions BY and BY SA, or for software, a GNU license or BSD or Apache licenses which refrain from copyleft). So the result is most likely like the Open Access approach. The increasing trend is to publish scientific works with not more limitations than the Creative Commons license type "CC BY - Attribution," while pure data should be licensed entirely free of restrictions by "CC0". Even scientific publishers are increasingly open to such licenses.
- Particular attention should be paid to **indicating the license as accurately as possible and easy to find**.
- **Problems with derivative works** may be avoided in some cases, for example when annotations are published as an independent work from which the original work can not be reconstructed. If the license which is advised for a derivative work is roughly equivalent to the underlying, the same license should be used to facilitate the reusability. In any case, provisions of the license of the underlying work that sometimes allow only certain licenses for later processing (see e.g. the “Share-Alike” clauses in Creative Commons licenses [#]_ ) should be noted.
----
.. [#] http://opendefinition.org/od/
.. [#] See the variety of content which is combined under different Creative Commons licenses, https://wiki.creativecommons.org/FAQ#Can_I_combine_material_under_different_Creative_Commons_licenses_in_my_work.3F
----
Recommendations for the use of software when creating derivative works
######################################################################
- If no license terms are known, one should attempt to determine if and which restrictions apply to the use of the software.
- Particularly with commercial annotation tools, it may be reasonable to clarify and set out in a supplementary agreement the extent that the outputs of the software may be distributed, because software license provisions often prohibit this altogether. Generally, however, only reverse engineering is to be prevented.
- Before using or licensing software, it should be clarified to what extent the outputs of the software may still be used after the license term expires.
----
.. [1] https://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsdaten/guidelines_review_board_linguistics_corpora.pdf
\ No newline at end of file
This diff is collapsed.
Data Protection
===============
Written consent
###############
An example written consent form used for the learner corpus of Lavtian "LaVA" consists of the following (taken from [Kaija, Auziņa 2019]): [#]_
- **Information letter**:
* basic information about the project, the institutions that are carrying it out, and contact information;
* brief instructions for the participant;
* information about the security of the data on the server used for corpus and privacy;
* explanation on expressing one's will regarding participation in the project (i.e. what to do if the author decides they no longer want their texts to be used in the corpus)
- **Permission** with the following statements:
* The author agrees that the corpus is available for free and is made for scientific and teaching purposes. The authors do not receive any financial reward for having their texts included in the corpus.
* The author confirms that none of the data in this text can lead to identification of any existing people.
* The author agrees that the text is anonymous and their name is not mentioned anywhere on the corpus website or its public documentation. Each author receives an anonymous code which makes it possible to recognize several texts written by the same author but does not reveal the identity of the author.
* The data included in the corpus can be cited in the educational materials, research papers, and other work in various forms.
* The corpus and all materials included in it can be publicly accessible for an unlimited period and can be viewed and researched an unlimited amount of times.
* All texts included in the corpus can have linguistic information added to them (e.g. error corrections, part-of-speech annotation, etc.).
* The author will have the right to withdraw their consent at any time. The withdrawal of consent shall not affect the lawfulness of processing based on consent before its withdrawal. The author is aware of this opportunity as a data provider.
- **Metadata collection questionnaire** asking about:
* age;
* gender;
* other corpus-specific metadata
----
.. [#] https://doi.org/10.3384/ecp2020172006
----
Which data to anonymize or pseudonymize
#######################################
Article 4 of the General Data Protection Regulation (GDPR) defines **personal data** as any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. [#]_
Note that a person you have signed a consent with might share personal information regarding third parties during the session - this data has to be de-identified.
In practice, potentially sensitive data include:
- names (personal names, nicknames, organization names)
- locations (addresses, city names, district names, etc.)
- age
- date expressions
- numbers (such as house number, phone number, Social Security number, etc.)
- email addresses
- URIs
- implicit references (e.g. someone's job)
Two common approaches to de-identifying the data are anonymization and psedunymization. Some researchers suggest to keep the raw data layer intact but only disclose to the public a separate layer which has undergone de-identification.
----
.. [#] https://gdpr-info.eu/art-4-gdpr/
----
Anonymization
#############
Anonymization is a process of **replacing sensitive data with random strings or standartized category names** (also known as categorization), e.g., "Michael" is replaced with "PERSON_NAME", "Berlin" with "LOCATION_NAME", "mail@example.com" with "EMAIL", etc.
Since voice and appearance might be interpreted as personal data, an ideal anonymization technique for audio- and video data would be hiring an actor to recite or re-enact the original recording. This however is often not feasible due to time and/or budget constraints, so you might consider the following measures instead:
- for audio recordings, **bleeping out** the parts containing personal data
- for video recordings, **blackening or pixelating** some parts of the speaker's body (this is relevant to the processing of the e.g. sign language data)
----
Pseudonymization
################
Pseudonymization is a process of **replacing sensitive data with semantically similar expressions in such a manner that the data can no longer be attributed to a specific person**. For example, "Michelle" becomes "Sandra", "Berlin" - "Münich", etc. Pseudonymization takes more time to carry out than anonymization, however the resulting data is more human-readable and has more potential to be re-used by third party researches (e.g., in a study focusing on certain linguistic properties of named entities).
QUEST Knowledge Base
====================
Types of corpora
----------------
- Introduction
- Multilingual corpora
- Multimodal corpora
Media formats
-------------
- Sound recordings
- Video recordings
Annotation formats
------------------
- Introduction
- ELAN
- EXMARaLDA
- FOLKER
- FLEX
Quality control
---------------
- `HZSK Corpus services <corpus_sevices>`__
Certification
-------------
- `Resource types <resource_types>`__
Legal aspects
-------------
- `Best practices <best_practices>`__
- `Data protection <data_protection>`__
- `Copyright <copyright>`__
Granularity and Internal Structure of Audiovisual Resources
===========================================================
Deposit
#######
Any data deposited with a data centre to be ingested into a repository or otherwise distributed can be referred to as a Deposit. For a Deposit, the **legal situation** must be clear, and **basic provenance information** must exist, but regarding size, content or data quality or consistency there are no specific requirements. This is an intended underspecification for a set of files of various kinds, making it possible to handle e.g. valuable legacy data before it can be curated or to describe “work in progress” data. Depending on its characteristics, after curation a Deposit can equal a `Collection/Corpus`, be a part of a `Collection/Corpus`, or comprise several `Collections/Corpora`.
----
Collection
##########
For a set of files to be called a Collection, further requirements must be met. When talking about audiovisual resources, a Collection is a **structured set** of files, i.e. at least a set of documented recordings, based on a specific design, even if only “with a shared origin and/or topic”. The content itself however might not be structured, transcripts e.g. must be browsable, but not searchable, making even images of handwritten transcripts possible. Accordingly, there are no requirements regarding the existence of data models for any parts of the content of the resource. Only basic legal and administrative metadata on the resource (for all included files) and basic source metadata for the recording situations including the participants is required.
----
Corpus
######
While unstructured annotation data, as described above, makes a resource a `collection`, structured annotation data alone does not make the resource a corpus. For a resource to qualify as a corpus, further basic requirements must be met regarding the design and processability. The **corpus design** must be thoroughly documented to allow for a manual assessment of the plausibility and suitability regarding requirements on completeness/representativeness for the chosen purpose. Furthermore, the **quality of the content** must have been manually assessed to ensure reliability and validity of the chosen conventions/schemas and their application, and this quality must be documented. Regarding general processability and in particular the complexity and reliability of queries, the following criteria must be met:
- The **main structure** defining the corpus data, i.e. the various files, their relationships and their metadata, must be **machine readable** and all paths to files must be resolvable.
- It must be possible to reliably **select specific parts of the data**, i.e. in effect annotation files, (to query or within a query result) on the basis of metadata on recording situations.
- All **participants** must be defined and recognizable in all parts of the data, i.e. unique speaker IDs are required, and a relationship between participants and annotation data is required when participants are not (redundantly) modelled as metadata of recording situations. It must be possible to reliably select specific parts of the data, i.e. in effect annotation files, (to query or within a query result) on the basis of participant metadata. The concept “contribution” must be modelled by the transcription/annotation data to allow reference to tokens/annotations produced by a specific participant.
- If the recordings have not been completely transcribed/annotated, it must be documented **which parts have been transcribed/annotated** and why. Recordings that have not been transcribed/annotated at all must be documented accordingly. An alternative to more fine-grained time-aligned annotation/analysis are longer events with thematic or structural information, e.g. conversation phases or topics.
- It must be explicit (and machine-readable) **which tiers** (or similar components of the data), if any, contain the most basic annotation, usually an orthographic transcription (“token layer”), and which tiers (or similar components of the data) contain higher level annotations referring to this base layer. The content and conventions/schemas must be documented for all tiers (or similar components of the data). **Transcription conventions** must be syntactically validated on an appropriate level and the result documented. If an annotation schema exists, only tags from this schema must occur in the tier (syntactic consistency).
- Within a transcription **different information types** must be explicitly marked-up and separatable, e.g. descriptions of non-verbal behaviour and comments must be identifiable as non-transcription data.
- If a **tokenization** is not explicitly included, tokenization must be possible according to the documented conventions, i.e. textual content must be automatically parsable. The result must however not be tokens as in standardized/normalized written words, since this is not a relevant unit in all systems for description of non-written language.
Sessions, (Speech) Events and Bundles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- `Session` = A complete recording session
- `Sub-Session(?)` = A part of a recording session, e.g. corresponding to a task
- `Bundle` = No semantics, just files that belong together - without multiple not synchronized media files
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment