Deposit
Any data deposited with a data centre to be ingested into a repository or otherwise distributed can be referred to as a Deposit. For a Deposit, the legal situation must be clear, and basic provenance information must exist, but regarding size, content or data quality or consistency there are no specific requirements. This is an intended underspecification for a set of files of various kinds, making it possible to handle e.g. valuable legacy data before it can be curated or to describe “work in progress” data. Depending on its characteristics, after curation a Deposit can equal a Collection/Corpus, be a part of a Collection/Corpus, or comprise several Collections/Corpora.
Collection
For a set of files to be called a Collection, further requirements must be met. When talking about audiovisual resources, a Collection is a structured set of files, i.e. at least a set of documented recordings, based on a specific design, even if only “with a shared origin and/or topic”. The content itself however might not be structured, transcripts e.g. must be browsable, but not searchable, making even images of handwritten transcripts possible. Accordingly, there are no requirements regarding the existence of data models for any parts of the content of the resource. Only basic legal and administrative metadata on the resource (for all included files) and basic source metadata for the recording situations including the participants is required.
Corpus
While unstructured annotation data, as described above, makes a resource a collection, structured annotation data alone does not make the resource a corpus. For a resource to qualify as a corpus, further basic requirements must be met regarding the design and processability. The corpus design must be thoroughly documented to allow for a manual assessment of the plausibility and suitability regarding requirements on completeness/representativeness for the chosen purpose. Furthermore, the quality of the content must have been manually assessed to ensure reliability and validity of the chosen conventions/schemas and their application, and this quality must be documented. Regarding general processability and in particular the complexity and reliability of queries, the following criteria must be met:
- The main structure defining the corpus data, i.e. the various files, their relationships and their metadata, must be machine readable and all paths to files must be resolvable.
- It must be possible to reliably select specific parts of the data, i.e. in effect annotation files, (to query or within a query result) on the basis of metadata on recording situations.
- All participants must be defined and recognizable in all parts of the data, i.e. unique speaker IDs are required, and a relationship between participants and annotation data is required when participants are not (redundantly) modelled as metadata of recording situations. It must be possible to reliably select specific parts of the data, i.e. in effect annotation files, (to query or within a query result) on the basis of participant metadata. The concept “contribution” must be modelled by the transcription/annotation data to allow reference to tokens/annotations produced by a specific participant.
- If the recordings have not been completely transcribed/annotated, it must be documented which parts have been transcribed/annotated and why. Recordings that have not been transcribed/annotated at all must be documented accordingly. An alternative to more fine-grained time-aligned annotation/analysis are longer events with thematic or structural information, e.g. conversation phases or topics.
- It must be explicit (and machine-readable) which tiers (or similar components of the data), if any, contain the most basic annotation, usually an orthographic transcription (“token layer”), and which tiers (or similar components of the data) contain higher level annotations referring to this base layer. The content and conventions/schemas must be documented for all tiers (or similar components of the data). Transcription conventions must be syntactically validated on an appropriate level and the result documented. If an annotation schema exists, only tags from this schema must occur in the tier (syntactic consistency).
- Within a transcription different information types must be explicitly marked-up and separatable, e.g. descriptions of non-verbal behaviour and comments must be identifiable as non-transcription data.
- If a tokenization is not explicitly included, tokenization must be possible according to the documented conventions, i.e. textual content must be automatically parsable. The result must however not be tokens as in standardized/normalized written words, since this is not a relevant unit in all systems for description of non-written language.
Sessions, (Speech) Events and Bundles
- Session = A complete recording session
- Sub-Session(?) = A part of a recording session, e.g. corresponding to a task
- Bundle = No semantics, just files that belong together - without multiple not synchronized media files