Granularity and Internal Structure of Audiovisual Resources

Deposit

Any data deposited with a data centre to be ingested into a repository or otherwise distributed can be referred to as a Deposit. For a Deposit, the legal situation must be clear, and basic provenance information must exist, but regarding size, content or data quality or consistency there are no specific requirements. This is an intended underspecification for a set of files of various kinds, making it possible to handle e.g. valuable legacy data before it can be curated or to describe “work in progress” data. Depending on its characteristics, after curation a Deposit can equal a Collection/Corpus, be a part of a Collection/Corpus, or comprise several Collections/Corpora.


Collection

For a set of files to be called a Collection, further requirements must be met. When talking about audiovisual resources, a Collection is a structured set of files, i.e. at least a set of documented recordings, based on a specific design, even if only “with a shared origin and/or topic”. The content itself however might not be structured, transcripts e.g. must be browsable, but not searchable, making even images of handwritten transcripts possible. Accordingly, there are no requirements regarding the existence of data models for any parts of the content of the resource. Only basic legal and administrative metadata on the resource (for all included files) and basic source metadata for the recording situations including the participants is required.


Corpus

While unstructured annotation data, as described above, makes a resource a collection, structured annotation data alone does not make the resource a corpus. For a resource to qualify as a corpus, further basic requirements must be met regarding the design and processability. The corpus design must be thoroughly documented to allow for a manual assessment of the plausibility and suitability regarding requirements on completeness/representativeness for the chosen purpose. Furthermore, the quality of the content must have been manually assessed to ensure reliability and validity of the chosen conventions/schemas and their application, and this quality must be documented. Regarding general processability and in particular the complexity and reliability of queries, the following criteria must be met:

Sessions, (Speech) Events and Bundles