(an excerpt from Guidelines for Building Language Corpora Under German Law, licensed under a CC-BY 4.0 International license)
Recommendations for building corpora
- In case of doubt, you should try to obtain licenses and consent. Right holders are usually cooperative when it comes to non-commercial, scientific purposes and no economic or other interests are violated e.g. by an unrestricted distribution of copies.
- The attempt to get licenses should begin as early as possible in the planning phase of a project, since the negotiations may drag on over a long period of time and this is the only way to ensure that the necessary rights may be obtained before the project starts and therefore any license fees or other rewards may be included in the calculation of the project costs.
- Also as early as possible in the planning phase, a center should be approached that is experienced with licensing of the relevant type of resource. It may provide assistance or in some circumstances take care about obtaining the licenses, and at the same time ensure that the licensing terms are drafted so that the data and the results of the projects may be included into their own archives/projects after the duration of the project and made available for the long term.
- Recommendations for the draft of license agreements can be found on the CLARIN-D Legal Information Platform.
- License agreements typically have a limited term, especially if they are associated with fees. Particularly in these cases, it is recommended to develop a strategy in cooperation with a center for making the content sustainably available. It also should be noted that unintentional interpretations of the licensor can prevent license renewals and additional licenses regardless of their legality.
- In cases where it is not possible to obtain sufficient rights to make available a text corpus to the scientific community permanently, but the reasons to build the corpus were nevertheless strong enough, the reasons should be documented and compromise strategies should be found on how a sustainable availability may be achieved at least rudimentarily. One possible model is e.g. to comprehensibly document how they may obtain the necessary rights themselves for subsequent users.
- Data protection issues should already be included in the planning phase of a project. If it is intended to collect personal data to a greater extent, an explicit document on the subject should be created and maintained (data protection concept). It must be captured which data is collected for which purposes. If necessary, appropriate consent declaration forms need to be developed and to be signed by the people affected by the data processing.
Recommendations for making written corpora available
- It is usually necessary and common practice to limit the number of users of corpora to people who identified and agreed with an End User License Agreement (see below) and, if necessary, additional data protection regulations. De facto this can be achieved by e.g. data access regulations via passwords which is allocated only on application and only in person or via a DFN-AAIAuthentication and web forms to request consent.
- As a general rule, rights and obligations which result from licensing agreements between right holders and corpus provider, need to be passed on to end-users via end user license agreements and data privacy policies (for example if a corpus provider undertakes an obligation to the licensor to document access to the corpus).
- With regard to personal data, anonymization and pseudoanonymization should be considered when making corpora available.
Recommendations for creating and making own works available: derivative works and databases
- Works that are created by scientists themselves should always be released under license terms, in order that subsequent users in the future may know if they can use the work for their own purposes. At the same time, contents that are (or become) free of copyright and on which the scientist did not acquire any other rights should not be portrayed as protected by law, and as far as possible explicitly marked as unprotected, e.g.with the help of "Public Domain Mark" (PDM).
- When selecting license terms, existing, widely-used standard licenses that are as liberal as possible should be used (e.g. one of the two Creative Commons licenses recognized in terms of the Open Definition , namely Creative Commons license versions BY and BY SA, or for software, a GNU license or BSD or Apache licenses which refrain from copyleft). So the result is most likely like the Open Access approach. The increasing trend is to publish scientific works with not more limitations than the Creative Commons license type "CC BY - Attribution," while pure data should be licensed entirely free of restrictions by "CC0". Even scientific publishers are increasingly open to such licenses.
- Particular attention should be paid to indicating the license as accurately as possible and easy to find.
- Problems with derivative works may be avoided in some cases, for example when annotations are published as an independent work from which the original work can not be reconstructed. If the license which is advised for a derivative work is roughly equivalent to the underlying, the same license should be used to facilitate the reusability. In any case, provisions of the license of the underlying work that sometimes allow only certain licenses for later processing (see e.g. the “Share-Alike” clauses in Creative Commons licenses ) should be noted.
Recommendations for the use of software when creating derivative works
- If no license terms are known, one should attempt to determine if and which restrictions apply to the use of the software.
- Particularly with commercial annotation tools, it may be reasonable to clarify and set out in a supplementary agreement the extent that the outputs of the software may be distributed, because software license provisions often prohibit this altogether. Generally, however, only reverse engineering is to be prevented.
- Before using or licensing software, it should be clarified to what extent the outputs of the software may still be used after the license term expires.