Data Model
==========

Conceptualization
-----------------

From both the user and task requirements one can derive that four basic
functions of data processing need to be carried out. Data have to be read, persistently
saved, searched, and deleted. Furthermore, some kind of user management
and multi-user processing is necessary. In addition, the framework should
support web technologies, be well documented, and easy to extent. Ideally, the
MVC pattern is realized.

\subsection{Data Model}\label{subsec:datamodel}
The guidelines of the
\emph{TEI}-standard\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf} on the
word level are defined in line with the structure defined above in section \ref{subsec:morphologicalSystems}. 
In listing \ref{lst:teiExamp} an
example is given for a possible markup at the word level for
\emph{comfortable}.\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html}

\begin{lstlisting}[language=XML,
caption={TEI-example for 'comfortable'},label=lst:teiExamp] 
<w type="adjective">
 <m type="base">
  <m type="prefix" baseForm="con">com</m>
  <m type="root">fort</m>
 </m>
 <m type="suffix">able</m>
</w>
\end{lstlisting}

This data model reflects just one theoretical conception of a word structure model. 
Crucially, the model emanates from the assumption
that the suffix node is on par with the word base. On the one hand, this 
implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the 
other hand, is enclosed in the base, which basically means a stronger lexical, 
and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different
hierarchical levels has important consequences for the branching direction at
subword level (here right-branching). Left the theoretical interest aside, the
choice of the TEI standard is reasonable with view to a sustainable architecture that allows for
exchanging data with little to no additional adjustments. 

The negative account is that the model is not eligible for all languages.
It reflects a theoretical construction based on Indo-European
languages. If attention is paid to which language this software is used, it will
not be problematic. This is the case for most languages of the Indo-European
stem and corresponds to the overwhelming majority of all research carried out
(unfortunately).

Implementation
--------------

As laid out in the task analysis in section \ref{subsec:datamodel}, it is
advantageous to use established standards. It was also shown that it makes sense
to keep the meta data of each corpus separate from the data model used for the
words to be analyzed. 

For the present case, the TEI-standard was identified as an
appropriate markup for words. In terms of the implementation this means that
the TEI guidelines have to be implemented as an object type compatible with the chosen
repository framework. However, the TEI standard is not complete regarding the
diachronic dimension, i.e. information on the development of the word. To
be compatible with the elements of the TEI standard on the one hand
and to best meet the requirements of the application on the other hand, some attributes
are added. This solution allows for processing the xml files according to
the TEI standard by ignoring the additional attributes and at the same
time, if needed, additional markup can be extracted. The additional attributes
comprise a link to the corpus meta data, but also \emph{position} and
\emph{occurrence} of the affixes.
Information on the position and some quantification thereof are potentially relevant for a 
wealth of research questions, such as predictions on the productivity of
derivatives and their interaction with the phonological or syntactic modules. So they were included
with respect to future use. 

For reasons of efficiency in subsequent processing,
the historic dates \emph{begin} and \emph{end} were included in both the word
data model and the corpus data model. The result of the word data model is given
in listing \ref{lst:worddatamodel}.
Whereas attributes of the objecttype are specific to the repository framework, the TEI structure can be
recognized in the hierarchy of the meta data element starting with the name
\emph{w} (line \ref{src:wordbegin}).

\begin{lstlisting}[language=XML,caption={Word Data
model},label=lst:worddatamodel,escapechar=|] <?xml version="1.0" encoding="UTF-8"?>
<objecttype
 name="morphilo"
 isChild="true"
 isParent="true"
 hasDerivates="true"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="datamodel.xsd">
 <metadata>
  <element name="morphiloContainer" type="xml" style="dontknow"
 notinherit="true" heritable="false"> 
   <xs:sequence>
    <xs:element name="morphilo">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="w" minOccurs="0" maxOccurs="unbounded">|label{src:wordbegin}|
        <xs:complexType mixed="true">
         <xs:sequence>
          <!-- stem -->
          <xs:element name="m1" minOccurs="0" maxOccurs="unbounded">
           <xs:complexType mixed="true">
            <xs:sequence>
             <!-- base -->
             <xs:element name="m2" minOccurs="0" maxOccurs="unbounded">
              <xs:complexType mixed="true">
               <xs:sequence>
                <!-- root -->
                <xs:element name="m3" minOccurs="0" maxOccurs="unbounded">
                 <xs:complexType mixed="true">
                  <xs:attribute name="type" type="xs:string"/>
                 </xs:complexType>
                </xs:element>
                <!-- prefix -->
                <xs:element name="m4" minOccurs="0" maxOccurs="unbounded">
                 <xs:complexType mixed="true">
                  <xs:attribute name="type" type="xs:string"/>
                  <xs:attribute name="PrefixbaseForm" type="xs:string"/>
                  <xs:attribute name="position" type="xs:string"/>
                 </xs:complexType>
                </xs:element>
               </xs:sequence>
               <xs:attribute name="type" type="xs:string"/>
              </xs:complexType>  
             </xs:element>
             <!-- suffix -->
             <xs:element name="m5" minOccurs="0" maxOccurs="unbounded">
              <xs:complexType mixed="true">
               <xs:attribute name="type" type="xs:string"/>
               <xs:attribute name="SuffixbaseForm" type="xs:string"/>
               <xs:attribute name="position" type="xs:string"/>
               <xs:attribute name="inflection" type="xs:string"/>
              </xs:complexType>
             </xs:element>
            </xs:sequence>
            <!-- stem-Attribute -->
            <xs:attribute name="type" type="xs:string"/>
            <xs:attribute name="pos" type="xs:string"/>
            <xs:attribute name="occurrence" type="xs:string"/>
           </xs:complexType>
          </xs:element>
         </xs:sequence>
         <!-- w -Attribute auf Wortebene -->
         <xs:attribute name="lemma" type="xs:string"/>
         <xs:attribute name="complexType" type="xs:string"/>
         <xs:attribute name="wordtype" type="xs:string"/>
         <xs:attribute name="occurrence" type="xs:string"/>
         <xs:attribute name="corpus" type="xs:string"/>
         <xs:attribute name="begin" type="xs:string"/>
         <xs:attribute name="end" type="xs:string"/>
        </xs:complexType>
       </xs:element>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
  </element>
  <element name="wordtype" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="wordtype"/>
  </element>
  <element name="complexType" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="complexType"/>
  </element>
  <element name="corpus" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="corpus"/>
  </element>
  <element name="pos" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="pos"/>
  </element>
  <element name="PrefixbaseForm" type="classification" minOccurs="0"
  maxOccurs="1"> 
   <classification id="PrefixbaseForm"/> 
  </element>
  <element name="SuffixbaseForm" type="classification" minOccurs="0"
  maxOccurs="1"> 
   <classification id="SuffixbaseForm"/> 
  </element>
  <element name="inflection" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="inflection"/>
  </element>
  <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded" >
   <target type="corpmeta"/>
  </element>
 </metadata>
</objecttype>
\end{lstlisting}

Additionally, it is worth mentioning that some attributes are modeled as a 
\emph{classification}. All these have to be listed
as separate elements in the data model. This has been done for all attributes
that are more or less subject to little or no change. In fact, all known suffix
and prefix morphemes should be known for the language investigated and are
therefore defined as a classification.
The same is true for the parts of speech named \emph{pos} in the morphilo data
model above.
Here the PENN-Treebank tagset was used. Last, the different morphemic layers in
the standard model named \emph{m} are changed to $m1$ through $m5$. This is the
only change in the standard that could be problematic if the data is to be
processed elsewhere and the change is not documented more explicitly. Yet, this
change was necessary for the MyCoRe repository throws errors caused by ambiguity 
issues on the different $m$-layers.

The second data model describes only very few properties of the text corpora
from which the words are extracted. Listing \ref{lst:corpusdatamodel} depicts
only the meta data element. For the sake of simplicity of the prototype, this
data model is kept as simple as possible. The obligatory field is the name of
the corpus. Specific dates of the corpus are classified as optional because in
some cases a text cannot be dated reliably. 


\begin{lstlisting}[language=XML,caption={Corpus Data
Model},label=lst:corpusdatamodel] 
<metadata> 
 <!-- Pflichtfelder --> 
 <element name="korpusname" type="text" minOccurs="1" maxOccurs="1"/> 
 <!-- Optionale Felder --> 
 <element name="sprache" type="text" minOccurs="0" maxOccurs="1"/>
 <element name="size" type="number" minOccurs="0" maxOccurs="1"/>
 <element name="datefrom" type="text" minOccurs="0" maxOccurs="1"/>
 <element name="dateuntil" type="text" minOccurs="0" maxOccurs="1"/>
 <!-- number of words -->
 <element name="NoW" type="text" minOccurs="0" maxOccurs="1"/>
 <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded">
  <target type="morphilo"/>
 </element>
</metadata>
\end{lstlisting}

As a final remark, one might have noticed that all attributes are modelled as
strings although other data types are available and fields encoding the dates or
the number of words suggest otherwise. The MyCoRe framework even provides a
data type \emph{historydate}. There is not a very satisfying answer to its
disuse.
All that can be said is that the use of data types different than the string
leads later on to problems in the convergence between the search engine and the
repository framework. These issues seem to be well known and can be followed on
github.