Skip to content
Snippets Groups Projects
Commit 4a8ee435 authored by Peukert, Dr. Hagen's avatar Peukert, Dr. Hagen
Browse files

basic documentation layout

parent b33a07f9
No related branches found
No related tags found
No related merge requests found
.project 0 → 100644
<?xml version="1.0" encoding="UTF-8"?>
<projectDescription>
<name>morphilo_doc</name>
<comment></comment>
<projects>
</projects>
<buildSpec>
</buildSpec>
<natures>
</natures>
</projectDescription>
Software Design
===============
\begin{figure}
\centering
\includegraphics[scale=0.8]{architecture.pdf}
\caption{Morphilo Architecture}
\label{fig:architect}
\end{figure}
The architecture of a possible \emph{take-and-share}-approach for language
resources is visualized in figure \ref{fig:architect}. Because the very gist
of the approach becomes clearer if describing a concrete example, the case of
annotating lexical derivatives of Middle English and a respective database is
given as an illustration.
However, any other tool that helps with manual annotations and manages metadata of a corpus could be
substituted here instead.
After inputting an untagged corpus or plain text, it is determined whether the
input material was annotated previously by a different user. This information is
usually provided by the metadata administered by the annotation tool; in the case at
hand it is called \emph{Morphilizer} in figure \ref{fig:architect}. An
alternative is a simple table look-up for all occurring words in the datasets Corpus 1 through Corpus n. If contained
completely, the \emph{yes}-branch is followed up further -- otherwise \emph{no}
succeeds. The difference between the two branches is subtle, yet crucial. On
both branches, the annotation tool (here \emph{Morphilizer}) is called, which, first,
sorts out all words that are not contained in the master database (here \emph{Morphilo-DB})
and, second, makes reasonable suggestions on an optimal annotation of
the items. In both cases the
annotations are linked to the respective items (e.g. words) in the
text, but they are also persistently saved in an extra dataset, i.e. Corpus 1
through n, together with all available metadata.
The difference between both information streams is that
in the \emph{yes}-branch a comparison between the newly created dataset and
all of the previous datasets of this text is carried out. Within this
unit, all deviations and congruencies are marked and counted. The underlying
assumption is that with a growing number of comparable texts the
correct annotations approach a theoretic true value of a correct annotation
while errors level out provided that the sample size is large enough. How the
distribution of errors and correct annotations exactly looks like and if a
normal distribution can be assumed is still object of the ongoing research, but
independent of the concrete results, the component (called \emph{compare
manual annotations} in figure \ref{fig:architect}) allows for specifying the
exact form of the sample population.
In fact, it is necessary at that point to define the form of the distribution,
sample size, and the rejection region. The standard setting are a normal
distribution, a rejection region of $\alpha = 0.05$ and sample size of $30$ so
that a simple Gau\ss-Test can be calculated.
Continuing the information flow further, these statistical calculations are
delivered to the quality-control-component. Based on the statistics, the
respective items together with the metadata, frequencies, and, of course,
annotations are written to the master database. All information in the master
database is directly used for automated annotations. Thus it is directly matched
to the input texts or corpora respectively through the \emph{Morphilizer}-tool.
The annotation tool decides on the entries looked up in the master which items
are to be manually annotated.
The processes just described are all hidden from the user who has no possibility
to impact the set quality standards but by errors in the annotation process. The
user will only see the number of items of the input text he or she will process manually. The
annotator will also see an estimation of the workload beforehand. On this
number, a decision can be made if to start the annotation at all. It will be
possible to interrupt the annotation work and save progress on the server. And
the user will have access to the annotations made in the respective dataset,
correct them or save them and resume later. It is important to note that the user will receive
the tagged document only after all items are fully annotated. No partially
tagged text can be output.
\ No newline at end of file
......@@ -3,3 +3,847 @@ Controller Adjustments
General Principle of Operation
------------------------------
Figure \ref{fig:classDiag} illustrates the dependencies of the five java classes that were integrated to add the morphilo
functionality defined in the default package \emph{custom.mycore.addons.morphilo}. The general principle of operation
is the following. The handling of data search, upload, saving, and user
authentification is fully left to the MyCoRe functionality that is completely
implemented. The class \emph{ProcessCorpusServlet.java} receives a request from the webinterface to process an uploaded file,
i.e. a simple text corpus, and it checks if any of the words are available in the master database. All words that are not
listed in the master database are written to an extra file. These are the words that have to be manually annotated. At the end, the
servlet sends a response back to the user interface. In case of all words are contained in the master, an xml file is generated from the
master database that includes all annotated words of the original corpus. Usually this will not be the case for larger textfiles.
So if some words are not in the master, the user will get the response to initiate the manual annotation process.
The manual annotation process is processed by the class
\emph{{Tag\-Corpus\-Serv\-let\-.ja\-va}}, which will build a JDOM object for the first word in the extra file.
This is done by creating an object of the \emph{JDOMorphilo.java} class. This class, in turn, will use the methods of
\emph{AffixStripper.java} that make simple, but reasonable, suggestions on the word structure. This JDOM object is then
given as a response back to the user. It is presented as a form, in which the user can make changes. This is necessary
because the word structure algorithm of \emph{AffixStripper.java} errs in some cases. Once the user agrees on the
suggestions or on his or her corrections, the JDOM object is saved as an xml that is only searchable, visible, and
changeable by the authenicated user (and the administrator), another file containing all processed words is created or
updated respectively and the \emph{TagCorpusServlet.java} servlet will restart until the last word in the extra list is
processed. This enables the user to stop and resume her or his annotation work at a later point in time. The
\emph{TagCorpusServlet} will call methods from \emph{ProcessCorpusServlet.java} to adjust the content of the extra
files harboring the untagged words. If this file is empty, and only then, it is replaced by the file comprising all words
from the original text file, both the ones from the master database and the ones that are annotated by the user,
in an annotated xml representation.
Each time \emph{ProcessCorpusServlet.java} is instantiated, it also instantiates \emph{QualityControl.java}. This class checks if a
new word can be transferred to the master database. The algorithm can be freely adopted to higher or lower quality standards.
In its present configuration, a method tests at a limit of 20 different
registered users agreeing on the annotation of the same word. More specifically,
if 20 JDOM objects are identical except in the attribute field \emph{occurrences} in the metadata node, the JDOM object becomes
part of the master. The latter is easily done by changing the attribute \emph{creator} from the user name
to \emph{``administrator''} in the service node. This makes the dataset part of the master database. Moreover, the \emph{occurrences}
attribute is updated by adding up all occurrences of the word that stem from
different text corpora of the same time range.
\begin{landscape}
\begin{figure}
\centering
\includegraphics[scale=0.55]{morphilo_uml.png}
\caption{Class Diagram Morphilo}
\label{fig:classDiag}
\end{figure}
\end{landscape}
Conceptualization
-----------------
The controller component is largely
specified and ready to use in some hundred or so java classes handling the
logic of the search such as indexing, but also dealing with directories and
files as saving, creating, deleting, and updating files.
Moreover, a rudimentary user management comprising different roles and
rights is offered. The basic technology behind the controller's logic is the
servlet. As such all new code has to be registered as a servlet in the
web-fragment.xml (here the Apache Tomcat container) as listing \ref{lst:webfragment} shows.
\begin{lstlisting}[language=XML,caption={Servlet Registering in the
web-fragment.xml (excerpt)},label=lst:webfragment,escapechar=|]
<servlet>
<servlet-name>ProcessCorpusServlet</servlet-name>
<servlet-class>custom.mycore.addons.morphilo.ProcessCorpusServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>ProcessCorpusServlet</servlet-name>
<url-pattern>/servlets/object/process</url-pattern>|\label{ln:process}|
</servlet-mapping>
<servlet>
<servlet-name>TagCorpusServlet</servlet-name>
<servlet-class>custom.mycore.addons.morphilo.TagCorpusServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>TagCorpusServlet</servlet-name>
<url-pattern>/servlets/object/tag</url-pattern>|\label{ln:tag}|
</servlet-mapping>
\end{lstlisting}
Now, the logic has to be extended by the specifications analyzed in chapter
\ref{chap:concept} on conceptualization. More specifically, some
classes have to be added that take care of analyzing words
(\emph{AffixStripper.java, InflectionEnum.java, SuffixEnum.java,
PrefixEnum.java}), extracting the relevant words from the text and checking the
uniqueness of the text (\emph{ProcessCorpusServlet.java}), make reasonable
suggestions on the annotation (\emph{TagCorpusServlet.java}), build the object
of each annotated word (\emph{JDOMorphilo.java}), and check on the quality by applying
statistical models (\emph{QualityControl.java}).
Implementation
--------------
Having taken a bird's eye perspective in the previous chapter, it is now time to take a look at the specific implementation at the level
of methods. Starting with the main servlet, \emph{ProcessCorpusServlet.java}, the class defines four getter method:
\renewcommand{\labelenumi}{(\theenumi)}
\begin{enumerate}
\item\label{itm:geturl} public String getURLParameter(MCRServletJob, String)
\item\label{itm:getcorp} public String getCorpusMetadata(MCRServletJob, String)
\item\label{itm:getcont} public ArrayList<String> getContentFromFile(MCRServletJob, String)
\item\label{itm:getderiv} public Path getDerivateFilePath(MCRServletJob, String)
\item\label{itm:now} public int getNumberOfWords(MCRServletJob job, String)
\end{enumerate}
Since each servlet in MyCoRe extends the class MCRServlet, it has access to MCRServletJob, from which the http requests and responses
can be used. This is the first argument in the above methods. The second argument of method (\ref{itm:geturl}) specifies the name of an url parameter, i.e.
the object id or the id of the derivate. The method returns the value of the given parameter. Typically MyCoRe uses the url to exchange
these ids. The second method provides us with the value of a data field in the xml document. So the string defines the name of an attribute.
\emph{getContentFromFile(MCRServletJob, String)} returns the words as a list from a file when given the filename as a string.
The getter listed in \ref{itm:getderiv}), returns the Path from the MyCoRe repository when the name of
the file is specified. And finally, method (\ref{itm:now}) returns the number of words by simply returning
\emph{getContentFromFile(job, fileName).size()}.
There are two methods in every MyCoRe-Servlet that have to be overwritten,
\emph{protected void render(MCRServletJob, Exception)}, which redirects the requests as \emph{POST} or \emph{GET} responds, and
\emph{protected void think(MCRServletJob)}, in which the logic is implemented. Since the latter is important to understand the
core idea of the Morphilo algorithm, it is displayed in full length in source code \ref{src:think}.
\begin{lstlisting}[language=java,caption={The overwritten think method},label=src:think,escapechar=|]
protected void think(MCRServletJob job) throws Exception
{
this.job = job;
String dateFromCorp = getCorpusMetadata(job, "def.datefrom");
String dateUntilCorp = getCorpusMetadata(job, "def.dateuntil");
String corpID = getURLParameter(job, "objID");
String derivID = getURLParameter(job, "id");
//if NoW is 0, fill with anzWords
MCRObject helpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(corpID));|\label{ln:bugfixstart}|
Document jdomDocHelp = helpObj.createXML();
XPathFactory xpfacty = XPathFactory.instance();
XPathExpression<Element> xpExp = xpfacty.compile("//NoW", Filters.element());
Element elem = xpExp.evaluateFirst(jdomDocHelp);
//fixes transferred morphilo data from previous stand alone project
int corpussize = getNumberOfWords(job, "");
if (Integer.parseInt(elem.getText()) != corpussize)
{
elem.setText(Integer.toString(corpussize));
helpObj = new MCRObject(jdomDocHelp);
MCRMetadataManager.update(helpObj);
}|\label{ln:bugfixend}|
//Check if the uploaded corpus was processed before
SolrClient slr = MCRSolrClientFactory.getSolrClient();|\label{ln:solrstart}|
SolrQuery qry = new SolrQuery();
qry.setFields("korpusname", "datefrom", "dateuntil", "NoW", "id");
qry.setQuery("datefrom:" + dateFromCorp + " AND dateuntil:" + dateUntilCorp + " AND NoW:" + corpussize);
SolrDocumentList rslt = slr.query(qry).getResults();|\label{ln:solrresult}|
Boolean incrOcc = true;
// if resultset contains only one, then it must be the newly created corpus
if (slr.query(qry).getResults().getNumFound() > 1)
{
incrOcc = false;
}|\label{ln:solrend}|
//match all words in corpus with morphilo (creator=administrator) and save all words that are not in morphilo DB in leftovers
ArrayList<String> leftovers = new ArrayList<String>();
ArrayList<String> processed = new ArrayList<String>();
leftovers = getUnknownWords(getContentFromFile(job, ""), dateFromCorp, dateUntilCorp, "", incrOcc, incrOcc, false);|\label{ln:callkeymeth}|
//write all words of leftover in file as derivative to respective corpmeta dataset
MCRPath root = MCRPath.getPath(derivID, "/");|\label{ln:filesavestart}|
Path fn = getDerivateFilePath(job, "").getFileName();
Path p = root.resolve("untagged-" + fn);
Files.write(p, leftovers);|\label{ln:filesaveend}|
//create a file for all words that were processed
Path procWds = root.resolve("processed-" + fn);
Files.write(procWds, processed);
}
\end{lstlisting}
Using the above mentioned getter methods, the \emph{think} method assigns values to the object ID, needed to get the xml document
that contain the corpus metadata, the file ID, and the beginning and starting dates from the corpus to be analyzed. Lines \ref{ln:bugfixstart}
through \ref{ln:bugfixend} show how to access a mycore object as an xml document, a procedure that will be used in different variants
throughout this implementation.
By means of the object ID, the respective corpus is identified and a JDOM document is constructed, which can then be accessed
by XPath. The XPath factory instances are collections of the xml nodes. In the present case, it is save to assume that only one element
of \emph{NoW} is available (see corpus datamodel listing \ref{lst:corpusdatamodel} with $maxOccurs='1'$). So we do not have to loop through
the collection, but use the first node named \emph{NoW}. The if-test checks if the number of words of the uploaded file is the
same as the number written in the document. When the document is initially created by the MyCoRe logic it was configured to be zero.
If unequal, the setText(String) method is used to write the number of words of the corpus to the document.
Lines \ref{ln:solrstart}--\ref{ln:solrend} reveal the second important ingredient, i.e. controlling the search engine. First, a solr
client and a query are initialized. Then, the output of the result set is defined by giving the fields of interest of the document.
In the case at hand, it is the id, the name of the corpus, the number of words, and the beginnig and ending dates. With \emph{setQuery}
it is possible to assign values to some or all of these fields. Finally, \emph{getResults()} carries out the search and writes
all hits to a \emph{SolrDocumentList} (line \ref{ln:solrresult}). The test that follows is really only to set a Boolean
encoding if the number of occurrences of that word in the master should be updated. To avoid multiple counts,
incrementing the word frequency is only done if it is a new corpus.
In line \ref{ln:callkeymeth} \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} is called and
returned as a list of words. This method is key and will be discussed in depth below. Finally, lines
\ref{ln:filesavestart}--\ref{ln:filesaveend} show how to handle file objects in MyCoRe. Using the file ID, the root path and the name
of the first file in that path are identified. Then, a second file starting with ``untagged'' is created and all words returned from
the \emph{getUnknownWords} is written to that file. By the same token an empty file is created (in the last two lines of the \emph{think}-method),
in which all words that are manually annotated will be saved.
In a refactoring phase, the method \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} could be subdivided into
three methods: for each Boolean parameter one. In fact, this method handles more than one task. This is mainly due to multiple code avoidance.
%this is just wrong because no resultset will substantially be more than 10-20
%In addition, for large text files this method would run into efficiency problems if the master database also reaches the intended size of about
%$100,000$ entries and beyond because
In essence, an outer loop runs through all words of the corpus and an inner loop runs through all hits in the solr result set. Because the result
set is supposed to be small, approximately between $10-20$ items, efficiency
problems are unlikely to cause a problem, although there are some more loops running through collection of about the same sizes.
%As the hits naturally grow larger with an increasing size of the data base, processing time will rise exponentially.
Since each word is identified on the basis of its projected word type, the word form, and the time range it falls into, it is these variables that
have to be checked for existence in the documents. If not in the xml documents,
\emph{null} is returned and needs to be corrected. Moreover, user authentification must be considered. There are three different XPaths that are relevant.
\begin{itemize}
\item[-] \emph{//service/servflags/servflag[@type='createdby']} to test for the correct user
\item[-] \emph{//morphiloContainer/morphilo} to create the annotated document
\item[-] \emph{//morphiloContainer/morphilo/w} to set occurrences or add a link
\end{itemize}
As an illustration of the core functioning of this method, listing \ref{src:getUnknowWords} is given.
\begin{lstlisting}[language=java,caption={Mode of Operation of getUnknownWords Method},label=src:getUnknowWords,escapechar=|]
public ArrayList<String> getUnknownWords(
ArrayList<String> corpus,
String timeCorpusBegin,
String timeCorpusEnd,
String wdtpe,
Boolean setOcc,
Boolean setXlink,
Boolean writeAllData) throws Exception
{
String currentUser = MCRSessionMgr.getCurrentSession().getUserInformation().getUserID();
ArrayList lo = new ArrayList();
for (int i = 0; i < corpus.size(); i++)
{
SolrClient solrClient = MCRSolrClientFactory.getSolrClient();
SolrQuery query = new SolrQuery();
query.setFields("w","occurrence","begin","end", "id", "wordtype");
query.setQuery(corpus.get(i));
query.setRows(50); //more than 50 items are extremely unlikely
SolrDocumentList results = solrClient.query(query).getResults();
Boolean available = false;
for (int entryNum = 0; entryNum < results.size(); entryNum++)
{
...
// update in MCRMetaDataManager
String mcrIDString = results.get(entryNum).getFieldValue("id").toString();
//MCRObjekt auslesen und JDOM-Document erzeugen:
MCRObject mcrObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(mcrIDString));
Document jdomDoc = mcrObj.createXML();
...
//check and correction for word type
...
//checkand correction time: timeCorrect
...
//check if user correct: isAuthorized
...
XPathExpression<Element> xp = xpfac.compile("//morphiloContainer/morphilo/w", Filters.element());
//Iterates w-elements and increments occurrence attribute if setOcc is true
for (Element e : xp.evaluate(jdomDoc))
{
//wenn Rechte da sind und Worttyp nirgends gegeben oder gleich ist
if (isAuthorized && timeCorrect
&& ((e.getAttributeValue("wordtype") == null && wdtpe.equals(""))
|| e.getAttributeValue("wordtype").equals(wordtype))) // nur zur Vereinheitlichung
{
int oc = -1;
available = true;|\label{ln:available}|
try
{
//adjust occurrence Attribut
if (setOcc)
{
oc = Integer.parseInt(e.getAttributeValue("occurrence"));
e.setAttribute("occurrence", Integer.toString(oc + 1));
}
//write morphilo-ObjectID in xml of corpmeta
if (setXlink)
{
Namespace xlinkNamespace = Namespace.getNamespace("xlink", "http://www.w3.org/1999/xlink");|\label{ln:namespace}|
MCRObject corpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(getURLParameter(job, "objID")));
Document corpDoc = corpObj.createXML();
XPathExpression<Element> xpathEx = xpfac.compile("//corpuslink", Filters.element());
Element elm = xpathEx.evaluateFirst(corpDoc);
elm.setAttribute("href" , mcrIDString, xlinkNamespace);
}
mcrObj = new MCRObject(jdomDoc);|\label{ln:updatestart}|
MCRMetadataManager.update(mcrObj);
QualityControl qc = new QualityControl(mcrObj);|\label{ln:updateend}|
}
catch(NumberFormatException except)
{
// ignore
}
}
}
if (!available) // if not available in datasets under the given conditions |\label{ln:notavailable}|
{
lo.add(corpus.get(i));
}
}
return lo;
}
\end{lstlisting}
As can be seen from the functionality of listing \ref{src:getUnknowWords}, getting the unknown words of a corpus, is rather a side effect for the equally named method.
More precisely, a Boolean (line \ref{ln:available}) is set when the document is manipulated otherwise because it is clear that the word must exist then.
If the Boolean remains false (line \ref{ln:notavailable}), the word is put on the list of words that have to be annotated manually. As already explained above, the
first loop runs through all words (corpus) and the following lines a solr result set is created. This set is also looped through and it is checked if the time range,
the word type and the user are authorized. In the remainder, the occurrence attribute of the morphilo document can be incremented (setOcc is true) or/and the word is linked to the
corpus meta data (setXlink is true). While all code lines are equivalent with
what was explained in listing \ref{src:think}, it suffices to focus on an
additional name space, i.e.
``xlink'' has to be defined (line \ref{ln:namespace}). Once the linking of word
and corpus is set, the entire MyCoRe object has to be updated. This is done by the functionality of the framework (lines \ref{ln:updatestart}--\ref{ln:updateend}).
At the end, an instance of \emph{QualityControl} is created.
%QualityControl
The class \emph{QualityControl} is instantiated with a constructor
depicted in listing \ref{src:constructQC}.
\begin{lstlisting}[language=java,caption={Constructor of QualityControl.java},label=src:constructQC,escapechar=|]
private MCRObject mycoreObject;
/* Constructor calls method to carry out quality control, i.e. if at least 20
* different users agree 100% on the segments of the word under investigation
*/
public QualityControl(MCRObject mycoreObject) throws Exception
{
this.mycoreObject = mycoreObject;
if (getEqualObjectNumber() > 20)
{
addToMorphiloDB();
}
}
\end{lstlisting}
The constructor takes an MyCoRe object, a potential word candidate for the
master data base, which is assigned to a private class variable because the
object is used though not changed by some other java methods.
More importantly, there are two more methods: \emph{getEqualNumber()} and
\emph{addToMorphiloDB()}. While the former initiates a process of counting and
comparing objects, the latter is concerned with calculating the correct number
of occurrences from different, but not the same texts, and generating a MyCoRe object with the same content but with two different flags in the \emph{//service/servflags/servflag}-node, i.e. \emph{createdby='administrator'} and \emph{state='published'}.
And of course, the \emph{occurrence} attribute is set to the newly calculated value. The logic corresponds exactly to what was explained in
listing \ref{src:think} and will not be repeated here. The only difference are the paths compiled by the XPathFactory. They are
\begin{itemize}
\item[-] \emph{//service/servflags/servflag[@type='createdby']} and
\item[-] \emph{//service/servstates/servstate[@classid='state']}.
\end{itemize}
It is more instructive to document how the number of occurrences is calculated. There are two steps involved. First, a list with all mycore objects that are
equal to the object which the class is instantiated with (``mycoreObject'' in listing \ref{src:constructQC}) is created. This list is looped and all occurrence
attributes are summed up. Second, all occurrences from equal texts are substracted. Equal texts are identified on the basis of its meta data and its derivate.
There are some obvious shortcomings of this approach, which will be discussed in chapter \ref{chap:results}, section \ref{sec:improv}. Here, suffice it to
understand the mode of operation. Listing \ref{src:equalOcc} shows a possible solution.
\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (1)},label=src:equalOcc,escapechar=|]
/* returns number of Occurrences if Objects are equal, zero otherwise
*/
private int getOccurrencesFromEqualTexts(MCRObject mcrobj1, MCRObject mcrobj2) throws SAXException, IOException
{
int occurrences = 1;
//extract corpmeta ObjectIDs from morphilo-Objects
String crpID1 = getAttributeValue("//corpuslink", "href", mcrobj1);
String crpID2 = getAttributeValue("//corpuslink", "href", mcrobj2);
//get these two corpmeta Objects
MCRObject corpo1 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID1));
MCRObject corpo2 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID2));
//are the texts equal? get list of 'processed-words' derivate
String corp1DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo1);
String corp2DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo2);
ArrayList result = new ArrayList(getContentFromFile(corp1DerivID, ""));|\label{ln:writeContent}|
result.remove(getContentFromFile(corp2DerivID, ""));|\label{ln:removeContent}|
if (result.size() == 0) // the texts are equal
{
// extract occurrences of one the objects
occurrences = Integer.parseInt(getAttributeValue("//morphiloContainer/morphilo/w", "occurrence", mcrobj1));
}
else
{
occurrences = 0; //project metadata happened to be the same, but texts are different
}
return occurrences;
}
\end{lstlisting}
In this implementation, the ids from the \emph{corpmeta} data model are accessed via the xlink attribute in the morphilo documents.
The method \emph{getAttributeValue(String, String, MCRObject)} does exactly the same as demonstrated earlier (see from line \ref{ln:namespace}
on in listing \ref{src:getUnknowWords}). The underlying logic is that the texts are equal if exactly the same number of words were uploaded.
So all words from one file are written to a list (line \ref{ln:writeContent}) and words from the other file are removed from the
very same list (line \ref{ln:removeContent}). If this list is empty, then the exact same number of words must have been in both files and the occurrences
are adjusted accordingly. Since this method is called from another private method that only contains a loop through all equal objects, one gets
the occurrences from all equal texts. For reasons of confirmability, the looping method is also given:
\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (2)},label=src:equalOcc2,escapechar=|]
private int getOccurrencesFromEqualTexts() throws Exception
{
ArrayList<MCRObject> equalObjects = new ArrayList<MCRObject>();
equalObjects = getAllEqualMCRObjects();
int occurrences = 0;
for (MCRObject obj : equalObjects)
{
occurrences = occurrences + getOccurrencesFromEqualTexts(mycoreObject, obj);
}
return occurrences;
}
\end{lstlisting}
Now, the constructor in listing \ref{src:constructQC} reveals another method that rolls out an equally complex concatenation of procedures.
As implied above, \emph{getEqualObjectNumber()} returns the number of equally annotated words. It does this by falling back to another
method from which the size of the returned list is calculated (\emph{getAllEqualMCRObjects().size()}). Hence, we should care about
\emph{getAllEqualMCRObjects()}. This method really has the same design as \emph{int getOccurrencesFromEqualTexts()} in listing \ref{src:equalOcc2}.
The difference is that another method (\emph{Boolean compareMCRObjects(MCRObject, MCRObject, String)}) is used within the loop and
that all equal objects are put into the list of MyCoRe objects that are returned. If this list comprises more than 20
entries,\footnote{This number is somewhat arbitrary. It is inspired by the sample size n in t-distributed data.} the respective document
will be integrated in the master data base by the process described above.
The comparator logic is shown in listing \ref{src:compareMCR}.
\begin{lstlisting}[language=java,caption={Comparison of MyCoRe objects},label=src:compareMCR,escapechar=|]
private Boolean compareMCRObjects(MCRObject mcrobj1, MCRObject mcrobj2, String xpath) throws SAXException, IOException
{
Boolean isEqual = false;
Boolean beginTime = false;
Boolean endTime = false;
Boolean occDiff = false;
Boolean corpusDiff = false;
String source = getXMLFromObject(mcrobj1, xpath);
String target = getXMLFromObject(mcrobj2, xpath);
XMLUnit.setIgnoreAttributeOrder(true);
XMLUnit.setIgnoreComments(true);
XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true);
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setNormalizeWhitespace(true);
//differences in occurrences, end, begin should be ignored
try
{
Diff xmlDiff = new Diff(source, target);
DetailedDiff dd = new DetailedDiff(xmlDiff);
//counters for differences
int i = 0;
int j = 0;
int k = 0;
int l = 0;
// list containing all differences
List differences = dd.getAllDifferences();|\label{ln:difflist}|
for (Object object : differences)
{
Difference difference = (Difference) object;
//@begin,@end,... node is not in the difference list if the count is 0
if (difference.getControlNodeDetail().getXpathLocation().endsWith("@begin")) i++;|\label{ln:diffbegin}|
if (difference.getControlNodeDetail().getXpathLocation().endsWith("@end")) j++;
if (difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence")) k++;
if (difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus")) l++;|\label{ln:diffend}|
//@begin and @end have different values: they must be checked if they fall right in the allowed time range
if ( difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())
&& difference.getControlNodeDetail().getXpathLocation().endsWith("@begin")
&& (Integer.parseInt(difference.getControlNodeDetail().getValue()) < Integer.parseInt(difference.getTestNodeDetail().getValue())) )
{
beginTime = true;
}
if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())
&& difference.getControlNodeDetail().getXpathLocation().endsWith("@end")
&& (Integer.parseInt(difference.getControlNodeDetail().getValue()) > Integer.parseInt(difference.getTestNodeDetail().getValue())) )
{
endTime = true;
}
//attribute values of @occurrence and @corpus are ignored if they are different
if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())
&& difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence"))
{
occDiff = true;
}
if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())
&& difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus"))
{
corpusDiff = true;
}
}
//if any of @begin, @end ... is identical set Boolean to true
if (i == 0) beginTime = true;|\label{ln:zerobegin}|
if (j == 0) endTime = true;
if (k == 0) occDiff = true;
if (l == 0) corpusDiff = true;|\label{ln:zeroend}|
//if the size of differences is greater than the number of changes admitted in @begin, @end ... something else must be different
if (beginTime && endTime && occDiff && corpusDiff && (i + j + k + l) == dd.getAllDifferences().size()) isEqual = true;|\label{ln:diffsum}|
}
catch (SAXException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
return isEqual;
}
\end{lstlisting}
In this method, XMLUnit is heavily used to make all necessary node comparisons. The matter becomes more complicated, however, if some attributes
are not only ignored, but evaluated according to a given definition as it is the case for the time range. If the evaluator and builder classes are
not to be overwritten entirely because needed for evaluating other nodes of the
xml document, the above solution appears a bit awkward. So there is potential for improvement before the production version is to be programmed.
XMLUnit provides us with a
list of the differences of the two documents (see line \ref{ln:difflist}). There are four differences allowed, that is, the attributes \emph{occurrence},
\emph{corpus}, \emph{begin}, and \emph{end}. For each of them a Boolean variable is set. Because any of the attributes could also be equal to the master
document and the difference list only contains the actual differences, one has to find a way to define both, equal and different, for the attributes.
This could be done by ignoring these nodes. Yet, this would not include testing if the beginning and ending dates fall into the range of the master
document. Therefore the attributes are counted as lines \ref{ln:diffbegin} through \ref{ln:diffend} reveal. If any two documents
differ in some of the four attributes just specified, then the sum of the counters (line \ref{ln:diffsum}) should not be greater than the collected differences
by XMLUnit. The rest of the if-tests assign truth values to the respective
Booleans. It is probably worth mentioning that if all counters are zero (lines
\ref{ln:zerobegin}-\ref{ln:zeroend}) the attributes and values are identical and hence the Boolean has to be set explicitly. Otherwise the test in line \ref{ln:diffsum} would fail.
%TagCorpusServlet
Once quality control (explained in detail further down) has been passed, it is
the user's turn to interact further. By clicking on the option \emph{Manual tagging}, the \emph{TagCorpusServlet} will be callled. This servlet instantiates
\emph{ProcessCorpusServlet} to get access to the \emph{getUnknownWords}-method, which delivers the words still to be
processed and which overwrites the content of the file starting with \emph{untagged}. For the next word in \emph{leftovers} a new MyCoRe object is created
using the JDOM API and added to the file beginning with \emph{processed}. In line \ref{ln:tagmanu} of listing \ref{src:tagservlet}, the previously defined
entry mask is called, with which the proposed word structure could be confirmed or changed. How the word structure is determined will be shown later in
the text.
\begin{lstlisting}[language=java,caption={Manual Tagging Procedure},label=src:tagservlet,escapechar=|]
...
if (!leftovers.isEmpty())
{
ArrayList<String> processed = new ArrayList<String>();
//processed.add(leftovers.get(0));
JDOMorphilo jdm = new JDOMorphilo();
MCRObject obj = jdm.createMorphiloObject(job, leftovers.get(0));|\label{ln:jdomobject}|
//write word to be annotated in process list and save it
Path filePathProc = pcs.getDerivateFilePath(job, "processed").getFileName();
Path proc = root.resolve(filePathProc);
processed = pcs.getContentFromFile(job, "processed");
processed.add(leftovers.get(0));
Files.write(proc, processed);
//call entry mask for next word
tagUrl = prop.getBaseURL() + "content/publish/morphilo.xed?id=" + obj.getId();|\label{ln:tagmanu}|
}
else
{
//initiate process to give a complete tagged file of the original corpus
//if untagged-file is empty, match original file with morphilo
//creator=administrator OR creator=username and write matches in a new file
ArrayList<String> complete = new ArrayList<String>();
ProcessCorpusServlet pcs2 = new ProcessCorpusServlet();
complete = pcs2.getUnknownWords(
pcs2.getContentFromFile(job, ""), //main corpus file
pcs2.getCorpusMetadata(job, "def.datefrom"),
pcs2.getCorpusMetadata(job, "def.dateuntil"),
"", //wordtype
false,
false,
true);
Files.delete(p);
MCRXMLFunctions mdm = new MCRXMLFunctions();
String mainFile = mdm.getMainDocName(derivID);
Path newRoot = root.resolve("tagged-" + mainFile);
Files.write(newRoot, complete);
//return to Menu page
tagUrl = prop.getBaseURL() + "receive/" + corpID;
}
\end{lstlisting}
At the point where no more items are in \emph{leftsovers} the \emph{getUnknownWords}-method is called whereas the last Boolean parameter
is set true. This indicates that the array list containing all available and relevant data to the respective user is returned as seen in
the code snippet in listing \ref{src:writeAll}.
\begin{lstlisting}[language=java,caption={Code snippet to deliver all data to the user},label=src:writeAll,escapechar=|]
...
// all data is written to lo in TEI
if (writeAllData && isAuthorized && timeCorrect)
{
XPathExpression<Element> xpath = xpfac.compile("//morphiloContainer/morphilo", Filters.element());
for (Element e : xpath.evaluate(jdomDoc))
{
XMLOutputter outputter = new XMLOutputter();
outputter.setFormat(Format.getPrettyFormat());
lo.add(outputter.outputString(e.getContent()));
}
}
...
\end{lstlisting}
The complete list (\emph{lo}) is written to yet a third file starting with \emph{tagged} and finally returned to the main project webpage.
%JDOMorphilo
The interesting question is now where does the word structure come from, which is filled in the entry mask as asserted above.
In listing \ref{src:tagservlet} line \ref{ln:jdomobject}, one can see that a JDOM object is created and the method
\emph{createMorphiloObject(MCRServletJob, String)} is called. The string parameter is the word that needs to be analyzed.
Most of the method is a mere application of the JDOM API given the data model in chapter \ref{chap:concept} section
\ref{subsec:datamodel} and listing \ref{lst:worddatamodel}. That means namespaces, elements and their attributes are defined in the correct
order and hierarchy.
To fill the elements and attributes with text, i.e. prefixes, suffixes, stems, etc., a Hashmap -- containing the morpheme as
key and its position as value -- are created that are filled with the results from an AffixStripper instantiation. Depending on how many prefixes
or suffixes respectively are put in the hashmap, the same number of xml elements are created. As a final step, a valid MyCoRe id is generated using
the existing MyCoRe functionality, the object is created and returned to the TagCorpusServlet.
%AffixStripper explanation
Last, the analyses of the word structure will be considered. It is implemented
in the \emph{AffixStripper.java} file.
All lexical affix morphemes and their allomorphs as well as the inflections were extracted from the
OED\footnote{Oxford English Dictionary http://www.oed.com/} and saved as enumerated lists (see the example in listing \ref{src:enumPref}).
The allomorphic items of these lists are mapped successively to the beginning in the case of prefixes
(see listing \ref{src:analyzePref}, line \ref{ln:prefLoop}) or to the end of words in the case of suffixes
(see listing \ref{src:analyzeSuf}). Since each
morphemic variant maps to its morpheme right away, it makes sense to use the morpheme and so
implicitly keep the relation to its allomorph.
\begin{lstlisting}[language=java,caption={Enumeration Example for the Prefix "over"},label=src:enumPref,escapechar=|]
package custom.mycore.addons.morphilo;
public enum PrefixEnum {
...
over("over"), ufer("over"), ufor("over"), uferr("over"), uvver("over"), obaer("over"), ober("over)"), ofaer("over"),
ofere("over"), ofir("over"), ofor("over"), ofer("over"), ouer("over"),oferr("over"), offerr("over"), offr("over"), aure("over"),
war("over"), euer("over"), oferre("over"), oouer("over"), oger("over"), ouere("over"), ouir("over"), ouire("over"),
ouur("over"), ouver("over"), ouyr("over"), ovar("over"), overe("over"), ovre("over"),ovur("over"), owuere("over"), owver("over"),
houyr("over"), ouyre("over"), ovir("over"), ovyr("over"), hover("over"), auver("over"), awver("over"), ovver("over"),
hauver("over"), ova("over"), ove("over"), obuh("over"), ovah("over"), ovuh("over"), ofowr("over"), ouuer("over"), oure("over"),
owere("over"), owr("over"), owre("over"), owur("over"), owyr("over"), our("over"), ower("over"), oher("over"),
ooer("over"), oor("over"), owwer("over"), ovr("over"), owir("over"), oar("over"), aur("over"), oer("over"), ufara("over"),
ufera("over"), ufere("over"), uferra("over"), ufora("over"), ufore("over"), ufra("over"), ufre("over"), ufyrra("over"),
yfera("over"), yfere("over"), yferra("over"), uuera("over"), ufe("over"), uferre("over"), uuer("over"), uuere("over"),
vfere("over"), vuer("over"), vuere("over"), vver("over"), uvvor("over") ...
...chap:results
private String morpheme;
//constructor
PrefixEnum(String morpheme)
{
this.morpheme = morpheme;
}
//getter Method
public String getMorpheme()
{
return this.morpheme;
}
}
\end{lstlisting}
As can be seen in line \ref{ln:prefPutMorph} in listing \ref{src:analyzePref}, the morpheme is saved to a hash map together with its position, i.e. the size of the
map plus one at the time being. In line \ref{ln:prefCutoff} the \emph{analyzePrefix} method is recursively called until no more matches can be made.
\begin{lstlisting}[language=java,caption={Method to recognize prefixes},label=src:analyzePref,escapechar=|]
private Map<String, Integer> prefixMorpheme = new HashMap<String,Integer>();
...
private void analyzePrefix(String restword)
{
if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion
{
for (PrefixEnum prefEnum : PrefixEnum.values())|\label{ln:prefLoop}|
{
String s = prefEnum.toString();
if (restword.startsWith(s))
{
prefixMorpheme.put(s, prefixMorpheme.size() + 1);|\label{ln:prefPutMorph}|
//cut off the prefix that is added to the list
analyzePrefix(restword.substring(s.length()));|\label{ln:prefCutoff}|
}
else
{
analyzePrefix("");
}
}
}
}
\end{lstlisting}
The recognition of suffixes differs only in the cut-off direction since suffixes occur at the end of a word.
Hence, line \ref{ln:prefCutoff} in listing \ref{src:analyzePref} reads in the case of suffixes.
\begin{lstlisting}[language=java,caption={Cut-off mechanism for suffixes},label=src:analyzeSuf,escapechar=|]
analyzeSuffix(restword.substring(0, restword.length() - s.length()));
\end{lstlisting}
It is important to note that inflections are suffixes (in the given model case of Middle English morphology) that usually occur at the very
end of a word, i.e. after all lexical suffixes, only once. It follows that inflections
have to be recognized at first without any repetition. So the procedure for inflections can be simplified
to a substantial degree as listing \ref{src:analyzeInfl} shows.
\begin{lstlisting}[language=java,caption={Method to recognize inflections},label=src:analyzeInfl,escapechar=|]
private String analyzeInflection(String wrd)
{
String infl = "";
for (InflectionEnum inflEnum : InflectionEnum.values())
{
if (wrd.endsWith(inflEnum.toString()))
{
infl = inflEnum.toString();
}
}
return infl;
}
\end{lstlisting}
Unfortunately the embeddedness problem prevents a very simple algorithm. Embeddedness occurs when a lexical item
is a substring of another lexical item. To illustrate, the suffix \emph{ion} is also contained in the suffix \emph{ation}, as is
\emph{ent} in \emph{ment}, and so on. The embeddedness problem cannot be solved completely on the basis of linear modelling, but
for a large part of embedded items one can work around it using implicitly Zipf's law, i.e. the correlation between frequency
and length of lexical items. The longer a word becomes, the less frequent it will occur. The simplest logic out of it is to assume
that longer suffixes (measured in letters) are preferred over shorter suffixes because it is more likely tha the longer the suffix string becomes,
the more likely it is one (as opposed to several) suffix unit(s). This is done in listing \ref{src:embedAffix}, whereas
the inner class \emph{sortedByLengthMap} returns a list sorted by length and the loop from line \ref{ln:deleteAffix} onwards deletes
the respective substrings.
\begin{lstlisting}[language=java,caption={Method to workaround embeddedness},label=src:embedAffix,escapechar=|]
private Map<String, Integer> sortOutAffixes(Map<String, Integer> affix)
{
Map<String,Integer> sortedByLengthMap = new TreeMap<String, Integer>(new Comparator<String>()
{
@Override
public int compare(String s1, String s2)
{
int cmp = Integer.compare(s1.length(), s2.length());
return cmp != 0 ? cmp : s1.compareTo(s2);
}
}
);
sortedByLengthMap.putAll(affix);
ArrayList<String> al1 = new ArrayList<String>(sortedByLengthMap.keySet());
ArrayList<String> al2 = al1;
Collections.reverse(al2);
for (String s2 : al1)|\label{ln:deleteAffix}|
{
for (String s1 : al2)
if (s1.contains(s2) && s1.length() > s2.length())
{
affix.remove(s2);
}
}
return affix;
}
\end{lstlisting}
Finally, the position of the affix has to be calculated because the hashmap in line \ref{ln:prefPutMorph} in
listing \ref{src:analyzePref} does not keep the original order for changes taken place in addressing the affix embeddedness
(listing \ref{src:embedAffix}). Listing \ref{src:affixPos} depicts the preferred solution.
The recursive construction of the method is similar to \emph{private void analyzePrefix(String)} (listing \ref{src:analyzePref})
only that the two affix types are handled in one method. For that, an additional parameter taking the form either \emph{suffix}
or \emph{prefix} is included.
\begin{lstlisting}[language=java,caption={Method to determine position of the affix},label=src:affixPos,escapechar=|]
private void getAffixPosition(Map<String, Integer> affix, String restword, int pos, String affixtype)
{
if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion
{
for (String s : affix.keySet())
{
if (restword.startsWith(s) && affixtype.equals("prefix"))
{
pos++;
prefixMorpheme.put(s, pos);
//prefixAllomorph.add(pos-1, restword.substring(s.length()));
getAffixPosition(affix, restword.substring(s.length()), pos, affixtype);
}
else if (restword.endsWith(s) && affixtype.equals("suffix"))
{
pos++;
suffixMorpheme.put(s, pos);
//suffixAllomorph.add(pos-1, restword.substring(s.length()));
getAffixPosition(affix, restword.substring(0, restword.length() - s.length()), pos, affixtype);
}
else
{
getAffixPosition(affix, "", pos, affixtype);
}
}
}
}
\end{lstlisting}
To give the complete word structure, the root of a word should also be provided. In listing \ref{src:rootAnalyze} a simple solution is offered, however,
considering compounds as words consisting of more than one root.
\begin{lstlisting}[language=java,caption={Method to determine roots},label=src:rootAnalyze,escapechar=|]
private ArrayList<String> analyzeRoot(Map<String, Integer> pref, Map<String, Integer> suf, int stemNumber)
{
ArrayList<String> root = new ArrayList<String>();
int j = 1; //one root always exists
// if word is a compound several roots exist
while (j <= stemNumber)
{
j++;
String rest = lemma;|\label{ln:lemma}|
for (int i=0;i<pref.size();i++)
{
for (String s : pref.keySet())
{
//if (i == pref.get(s))
if (rest.length() > s.length() && s.equals(rest.substring(0, s.length())))
{
rest = rest.substring(s.length(),rest.length());
}
}
}
for (int i=0;i<suf.size();i++)
{
for (String s : suf.keySet())
{
//if (i == suf.get(s))
if (s.length() < rest.length() && (s.equals(rest.substring(rest.length() - s.length(), rest.length()))))
{
rest = rest.substring(0, rest.length() - s.length());
}
}
}
root.add(rest);
}
return root;
}
\end{lstlisting}
The logic behind this method is that the root is the remainder of a word when all prefixes and suffixes are substracted.
So the loops run through the number of prefixes and suffixes at each position and substract the affix. Really, there is
some code doubling with the previously described methods, which could be eliminated by making it more modular in a possible
refactoring phase. Again, this is not the concern of a prototype. Line \ref{ln:lemma} defines the initial state of a root,
which is the case for monomorphemic words. The \emph{lemma} is defined as the wordtoken without the inflection. Thus listing
\ref{src:lemmaAnalyze} reveals how the class variable is calculated
\begin{lstlisting}[language=java,caption={Method to determine lemma},label=src:lemmaAnalyze,escapechar=|]
/*
* Simplification: lemma = wordtoken - inflection
*/
private String analyzeLemma(String wrd, String infl)
{
return wrd.substring(0, wrd.length() - infl.length());
}
\end{lstlisting}
The constructor of \emph{AffixStripper} calls the method \emph{analyzeWord()}
whose only job is to calculate each structure element in the correct order
(listing \ref{src:lemmaAnalyze}). All structure elements are also provided by getters.
\begin{lstlisting}[language=java,caption={Method to determine all word structure},label=src:lemmaAnalyze,escapechar=|]
private void analyzeWord()
{
//analyze inflection first because it always occurs at the end of a word
inflection = analyzeInflection(wordtoken);
lemma = analyzeLemma(wordtoken, inflection);
analyzePrefix(lemma);
analyzeSuffix(lemma);
getAffixPosition(sortOutAffixes(prefixMorpheme), lemma, 0, "prefix");
getAffixPosition(sortOutAffixes(suffixMorpheme), lemma, 0, "suffix");
prefixNumber = prefixMorpheme.size();
suffixNumber = suffixMorpheme.size();
wordroot = analyzeRoot(prefixMorpheme, suffixMorpheme, getStemNumber());
}
\end{lstlisting}
To conclude, the Morphilo implementation as presented here, aims at fulfilling the task of a working prototype. It is important to note
that it neither claims to be a very efficient nor a ready software program to be used in production. However, it marks a crucial milestone
on the way to a production system. At some listings sources of improvement were made explicit; at others no suggestions were made. In the latter
case this does not imply that there is no potential for improvement. Once acceptability tests are carried out, it will be the task of a follow up project
to identify these potentials and implement them accordingly.
\ No newline at end of file
Data Model Implementation
=========================
Data Model
==========
Conceptualization
-----------------
From both the user and task requirements one can derive that four basic
functions of data processing need to be carried out. Data have to be read, persistently
saved, searched, and deleted. Furthermore, some kind of user management
and multi-user processing is necessary. In addition, the framework should
support web technologies, be well documented, and easy to extent. Ideally, the
MVC pattern is realized.
\subsection{Data Model}\label{subsec:datamodel}
The guidelines of the
\emph{TEI}-standard\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf} on the
word level are defined in line with the structure defined above in section \ref{subsec:morphologicalSystems}.
In listing \ref{lst:teiExamp} an
example is given for a possible markup at the word level for
\emph{comfortable}.\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html}
\begin{lstlisting}[language=XML,
caption={TEI-example for 'comfortable'},label=lst:teiExamp]
<w type="adjective">
<m type="base">
<m type="prefix" baseForm="con">com</m>
<m type="root">fort</m>
</m>
<m type="suffix">able</m>
</w>
\end{lstlisting}
This data model reflects just one theoretical conception of a word structure model.
Crucially, the model emanates from the assumption
that the suffix node is on par with the word base. On the one hand, this
implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the
other hand, is enclosed in the base, which basically means a stronger lexical,
and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different
hierarchical levels has important consequences for the branching direction at
subword level (here right-branching). Left the theoretical interest aside, the
choice of the TEI standard is reasonable with view to a sustainable architecture that allows for
exchanging data with little to no additional adjustments.
The negative account is that the model is not eligible for all languages.
It reflects a theoretical construction based on Indo-European
languages. If attention is paid to which language this software is used, it will
not be problematic. This is the case for most languages of the Indo-European
stem and corresponds to the overwhelming majority of all research carried out
(unfortunately).
Implementation
--------------
As laid out in the task analysis in section \ref{subsec:datamodel}, it is
advantageous to use established standards. It was also shown that it makes sense
to keep the meta data of each corpus separate from the data model used for the
words to be analyzed.
For the present case, the TEI-standard was identified as an
appropriate markup for words. In terms of the implementation this means that
......@@ -26,3 +81,161 @@ Whereas attributes of the objecttype are specific to the repository framework, t
recognized in the hierarchy of the meta data element starting with the name
\emph{w} (line \ref{src:wordbegin}).
\begin{lstlisting}[language=XML,caption={Word Data
model},label=lst:worddatamodel,escapechar=|] <?xml version="1.0" encoding="UTF-8"?>
<objecttype
name="morphilo"
isChild="true"
isParent="true"
hasDerivates="true"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="datamodel.xsd">
<metadata>
<element name="morphiloContainer" type="xml" style="dontknow"
notinherit="true" heritable="false">
<xs:sequence>
<xs:element name="morphilo">
<xs:complexType>
<xs:sequence>
<xs:element name="w" minOccurs="0" maxOccurs="unbounded">|label{src:wordbegin}|
<xs:complexType mixed="true">
<xs:sequence>
<!-- stem -->
<xs:element name="m1" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:sequence>
<!-- base -->
<xs:element name="m2" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:sequence>
<!-- root -->
<xs:element name="m3" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:attribute name="type" type="xs:string"/>
</xs:complexType>
</xs:element>
<!-- prefix -->
<xs:element name="m4" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:attribute name="type" type="xs:string"/>
<xs:attribute name="PrefixbaseForm" type="xs:string"/>
<xs:attribute name="position" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="type" type="xs:string"/>
</xs:complexType>
</xs:element>
<!-- suffix -->
<xs:element name="m5" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:attribute name="type" type="xs:string"/>
<xs:attribute name="SuffixbaseForm" type="xs:string"/>
<xs:attribute name="position" type="xs:string"/>
<xs:attribute name="inflection" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<!-- stem-Attribute -->
<xs:attribute name="type" type="xs:string"/>
<xs:attribute name="pos" type="xs:string"/>
<xs:attribute name="occurrence" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<!-- w -Attribute auf Wortebene -->
<xs:attribute name="lemma" type="xs:string"/>
<xs:attribute name="complexType" type="xs:string"/>
<xs:attribute name="wordtype" type="xs:string"/>
<xs:attribute name="occurrence" type="xs:string"/>
<xs:attribute name="corpus" type="xs:string"/>
<xs:attribute name="begin" type="xs:string"/>
<xs:attribute name="end" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</element>
<element name="wordtype" type="classification" minOccurs="0" maxOccurs="1">
<classification id="wordtype"/>
</element>
<element name="complexType" type="classification" minOccurs="0" maxOccurs="1">
<classification id="complexType"/>
</element>
<element name="corpus" type="classification" minOccurs="0" maxOccurs="1">
<classification id="corpus"/>
</element>
<element name="pos" type="classification" minOccurs="0" maxOccurs="1">
<classification id="pos"/>
</element>
<element name="PrefixbaseForm" type="classification" minOccurs="0"
maxOccurs="1">
<classification id="PrefixbaseForm"/>
</element>
<element name="SuffixbaseForm" type="classification" minOccurs="0"
maxOccurs="1">
<classification id="SuffixbaseForm"/>
</element>
<element name="inflection" type="classification" minOccurs="0" maxOccurs="1">
<classification id="inflection"/>
</element>
<element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded" >
<target type="corpmeta"/>
</element>
</metadata>
</objecttype>
\end{lstlisting}
Additionally, it is worth mentioning that some attributes are modeled as a
\emph{classification}. All these have to be listed
as separate elements in the data model. This has been done for all attributes
that are more or less subject to little or no change. In fact, all known suffix
and prefix morphemes should be known for the language investigated and are
therefore defined as a classification.
The same is true for the parts of speech named \emph{pos} in the morphilo data
model above.
Here the PENN-Treebank tagset was used. Last, the different morphemic layers in
the standard model named \emph{m} are changed to $m1$ through $m5$. This is the
only change in the standard that could be problematic if the data is to be
processed elsewhere and the change is not documented more explicitly. Yet, this
change was necessary for the MyCoRe repository throws errors caused by ambiguity
issues on the different $m$-layers.
The second data model describes only very few properties of the text corpora
from which the words are extracted. Listing \ref{lst:corpusdatamodel} depicts
only the meta data element. For the sake of simplicity of the prototype, this
data model is kept as simple as possible. The obligatory field is the name of
the corpus. Specific dates of the corpus are classified as optional because in
some cases a text cannot be dated reliably.
\begin{lstlisting}[language=XML,caption={Corpus Data
Model},label=lst:corpusdatamodel]
<metadata>
<!-- Pflichtfelder -->
<element name="korpusname" type="text" minOccurs="1" maxOccurs="1"/>
<!-- Optionale Felder -->
<element name="sprache" type="text" minOccurs="0" maxOccurs="1"/>
<element name="size" type="number" minOccurs="0" maxOccurs="1"/>
<element name="datefrom" type="text" minOccurs="0" maxOccurs="1"/>
<element name="dateuntil" type="text" minOccurs="0" maxOccurs="1"/>
<!-- number of words -->
<element name="NoW" type="text" minOccurs="0" maxOccurs="1"/>
<element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded">
<target type="morphilo"/>
</element>
</metadata>
\end{lstlisting}
As a final remark, one might have noticed that all attributes are modelled as
strings although other data types are available and fields encoding the dates or
the number of words suggest otherwise. The MyCoRe framework even provides a
data type \emph{historydate}. There is not a very satisfying answer to its
disuse.
All that can be said is that the use of data types different than the string
leads later on to problems in the convergence between the search engine and the
repository framework. These issues seem to be well known and can be followed on
github.
\ No newline at end of file
Framework
=========
\begin{figure}
\centering
\includegraphics[scale=0.33]{mycore_architecture-2.png}
\caption[MyCoRe-Architecture and Components]{MyCoRe-Architecture and Components\protect\footnotemark}
\label{fig:abbMyCoReStruktur}
\end{figure}
\footnotetext{source: https://www.mycore.de}
To specify the MyCoRe framework the morphilo application logic will have to be implemented,
the TEI data model specified, and the input, search and output mask programmed.
There are three directories which are
important for adjusting the MyCoRe framework to the needs of one's own application. These three directories
correspond essentially to the three components in the MVC model as explicated in
section \ref{subsec:mvc}. Roughly, they are envisualized in figure \ref{fig:abbMyCoReStruktur} in the upper
right hand corner. More precisely, the view (\emph{Layout} in figure \ref{fig:abbMyCoReStruktur}) and the model layer
(\emph{Datenmodell} in figure \ref{fig:abbMyCoReStruktur}) can be done
completely via the ``interface'', which is a directory with a predefined
structure and some standard files. For the configuration of the logic an extra directory is offered (/src/main/java/custom/mycore/addons/). Here all, java classes
extending the controller layer should be added.
Practically, all three MVC layers are placed in the
\emph{src/main/}-directory of the application. In one of the subdirectories,
\emph{datamodel/def}, the datamodel specifications are defined as xml files. It parallels the model
layer in the MVC pattern. How the data model was defined will be explained in
section \ref{subsec:datamodelimpl}.
\ No newline at end of file
View
====
Conceptualization
-----------------
Lastly, the third directory (\emph{src/main/resources}) contains all code needed
for rendering the data to be displayed on the screen. So this corresponds to
the view in an MVC approach. It is done by xsl-files that (unfortunately)
contain some logic that really belongs to the controller. Thus, the division is
not as clear as implied in theory. I will discuss this issue more specifically in the
relevant subsection below. Among the resources are also all images, styles, and
javascripts.
Implementation
--------------
As explained in section \ref{subsec:mvc}, the view component handles the visual
representation in the form of an interface that allows interaction between
the user and the task to be carried out by the machine. As a
webservice in the present case, all interaction happens via a browser, i.e. webpages are
visualized and responses are recognized by registering mouse or keyboard
events. More specifically, a webpage is rendered by transforming xml documents
to html pages. The MyCoRe repository framework uses an open source XSLT
processor from Apache, Xalan.\footnote{http://xalan.apache.org} This engine
transforms document nodes described by the XPath syntax into hypertext making
use of a special form of template matching. All templates are collected in so
called xml-encoded stylesheets. Since there are two data models with two
different structures, it is good practice to define two stylesheet files one for
each data model.
As a demonstration, in listing \ref{lst:morphilostylesheet} below a short
extract is given for rendering the word data.
\begin{lstlisting}[language=XML,caption={stylesheet
morphilo.xsl},label=lst:morphilostylesheet]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xalan="http://xml.apache.org/xalan"
xmlns:i18n="xalan://org.mycore.services.i18n.MCRTranslation"
xmlns:acl="xalan://org.mycore.access.MCRAccessManager"
xmlns:mcr="http://www.mycore.org/" xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:mods="http://www.loc.gov/mods/v3"
xmlns:encoder="xalan://java.net.URLEncoder"
xmlns:mcrxsl="xalan://org.mycore.common.xml.MCRXMLFunctions"
xmlns:mcrurn="xalan://org.mycore.urn.MCRXMLFunctions"
exclude-result-prefixes="xalan xlink mcr i18n acl mods mcrxsl mcrurn encoder"
version="1.0">
<xsl:param name="MCR.Users.Superuser.UserName"/>
<xsl:template match="/mycoreobject[contains(@ID,'_morphilo_')]">
<head>
<link href="{$WebApplicationBaseURL}css/file.css" rel="stylesheet"/>
</head>
<div class="row">
<xsl:call-template name="objectAction">
<xsl:with-param name="id" select="@ID"/>
<xsl:with-param name="deriv" select="structure/derobjects/derobject/@xlink:href"/>
</xsl:call-template>
<xsl:variable name="objID" select="@ID"/>
<!-- Hier Ueberschrift setzen -->
<h1 style="text-indent: 4em;">
<xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w">
<xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/>
</xsl:if>
</h1>
<dl class="dl-horizontal">
<!-- (1) Display word -->
<xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w">
<dt>
<xsl:value-of select="i18n:translate('response.page.label.word')"/>
</dt>
<dd>
<xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/>
</dd>
</xsl:if>
<!-- (2) Display lemma -->
...
</xsl:template>
...
<xsl:template name="objectAction">
...
</xsl:template>
...
</xsl:stylesheet>
\end{lstlisting}
This template matches with
the root node of each \emph{MyCoRe object} ensuring that a valid MyCoRe model is
used and checking that the document to be processed contains a unique
identifier, here a \emph{MyCoRe-ID}, and the name of the correct data model,
here \emph{morphilo}.
Then, another template, \emph{objectAction}, is called together with two parameters, the ids
of the document object and attached files. In the remainder all relevant
information from the document is accessed by XPath, such as the word and the lemma,
and enriched with hypertext annotations it is rendered as a hypertext document.
The template \emph{objectAction} is key to understand the coupling process in the software
framework. It is therefore separately listed in \ref{lst:objActionTempl}.
\begin{lstlisting}[language=XML,caption={template
objectAction},label=lst:objActionTempl,escapechar=|]
<xsl:template name="objectAction">
<xsl:param name="id" select="./@ID"/>
<xsl:param name="accessedit" select="acl:checkPermission($id,'writedb')"/>
<xsl:param name="accessdelete" select="acl:checkPermission($id,'deletedb')"/>
<xsl:variable name="derivCorp" select="./@label"/>
<xsl:variable name="corpID" select="metadata/def.corpuslink[@class='MCRMetaLinkID']/corpuslink/@xlink:href"/>
<xsl:if test="$accessedit or $accessdelete">|\label{ln:ng}|
<div class="dropdown pull-right">
<xsl:if test="string-length($corpID) &gt; 0 or $CurrentUser='administrator'">
<button class="btn btn-default dropdown-toggle" style="margin:10px" type="button" id="dropdownMenu1" data-toggle="dropdown" aria-expanded="true">
<span class="glyphicon glyphicon-cog" aria-hidden="true"></span> Annotieren
<span class="caret"></span>
</button>
</xsl:if>
<xsl:if test="string-length($corpID) &gt; 0">|\label{ln:ru}|
<xsl:variable name="ifsDirectory" select="document(concat('ifs:/',$derivCorp))"/>
<ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1">
<li role="presentation">
|\label{ln:nw1}|<a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$derivCorp}&amp;objID={$corpID}" role="menuitem" tabindex="-1">|\label{ln:nw2}|
<xsl:value-of select="i18n:translate('object.nextObject')"/>
</a>
</li>
<li role="presentation">
<a href="{$WebApplicationBaseURL}receive/{$corpID}" role="menuitem" tabindex="-1">
<xsl:value-of select="i18n:translate('object.backToProject')"/>
</a>
</li>
</ul>
</xsl:if>
<xsl:if test="$CurrentUser='administrator'">
<ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1">
<li role="presentation">
<a role="menuitem" tabindex="-1" href="{$WebApplicationBaseURL}content/publish/morphilo.xed?id={$id}">
<xsl:value-of select="i18n:translate('object.editWord')"/>
</a>
</li>
<li role="presentation">
<a href="{$ServletsBaseURL}object/delete{$HttpSession}?id={$id}" role="menuitem" tabindex="-1" class="confirm_deletion option" data-text="Wirklich loeschen">
<xsl:value-of select="i18n:translate('object.delWord')"/>
</a>
</li>
</ul>
</xsl:if>
</div>
<div class="row" style="margin-left:0px; margin-right:10px">
<xsl:apply-templates select="structure/derobjects/derobject[acl:checkPermission(@xlink:href,'read')]">
<xsl:with-param name="objID" select="@ID"/>
</xsl:apply-templates>
</div>
</xsl:if>
</xsl:template>
\end{lstlisting}
The \emph{objectAction} template defines the selection menu appearing -- once manual tagging has
started -- on the upper right hand side of the webpage entitled
\emph{Annotieren} and displaying the two options \emph{next word} or \emph{back
to project}.
The first thing to note here is that in line \ref{ln:ng} a simple test
excludes all guest users from accessing the procedure. After ensuring that only
the user who owns the corpus project has access (line \ref{ln:ru}), s/he will be
able to access the drop down menu, which is really a url, e.g. line
\ref{ln:nw1}. The attentive reader might have noticed that
the url exactly matches the definition in the web-fragment.xml as shown in
listing \ref{lst:webfragment}, line \ref{ln:tag}, which resolves to the
respective java class there. Really, this mechanism is the data interface within the
MVC pattern. The url also contains two variables, named \emph{derivCorp} and
\emph{corpID}, that are needed to identify the corpus and file object by the
java classes (see section \ref{sec:javacode}).
The morphilo.xsl stylesheet contains yet another modification that deserves mention.
In listing \ref{lst:derobjectTempl}, line \ref{ln:morphMenu}, two menu options --
\emph{Tag automatically} and \emph{Tag manually} -- are defined. The former option
initiates ProcessCorpusServlet.java as can be seen again in listing \ref{lst:webfragment},
line \ref{ln:process}, which determines words that are not in the master data base.
Still, it is important to note that the menu option is only displayed if two restrictions
are met. First, a file has to be uploaded (line \ref{ln:1test}) and, second, there must be
only one file. This is necessary because in the annotation process other files will be generated
that store the words that were not yet processed or a file that includes the final result. The
generated files follow a certain pattern. The file harboring the final, entire TEI-annotated
corpus is prefixed by \emph{tagged}, the other file is prefixed \emph{untagged}. This circumstance
is exploited for manipulating the second option (line \ref{ln:loop}). A loop runs through all
files in the respective directory and if a file name starts with \emph{untagged},
the option to manually tag is displayed.
\begin{lstlisting}[language=XML,caption={template
matching derobject},label=lst:derobjectTempl,escapechar=|]
<xsl:template match="derobject" mode="derivateActions">
<xsl:param name="deriv" />
<xsl:param name="parentObjID" />
<xsl:param name="suffix" select="''" />
<xsl:param name="id" select="../../../@ID" />
<xsl:if test="acl:checkPermission($deriv,'writedb')">
<xsl:variable name="ifsDirectory" select="document(concat('ifs:',$deriv,'/'))" />
<xsl:variable name="path" select="$ifsDirectory/mcr_directory/path" />
...
<div class="options pull-right">
<div class="btn-group" style="margin:10px">
<a href="#" class="btn btn-default dropdown-toggle" data-toggle="dropdown">
<i class="fa fa-cog"></i>
<xsl:value-of select="' Korpus'"/>
<span class="caret"></span>
</a>
<ul class="dropdown-menu dropdown-menu-right">
<!-- Anpasssungen Morphilo -->|\label{ln:morphMenu}|
<xsl:if test="string-length($deriv) &gt; 0">|\label{ln:1test}|
<xsl:if test="count($ifsDirectory/mcr_directory/children/child) = 1">|\label{ln:2test}|
<li role="presentation">
<a href="{$ServletsBaseURL}object/process{$HttpSession}?id={$deriv}&amp;objID={$id}" role="menuitem" tabindex="-1">
<xsl:value-of select="i18n:translate('derivate.process')"/>
</a>
</li>
</xsl:if>
<xsl:for-each select="$ifsDirectory/mcr_directory/children/child">|\label{ln:loop}|
<xsl:variable name="untagged" select="concat($path, 'untagged')"/>
<xsl:variable name="filename" select="concat($path,./name)"/>
<xsl:if test="starts-with($filename, $untagged)">
<li role="presentation">
<a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$deriv}&amp;objID={$id}" role="menuitem" tabindex="-1">
<xsl:value-of select="i18n:translate('derivate.taggen')"/>
</a>
</li>
</xsl:if>
</xsl:for-each>
</xsl:if>
...
</ul>
</div>
</div>
</xsl:if>
</xsl:template>
\end{lstlisting}
Besides the two stylesheets morphilo.xsl and corpmeta.xsl, other stylesheets have
to be adjusted. They will not be discussed in detail here for they are self-explanatory for the most part.
Essentially, they render the overall layout (\emph{common-layout.xsl}, \emph{skeleton\_layout\_template.xsl})
or the presentation
of the search results (\emph{response-page.xsl}) and definitions of the solr search fields (\emph{searchfields-solr.xsl}).
The former and latter also inherit templates from \emph{response-general.xsl} and \emph{response-browse.xsl}, in which the
navigation bar of search results can be changed. For the use of multilinguality a separate configuration directory
has to be created containing as many \emph{.property}-files as different
languages want to be displayed. In the current case these are restricted to German and English (\emph{messages\_de.properties} and \emph{messages\_en.properties}).
The property files include all \emph{i18n} definitions. All these files are located in the \emph{resources} directory.
Furthermore, a search mask and a page for manually entering the annotations had
to be designed.
For these files a specially designed xml standard (\emph{xed}) is recommended to be used within the
repository framework.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment