diff --git a/.project b/.project new file mode 100644 index 0000000000000000000000000000000000000000..60c6a08b90f94e32c21e718266f3acfbbe54cecc --- /dev/null +++ b/.project @@ -0,0 +1,11 @@ +<?xml version="1.0" encoding="UTF-8"?> +<projectDescription> + <name>morphilo_doc</name> + <comment></comment> + <projects> + </projects> + <buildSpec> + </buildSpec> + <natures> + </natures> +</projectDescription> diff --git a/Morphilo_doc/source/architecture.rst b/Morphilo_doc/source/architecture.rst new file mode 100644 index 0000000000000000000000000000000000000000..5aa450318ac8d14f2947c8546ef17cbe5b94a73f --- /dev/null +++ b/Morphilo_doc/source/architecture.rst @@ -0,0 +1,69 @@ +Software Design +=============== + +\begin{figure} + \centering + \includegraphics[scale=0.8]{architecture.pdf} + \caption{Morphilo Architecture} + \label{fig:architect} +\end{figure} + +The architecture of a possible \emph{take-and-share}-approach for language +resources is visualized in figure \ref{fig:architect}. Because the very gist +of the approach becomes clearer if describing a concrete example, the case of +annotating lexical derivatives of Middle English and a respective database is +given as an illustration. +However, any other tool that helps with manual annotations and manages metadata of a corpus could be +substituted here instead. + +After inputting an untagged corpus or plain text, it is determined whether the +input material was annotated previously by a different user. This information is +usually provided by the metadata administered by the annotation tool; in the case at +hand it is called \emph{Morphilizer} in figure \ref{fig:architect}. An +alternative is a simple table look-up for all occurring words in the datasets Corpus 1 through Corpus n. If contained +completely, the \emph{yes}-branch is followed up further -- otherwise \emph{no} +succeeds. The difference between the two branches is subtle, yet crucial. On +both branches, the annotation tool (here \emph{Morphilizer}) is called, which, first, +sorts out all words that are not contained in the master database (here \emph{Morphilo-DB}) +and, second, makes reasonable suggestions on an optimal annotation of +the items. In both cases the +annotations are linked to the respective items (e.g. words) in the +text, but they are also persistently saved in an extra dataset, i.e. Corpus 1 +through n, together with all available metadata. + +The difference between both information streams is that +in the \emph{yes}-branch a comparison between the newly created dataset and +all of the previous datasets of this text is carried out. Within this +unit, all deviations and congruencies are marked and counted. The underlying +assumption is that with a growing number of comparable texts the +correct annotations approach a theoretic true value of a correct annotation +while errors level out provided that the sample size is large enough. How the +distribution of errors and correct annotations exactly looks like and if a +normal distribution can be assumed is still object of the ongoing research, but +independent of the concrete results, the component (called \emph{compare +manual annotations} in figure \ref{fig:architect}) allows for specifying the +exact form of the sample population. +In fact, it is necessary at that point to define the form of the distribution, +sample size, and the rejection region. The standard setting are a normal +distribution, a rejection region of $\alpha = 0.05$ and sample size of $30$ so +that a simple Gau\ss-Test can be calculated. + +Continuing the information flow further, these statistical calculations are +delivered to the quality-control-component. Based on the statistics, the +respective items together with the metadata, frequencies, and, of course, +annotations are written to the master database. All information in the master +database is directly used for automated annotations. Thus it is directly matched +to the input texts or corpora respectively through the \emph{Morphilizer}-tool. +The annotation tool decides on the entries looked up in the master which items +are to be manually annotated. + +The processes just described are all hidden from the user who has no possibility +to impact the set quality standards but by errors in the annotation process. The +user will only see the number of items of the input text he or she will process manually. The +annotator will also see an estimation of the workload beforehand. On this +number, a decision can be made if to start the annotation at all. It will be +possible to interrupt the annotation work and save progress on the server. And +the user will have access to the annotations made in the respective dataset, +correct them or save them and resume later. It is important to note that the user will receive +the tagged document only after all items are fully annotated. No partially +tagged text can be output. \ No newline at end of file diff --git a/Morphilo_doc/source/controller.rst b/Morphilo_doc/source/controller.rst index 9d4b5110e710c4b2b09f540962a9485c6ef131cc..6f6b896272e6cade94e54ff34fd76af6bfc28326 100644 --- a/Morphilo_doc/source/controller.rst +++ b/Morphilo_doc/source/controller.rst @@ -2,4 +2,848 @@ Controller Adjustments ====================== General Principle of Operation ------------------------------- \ No newline at end of file +------------------------------ + +Figure \ref{fig:classDiag} illustrates the dependencies of the five java classes that were integrated to add the morphilo +functionality defined in the default package \emph{custom.mycore.addons.morphilo}. The general principle of operation +is the following. The handling of data search, upload, saving, and user +authentification is fully left to the MyCoRe functionality that is completely +implemented. The class \emph{ProcessCorpusServlet.java} receives a request from the webinterface to process an uploaded file, +i.e. a simple text corpus, and it checks if any of the words are available in the master database. All words that are not +listed in the master database are written to an extra file. These are the words that have to be manually annotated. At the end, the +servlet sends a response back to the user interface. In case of all words are contained in the master, an xml file is generated from the +master database that includes all annotated words of the original corpus. Usually this will not be the case for larger textfiles. +So if some words are not in the master, the user will get the response to initiate the manual annotation process. + +The manual annotation process is processed by the class +\emph{{Tag\-Corpus\-Serv\-let\-.ja\-va}}, which will build a JDOM object for the first word in the extra file. +This is done by creating an object of the \emph{JDOMorphilo.java} class. This class, in turn, will use the methods of +\emph{AffixStripper.java} that make simple, but reasonable, suggestions on the word structure. This JDOM object is then +given as a response back to the user. It is presented as a form, in which the user can make changes. This is necessary +because the word structure algorithm of \emph{AffixStripper.java} errs in some cases. Once the user agrees on the +suggestions or on his or her corrections, the JDOM object is saved as an xml that is only searchable, visible, and +changeable by the authenicated user (and the administrator), another file containing all processed words is created or +updated respectively and the \emph{TagCorpusServlet.java} servlet will restart until the last word in the extra list is +processed. This enables the user to stop and resume her or his annotation work at a later point in time. The +\emph{TagCorpusServlet} will call methods from \emph{ProcessCorpusServlet.java} to adjust the content of the extra +files harboring the untagged words. If this file is empty, and only then, it is replaced by the file comprising all words +from the original text file, both the ones from the master database and the ones that are annotated by the user, +in an annotated xml representation. + +Each time \emph{ProcessCorpusServlet.java} is instantiated, it also instantiates \emph{QualityControl.java}. This class checks if a +new word can be transferred to the master database. The algorithm can be freely adopted to higher or lower quality standards. +In its present configuration, a method tests at a limit of 20 different +registered users agreeing on the annotation of the same word. More specifically, +if 20 JDOM objects are identical except in the attribute field \emph{occurrences} in the metadata node, the JDOM object becomes +part of the master. The latter is easily done by changing the attribute \emph{creator} from the user name +to \emph{``administrator''} in the service node. This makes the dataset part of the master database. Moreover, the \emph{occurrences} +attribute is updated by adding up all occurrences of the word that stem from +different text corpora of the same time range. +\begin{landscape} + \begin{figure} + \centering + \includegraphics[scale=0.55]{morphilo_uml.png} + \caption{Class Diagram Morphilo} + \label{fig:classDiag} + \end{figure} +\end{landscape} + + + +Conceptualization +----------------- + +The controller component is largely +specified and ready to use in some hundred or so java classes handling the +logic of the search such as indexing, but also dealing with directories and +files as saving, creating, deleting, and updating files. +Moreover, a rudimentary user management comprising different roles and +rights is offered. The basic technology behind the controller's logic is the +servlet. As such all new code has to be registered as a servlet in the +web-fragment.xml (here the Apache Tomcat container) as listing \ref{lst:webfragment} shows. + +\begin{lstlisting}[language=XML,caption={Servlet Registering in the +web-fragment.xml (excerpt)},label=lst:webfragment,escapechar=|] +<servlet> + <servlet-name>ProcessCorpusServlet</servlet-name> + <servlet-class>custom.mycore.addons.morphilo.ProcessCorpusServlet</servlet-class> +</servlet> +<servlet-mapping> + <servlet-name>ProcessCorpusServlet</servlet-name> + <url-pattern>/servlets/object/process</url-pattern>|\label{ln:process}| +</servlet-mapping> +<servlet> + <servlet-name>TagCorpusServlet</servlet-name> + <servlet-class>custom.mycore.addons.morphilo.TagCorpusServlet</servlet-class> +</servlet> +<servlet-mapping> + <servlet-name>TagCorpusServlet</servlet-name> + <url-pattern>/servlets/object/tag</url-pattern>|\label{ln:tag}| +</servlet-mapping> +\end{lstlisting} + +Now, the logic has to be extended by the specifications analyzed in chapter +\ref{chap:concept} on conceptualization. More specifically, some +classes have to be added that take care of analyzing words +(\emph{AffixStripper.java, InflectionEnum.java, SuffixEnum.java, +PrefixEnum.java}), extracting the relevant words from the text and checking the +uniqueness of the text (\emph{ProcessCorpusServlet.java}), make reasonable +suggestions on the annotation (\emph{TagCorpusServlet.java}), build the object +of each annotated word (\emph{JDOMorphilo.java}), and check on the quality by applying +statistical models (\emph{QualityControl.java}). + +Implementation +-------------- + +Having taken a bird's eye perspective in the previous chapter, it is now time to take a look at the specific implementation at the level +of methods. Starting with the main servlet, \emph{ProcessCorpusServlet.java}, the class defines four getter method: +\renewcommand{\labelenumi}{(\theenumi)} +\begin{enumerate} + \item\label{itm:geturl} public String getURLParameter(MCRServletJob, String) + \item\label{itm:getcorp} public String getCorpusMetadata(MCRServletJob, String) + \item\label{itm:getcont} public ArrayList<String> getContentFromFile(MCRServletJob, String) + \item\label{itm:getderiv} public Path getDerivateFilePath(MCRServletJob, String) + \item\label{itm:now} public int getNumberOfWords(MCRServletJob job, String) +\end{enumerate} +Since each servlet in MyCoRe extends the class MCRServlet, it has access to MCRServletJob, from which the http requests and responses +can be used. This is the first argument in the above methods. The second argument of method (\ref{itm:geturl}) specifies the name of an url parameter, i.e. +the object id or the id of the derivate. The method returns the value of the given parameter. Typically MyCoRe uses the url to exchange +these ids. The second method provides us with the value of a data field in the xml document. So the string defines the name of an attribute. +\emph{getContentFromFile(MCRServletJob, String)} returns the words as a list from a file when given the filename as a string. +The getter listed in \ref{itm:getderiv}), returns the Path from the MyCoRe repository when the name of +the file is specified. And finally, method (\ref{itm:now}) returns the number of words by simply returning +\emph{getContentFromFile(job, fileName).size()}. + +There are two methods in every MyCoRe-Servlet that have to be overwritten, +\emph{protected void render(MCRServletJob, Exception)}, which redirects the requests as \emph{POST} or \emph{GET} responds, and +\emph{protected void think(MCRServletJob)}, in which the logic is implemented. Since the latter is important to understand the +core idea of the Morphilo algorithm, it is displayed in full length in source code \ref{src:think}. + +\begin{lstlisting}[language=java,caption={The overwritten think method},label=src:think,escapechar=|] +protected void think(MCRServletJob job) throws Exception +{ + this.job = job; + String dateFromCorp = getCorpusMetadata(job, "def.datefrom"); + String dateUntilCorp = getCorpusMetadata(job, "def.dateuntil"); + String corpID = getURLParameter(job, "objID"); + String derivID = getURLParameter(job, "id"); + + //if NoW is 0, fill with anzWords + MCRObject helpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(corpID));|\label{ln:bugfixstart}| + Document jdomDocHelp = helpObj.createXML(); + XPathFactory xpfacty = XPathFactory.instance(); + XPathExpression<Element> xpExp = xpfacty.compile("//NoW", Filters.element()); + Element elem = xpExp.evaluateFirst(jdomDocHelp); + //fixes transferred morphilo data from previous stand alone project + int corpussize = getNumberOfWords(job, ""); + if (Integer.parseInt(elem.getText()) != corpussize) + { + elem.setText(Integer.toString(corpussize)); + helpObj = new MCRObject(jdomDocHelp); + MCRMetadataManager.update(helpObj); + }|\label{ln:bugfixend}| + + //Check if the uploaded corpus was processed before + SolrClient slr = MCRSolrClientFactory.getSolrClient();|\label{ln:solrstart}| + SolrQuery qry = new SolrQuery(); + qry.setFields("korpusname", "datefrom", "dateuntil", "NoW", "id"); + qry.setQuery("datefrom:" + dateFromCorp + " AND dateuntil:" + dateUntilCorp + " AND NoW:" + corpussize); + SolrDocumentList rslt = slr.query(qry).getResults();|\label{ln:solrresult}| + + Boolean incrOcc = true; + // if resultset contains only one, then it must be the newly created corpus + if (slr.query(qry).getResults().getNumFound() > 1) + { + incrOcc = false; + }|\label{ln:solrend}| + + //match all words in corpus with morphilo (creator=administrator) and save all words that are not in morphilo DB in leftovers + ArrayList<String> leftovers = new ArrayList<String>(); + ArrayList<String> processed = new ArrayList<String>(); + + leftovers = getUnknownWords(getContentFromFile(job, ""), dateFromCorp, dateUntilCorp, "", incrOcc, incrOcc, false);|\label{ln:callkeymeth}| + + //write all words of leftover in file as derivative to respective corpmeta dataset + MCRPath root = MCRPath.getPath(derivID, "/");|\label{ln:filesavestart}| + Path fn = getDerivateFilePath(job, "").getFileName(); + Path p = root.resolve("untagged-" + fn); + Files.write(p, leftovers);|\label{ln:filesaveend}| + + //create a file for all words that were processed + Path procWds = root.resolve("processed-" + fn); + Files.write(procWds, processed); +} +\end{lstlisting} +Using the above mentioned getter methods, the \emph{think} method assigns values to the object ID, needed to get the xml document +that contain the corpus metadata, the file ID, and the beginning and starting dates from the corpus to be analyzed. Lines \ref{ln:bugfixstart} +through \ref{ln:bugfixend} show how to access a mycore object as an xml document, a procedure that will be used in different variants +throughout this implementation. +By means of the object ID, the respective corpus is identified and a JDOM document is constructed, which can then be accessed +by XPath. The XPath factory instances are collections of the xml nodes. In the present case, it is save to assume that only one element +of \emph{NoW} is available (see corpus datamodel listing \ref{lst:corpusdatamodel} with $maxOccurs='1'$). So we do not have to loop through +the collection, but use the first node named \emph{NoW}. The if-test checks if the number of words of the uploaded file is the +same as the number written in the document. When the document is initially created by the MyCoRe logic it was configured to be zero. +If unequal, the setText(String) method is used to write the number of words of the corpus to the document. + +Lines \ref{ln:solrstart}--\ref{ln:solrend} reveal the second important ingredient, i.e. controlling the search engine. First, a solr +client and a query are initialized. Then, the output of the result set is defined by giving the fields of interest of the document. +In the case at hand, it is the id, the name of the corpus, the number of words, and the beginnig and ending dates. With \emph{setQuery} +it is possible to assign values to some or all of these fields. Finally, \emph{getResults()} carries out the search and writes +all hits to a \emph{SolrDocumentList} (line \ref{ln:solrresult}). The test that follows is really only to set a Boolean +encoding if the number of occurrences of that word in the master should be updated. To avoid multiple counts, +incrementing the word frequency is only done if it is a new corpus. + +In line \ref{ln:callkeymeth} \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} is called and +returned as a list of words. This method is key and will be discussed in depth below. Finally, lines +\ref{ln:filesavestart}--\ref{ln:filesaveend} show how to handle file objects in MyCoRe. Using the file ID, the root path and the name +of the first file in that path are identified. Then, a second file starting with ``untagged'' is created and all words returned from +the \emph{getUnknownWords} is written to that file. By the same token an empty file is created (in the last two lines of the \emph{think}-method), +in which all words that are manually annotated will be saved. + +In a refactoring phase, the method \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} could be subdivided into +three methods: for each Boolean parameter one. In fact, this method handles more than one task. This is mainly due to multiple code avoidance. +%this is just wrong because no resultset will substantially be more than 10-20 +%In addition, for large text files this method would run into efficiency problems if the master database also reaches the intended size of about +%$100,000$ entries and beyond because +In essence, an outer loop runs through all words of the corpus and an inner loop runs through all hits in the solr result set. Because the result +set is supposed to be small, approximately between $10-20$ items, efficiency +problems are unlikely to cause a problem, although there are some more loops running through collection of about the same sizes. +%As the hits naturally grow larger with an increasing size of the data base, processing time will rise exponentially. +Since each word is identified on the basis of its projected word type, the word form, and the time range it falls into, it is these variables that +have to be checked for existence in the documents. If not in the xml documents, +\emph{null} is returned and needs to be corrected. Moreover, user authentification must be considered. There are three different XPaths that are relevant. +\begin{itemize} + \item[-] \emph{//service/servflags/servflag[@type='createdby']} to test for the correct user + \item[-] \emph{//morphiloContainer/morphilo} to create the annotated document + \item[-] \emph{//morphiloContainer/morphilo/w} to set occurrences or add a link +\end{itemize} + +As an illustration of the core functioning of this method, listing \ref{src:getUnknowWords} is given. +\begin{lstlisting}[language=java,caption={Mode of Operation of getUnknownWords Method},label=src:getUnknowWords,escapechar=|] +public ArrayList<String> getUnknownWords( + ArrayList<String> corpus, + String timeCorpusBegin, + String timeCorpusEnd, + String wdtpe, + Boolean setOcc, + Boolean setXlink, + Boolean writeAllData) throws Exception + { + String currentUser = MCRSessionMgr.getCurrentSession().getUserInformation().getUserID(); + ArrayList lo = new ArrayList(); + + for (int i = 0; i < corpus.size(); i++) + { + SolrClient solrClient = MCRSolrClientFactory.getSolrClient(); + SolrQuery query = new SolrQuery(); + query.setFields("w","occurrence","begin","end", "id", "wordtype"); + query.setQuery(corpus.get(i)); + query.setRows(50); //more than 50 items are extremely unlikely + SolrDocumentList results = solrClient.query(query).getResults(); + Boolean available = false; + for (int entryNum = 0; entryNum < results.size(); entryNum++) + { + ... + // update in MCRMetaDataManager + String mcrIDString = results.get(entryNum).getFieldValue("id").toString(); + //MCRObjekt auslesen und JDOM-Document erzeugen: + MCRObject mcrObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(mcrIDString)); + Document jdomDoc = mcrObj.createXML(); + ... + //check and correction for word type + ... + //checkand correction time: timeCorrect + ... + //check if user correct: isAuthorized + ... + XPathExpression<Element> xp = xpfac.compile("//morphiloContainer/morphilo/w", Filters.element()); + //Iterates w-elements and increments occurrence attribute if setOcc is true + for (Element e : xp.evaluate(jdomDoc)) + { + //wenn Rechte da sind und Worttyp nirgends gegeben oder gleich ist + if (isAuthorized && timeCorrect + && ((e.getAttributeValue("wordtype") == null && wdtpe.equals("")) + || e.getAttributeValue("wordtype").equals(wordtype))) // nur zur Vereinheitlichung + { + int oc = -1; + available = true;|\label{ln:available}| + try + { + //adjust occurrence Attribut + if (setOcc) + { + oc = Integer.parseInt(e.getAttributeValue("occurrence")); + e.setAttribute("occurrence", Integer.toString(oc + 1)); + } + + //write morphilo-ObjectID in xml of corpmeta + if (setXlink) + { + Namespace xlinkNamespace = Namespace.getNamespace("xlink", "http://www.w3.org/1999/xlink");|\label{ln:namespace}| + MCRObject corpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(getURLParameter(job, "objID"))); + Document corpDoc = corpObj.createXML(); + XPathExpression<Element> xpathEx = xpfac.compile("//corpuslink", Filters.element()); + Element elm = xpathEx.evaluateFirst(corpDoc); + elm.setAttribute("href" , mcrIDString, xlinkNamespace); + } + mcrObj = new MCRObject(jdomDoc);|\label{ln:updatestart}| + MCRMetadataManager.update(mcrObj); + QualityControl qc = new QualityControl(mcrObj);|\label{ln:updateend}| + } + catch(NumberFormatException except) + { + // ignore + } + } + } + if (!available) // if not available in datasets under the given conditions |\label{ln:notavailable}| + { + lo.add(corpus.get(i)); + } + } + return lo; + } +\end{lstlisting} +As can be seen from the functionality of listing \ref{src:getUnknowWords}, getting the unknown words of a corpus, is rather a side effect for the equally named method. +More precisely, a Boolean (line \ref{ln:available}) is set when the document is manipulated otherwise because it is clear that the word must exist then. +If the Boolean remains false (line \ref{ln:notavailable}), the word is put on the list of words that have to be annotated manually. As already explained above, the +first loop runs through all words (corpus) and the following lines a solr result set is created. This set is also looped through and it is checked if the time range, +the word type and the user are authorized. In the remainder, the occurrence attribute of the morphilo document can be incremented (setOcc is true) or/and the word is linked to the +corpus meta data (setXlink is true). While all code lines are equivalent with +what was explained in listing \ref{src:think}, it suffices to focus on an +additional name space, i.e. +``xlink'' has to be defined (line \ref{ln:namespace}). Once the linking of word +and corpus is set, the entire MyCoRe object has to be updated. This is done by the functionality of the framework (lines \ref{ln:updatestart}--\ref{ln:updateend}). +At the end, an instance of \emph{QualityControl} is created. + +%QualityControl +The class \emph{QualityControl} is instantiated with a constructor +depicted in listing \ref{src:constructQC}. +\begin{lstlisting}[language=java,caption={Constructor of QualityControl.java},label=src:constructQC,escapechar=|] +private MCRObject mycoreObject; +/* Constructor calls method to carry out quality control, i.e. if at least 20 + * different users agree 100% on the segments of the word under investigation + */ +public QualityControl(MCRObject mycoreObject) throws Exception +{ + this.mycoreObject = mycoreObject; + if (getEqualObjectNumber() > 20) + { + addToMorphiloDB(); + } +} +\end{lstlisting} +The constructor takes an MyCoRe object, a potential word candidate for the +master data base, which is assigned to a private class variable because the +object is used though not changed by some other java methods. +More importantly, there are two more methods: \emph{getEqualNumber()} and +\emph{addToMorphiloDB()}. While the former initiates a process of counting and +comparing objects, the latter is concerned with calculating the correct number +of occurrences from different, but not the same texts, and generating a MyCoRe object with the same content but with two different flags in the \emph{//service/servflags/servflag}-node, i.e. \emph{createdby='administrator'} and \emph{state='published'}. +And of course, the \emph{occurrence} attribute is set to the newly calculated value. The logic corresponds exactly to what was explained in +listing \ref{src:think} and will not be repeated here. The only difference are the paths compiled by the XPathFactory. They are +\begin{itemize} + \item[-] \emph{//service/servflags/servflag[@type='createdby']} and + \item[-] \emph{//service/servstates/servstate[@classid='state']}. +\end{itemize} +It is more instructive to document how the number of occurrences is calculated. There are two steps involved. First, a list with all mycore objects that are +equal to the object which the class is instantiated with (``mycoreObject'' in listing \ref{src:constructQC}) is created. This list is looped and all occurrence +attributes are summed up. Second, all occurrences from equal texts are substracted. Equal texts are identified on the basis of its meta data and its derivate. +There are some obvious shortcomings of this approach, which will be discussed in chapter \ref{chap:results}, section \ref{sec:improv}. Here, suffice it to +understand the mode of operation. Listing \ref{src:equalOcc} shows a possible solution. +\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (1)},label=src:equalOcc,escapechar=|] +/* returns number of Occurrences if Objects are equal, zero otherwise + */ +private int getOccurrencesFromEqualTexts(MCRObject mcrobj1, MCRObject mcrobj2) throws SAXException, IOException +{ + int occurrences = 1; + //extract corpmeta ObjectIDs from morphilo-Objects + String crpID1 = getAttributeValue("//corpuslink", "href", mcrobj1); + String crpID2 = getAttributeValue("//corpuslink", "href", mcrobj2); + //get these two corpmeta Objects + MCRObject corpo1 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID1)); + MCRObject corpo2 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID2)); + //are the texts equal? get list of 'processed-words' derivate + String corp1DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo1); + String corp2DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo2); + + ArrayList result = new ArrayList(getContentFromFile(corp1DerivID, ""));|\label{ln:writeContent}| + result.remove(getContentFromFile(corp2DerivID, ""));|\label{ln:removeContent}| + if (result.size() == 0) // the texts are equal + { + // extract occurrences of one the objects + occurrences = Integer.parseInt(getAttributeValue("//morphiloContainer/morphilo/w", "occurrence", mcrobj1)); + } + else + { + occurrences = 0; //project metadata happened to be the same, but texts are different + } + return occurrences; +} +\end{lstlisting} +In this implementation, the ids from the \emph{corpmeta} data model are accessed via the xlink attribute in the morphilo documents. +The method \emph{getAttributeValue(String, String, MCRObject)} does exactly the same as demonstrated earlier (see from line \ref{ln:namespace} +on in listing \ref{src:getUnknowWords}). The underlying logic is that the texts are equal if exactly the same number of words were uploaded. +So all words from one file are written to a list (line \ref{ln:writeContent}) and words from the other file are removed from the +very same list (line \ref{ln:removeContent}). If this list is empty, then the exact same number of words must have been in both files and the occurrences +are adjusted accordingly. Since this method is called from another private method that only contains a loop through all equal objects, one gets +the occurrences from all equal texts. For reasons of confirmability, the looping method is also given: +\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (2)},label=src:equalOcc2,escapechar=|] +private int getOccurrencesFromEqualTexts() throws Exception +{ + ArrayList<MCRObject> equalObjects = new ArrayList<MCRObject>(); + equalObjects = getAllEqualMCRObjects(); + int occurrences = 0; + for (MCRObject obj : equalObjects) + { + occurrences = occurrences + getOccurrencesFromEqualTexts(mycoreObject, obj); + } + return occurrences; +} +\end{lstlisting} + +Now, the constructor in listing \ref{src:constructQC} reveals another method that rolls out an equally complex concatenation of procedures. +As implied above, \emph{getEqualObjectNumber()} returns the number of equally annotated words. It does this by falling back to another +method from which the size of the returned list is calculated (\emph{getAllEqualMCRObjects().size()}). Hence, we should care about +\emph{getAllEqualMCRObjects()}. This method really has the same design as \emph{int getOccurrencesFromEqualTexts()} in listing \ref{src:equalOcc2}. +The difference is that another method (\emph{Boolean compareMCRObjects(MCRObject, MCRObject, String)}) is used within the loop and +that all equal objects are put into the list of MyCoRe objects that are returned. If this list comprises more than 20 +entries,\footnote{This number is somewhat arbitrary. It is inspired by the sample size n in t-distributed data.} the respective document +will be integrated in the master data base by the process described above. +The comparator logic is shown in listing \ref{src:compareMCR}. +\begin{lstlisting}[language=java,caption={Comparison of MyCoRe objects},label=src:compareMCR,escapechar=|] +private Boolean compareMCRObjects(MCRObject mcrobj1, MCRObject mcrobj2, String xpath) throws SAXException, IOException +{ + Boolean isEqual = false; + Boolean beginTime = false; + Boolean endTime = false; + Boolean occDiff = false; + Boolean corpusDiff = false; + + String source = getXMLFromObject(mcrobj1, xpath); + String target = getXMLFromObject(mcrobj2, xpath); + + XMLUnit.setIgnoreAttributeOrder(true); + XMLUnit.setIgnoreComments(true); + XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true); + XMLUnit.setIgnoreWhitespace(true); + XMLUnit.setNormalizeWhitespace(true); + + //differences in occurrences, end, begin should be ignored + try + { + Diff xmlDiff = new Diff(source, target); + DetailedDiff dd = new DetailedDiff(xmlDiff); + //counters for differences + int i = 0; + int j = 0; + int k = 0; + int l = 0; + // list containing all differences + List differences = dd.getAllDifferences();|\label{ln:difflist}| + for (Object object : differences) + { + Difference difference = (Difference) object; + //@begin,@end,... node is not in the difference list if the count is 0 + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@begin")) i++;|\label{ln:diffbegin}| + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@end")) j++; + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence")) k++; + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus")) l++;|\label{ln:diffend}| + //@begin and @end have different values: they must be checked if they fall right in the allowed time range + if ( difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@begin") + && (Integer.parseInt(difference.getControlNodeDetail().getValue()) < Integer.parseInt(difference.getTestNodeDetail().getValue())) ) + { + beginTime = true; + } + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@end") + && (Integer.parseInt(difference.getControlNodeDetail().getValue()) > Integer.parseInt(difference.getTestNodeDetail().getValue())) ) + { + endTime = true; + } + //attribute values of @occurrence and @corpus are ignored if they are different + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence")) + { + occDiff = true; + } + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus")) + { + corpusDiff = true; + } + } + //if any of @begin, @end ... is identical set Boolean to true + if (i == 0) beginTime = true;|\label{ln:zerobegin}| + if (j == 0) endTime = true; + if (k == 0) occDiff = true; + if (l == 0) corpusDiff = true;|\label{ln:zeroend}| + //if the size of differences is greater than the number of changes admitted in @begin, @end ... something else must be different + if (beginTime && endTime && occDiff && corpusDiff && (i + j + k + l) == dd.getAllDifferences().size()) isEqual = true;|\label{ln:diffsum}| + } + catch (SAXException e) + { + e.printStackTrace(); + } + catch (IOException e) + { + e.printStackTrace(); + } + return isEqual; +} +\end{lstlisting} +In this method, XMLUnit is heavily used to make all necessary node comparisons. The matter becomes more complicated, however, if some attributes +are not only ignored, but evaluated according to a given definition as it is the case for the time range. If the evaluator and builder classes are +not to be overwritten entirely because needed for evaluating other nodes of the +xml document, the above solution appears a bit awkward. So there is potential for improvement before the production version is to be programmed. + +XMLUnit provides us with a +list of the differences of the two documents (see line \ref{ln:difflist}). There are four differences allowed, that is, the attributes \emph{occurrence}, +\emph{corpus}, \emph{begin}, and \emph{end}. For each of them a Boolean variable is set. Because any of the attributes could also be equal to the master +document and the difference list only contains the actual differences, one has to find a way to define both, equal and different, for the attributes. +This could be done by ignoring these nodes. Yet, this would not include testing if the beginning and ending dates fall into the range of the master +document. Therefore the attributes are counted as lines \ref{ln:diffbegin} through \ref{ln:diffend} reveal. If any two documents +differ in some of the four attributes just specified, then the sum of the counters (line \ref{ln:diffsum}) should not be greater than the collected differences +by XMLUnit. The rest of the if-tests assign truth values to the respective +Booleans. It is probably worth mentioning that if all counters are zero (lines +\ref{ln:zerobegin}-\ref{ln:zeroend}) the attributes and values are identical and hence the Boolean has to be set explicitly. Otherwise the test in line \ref{ln:diffsum} would fail. + +%TagCorpusServlet +Once quality control (explained in detail further down) has been passed, it is +the user's turn to interact further. By clicking on the option \emph{Manual tagging}, the \emph{TagCorpusServlet} will be callled. This servlet instantiates +\emph{ProcessCorpusServlet} to get access to the \emph{getUnknownWords}-method, which delivers the words still to be +processed and which overwrites the content of the file starting with \emph{untagged}. For the next word in \emph{leftovers} a new MyCoRe object is created +using the JDOM API and added to the file beginning with \emph{processed}. In line \ref{ln:tagmanu} of listing \ref{src:tagservlet}, the previously defined +entry mask is called, with which the proposed word structure could be confirmed or changed. How the word structure is determined will be shown later in +the text. +\begin{lstlisting}[language=java,caption={Manual Tagging Procedure},label=src:tagservlet,escapechar=|] +... +if (!leftovers.isEmpty()) +{ + ArrayList<String> processed = new ArrayList<String>(); + //processed.add(leftovers.get(0)); + JDOMorphilo jdm = new JDOMorphilo(); + MCRObject obj = jdm.createMorphiloObject(job, leftovers.get(0));|\label{ln:jdomobject}| + //write word to be annotated in process list and save it + Path filePathProc = pcs.getDerivateFilePath(job, "processed").getFileName(); + Path proc = root.resolve(filePathProc); + processed = pcs.getContentFromFile(job, "processed"); + processed.add(leftovers.get(0)); + Files.write(proc, processed); + + //call entry mask for next word + tagUrl = prop.getBaseURL() + "content/publish/morphilo.xed?id=" + obj.getId();|\label{ln:tagmanu}| +} +else +{ + //initiate process to give a complete tagged file of the original corpus + //if untagged-file is empty, match original file with morphilo + //creator=administrator OR creator=username and write matches in a new file + ArrayList<String> complete = new ArrayList<String>(); + ProcessCorpusServlet pcs2 = new ProcessCorpusServlet(); + complete = pcs2.getUnknownWords( + pcs2.getContentFromFile(job, ""), //main corpus file + pcs2.getCorpusMetadata(job, "def.datefrom"), + pcs2.getCorpusMetadata(job, "def.dateuntil"), + "", //wordtype + false, + false, + true); + + Files.delete(p); + MCRXMLFunctions mdm = new MCRXMLFunctions(); + String mainFile = mdm.getMainDocName(derivID); + Path newRoot = root.resolve("tagged-" + mainFile); + Files.write(newRoot, complete); + + //return to Menu page + tagUrl = prop.getBaseURL() + "receive/" + corpID; +} +\end{lstlisting} +At the point where no more items are in \emph{leftsovers} the \emph{getUnknownWords}-method is called whereas the last Boolean parameter +is set true. This indicates that the array list containing all available and relevant data to the respective user is returned as seen in +the code snippet in listing \ref{src:writeAll}. +\begin{lstlisting}[language=java,caption={Code snippet to deliver all data to the user},label=src:writeAll,escapechar=|] +... +// all data is written to lo in TEI +if (writeAllData && isAuthorized && timeCorrect) +{ + XPathExpression<Element> xpath = xpfac.compile("//morphiloContainer/morphilo", Filters.element()); + for (Element e : xpath.evaluate(jdomDoc)) + { + XMLOutputter outputter = new XMLOutputter(); + outputter.setFormat(Format.getPrettyFormat()); + lo.add(outputter.outputString(e.getContent())); + } +} +... +\end{lstlisting} +The complete list (\emph{lo}) is written to yet a third file starting with \emph{tagged} and finally returned to the main project webpage. + +%JDOMorphilo +The interesting question is now where does the word structure come from, which is filled in the entry mask as asserted above. +In listing \ref{src:tagservlet} line \ref{ln:jdomobject}, one can see that a JDOM object is created and the method +\emph{createMorphiloObject(MCRServletJob, String)} is called. The string parameter is the word that needs to be analyzed. +Most of the method is a mere application of the JDOM API given the data model in chapter \ref{chap:concept} section +\ref{subsec:datamodel} and listing \ref{lst:worddatamodel}. That means namespaces, elements and their attributes are defined in the correct +order and hierarchy. + +To fill the elements and attributes with text, i.e. prefixes, suffixes, stems, etc., a Hashmap -- containing the morpheme as +key and its position as value -- are created that are filled with the results from an AffixStripper instantiation. Depending on how many prefixes +or suffixes respectively are put in the hashmap, the same number of xml elements are created. As a final step, a valid MyCoRe id is generated using +the existing MyCoRe functionality, the object is created and returned to the TagCorpusServlet. + +%AffixStripper explanation +Last, the analyses of the word structure will be considered. It is implemented +in the \emph{AffixStripper.java} file. +All lexical affix morphemes and their allomorphs as well as the inflections were extracted from the +OED\footnote{Oxford English Dictionary http://www.oed.com/} and saved as enumerated lists (see the example in listing \ref{src:enumPref}). +The allomorphic items of these lists are mapped successively to the beginning in the case of prefixes +(see listing \ref{src:analyzePref}, line \ref{ln:prefLoop}) or to the end of words in the case of suffixes +(see listing \ref{src:analyzeSuf}). Since each +morphemic variant maps to its morpheme right away, it makes sense to use the morpheme and so +implicitly keep the relation to its allomorph. + +\begin{lstlisting}[language=java,caption={Enumeration Example for the Prefix "over"},label=src:enumPref,escapechar=|] +package custom.mycore.addons.morphilo; + +public enum PrefixEnum { +... + over("over"), ufer("over"), ufor("over"), uferr("over"), uvver("over"), obaer("over"), ober("over)"), ofaer("over"), + ofere("over"), ofir("over"), ofor("over"), ofer("over"), ouer("over"),oferr("over"), offerr("over"), offr("over"), aure("over"), + war("over"), euer("over"), oferre("over"), oouer("over"), oger("over"), ouere("over"), ouir("over"), ouire("over"), + ouur("over"), ouver("over"), ouyr("over"), ovar("over"), overe("over"), ovre("over"),ovur("over"), owuere("over"), owver("over"), + houyr("over"), ouyre("over"), ovir("over"), ovyr("over"), hover("over"), auver("over"), awver("over"), ovver("over"), + hauver("over"), ova("over"), ove("over"), obuh("over"), ovah("over"), ovuh("over"), ofowr("over"), ouuer("over"), oure("over"), + owere("over"), owr("over"), owre("over"), owur("over"), owyr("over"), our("over"), ower("over"), oher("over"), + ooer("over"), oor("over"), owwer("over"), ovr("over"), owir("over"), oar("over"), aur("over"), oer("over"), ufara("over"), + ufera("over"), ufere("over"), uferra("over"), ufora("over"), ufore("over"), ufra("over"), ufre("over"), ufyrra("over"), + yfera("over"), yfere("over"), yferra("over"), uuera("over"), ufe("over"), uferre("over"), uuer("over"), uuere("over"), + vfere("over"), vuer("over"), vuere("over"), vver("over"), uvvor("over") ... +...chap:results + private String morpheme; + //constructor + PrefixEnum(String morpheme) + { + this.morpheme = morpheme; + } + //getter Method + + public String getMorpheme() + { + return this.morpheme; + } +} +\end{lstlisting} +As can be seen in line \ref{ln:prefPutMorph} in listing \ref{src:analyzePref}, the morpheme is saved to a hash map together with its position, i.e. the size of the +map plus one at the time being. In line \ref{ln:prefCutoff} the \emph{analyzePrefix} method is recursively called until no more matches can be made. + +\begin{lstlisting}[language=java,caption={Method to recognize prefixes},label=src:analyzePref,escapechar=|] +private Map<String, Integer> prefixMorpheme = new HashMap<String,Integer>(); +... +private void analyzePrefix(String restword) +{ + if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion + { + for (PrefixEnum prefEnum : PrefixEnum.values())|\label{ln:prefLoop}| + { + String s = prefEnum.toString(); + if (restword.startsWith(s)) + { + prefixMorpheme.put(s, prefixMorpheme.size() + 1);|\label{ln:prefPutMorph}| + //cut off the prefix that is added to the list + analyzePrefix(restword.substring(s.length()));|\label{ln:prefCutoff}| + } + else + { + analyzePrefix(""); + } + } + } +} +\end{lstlisting} + +The recognition of suffixes differs only in the cut-off direction since suffixes occur at the end of a word. +Hence, line \ref{ln:prefCutoff} in listing \ref{src:analyzePref} reads in the case of suffixes. + +\begin{lstlisting}[language=java,caption={Cut-off mechanism for suffixes},label=src:analyzeSuf,escapechar=|] +analyzeSuffix(restword.substring(0, restword.length() - s.length())); +\end{lstlisting} + +It is important to note that inflections are suffixes (in the given model case of Middle English morphology) that usually occur at the very +end of a word, i.e. after all lexical suffixes, only once. It follows that inflections +have to be recognized at first without any repetition. So the procedure for inflections can be simplified +to a substantial degree as listing \ref{src:analyzeInfl} shows. + +\begin{lstlisting}[language=java,caption={Method to recognize inflections},label=src:analyzeInfl,escapechar=|] +private String analyzeInflection(String wrd) +{ + String infl = ""; + for (InflectionEnum inflEnum : InflectionEnum.values()) + { + if (wrd.endsWith(inflEnum.toString())) + { + infl = inflEnum.toString(); + } + } + return infl; +} +\end{lstlisting} + +Unfortunately the embeddedness problem prevents a very simple algorithm. Embeddedness occurs when a lexical item +is a substring of another lexical item. To illustrate, the suffix \emph{ion} is also contained in the suffix \emph{ation}, as is +\emph{ent} in \emph{ment}, and so on. The embeddedness problem cannot be solved completely on the basis of linear modelling, but +for a large part of embedded items one can work around it using implicitly Zipf's law, i.e. the correlation between frequency +and length of lexical items. The longer a word becomes, the less frequent it will occur. The simplest logic out of it is to assume +that longer suffixes (measured in letters) are preferred over shorter suffixes because it is more likely tha the longer the suffix string becomes, +the more likely it is one (as opposed to several) suffix unit(s). This is done in listing \ref{src:embedAffix}, whereas +the inner class \emph{sortedByLengthMap} returns a list sorted by length and the loop from line \ref{ln:deleteAffix} onwards deletes +the respective substrings. + +\begin{lstlisting}[language=java,caption={Method to workaround embeddedness},label=src:embedAffix,escapechar=|] +private Map<String, Integer> sortOutAffixes(Map<String, Integer> affix) +{ + Map<String,Integer> sortedByLengthMap = new TreeMap<String, Integer>(new Comparator<String>() + { + @Override + public int compare(String s1, String s2) + { + int cmp = Integer.compare(s1.length(), s2.length()); + return cmp != 0 ? cmp : s1.compareTo(s2); + } + } + ); + sortedByLengthMap.putAll(affix); + ArrayList<String> al1 = new ArrayList<String>(sortedByLengthMap.keySet()); + ArrayList<String> al2 = al1; + Collections.reverse(al2); + for (String s2 : al1)|\label{ln:deleteAffix}| + { + for (String s1 : al2) + if (s1.contains(s2) && s1.length() > s2.length()) + { + affix.remove(s2); + } + } + return affix; +} +\end{lstlisting} + +Finally, the position of the affix has to be calculated because the hashmap in line \ref{ln:prefPutMorph} in +listing \ref{src:analyzePref} does not keep the original order for changes taken place in addressing the affix embeddedness +(listing \ref{src:embedAffix}). Listing \ref{src:affixPos} depicts the preferred solution. +The recursive construction of the method is similar to \emph{private void analyzePrefix(String)} (listing \ref{src:analyzePref}) +only that the two affix types are handled in one method. For that, an additional parameter taking the form either \emph{suffix} +or \emph{prefix} is included. + +\begin{lstlisting}[language=java,caption={Method to determine position of the affix},label=src:affixPos,escapechar=|] +private void getAffixPosition(Map<String, Integer> affix, String restword, int pos, String affixtype) +{ + if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion + { + for (String s : affix.keySet()) + { + if (restword.startsWith(s) && affixtype.equals("prefix")) + { + pos++; + prefixMorpheme.put(s, pos); + //prefixAllomorph.add(pos-1, restword.substring(s.length())); + getAffixPosition(affix, restword.substring(s.length()), pos, affixtype); + } + else if (restword.endsWith(s) && affixtype.equals("suffix")) + { + pos++; + suffixMorpheme.put(s, pos); + //suffixAllomorph.add(pos-1, restword.substring(s.length())); + getAffixPosition(affix, restword.substring(0, restword.length() - s.length()), pos, affixtype); + } + else + { + getAffixPosition(affix, "", pos, affixtype); + } + } + } +} +\end{lstlisting} + +To give the complete word structure, the root of a word should also be provided. In listing \ref{src:rootAnalyze} a simple solution is offered, however, +considering compounds as words consisting of more than one root. +\begin{lstlisting}[language=java,caption={Method to determine roots},label=src:rootAnalyze,escapechar=|] +private ArrayList<String> analyzeRoot(Map<String, Integer> pref, Map<String, Integer> suf, int stemNumber) +{ + ArrayList<String> root = new ArrayList<String>(); + int j = 1; //one root always exists + // if word is a compound several roots exist + while (j <= stemNumber) + { + j++; + String rest = lemma;|\label{ln:lemma}| + + for (int i=0;i<pref.size();i++) + { + for (String s : pref.keySet()) + { + //if (i == pref.get(s)) + if (rest.length() > s.length() && s.equals(rest.substring(0, s.length()))) + { + rest = rest.substring(s.length(),rest.length()); + } + } + } + + for (int i=0;i<suf.size();i++) + { + for (String s : suf.keySet()) + { + //if (i == suf.get(s)) + if (s.length() < rest.length() && (s.equals(rest.substring(rest.length() - s.length(), rest.length())))) + { + rest = rest.substring(0, rest.length() - s.length()); + } + } + } + root.add(rest); + } + return root; +} +\end{lstlisting} +The logic behind this method is that the root is the remainder of a word when all prefixes and suffixes are substracted. +So the loops run through the number of prefixes and suffixes at each position and substract the affix. Really, there is +some code doubling with the previously described methods, which could be eliminated by making it more modular in a possible +refactoring phase. Again, this is not the concern of a prototype. Line \ref{ln:lemma} defines the initial state of a root, +which is the case for monomorphemic words. The \emph{lemma} is defined as the wordtoken without the inflection. Thus listing +\ref{src:lemmaAnalyze} reveals how the class variable is calculated +\begin{lstlisting}[language=java,caption={Method to determine lemma},label=src:lemmaAnalyze,escapechar=|] +/* + * Simplification: lemma = wordtoken - inflection + */ +private String analyzeLemma(String wrd, String infl) +{ + return wrd.substring(0, wrd.length() - infl.length()); +} +\end{lstlisting} +The constructor of \emph{AffixStripper} calls the method \emph{analyzeWord()} +whose only job is to calculate each structure element in the correct order +(listing \ref{src:lemmaAnalyze}). All structure elements are also provided by getters. +\begin{lstlisting}[language=java,caption={Method to determine all word structure},label=src:lemmaAnalyze,escapechar=|] +private void analyzeWord() +{ + //analyze inflection first because it always occurs at the end of a word + inflection = analyzeInflection(wordtoken); + lemma = analyzeLemma(wordtoken, inflection); + analyzePrefix(lemma); + analyzeSuffix(lemma); + getAffixPosition(sortOutAffixes(prefixMorpheme), lemma, 0, "prefix"); + getAffixPosition(sortOutAffixes(suffixMorpheme), lemma, 0, "suffix"); + prefixNumber = prefixMorpheme.size(); + suffixNumber = suffixMorpheme.size(); + wordroot = analyzeRoot(prefixMorpheme, suffixMorpheme, getStemNumber()); +} +\end{lstlisting} + +To conclude, the Morphilo implementation as presented here, aims at fulfilling the task of a working prototype. It is important to note +that it neither claims to be a very efficient nor a ready software program to be used in production. However, it marks a crucial milestone +on the way to a production system. At some listings sources of improvement were made explicit; at others no suggestions were made. In the latter +case this does not imply that there is no potential for improvement. Once acceptability tests are carried out, it will be the task of a follow up project +to identify these potentials and implement them accordingly. \ No newline at end of file diff --git a/Morphilo_doc/source/datamodel.rst b/Morphilo_doc/source/datamodel.rst index 2d0aef4570bc6a16acf9acb185019c0b63dadaa2..f206ef3ffb8967f600e499b5b76eee996ac7de31 100644 --- a/Morphilo_doc/source/datamodel.rst +++ b/Morphilo_doc/source/datamodel.rst @@ -1,5 +1,60 @@ -Data Model Implementation -========================= +Data Model +========== + +Conceptualization +----------------- + +From both the user and task requirements one can derive that four basic +functions of data processing need to be carried out. Data have to be read, persistently +saved, searched, and deleted. Furthermore, some kind of user management +and multi-user processing is necessary. In addition, the framework should +support web technologies, be well documented, and easy to extent. Ideally, the +MVC pattern is realized. + +\subsection{Data Model}\label{subsec:datamodel} +The guidelines of the +\emph{TEI}-standard\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf} on the +word level are defined in line with the structure defined above in section \ref{subsec:morphologicalSystems}. +In listing \ref{lst:teiExamp} an +example is given for a possible markup at the word level for +\emph{comfortable}.\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html} + +\begin{lstlisting}[language=XML, +caption={TEI-example for 'comfortable'},label=lst:teiExamp] +<w type="adjective"> + <m type="base"> + <m type="prefix" baseForm="con">com</m> + <m type="root">fort</m> + </m> + <m type="suffix">able</m> +</w> +\end{lstlisting} + +This data model reflects just one theoretical conception of a word structure model. +Crucially, the model emanates from the assumption +that the suffix node is on par with the word base. On the one hand, this +implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the +other hand, is enclosed in the base, which basically means a stronger lexical, +and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different +hierarchical levels has important consequences for the branching direction at +subword level (here right-branching). Left the theoretical interest aside, the +choice of the TEI standard is reasonable with view to a sustainable architecture that allows for +exchanging data with little to no additional adjustments. + +The negative account is that the model is not eligible for all languages. +It reflects a theoretical construction based on Indo-European +languages. If attention is paid to which language this software is used, it will +not be problematic. This is the case for most languages of the Indo-European +stem and corresponds to the overwhelming majority of all research carried out +(unfortunately). + +Implementation +-------------- + +As laid out in the task analysis in section \ref{subsec:datamodel}, it is +advantageous to use established standards. It was also shown that it makes sense +to keep the meta data of each corpus separate from the data model used for the +words to be analyzed. For the present case, the TEI-standard was identified as an appropriate markup for words. In terms of the implementation this means that @@ -26,3 +81,161 @@ Whereas attributes of the objecttype are specific to the repository framework, t recognized in the hierarchy of the meta data element starting with the name \emph{w} (line \ref{src:wordbegin}). +\begin{lstlisting}[language=XML,caption={Word Data +model},label=lst:worddatamodel,escapechar=|] <?xml version="1.0" encoding="UTF-8"?> +<objecttype + name="morphilo" + isChild="true" + isParent="true" + hasDerivates="true" + xmlns:xs="http://www.w3.org/2001/XMLSchema" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:noNamespaceSchemaLocation="datamodel.xsd"> + <metadata> + <element name="morphiloContainer" type="xml" style="dontknow" + notinherit="true" heritable="false"> + <xs:sequence> + <xs:element name="morphilo"> + <xs:complexType> + <xs:sequence> + <xs:element name="w" minOccurs="0" maxOccurs="unbounded">|label{src:wordbegin}| + <xs:complexType mixed="true"> + <xs:sequence> + <!-- stem --> + <xs:element name="m1" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:sequence> + <!-- base --> + <xs:element name="m2" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:sequence> + <!-- root --> + <xs:element name="m3" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + </xs:complexType> + </xs:element> + <!-- prefix --> + <xs:element name="m4" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="PrefixbaseForm" type="xs:string"/> + <xs:attribute name="position" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <xs:attribute name="type" type="xs:string"/> + </xs:complexType> + </xs:element> + <!-- suffix --> + <xs:element name="m5" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="SuffixbaseForm" type="xs:string"/> + <xs:attribute name="position" type="xs:string"/> + <xs:attribute name="inflection" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <!-- stem-Attribute --> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="pos" type="xs:string"/> + <xs:attribute name="occurrence" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <!-- w -Attribute auf Wortebene --> + <xs:attribute name="lemma" type="xs:string"/> + <xs:attribute name="complexType" type="xs:string"/> + <xs:attribute name="wordtype" type="xs:string"/> + <xs:attribute name="occurrence" type="xs:string"/> + <xs:attribute name="corpus" type="xs:string"/> + <xs:attribute name="begin" type="xs:string"/> + <xs:attribute name="end" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + </xs:complexType> + </xs:element> + </xs:sequence> + </element> + <element name="wordtype" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="wordtype"/> + </element> + <element name="complexType" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="complexType"/> + </element> + <element name="corpus" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="corpus"/> + </element> + <element name="pos" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="pos"/> + </element> + <element name="PrefixbaseForm" type="classification" minOccurs="0" + maxOccurs="1"> + <classification id="PrefixbaseForm"/> + </element> + <element name="SuffixbaseForm" type="classification" minOccurs="0" + maxOccurs="1"> + <classification id="SuffixbaseForm"/> + </element> + <element name="inflection" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="inflection"/> + </element> + <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded" > + <target type="corpmeta"/> + </element> + </metadata> +</objecttype> +\end{lstlisting} + +Additionally, it is worth mentioning that some attributes are modeled as a +\emph{classification}. All these have to be listed +as separate elements in the data model. This has been done for all attributes +that are more or less subject to little or no change. In fact, all known suffix +and prefix morphemes should be known for the language investigated and are +therefore defined as a classification. +The same is true for the parts of speech named \emph{pos} in the morphilo data +model above. +Here the PENN-Treebank tagset was used. Last, the different morphemic layers in +the standard model named \emph{m} are changed to $m1$ through $m5$. This is the +only change in the standard that could be problematic if the data is to be +processed elsewhere and the change is not documented more explicitly. Yet, this +change was necessary for the MyCoRe repository throws errors caused by ambiguity +issues on the different $m$-layers. + +The second data model describes only very few properties of the text corpora +from which the words are extracted. Listing \ref{lst:corpusdatamodel} depicts +only the meta data element. For the sake of simplicity of the prototype, this +data model is kept as simple as possible. The obligatory field is the name of +the corpus. Specific dates of the corpus are classified as optional because in +some cases a text cannot be dated reliably. + + +\begin{lstlisting}[language=XML,caption={Corpus Data +Model},label=lst:corpusdatamodel] +<metadata> + <!-- Pflichtfelder --> + <element name="korpusname" type="text" minOccurs="1" maxOccurs="1"/> + <!-- Optionale Felder --> + <element name="sprache" type="text" minOccurs="0" maxOccurs="1"/> + <element name="size" type="number" minOccurs="0" maxOccurs="1"/> + <element name="datefrom" type="text" minOccurs="0" maxOccurs="1"/> + <element name="dateuntil" type="text" minOccurs="0" maxOccurs="1"/> + <!-- number of words --> + <element name="NoW" type="text" minOccurs="0" maxOccurs="1"/> + <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded"> + <target type="morphilo"/> + </element> +</metadata> +\end{lstlisting} + +As a final remark, one might have noticed that all attributes are modelled as +strings although other data types are available and fields encoding the dates or +the number of words suggest otherwise. The MyCoRe framework even provides a +data type \emph{historydate}. There is not a very satisfying answer to its +disuse. +All that can be said is that the use of data types different than the string +leads later on to problems in the convergence between the search engine and the +repository framework. These issues seem to be well known and can be followed on +github. \ No newline at end of file diff --git a/Morphilo_doc/source/framework.rst b/Morphilo_doc/source/framework.rst new file mode 100644 index 0000000000000000000000000000000000000000..1b9925de0b23715d8a34406bc9994664224637f3 --- /dev/null +++ b/Morphilo_doc/source/framework.rst @@ -0,0 +1,27 @@ +Framework +========= + +\begin{figure} + \centering + \includegraphics[scale=0.33]{mycore_architecture-2.png} + \caption[MyCoRe-Architecture and Components]{MyCoRe-Architecture and Components\protect\footnotemark} + \label{fig:abbMyCoReStruktur} +\end{figure} +\footnotetext{source: https://www.mycore.de} +To specify the MyCoRe framework the morphilo application logic will have to be implemented, +the TEI data model specified, and the input, search and output mask programmed. + +There are three directories which are +important for adjusting the MyCoRe framework to the needs of one's own application. These three directories +correspond essentially to the three components in the MVC model as explicated in +section \ref{subsec:mvc}. Roughly, they are envisualized in figure \ref{fig:abbMyCoReStruktur} in the upper +right hand corner. More precisely, the view (\emph{Layout} in figure \ref{fig:abbMyCoReStruktur}) and the model layer +(\emph{Datenmodell} in figure \ref{fig:abbMyCoReStruktur}) can be done +completely via the ``interface'', which is a directory with a predefined +structure and some standard files. For the configuration of the logic an extra directory is offered (/src/main/java/custom/mycore/addons/). Here all, java classes +extending the controller layer should be added. +Practically, all three MVC layers are placed in the +\emph{src/main/}-directory of the application. In one of the subdirectories, +\emph{datamodel/def}, the datamodel specifications are defined as xml files. It parallels the model +layer in the MVC pattern. How the data model was defined will be explained in +section \ref{subsec:datamodelimpl}. \ No newline at end of file diff --git a/Morphilo_doc/source/view.rst b/Morphilo_doc/source/view.rst new file mode 100644 index 0000000000000000000000000000000000000000..5f09e06bd9d7a0d9c1edd889d8ac44d3cb36757f --- /dev/null +++ b/Morphilo_doc/source/view.rst @@ -0,0 +1,247 @@ +View +==== + +Conceptualization +----------------- + +Lastly, the third directory (\emph{src/main/resources}) contains all code needed +for rendering the data to be displayed on the screen. So this corresponds to +the view in an MVC approach. It is done by xsl-files that (unfortunately) +contain some logic that really belongs to the controller. Thus, the division is +not as clear as implied in theory. I will discuss this issue more specifically in the +relevant subsection below. Among the resources are also all images, styles, and +javascripts. + +Implementation +-------------- + +As explained in section \ref{subsec:mvc}, the view component handles the visual +representation in the form of an interface that allows interaction between +the user and the task to be carried out by the machine. As a +webservice in the present case, all interaction happens via a browser, i.e. webpages are +visualized and responses are recognized by registering mouse or keyboard +events. More specifically, a webpage is rendered by transforming xml documents +to html pages. The MyCoRe repository framework uses an open source XSLT +processor from Apache, Xalan.\footnote{http://xalan.apache.org} This engine +transforms document nodes described by the XPath syntax into hypertext making +use of a special form of template matching. All templates are collected in so +called xml-encoded stylesheets. Since there are two data models with two +different structures, it is good practice to define two stylesheet files one for +each data model. + +As a demonstration, in listing \ref{lst:morphilostylesheet} below a short +extract is given for rendering the word data. + +\begin{lstlisting}[language=XML,caption={stylesheet +morphilo.xsl},label=lst:morphilostylesheet] +<?xml version="1.0" encoding="UTF-8"?> +<xsl:stylesheet + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xmlns:xalan="http://xml.apache.org/xalan" + xmlns:i18n="xalan://org.mycore.services.i18n.MCRTranslation" + xmlns:acl="xalan://org.mycore.access.MCRAccessManager" + xmlns:mcr="http://www.mycore.org/" xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:mods="http://www.loc.gov/mods/v3" + xmlns:encoder="xalan://java.net.URLEncoder" + xmlns:mcrxsl="xalan://org.mycore.common.xml.MCRXMLFunctions" + xmlns:mcrurn="xalan://org.mycore.urn.MCRXMLFunctions" + exclude-result-prefixes="xalan xlink mcr i18n acl mods mcrxsl mcrurn encoder" + version="1.0"> + <xsl:param name="MCR.Users.Superuser.UserName"/> + + <xsl:template match="/mycoreobject[contains(@ID,'_morphilo_')]"> + <head> + <link href="{$WebApplicationBaseURL}css/file.css" rel="stylesheet"/> + </head> + <div class="row"> + <xsl:call-template name="objectAction"> + <xsl:with-param name="id" select="@ID"/> + <xsl:with-param name="deriv" select="structure/derobjects/derobject/@xlink:href"/> + </xsl:call-template> + <xsl:variable name="objID" select="@ID"/> + <!-- Hier Ueberschrift setzen --> + <h1 style="text-indent: 4em;"> + <xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w"> + <xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/> + </xsl:if> + </h1> + <dl class="dl-horizontal"> + <!-- (1) Display word --> + <xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w"> + <dt> + <xsl:value-of select="i18n:translate('response.page.label.word')"/> + </dt> + <dd> + <xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/> + </dd> + </xsl:if> + <!-- (2) Display lemma --> + ... + </xsl:template> + ... + <xsl:template name="objectAction"> + ... + </xsl:template> +... +</xsl:stylesheet> +\end{lstlisting} +This template matches with +the root node of each \emph{MyCoRe object} ensuring that a valid MyCoRe model is +used and checking that the document to be processed contains a unique +identifier, here a \emph{MyCoRe-ID}, and the name of the correct data model, +here \emph{morphilo}. +Then, another template, \emph{objectAction}, is called together with two parameters, the ids +of the document object and attached files. In the remainder all relevant +information from the document is accessed by XPath, such as the word and the lemma, +and enriched with hypertext annotations it is rendered as a hypertext document. +The template \emph{objectAction} is key to understand the coupling process in the software +framework. It is therefore separately listed in \ref{lst:objActionTempl}. + +\begin{lstlisting}[language=XML,caption={template +objectAction},label=lst:objActionTempl,escapechar=|] +<xsl:template name="objectAction"> + <xsl:param name="id" select="./@ID"/> + <xsl:param name="accessedit" select="acl:checkPermission($id,'writedb')"/> + <xsl:param name="accessdelete" select="acl:checkPermission($id,'deletedb')"/> + <xsl:variable name="derivCorp" select="./@label"/> + <xsl:variable name="corpID" select="metadata/def.corpuslink[@class='MCRMetaLinkID']/corpuslink/@xlink:href"/> + <xsl:if test="$accessedit or $accessdelete">|\label{ln:ng}| + <div class="dropdown pull-right"> + <xsl:if test="string-length($corpID) > 0 or $CurrentUser='administrator'"> + <button class="btn btn-default dropdown-toggle" style="margin:10px" type="button" id="dropdownMenu1" data-toggle="dropdown" aria-expanded="true"> + <span class="glyphicon glyphicon-cog" aria-hidden="true"></span> Annotieren + <span class="caret"></span> + </button> + </xsl:if> + <xsl:if test="string-length($corpID) > 0">|\label{ln:ru}| + <xsl:variable name="ifsDirectory" select="document(concat('ifs:/',$derivCorp))"/> + <ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1"> + <li role="presentation"> + |\label{ln:nw1}|<a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$derivCorp}&objID={$corpID}" role="menuitem" tabindex="-1">|\label{ln:nw2}| + <xsl:value-of select="i18n:translate('object.nextObject')"/> + </a> + </li> + <li role="presentation"> + <a href="{$WebApplicationBaseURL}receive/{$corpID}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('object.backToProject')"/> + </a> + </li> + </ul> + </xsl:if> + <xsl:if test="$CurrentUser='administrator'"> + <ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1"> + <li role="presentation"> + <a role="menuitem" tabindex="-1" href="{$WebApplicationBaseURL}content/publish/morphilo.xed?id={$id}"> + <xsl:value-of select="i18n:translate('object.editWord')"/> + </a> + </li> + <li role="presentation"> + <a href="{$ServletsBaseURL}object/delete{$HttpSession}?id={$id}" role="menuitem" tabindex="-1" class="confirm_deletion option" data-text="Wirklich loeschen"> + <xsl:value-of select="i18n:translate('object.delWord')"/> + </a> + </li> + </ul> + </xsl:if> + </div> + <div class="row" style="margin-left:0px; margin-right:10px"> + <xsl:apply-templates select="structure/derobjects/derobject[acl:checkPermission(@xlink:href,'read')]"> + <xsl:with-param name="objID" select="@ID"/> + </xsl:apply-templates> + </div> + </xsl:if> +</xsl:template> +\end{lstlisting} +The \emph{objectAction} template defines the selection menu appearing -- once manual tagging has +started -- on the upper right hand side of the webpage entitled +\emph{Annotieren} and displaying the two options \emph{next word} or \emph{back +to project}. +The first thing to note here is that in line \ref{ln:ng} a simple test +excludes all guest users from accessing the procedure. After ensuring that only +the user who owns the corpus project has access (line \ref{ln:ru}), s/he will be +able to access the drop down menu, which is really a url, e.g. line +\ref{ln:nw1}. The attentive reader might have noticed that +the url exactly matches the definition in the web-fragment.xml as shown in +listing \ref{lst:webfragment}, line \ref{ln:tag}, which resolves to the +respective java class there. Really, this mechanism is the data interface within the +MVC pattern. The url also contains two variables, named \emph{derivCorp} and +\emph{corpID}, that are needed to identify the corpus and file object by the +java classes (see section \ref{sec:javacode}). + +The morphilo.xsl stylesheet contains yet another modification that deserves mention. +In listing \ref{lst:derobjectTempl}, line \ref{ln:morphMenu}, two menu options -- +\emph{Tag automatically} and \emph{Tag manually} -- are defined. The former option +initiates ProcessCorpusServlet.java as can be seen again in listing \ref{lst:webfragment}, +line \ref{ln:process}, which determines words that are not in the master data base. +Still, it is important to note that the menu option is only displayed if two restrictions +are met. First, a file has to be uploaded (line \ref{ln:1test}) and, second, there must be +only one file. This is necessary because in the annotation process other files will be generated +that store the words that were not yet processed or a file that includes the final result. The +generated files follow a certain pattern. The file harboring the final, entire TEI-annotated +corpus is prefixed by \emph{tagged}, the other file is prefixed \emph{untagged}. This circumstance +is exploited for manipulating the second option (line \ref{ln:loop}). A loop runs through all +files in the respective directory and if a file name starts with \emph{untagged}, +the option to manually tag is displayed. + +\begin{lstlisting}[language=XML,caption={template +matching derobject},label=lst:derobjectTempl,escapechar=|] +<xsl:template match="derobject" mode="derivateActions"> + <xsl:param name="deriv" /> + <xsl:param name="parentObjID" /> + <xsl:param name="suffix" select="''" /> + <xsl:param name="id" select="../../../@ID" /> + <xsl:if test="acl:checkPermission($deriv,'writedb')"> + <xsl:variable name="ifsDirectory" select="document(concat('ifs:',$deriv,'/'))" /> + <xsl:variable name="path" select="$ifsDirectory/mcr_directory/path" /> + ... + <div class="options pull-right"> + <div class="btn-group" style="margin:10px"> + <a href="#" class="btn btn-default dropdown-toggle" data-toggle="dropdown"> + <i class="fa fa-cog"></i> + <xsl:value-of select="' Korpus'"/> + <span class="caret"></span> + </a> + <ul class="dropdown-menu dropdown-menu-right"> + <!-- Anpasssungen Morphilo -->|\label{ln:morphMenu}| + <xsl:if test="string-length($deriv) > 0">|\label{ln:1test}| + <xsl:if test="count($ifsDirectory/mcr_directory/children/child) = 1">|\label{ln:2test}| + <li role="presentation"> + <a href="{$ServletsBaseURL}object/process{$HttpSession}?id={$deriv}&objID={$id}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('derivate.process')"/> + </a> + </li> + </xsl:if> + <xsl:for-each select="$ifsDirectory/mcr_directory/children/child">|\label{ln:loop}| + <xsl:variable name="untagged" select="concat($path, 'untagged')"/> + <xsl:variable name="filename" select="concat($path,./name)"/> + <xsl:if test="starts-with($filename, $untagged)"> + <li role="presentation"> + <a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$deriv}&objID={$id}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('derivate.taggen')"/> + </a> + </li> + </xsl:if> + </xsl:for-each> + </xsl:if> + ... + </ul> + </div> + </div> + </xsl:if> +</xsl:template> +\end{lstlisting} + +Besides the two stylesheets morphilo.xsl and corpmeta.xsl, other stylesheets have +to be adjusted. They will not be discussed in detail here for they are self-explanatory for the most part. +Essentially, they render the overall layout (\emph{common-layout.xsl}, \emph{skeleton\_layout\_template.xsl}) +or the presentation +of the search results (\emph{response-page.xsl}) and definitions of the solr search fields (\emph{searchfields-solr.xsl}). +The former and latter also inherit templates from \emph{response-general.xsl} and \emph{response-browse.xsl}, in which the +navigation bar of search results can be changed. For the use of multilinguality a separate configuration directory +has to be created containing as many \emph{.property}-files as different +languages want to be displayed. In the current case these are restricted to German and English (\emph{messages\_de.properties} and \emph{messages\_en.properties}). +The property files include all \emph{i18n} definitions. All these files are located in the \emph{resources} directory. + +Furthermore, a search mask and a page for manually entering the annotations had +to be designed. +For these files a specially designed xml standard (\emph{xed}) is recommended to be used within the +repository framework. \ No newline at end of file