diff --git a/Morphilo_doc/_build/doctrees/environment.pickle b/Morphilo_doc/_build/doctrees/environment.pickle index f1fe7ff4f12bb06e800accb161c3a099e8c7447e..44cad1ab2dc882083c923484739e0b01d6305342 100644 Binary files a/Morphilo_doc/_build/doctrees/environment.pickle and b/Morphilo_doc/_build/doctrees/environment.pickle differ diff --git a/Morphilo_doc/_build/doctrees/index.doctree b/Morphilo_doc/_build/doctrees/index.doctree index 78f4aa8256d372d508c143c8fe0ef0613d2fc1a1..b1ce5341c9ec85930100b86af5caa390ac58d26e 100644 Binary files a/Morphilo_doc/_build/doctrees/index.doctree and b/Morphilo_doc/_build/doctrees/index.doctree differ diff --git a/Morphilo_doc/_build/doctrees/source/architecture.doctree b/Morphilo_doc/_build/doctrees/source/architecture.doctree new file mode 100644 index 0000000000000000000000000000000000000000..c7a37e241c3d401cf38326992cd31161cf5e8614 Binary files /dev/null and b/Morphilo_doc/_build/doctrees/source/architecture.doctree differ diff --git a/Morphilo_doc/_build/doctrees/source/controller.doctree b/Morphilo_doc/_build/doctrees/source/controller.doctree index a0523de44000e258728112ae85166c7ff87ddb8b..af065b89378efaf7a1237ecbada92a78dddf1af2 100644 Binary files a/Morphilo_doc/_build/doctrees/source/controller.doctree and b/Morphilo_doc/_build/doctrees/source/controller.doctree differ diff --git a/Morphilo_doc/_build/doctrees/source/datamodel.doctree b/Morphilo_doc/_build/doctrees/source/datamodel.doctree index ff65d2a7cdf4e021a3501a55726cfac05164f696..d9306dc94fb646762de7f705c1140572b16d39c5 100644 Binary files a/Morphilo_doc/_build/doctrees/source/datamodel.doctree and b/Morphilo_doc/_build/doctrees/source/datamodel.doctree differ diff --git a/Morphilo_doc/_build/doctrees/source/framework.doctree b/Morphilo_doc/_build/doctrees/source/framework.doctree new file mode 100644 index 0000000000000000000000000000000000000000..398e1c277effc4fa832fe35372ae81e56970296b Binary files /dev/null and b/Morphilo_doc/_build/doctrees/source/framework.doctree differ diff --git a/Morphilo_doc/_build/doctrees/source/view.doctree b/Morphilo_doc/_build/doctrees/source/view.doctree new file mode 100644 index 0000000000000000000000000000000000000000..a2fb1ec70077bbaf14a994eace6594fb7c7a11cc Binary files /dev/null and b/Morphilo_doc/_build/doctrees/source/view.doctree differ diff --git a/Morphilo_doc/_build/html/_sources/index.rst.txt b/Morphilo_doc/_build/html/_sources/index.rst.txt index 3c70750c979e546f0f879e02114d79f00c8ba52b..c7043067b4c65dfe25c731ac0e2deb7221e09da4 100644 --- a/Morphilo_doc/_build/html/_sources/index.rst.txt +++ b/Morphilo_doc/_build/html/_sources/index.rst.txt @@ -3,7 +3,7 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to Morphilo's documentation! +Documentation Morphilo Project ==================================== .. toctree:: @@ -12,8 +12,9 @@ Welcome to Morphilo's documentation! source/datamodel.rst source/controller.rst - - + source/view.rst + source/architecture.rst + source/framework.rst Indices and tables ================== diff --git a/Morphilo_doc/_build/html/_sources/source/architecture.rst.txt b/Morphilo_doc/_build/html/_sources/source/architecture.rst.txt new file mode 100644 index 0000000000000000000000000000000000000000..5b114bda79b5dc2cca9d10df9d300507bf510317 --- /dev/null +++ b/Morphilo_doc/_build/html/_sources/source/architecture.rst.txt @@ -0,0 +1,66 @@ +Software Design +=============== + + +.. image:: architecture.* + + +The architecture of a possible **take-and-share**-approach for language +resources is visualized in figure \ref{fig:architect}. Because the very gist +of the approach becomes clearer if describing a concrete example, the case of +annotating lexical derivatives of Middle English and a respective database is +given as an illustration. +However, any other tool that helps with manual annotations and manages metadata of a corpus could be +substituted here instead. + +After inputting an untagged corpus or plain text, it is determined whether the +input material was annotated previously by a different user. This information is +usually provided by the metadata administered by the annotation tool; in the case at +hand it is called \emph{Morphilizer} in figure \ref{fig:architect}. An +alternative is a simple table look-up for all occurring words in the datasets Corpus 1 through Corpus n. If contained +completely, the \emph{yes}-branch is followed up further -- otherwise \emph{no} +succeeds. The difference between the two branches is subtle, yet crucial. On +both branches, the annotation tool (here \emph{Morphilizer}) is called, which, first, +sorts out all words that are not contained in the master database (here \emph{Morphilo-DB}) +and, second, makes reasonable suggestions on an optimal annotation of +the items. In both cases the +annotations are linked to the respective items (e.g. words) in the +text, but they are also persistently saved in an extra dataset, i.e. Corpus 1 +through n, together with all available metadata. + +The difference between both information streams is that +in the \emph{yes}-branch a comparison between the newly created dataset and +all of the previous datasets of this text is carried out. Within this +unit, all deviations and congruencies are marked and counted. The underlying +assumption is that with a growing number of comparable texts the +correct annotations approach a theoretic true value of a correct annotation +while errors level out provided that the sample size is large enough. How the +distribution of errors and correct annotations exactly looks like and if a +normal distribution can be assumed is still object of the ongoing research, but +independent of the concrete results, the component (called \emph{compare +manual annotations} in figure \ref{fig:architect}) allows for specifying the +exact form of the sample population. +In fact, it is necessary at that point to define the form of the distribution, +sample size, and the rejection region. The standard setting are a normal +distribution, a rejection region of $\alpha = 0.05$ and sample size of $30$ so +that a simple Gau\ss-Test can be calculated. + +Continuing the information flow further, these statistical calculations are +delivered to the quality-control-component. Based on the statistics, the +respective items together with the metadata, frequencies, and, of course, +annotations are written to the master database. All information in the master +database is directly used for automated annotations. Thus it is directly matched +to the input texts or corpora respectively through the \emph{Morphilizer}-tool. +The annotation tool decides on the entries looked up in the master which items +are to be manually annotated. + +The processes just described are all hidden from the user who has no possibility +to impact the set quality standards but by errors in the annotation process. The +user will only see the number of items of the input text he or she will process manually. The +annotator will also see an estimation of the workload beforehand. On this +number, a decision can be made if to start the annotation at all. It will be +possible to interrupt the annotation work and save progress on the server. And +the user will have access to the annotations made in the respective dataset, +correct them or save them and resume later. It is important to note that the user will receive +the tagged document only after all items are fully annotated. No partially +tagged text can be output. \ No newline at end of file diff --git a/Morphilo_doc/_build/html/_sources/source/controller.rst.txt b/Morphilo_doc/_build/html/_sources/source/controller.rst.txt index 9d4b5110e710c4b2b09f540962a9485c6ef131cc..6f6b896272e6cade94e54ff34fd76af6bfc28326 100644 --- a/Morphilo_doc/_build/html/_sources/source/controller.rst.txt +++ b/Morphilo_doc/_build/html/_sources/source/controller.rst.txt @@ -2,4 +2,848 @@ Controller Adjustments ====================== General Principle of Operation ------------------------------- \ No newline at end of file +------------------------------ + +Figure \ref{fig:classDiag} illustrates the dependencies of the five java classes that were integrated to add the morphilo +functionality defined in the default package \emph{custom.mycore.addons.morphilo}. The general principle of operation +is the following. The handling of data search, upload, saving, and user +authentification is fully left to the MyCoRe functionality that is completely +implemented. The class \emph{ProcessCorpusServlet.java} receives a request from the webinterface to process an uploaded file, +i.e. a simple text corpus, and it checks if any of the words are available in the master database. All words that are not +listed in the master database are written to an extra file. These are the words that have to be manually annotated. At the end, the +servlet sends a response back to the user interface. In case of all words are contained in the master, an xml file is generated from the +master database that includes all annotated words of the original corpus. Usually this will not be the case for larger textfiles. +So if some words are not in the master, the user will get the response to initiate the manual annotation process. + +The manual annotation process is processed by the class +\emph{{Tag\-Corpus\-Serv\-let\-.ja\-va}}, which will build a JDOM object for the first word in the extra file. +This is done by creating an object of the \emph{JDOMorphilo.java} class. This class, in turn, will use the methods of +\emph{AffixStripper.java} that make simple, but reasonable, suggestions on the word structure. This JDOM object is then +given as a response back to the user. It is presented as a form, in which the user can make changes. This is necessary +because the word structure algorithm of \emph{AffixStripper.java} errs in some cases. Once the user agrees on the +suggestions or on his or her corrections, the JDOM object is saved as an xml that is only searchable, visible, and +changeable by the authenicated user (and the administrator), another file containing all processed words is created or +updated respectively and the \emph{TagCorpusServlet.java} servlet will restart until the last word in the extra list is +processed. This enables the user to stop and resume her or his annotation work at a later point in time. The +\emph{TagCorpusServlet} will call methods from \emph{ProcessCorpusServlet.java} to adjust the content of the extra +files harboring the untagged words. If this file is empty, and only then, it is replaced by the file comprising all words +from the original text file, both the ones from the master database and the ones that are annotated by the user, +in an annotated xml representation. + +Each time \emph{ProcessCorpusServlet.java} is instantiated, it also instantiates \emph{QualityControl.java}. This class checks if a +new word can be transferred to the master database. The algorithm can be freely adopted to higher or lower quality standards. +In its present configuration, a method tests at a limit of 20 different +registered users agreeing on the annotation of the same word. More specifically, +if 20 JDOM objects are identical except in the attribute field \emph{occurrences} in the metadata node, the JDOM object becomes +part of the master. The latter is easily done by changing the attribute \emph{creator} from the user name +to \emph{``administrator''} in the service node. This makes the dataset part of the master database. Moreover, the \emph{occurrences} +attribute is updated by adding up all occurrences of the word that stem from +different text corpora of the same time range. +\begin{landscape} + \begin{figure} + \centering + \includegraphics[scale=0.55]{morphilo_uml.png} + \caption{Class Diagram Morphilo} + \label{fig:classDiag} + \end{figure} +\end{landscape} + + + +Conceptualization +----------------- + +The controller component is largely +specified and ready to use in some hundred or so java classes handling the +logic of the search such as indexing, but also dealing with directories and +files as saving, creating, deleting, and updating files. +Moreover, a rudimentary user management comprising different roles and +rights is offered. The basic technology behind the controller's logic is the +servlet. As such all new code has to be registered as a servlet in the +web-fragment.xml (here the Apache Tomcat container) as listing \ref{lst:webfragment} shows. + +\begin{lstlisting}[language=XML,caption={Servlet Registering in the +web-fragment.xml (excerpt)},label=lst:webfragment,escapechar=|] +<servlet> + <servlet-name>ProcessCorpusServlet</servlet-name> + <servlet-class>custom.mycore.addons.morphilo.ProcessCorpusServlet</servlet-class> +</servlet> +<servlet-mapping> + <servlet-name>ProcessCorpusServlet</servlet-name> + <url-pattern>/servlets/object/process</url-pattern>|\label{ln:process}| +</servlet-mapping> +<servlet> + <servlet-name>TagCorpusServlet</servlet-name> + <servlet-class>custom.mycore.addons.morphilo.TagCorpusServlet</servlet-class> +</servlet> +<servlet-mapping> + <servlet-name>TagCorpusServlet</servlet-name> + <url-pattern>/servlets/object/tag</url-pattern>|\label{ln:tag}| +</servlet-mapping> +\end{lstlisting} + +Now, the logic has to be extended by the specifications analyzed in chapter +\ref{chap:concept} on conceptualization. More specifically, some +classes have to be added that take care of analyzing words +(\emph{AffixStripper.java, InflectionEnum.java, SuffixEnum.java, +PrefixEnum.java}), extracting the relevant words from the text and checking the +uniqueness of the text (\emph{ProcessCorpusServlet.java}), make reasonable +suggestions on the annotation (\emph{TagCorpusServlet.java}), build the object +of each annotated word (\emph{JDOMorphilo.java}), and check on the quality by applying +statistical models (\emph{QualityControl.java}). + +Implementation +-------------- + +Having taken a bird's eye perspective in the previous chapter, it is now time to take a look at the specific implementation at the level +of methods. Starting with the main servlet, \emph{ProcessCorpusServlet.java}, the class defines four getter method: +\renewcommand{\labelenumi}{(\theenumi)} +\begin{enumerate} + \item\label{itm:geturl} public String getURLParameter(MCRServletJob, String) + \item\label{itm:getcorp} public String getCorpusMetadata(MCRServletJob, String) + \item\label{itm:getcont} public ArrayList<String> getContentFromFile(MCRServletJob, String) + \item\label{itm:getderiv} public Path getDerivateFilePath(MCRServletJob, String) + \item\label{itm:now} public int getNumberOfWords(MCRServletJob job, String) +\end{enumerate} +Since each servlet in MyCoRe extends the class MCRServlet, it has access to MCRServletJob, from which the http requests and responses +can be used. This is the first argument in the above methods. The second argument of method (\ref{itm:geturl}) specifies the name of an url parameter, i.e. +the object id or the id of the derivate. The method returns the value of the given parameter. Typically MyCoRe uses the url to exchange +these ids. The second method provides us with the value of a data field in the xml document. So the string defines the name of an attribute. +\emph{getContentFromFile(MCRServletJob, String)} returns the words as a list from a file when given the filename as a string. +The getter listed in \ref{itm:getderiv}), returns the Path from the MyCoRe repository when the name of +the file is specified. And finally, method (\ref{itm:now}) returns the number of words by simply returning +\emph{getContentFromFile(job, fileName).size()}. + +There are two methods in every MyCoRe-Servlet that have to be overwritten, +\emph{protected void render(MCRServletJob, Exception)}, which redirects the requests as \emph{POST} or \emph{GET} responds, and +\emph{protected void think(MCRServletJob)}, in which the logic is implemented. Since the latter is important to understand the +core idea of the Morphilo algorithm, it is displayed in full length in source code \ref{src:think}. + +\begin{lstlisting}[language=java,caption={The overwritten think method},label=src:think,escapechar=|] +protected void think(MCRServletJob job) throws Exception +{ + this.job = job; + String dateFromCorp = getCorpusMetadata(job, "def.datefrom"); + String dateUntilCorp = getCorpusMetadata(job, "def.dateuntil"); + String corpID = getURLParameter(job, "objID"); + String derivID = getURLParameter(job, "id"); + + //if NoW is 0, fill with anzWords + MCRObject helpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(corpID));|\label{ln:bugfixstart}| + Document jdomDocHelp = helpObj.createXML(); + XPathFactory xpfacty = XPathFactory.instance(); + XPathExpression<Element> xpExp = xpfacty.compile("//NoW", Filters.element()); + Element elem = xpExp.evaluateFirst(jdomDocHelp); + //fixes transferred morphilo data from previous stand alone project + int corpussize = getNumberOfWords(job, ""); + if (Integer.parseInt(elem.getText()) != corpussize) + { + elem.setText(Integer.toString(corpussize)); + helpObj = new MCRObject(jdomDocHelp); + MCRMetadataManager.update(helpObj); + }|\label{ln:bugfixend}| + + //Check if the uploaded corpus was processed before + SolrClient slr = MCRSolrClientFactory.getSolrClient();|\label{ln:solrstart}| + SolrQuery qry = new SolrQuery(); + qry.setFields("korpusname", "datefrom", "dateuntil", "NoW", "id"); + qry.setQuery("datefrom:" + dateFromCorp + " AND dateuntil:" + dateUntilCorp + " AND NoW:" + corpussize); + SolrDocumentList rslt = slr.query(qry).getResults();|\label{ln:solrresult}| + + Boolean incrOcc = true; + // if resultset contains only one, then it must be the newly created corpus + if (slr.query(qry).getResults().getNumFound() > 1) + { + incrOcc = false; + }|\label{ln:solrend}| + + //match all words in corpus with morphilo (creator=administrator) and save all words that are not in morphilo DB in leftovers + ArrayList<String> leftovers = new ArrayList<String>(); + ArrayList<String> processed = new ArrayList<String>(); + + leftovers = getUnknownWords(getContentFromFile(job, ""), dateFromCorp, dateUntilCorp, "", incrOcc, incrOcc, false);|\label{ln:callkeymeth}| + + //write all words of leftover in file as derivative to respective corpmeta dataset + MCRPath root = MCRPath.getPath(derivID, "/");|\label{ln:filesavestart}| + Path fn = getDerivateFilePath(job, "").getFileName(); + Path p = root.resolve("untagged-" + fn); + Files.write(p, leftovers);|\label{ln:filesaveend}| + + //create a file for all words that were processed + Path procWds = root.resolve("processed-" + fn); + Files.write(procWds, processed); +} +\end{lstlisting} +Using the above mentioned getter methods, the \emph{think} method assigns values to the object ID, needed to get the xml document +that contain the corpus metadata, the file ID, and the beginning and starting dates from the corpus to be analyzed. Lines \ref{ln:bugfixstart} +through \ref{ln:bugfixend} show how to access a mycore object as an xml document, a procedure that will be used in different variants +throughout this implementation. +By means of the object ID, the respective corpus is identified and a JDOM document is constructed, which can then be accessed +by XPath. The XPath factory instances are collections of the xml nodes. In the present case, it is save to assume that only one element +of \emph{NoW} is available (see corpus datamodel listing \ref{lst:corpusdatamodel} with $maxOccurs='1'$). So we do not have to loop through +the collection, but use the first node named \emph{NoW}. The if-test checks if the number of words of the uploaded file is the +same as the number written in the document. When the document is initially created by the MyCoRe logic it was configured to be zero. +If unequal, the setText(String) method is used to write the number of words of the corpus to the document. + +Lines \ref{ln:solrstart}--\ref{ln:solrend} reveal the second important ingredient, i.e. controlling the search engine. First, a solr +client and a query are initialized. Then, the output of the result set is defined by giving the fields of interest of the document. +In the case at hand, it is the id, the name of the corpus, the number of words, and the beginnig and ending dates. With \emph{setQuery} +it is possible to assign values to some or all of these fields. Finally, \emph{getResults()} carries out the search and writes +all hits to a \emph{SolrDocumentList} (line \ref{ln:solrresult}). The test that follows is really only to set a Boolean +encoding if the number of occurrences of that word in the master should be updated. To avoid multiple counts, +incrementing the word frequency is only done if it is a new corpus. + +In line \ref{ln:callkeymeth} \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} is called and +returned as a list of words. This method is key and will be discussed in depth below. Finally, lines +\ref{ln:filesavestart}--\ref{ln:filesaveend} show how to handle file objects in MyCoRe. Using the file ID, the root path and the name +of the first file in that path are identified. Then, a second file starting with ``untagged'' is created and all words returned from +the \emph{getUnknownWords} is written to that file. By the same token an empty file is created (in the last two lines of the \emph{think}-method), +in which all words that are manually annotated will be saved. + +In a refactoring phase, the method \emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} could be subdivided into +three methods: for each Boolean parameter one. In fact, this method handles more than one task. This is mainly due to multiple code avoidance. +%this is just wrong because no resultset will substantially be more than 10-20 +%In addition, for large text files this method would run into efficiency problems if the master database also reaches the intended size of about +%$100,000$ entries and beyond because +In essence, an outer loop runs through all words of the corpus and an inner loop runs through all hits in the solr result set. Because the result +set is supposed to be small, approximately between $10-20$ items, efficiency +problems are unlikely to cause a problem, although there are some more loops running through collection of about the same sizes. +%As the hits naturally grow larger with an increasing size of the data base, processing time will rise exponentially. +Since each word is identified on the basis of its projected word type, the word form, and the time range it falls into, it is these variables that +have to be checked for existence in the documents. If not in the xml documents, +\emph{null} is returned and needs to be corrected. Moreover, user authentification must be considered. There are three different XPaths that are relevant. +\begin{itemize} + \item[-] \emph{//service/servflags/servflag[@type='createdby']} to test for the correct user + \item[-] \emph{//morphiloContainer/morphilo} to create the annotated document + \item[-] \emph{//morphiloContainer/morphilo/w} to set occurrences or add a link +\end{itemize} + +As an illustration of the core functioning of this method, listing \ref{src:getUnknowWords} is given. +\begin{lstlisting}[language=java,caption={Mode of Operation of getUnknownWords Method},label=src:getUnknowWords,escapechar=|] +public ArrayList<String> getUnknownWords( + ArrayList<String> corpus, + String timeCorpusBegin, + String timeCorpusEnd, + String wdtpe, + Boolean setOcc, + Boolean setXlink, + Boolean writeAllData) throws Exception + { + String currentUser = MCRSessionMgr.getCurrentSession().getUserInformation().getUserID(); + ArrayList lo = new ArrayList(); + + for (int i = 0; i < corpus.size(); i++) + { + SolrClient solrClient = MCRSolrClientFactory.getSolrClient(); + SolrQuery query = new SolrQuery(); + query.setFields("w","occurrence","begin","end", "id", "wordtype"); + query.setQuery(corpus.get(i)); + query.setRows(50); //more than 50 items are extremely unlikely + SolrDocumentList results = solrClient.query(query).getResults(); + Boolean available = false; + for (int entryNum = 0; entryNum < results.size(); entryNum++) + { + ... + // update in MCRMetaDataManager + String mcrIDString = results.get(entryNum).getFieldValue("id").toString(); + //MCRObjekt auslesen und JDOM-Document erzeugen: + MCRObject mcrObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(mcrIDString)); + Document jdomDoc = mcrObj.createXML(); + ... + //check and correction for word type + ... + //checkand correction time: timeCorrect + ... + //check if user correct: isAuthorized + ... + XPathExpression<Element> xp = xpfac.compile("//morphiloContainer/morphilo/w", Filters.element()); + //Iterates w-elements and increments occurrence attribute if setOcc is true + for (Element e : xp.evaluate(jdomDoc)) + { + //wenn Rechte da sind und Worttyp nirgends gegeben oder gleich ist + if (isAuthorized && timeCorrect + && ((e.getAttributeValue("wordtype") == null && wdtpe.equals("")) + || e.getAttributeValue("wordtype").equals(wordtype))) // nur zur Vereinheitlichung + { + int oc = -1; + available = true;|\label{ln:available}| + try + { + //adjust occurrence Attribut + if (setOcc) + { + oc = Integer.parseInt(e.getAttributeValue("occurrence")); + e.setAttribute("occurrence", Integer.toString(oc + 1)); + } + + //write morphilo-ObjectID in xml of corpmeta + if (setXlink) + { + Namespace xlinkNamespace = Namespace.getNamespace("xlink", "http://www.w3.org/1999/xlink");|\label{ln:namespace}| + MCRObject corpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(getURLParameter(job, "objID"))); + Document corpDoc = corpObj.createXML(); + XPathExpression<Element> xpathEx = xpfac.compile("//corpuslink", Filters.element()); + Element elm = xpathEx.evaluateFirst(corpDoc); + elm.setAttribute("href" , mcrIDString, xlinkNamespace); + } + mcrObj = new MCRObject(jdomDoc);|\label{ln:updatestart}| + MCRMetadataManager.update(mcrObj); + QualityControl qc = new QualityControl(mcrObj);|\label{ln:updateend}| + } + catch(NumberFormatException except) + { + // ignore + } + } + } + if (!available) // if not available in datasets under the given conditions |\label{ln:notavailable}| + { + lo.add(corpus.get(i)); + } + } + return lo; + } +\end{lstlisting} +As can be seen from the functionality of listing \ref{src:getUnknowWords}, getting the unknown words of a corpus, is rather a side effect for the equally named method. +More precisely, a Boolean (line \ref{ln:available}) is set when the document is manipulated otherwise because it is clear that the word must exist then. +If the Boolean remains false (line \ref{ln:notavailable}), the word is put on the list of words that have to be annotated manually. As already explained above, the +first loop runs through all words (corpus) and the following lines a solr result set is created. This set is also looped through and it is checked if the time range, +the word type and the user are authorized. In the remainder, the occurrence attribute of the morphilo document can be incremented (setOcc is true) or/and the word is linked to the +corpus meta data (setXlink is true). While all code lines are equivalent with +what was explained in listing \ref{src:think}, it suffices to focus on an +additional name space, i.e. +``xlink'' has to be defined (line \ref{ln:namespace}). Once the linking of word +and corpus is set, the entire MyCoRe object has to be updated. This is done by the functionality of the framework (lines \ref{ln:updatestart}--\ref{ln:updateend}). +At the end, an instance of \emph{QualityControl} is created. + +%QualityControl +The class \emph{QualityControl} is instantiated with a constructor +depicted in listing \ref{src:constructQC}. +\begin{lstlisting}[language=java,caption={Constructor of QualityControl.java},label=src:constructQC,escapechar=|] +private MCRObject mycoreObject; +/* Constructor calls method to carry out quality control, i.e. if at least 20 + * different users agree 100% on the segments of the word under investigation + */ +public QualityControl(MCRObject mycoreObject) throws Exception +{ + this.mycoreObject = mycoreObject; + if (getEqualObjectNumber() > 20) + { + addToMorphiloDB(); + } +} +\end{lstlisting} +The constructor takes an MyCoRe object, a potential word candidate for the +master data base, which is assigned to a private class variable because the +object is used though not changed by some other java methods. +More importantly, there are two more methods: \emph{getEqualNumber()} and +\emph{addToMorphiloDB()}. While the former initiates a process of counting and +comparing objects, the latter is concerned with calculating the correct number +of occurrences from different, but not the same texts, and generating a MyCoRe object with the same content but with two different flags in the \emph{//service/servflags/servflag}-node, i.e. \emph{createdby='administrator'} and \emph{state='published'}. +And of course, the \emph{occurrence} attribute is set to the newly calculated value. The logic corresponds exactly to what was explained in +listing \ref{src:think} and will not be repeated here. The only difference are the paths compiled by the XPathFactory. They are +\begin{itemize} + \item[-] \emph{//service/servflags/servflag[@type='createdby']} and + \item[-] \emph{//service/servstates/servstate[@classid='state']}. +\end{itemize} +It is more instructive to document how the number of occurrences is calculated. There are two steps involved. First, a list with all mycore objects that are +equal to the object which the class is instantiated with (``mycoreObject'' in listing \ref{src:constructQC}) is created. This list is looped and all occurrence +attributes are summed up. Second, all occurrences from equal texts are substracted. Equal texts are identified on the basis of its meta data and its derivate. +There are some obvious shortcomings of this approach, which will be discussed in chapter \ref{chap:results}, section \ref{sec:improv}. Here, suffice it to +understand the mode of operation. Listing \ref{src:equalOcc} shows a possible solution. +\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (1)},label=src:equalOcc,escapechar=|] +/* returns number of Occurrences if Objects are equal, zero otherwise + */ +private int getOccurrencesFromEqualTexts(MCRObject mcrobj1, MCRObject mcrobj2) throws SAXException, IOException +{ + int occurrences = 1; + //extract corpmeta ObjectIDs from morphilo-Objects + String crpID1 = getAttributeValue("//corpuslink", "href", mcrobj1); + String crpID2 = getAttributeValue("//corpuslink", "href", mcrobj2); + //get these two corpmeta Objects + MCRObject corpo1 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID1)); + MCRObject corpo2 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID2)); + //are the texts equal? get list of 'processed-words' derivate + String corp1DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo1); + String corp2DerivID = getAttributeValue("//structure/derobjects/derobject", "href", corpo2); + + ArrayList result = new ArrayList(getContentFromFile(corp1DerivID, ""));|\label{ln:writeContent}| + result.remove(getContentFromFile(corp2DerivID, ""));|\label{ln:removeContent}| + if (result.size() == 0) // the texts are equal + { + // extract occurrences of one the objects + occurrences = Integer.parseInt(getAttributeValue("//morphiloContainer/morphilo/w", "occurrence", mcrobj1)); + } + else + { + occurrences = 0; //project metadata happened to be the same, but texts are different + } + return occurrences; +} +\end{lstlisting} +In this implementation, the ids from the \emph{corpmeta} data model are accessed via the xlink attribute in the morphilo documents. +The method \emph{getAttributeValue(String, String, MCRObject)} does exactly the same as demonstrated earlier (see from line \ref{ln:namespace} +on in listing \ref{src:getUnknowWords}). The underlying logic is that the texts are equal if exactly the same number of words were uploaded. +So all words from one file are written to a list (line \ref{ln:writeContent}) and words from the other file are removed from the +very same list (line \ref{ln:removeContent}). If this list is empty, then the exact same number of words must have been in both files and the occurrences +are adjusted accordingly. Since this method is called from another private method that only contains a loop through all equal objects, one gets +the occurrences from all equal texts. For reasons of confirmability, the looping method is also given: +\begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (2)},label=src:equalOcc2,escapechar=|] +private int getOccurrencesFromEqualTexts() throws Exception +{ + ArrayList<MCRObject> equalObjects = new ArrayList<MCRObject>(); + equalObjects = getAllEqualMCRObjects(); + int occurrences = 0; + for (MCRObject obj : equalObjects) + { + occurrences = occurrences + getOccurrencesFromEqualTexts(mycoreObject, obj); + } + return occurrences; +} +\end{lstlisting} + +Now, the constructor in listing \ref{src:constructQC} reveals another method that rolls out an equally complex concatenation of procedures. +As implied above, \emph{getEqualObjectNumber()} returns the number of equally annotated words. It does this by falling back to another +method from which the size of the returned list is calculated (\emph{getAllEqualMCRObjects().size()}). Hence, we should care about +\emph{getAllEqualMCRObjects()}. This method really has the same design as \emph{int getOccurrencesFromEqualTexts()} in listing \ref{src:equalOcc2}. +The difference is that another method (\emph{Boolean compareMCRObjects(MCRObject, MCRObject, String)}) is used within the loop and +that all equal objects are put into the list of MyCoRe objects that are returned. If this list comprises more than 20 +entries,\footnote{This number is somewhat arbitrary. It is inspired by the sample size n in t-distributed data.} the respective document +will be integrated in the master data base by the process described above. +The comparator logic is shown in listing \ref{src:compareMCR}. +\begin{lstlisting}[language=java,caption={Comparison of MyCoRe objects},label=src:compareMCR,escapechar=|] +private Boolean compareMCRObjects(MCRObject mcrobj1, MCRObject mcrobj2, String xpath) throws SAXException, IOException +{ + Boolean isEqual = false; + Boolean beginTime = false; + Boolean endTime = false; + Boolean occDiff = false; + Boolean corpusDiff = false; + + String source = getXMLFromObject(mcrobj1, xpath); + String target = getXMLFromObject(mcrobj2, xpath); + + XMLUnit.setIgnoreAttributeOrder(true); + XMLUnit.setIgnoreComments(true); + XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true); + XMLUnit.setIgnoreWhitespace(true); + XMLUnit.setNormalizeWhitespace(true); + + //differences in occurrences, end, begin should be ignored + try + { + Diff xmlDiff = new Diff(source, target); + DetailedDiff dd = new DetailedDiff(xmlDiff); + //counters for differences + int i = 0; + int j = 0; + int k = 0; + int l = 0; + // list containing all differences + List differences = dd.getAllDifferences();|\label{ln:difflist}| + for (Object object : differences) + { + Difference difference = (Difference) object; + //@begin,@end,... node is not in the difference list if the count is 0 + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@begin")) i++;|\label{ln:diffbegin}| + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@end")) j++; + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence")) k++; + if (difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus")) l++;|\label{ln:diffend}| + //@begin and @end have different values: they must be checked if they fall right in the allowed time range + if ( difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@begin") + && (Integer.parseInt(difference.getControlNodeDetail().getValue()) < Integer.parseInt(difference.getTestNodeDetail().getValue())) ) + { + beginTime = true; + } + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@end") + && (Integer.parseInt(difference.getControlNodeDetail().getValue()) > Integer.parseInt(difference.getTestNodeDetail().getValue())) ) + { + endTime = true; + } + //attribute values of @occurrence and @corpus are ignored if they are different + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@occurrence")) + { + occDiff = true; + } + if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) + && difference.getControlNodeDetail().getXpathLocation().endsWith("@corpus")) + { + corpusDiff = true; + } + } + //if any of @begin, @end ... is identical set Boolean to true + if (i == 0) beginTime = true;|\label{ln:zerobegin}| + if (j == 0) endTime = true; + if (k == 0) occDiff = true; + if (l == 0) corpusDiff = true;|\label{ln:zeroend}| + //if the size of differences is greater than the number of changes admitted in @begin, @end ... something else must be different + if (beginTime && endTime && occDiff && corpusDiff && (i + j + k + l) == dd.getAllDifferences().size()) isEqual = true;|\label{ln:diffsum}| + } + catch (SAXException e) + { + e.printStackTrace(); + } + catch (IOException e) + { + e.printStackTrace(); + } + return isEqual; +} +\end{lstlisting} +In this method, XMLUnit is heavily used to make all necessary node comparisons. The matter becomes more complicated, however, if some attributes +are not only ignored, but evaluated according to a given definition as it is the case for the time range. If the evaluator and builder classes are +not to be overwritten entirely because needed for evaluating other nodes of the +xml document, the above solution appears a bit awkward. So there is potential for improvement before the production version is to be programmed. + +XMLUnit provides us with a +list of the differences of the two documents (see line \ref{ln:difflist}). There are four differences allowed, that is, the attributes \emph{occurrence}, +\emph{corpus}, \emph{begin}, and \emph{end}. For each of them a Boolean variable is set. Because any of the attributes could also be equal to the master +document and the difference list only contains the actual differences, one has to find a way to define both, equal and different, for the attributes. +This could be done by ignoring these nodes. Yet, this would not include testing if the beginning and ending dates fall into the range of the master +document. Therefore the attributes are counted as lines \ref{ln:diffbegin} through \ref{ln:diffend} reveal. If any two documents +differ in some of the four attributes just specified, then the sum of the counters (line \ref{ln:diffsum}) should not be greater than the collected differences +by XMLUnit. The rest of the if-tests assign truth values to the respective +Booleans. It is probably worth mentioning that if all counters are zero (lines +\ref{ln:zerobegin}-\ref{ln:zeroend}) the attributes and values are identical and hence the Boolean has to be set explicitly. Otherwise the test in line \ref{ln:diffsum} would fail. + +%TagCorpusServlet +Once quality control (explained in detail further down) has been passed, it is +the user's turn to interact further. By clicking on the option \emph{Manual tagging}, the \emph{TagCorpusServlet} will be callled. This servlet instantiates +\emph{ProcessCorpusServlet} to get access to the \emph{getUnknownWords}-method, which delivers the words still to be +processed and which overwrites the content of the file starting with \emph{untagged}. For the next word in \emph{leftovers} a new MyCoRe object is created +using the JDOM API and added to the file beginning with \emph{processed}. In line \ref{ln:tagmanu} of listing \ref{src:tagservlet}, the previously defined +entry mask is called, with which the proposed word structure could be confirmed or changed. How the word structure is determined will be shown later in +the text. +\begin{lstlisting}[language=java,caption={Manual Tagging Procedure},label=src:tagservlet,escapechar=|] +... +if (!leftovers.isEmpty()) +{ + ArrayList<String> processed = new ArrayList<String>(); + //processed.add(leftovers.get(0)); + JDOMorphilo jdm = new JDOMorphilo(); + MCRObject obj = jdm.createMorphiloObject(job, leftovers.get(0));|\label{ln:jdomobject}| + //write word to be annotated in process list and save it + Path filePathProc = pcs.getDerivateFilePath(job, "processed").getFileName(); + Path proc = root.resolve(filePathProc); + processed = pcs.getContentFromFile(job, "processed"); + processed.add(leftovers.get(0)); + Files.write(proc, processed); + + //call entry mask for next word + tagUrl = prop.getBaseURL() + "content/publish/morphilo.xed?id=" + obj.getId();|\label{ln:tagmanu}| +} +else +{ + //initiate process to give a complete tagged file of the original corpus + //if untagged-file is empty, match original file with morphilo + //creator=administrator OR creator=username and write matches in a new file + ArrayList<String> complete = new ArrayList<String>(); + ProcessCorpusServlet pcs2 = new ProcessCorpusServlet(); + complete = pcs2.getUnknownWords( + pcs2.getContentFromFile(job, ""), //main corpus file + pcs2.getCorpusMetadata(job, "def.datefrom"), + pcs2.getCorpusMetadata(job, "def.dateuntil"), + "", //wordtype + false, + false, + true); + + Files.delete(p); + MCRXMLFunctions mdm = new MCRXMLFunctions(); + String mainFile = mdm.getMainDocName(derivID); + Path newRoot = root.resolve("tagged-" + mainFile); + Files.write(newRoot, complete); + + //return to Menu page + tagUrl = prop.getBaseURL() + "receive/" + corpID; +} +\end{lstlisting} +At the point where no more items are in \emph{leftsovers} the \emph{getUnknownWords}-method is called whereas the last Boolean parameter +is set true. This indicates that the array list containing all available and relevant data to the respective user is returned as seen in +the code snippet in listing \ref{src:writeAll}. +\begin{lstlisting}[language=java,caption={Code snippet to deliver all data to the user},label=src:writeAll,escapechar=|] +... +// all data is written to lo in TEI +if (writeAllData && isAuthorized && timeCorrect) +{ + XPathExpression<Element> xpath = xpfac.compile("//morphiloContainer/morphilo", Filters.element()); + for (Element e : xpath.evaluate(jdomDoc)) + { + XMLOutputter outputter = new XMLOutputter(); + outputter.setFormat(Format.getPrettyFormat()); + lo.add(outputter.outputString(e.getContent())); + } +} +... +\end{lstlisting} +The complete list (\emph{lo}) is written to yet a third file starting with \emph{tagged} and finally returned to the main project webpage. + +%JDOMorphilo +The interesting question is now where does the word structure come from, which is filled in the entry mask as asserted above. +In listing \ref{src:tagservlet} line \ref{ln:jdomobject}, one can see that a JDOM object is created and the method +\emph{createMorphiloObject(MCRServletJob, String)} is called. The string parameter is the word that needs to be analyzed. +Most of the method is a mere application of the JDOM API given the data model in chapter \ref{chap:concept} section +\ref{subsec:datamodel} and listing \ref{lst:worddatamodel}. That means namespaces, elements and their attributes are defined in the correct +order and hierarchy. + +To fill the elements and attributes with text, i.e. prefixes, suffixes, stems, etc., a Hashmap -- containing the morpheme as +key and its position as value -- are created that are filled with the results from an AffixStripper instantiation. Depending on how many prefixes +or suffixes respectively are put in the hashmap, the same number of xml elements are created. As a final step, a valid MyCoRe id is generated using +the existing MyCoRe functionality, the object is created and returned to the TagCorpusServlet. + +%AffixStripper explanation +Last, the analyses of the word structure will be considered. It is implemented +in the \emph{AffixStripper.java} file. +All lexical affix morphemes and their allomorphs as well as the inflections were extracted from the +OED\footnote{Oxford English Dictionary http://www.oed.com/} and saved as enumerated lists (see the example in listing \ref{src:enumPref}). +The allomorphic items of these lists are mapped successively to the beginning in the case of prefixes +(see listing \ref{src:analyzePref}, line \ref{ln:prefLoop}) or to the end of words in the case of suffixes +(see listing \ref{src:analyzeSuf}). Since each +morphemic variant maps to its morpheme right away, it makes sense to use the morpheme and so +implicitly keep the relation to its allomorph. + +\begin{lstlisting}[language=java,caption={Enumeration Example for the Prefix "over"},label=src:enumPref,escapechar=|] +package custom.mycore.addons.morphilo; + +public enum PrefixEnum { +... + over("over"), ufer("over"), ufor("over"), uferr("over"), uvver("over"), obaer("over"), ober("over)"), ofaer("over"), + ofere("over"), ofir("over"), ofor("over"), ofer("over"), ouer("over"),oferr("over"), offerr("over"), offr("over"), aure("over"), + war("over"), euer("over"), oferre("over"), oouer("over"), oger("over"), ouere("over"), ouir("over"), ouire("over"), + ouur("over"), ouver("over"), ouyr("over"), ovar("over"), overe("over"), ovre("over"),ovur("over"), owuere("over"), owver("over"), + houyr("over"), ouyre("over"), ovir("over"), ovyr("over"), hover("over"), auver("over"), awver("over"), ovver("over"), + hauver("over"), ova("over"), ove("over"), obuh("over"), ovah("over"), ovuh("over"), ofowr("over"), ouuer("over"), oure("over"), + owere("over"), owr("over"), owre("over"), owur("over"), owyr("over"), our("over"), ower("over"), oher("over"), + ooer("over"), oor("over"), owwer("over"), ovr("over"), owir("over"), oar("over"), aur("over"), oer("over"), ufara("over"), + ufera("over"), ufere("over"), uferra("over"), ufora("over"), ufore("over"), ufra("over"), ufre("over"), ufyrra("over"), + yfera("over"), yfere("over"), yferra("over"), uuera("over"), ufe("over"), uferre("over"), uuer("over"), uuere("over"), + vfere("over"), vuer("over"), vuere("over"), vver("over"), uvvor("over") ... +...chap:results + private String morpheme; + //constructor + PrefixEnum(String morpheme) + { + this.morpheme = morpheme; + } + //getter Method + + public String getMorpheme() + { + return this.morpheme; + } +} +\end{lstlisting} +As can be seen in line \ref{ln:prefPutMorph} in listing \ref{src:analyzePref}, the morpheme is saved to a hash map together with its position, i.e. the size of the +map plus one at the time being. In line \ref{ln:prefCutoff} the \emph{analyzePrefix} method is recursively called until no more matches can be made. + +\begin{lstlisting}[language=java,caption={Method to recognize prefixes},label=src:analyzePref,escapechar=|] +private Map<String, Integer> prefixMorpheme = new HashMap<String,Integer>(); +... +private void analyzePrefix(String restword) +{ + if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion + { + for (PrefixEnum prefEnum : PrefixEnum.values())|\label{ln:prefLoop}| + { + String s = prefEnum.toString(); + if (restword.startsWith(s)) + { + prefixMorpheme.put(s, prefixMorpheme.size() + 1);|\label{ln:prefPutMorph}| + //cut off the prefix that is added to the list + analyzePrefix(restword.substring(s.length()));|\label{ln:prefCutoff}| + } + else + { + analyzePrefix(""); + } + } + } +} +\end{lstlisting} + +The recognition of suffixes differs only in the cut-off direction since suffixes occur at the end of a word. +Hence, line \ref{ln:prefCutoff} in listing \ref{src:analyzePref} reads in the case of suffixes. + +\begin{lstlisting}[language=java,caption={Cut-off mechanism for suffixes},label=src:analyzeSuf,escapechar=|] +analyzeSuffix(restword.substring(0, restword.length() - s.length())); +\end{lstlisting} + +It is important to note that inflections are suffixes (in the given model case of Middle English morphology) that usually occur at the very +end of a word, i.e. after all lexical suffixes, only once. It follows that inflections +have to be recognized at first without any repetition. So the procedure for inflections can be simplified +to a substantial degree as listing \ref{src:analyzeInfl} shows. + +\begin{lstlisting}[language=java,caption={Method to recognize inflections},label=src:analyzeInfl,escapechar=|] +private String analyzeInflection(String wrd) +{ + String infl = ""; + for (InflectionEnum inflEnum : InflectionEnum.values()) + { + if (wrd.endsWith(inflEnum.toString())) + { + infl = inflEnum.toString(); + } + } + return infl; +} +\end{lstlisting} + +Unfortunately the embeddedness problem prevents a very simple algorithm. Embeddedness occurs when a lexical item +is a substring of another lexical item. To illustrate, the suffix \emph{ion} is also contained in the suffix \emph{ation}, as is +\emph{ent} in \emph{ment}, and so on. The embeddedness problem cannot be solved completely on the basis of linear modelling, but +for a large part of embedded items one can work around it using implicitly Zipf's law, i.e. the correlation between frequency +and length of lexical items. The longer a word becomes, the less frequent it will occur. The simplest logic out of it is to assume +that longer suffixes (measured in letters) are preferred over shorter suffixes because it is more likely tha the longer the suffix string becomes, +the more likely it is one (as opposed to several) suffix unit(s). This is done in listing \ref{src:embedAffix}, whereas +the inner class \emph{sortedByLengthMap} returns a list sorted by length and the loop from line \ref{ln:deleteAffix} onwards deletes +the respective substrings. + +\begin{lstlisting}[language=java,caption={Method to workaround embeddedness},label=src:embedAffix,escapechar=|] +private Map<String, Integer> sortOutAffixes(Map<String, Integer> affix) +{ + Map<String,Integer> sortedByLengthMap = new TreeMap<String, Integer>(new Comparator<String>() + { + @Override + public int compare(String s1, String s2) + { + int cmp = Integer.compare(s1.length(), s2.length()); + return cmp != 0 ? cmp : s1.compareTo(s2); + } + } + ); + sortedByLengthMap.putAll(affix); + ArrayList<String> al1 = new ArrayList<String>(sortedByLengthMap.keySet()); + ArrayList<String> al2 = al1; + Collections.reverse(al2); + for (String s2 : al1)|\label{ln:deleteAffix}| + { + for (String s1 : al2) + if (s1.contains(s2) && s1.length() > s2.length()) + { + affix.remove(s2); + } + } + return affix; +} +\end{lstlisting} + +Finally, the position of the affix has to be calculated because the hashmap in line \ref{ln:prefPutMorph} in +listing \ref{src:analyzePref} does not keep the original order for changes taken place in addressing the affix embeddedness +(listing \ref{src:embedAffix}). Listing \ref{src:affixPos} depicts the preferred solution. +The recursive construction of the method is similar to \emph{private void analyzePrefix(String)} (listing \ref{src:analyzePref}) +only that the two affix types are handled in one method. For that, an additional parameter taking the form either \emph{suffix} +or \emph{prefix} is included. + +\begin{lstlisting}[language=java,caption={Method to determine position of the affix},label=src:affixPos,escapechar=|] +private void getAffixPosition(Map<String, Integer> affix, String restword, int pos, String affixtype) +{ + if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion + { + for (String s : affix.keySet()) + { + if (restword.startsWith(s) && affixtype.equals("prefix")) + { + pos++; + prefixMorpheme.put(s, pos); + //prefixAllomorph.add(pos-1, restword.substring(s.length())); + getAffixPosition(affix, restword.substring(s.length()), pos, affixtype); + } + else if (restword.endsWith(s) && affixtype.equals("suffix")) + { + pos++; + suffixMorpheme.put(s, pos); + //suffixAllomorph.add(pos-1, restword.substring(s.length())); + getAffixPosition(affix, restword.substring(0, restword.length() - s.length()), pos, affixtype); + } + else + { + getAffixPosition(affix, "", pos, affixtype); + } + } + } +} +\end{lstlisting} + +To give the complete word structure, the root of a word should also be provided. In listing \ref{src:rootAnalyze} a simple solution is offered, however, +considering compounds as words consisting of more than one root. +\begin{lstlisting}[language=java,caption={Method to determine roots},label=src:rootAnalyze,escapechar=|] +private ArrayList<String> analyzeRoot(Map<String, Integer> pref, Map<String, Integer> suf, int stemNumber) +{ + ArrayList<String> root = new ArrayList<String>(); + int j = 1; //one root always exists + // if word is a compound several roots exist + while (j <= stemNumber) + { + j++; + String rest = lemma;|\label{ln:lemma}| + + for (int i=0;i<pref.size();i++) + { + for (String s : pref.keySet()) + { + //if (i == pref.get(s)) + if (rest.length() > s.length() && s.equals(rest.substring(0, s.length()))) + { + rest = rest.substring(s.length(),rest.length()); + } + } + } + + for (int i=0;i<suf.size();i++) + { + for (String s : suf.keySet()) + { + //if (i == suf.get(s)) + if (s.length() < rest.length() && (s.equals(rest.substring(rest.length() - s.length(), rest.length())))) + { + rest = rest.substring(0, rest.length() - s.length()); + } + } + } + root.add(rest); + } + return root; +} +\end{lstlisting} +The logic behind this method is that the root is the remainder of a word when all prefixes and suffixes are substracted. +So the loops run through the number of prefixes and suffixes at each position and substract the affix. Really, there is +some code doubling with the previously described methods, which could be eliminated by making it more modular in a possible +refactoring phase. Again, this is not the concern of a prototype. Line \ref{ln:lemma} defines the initial state of a root, +which is the case for monomorphemic words. The \emph{lemma} is defined as the wordtoken without the inflection. Thus listing +\ref{src:lemmaAnalyze} reveals how the class variable is calculated +\begin{lstlisting}[language=java,caption={Method to determine lemma},label=src:lemmaAnalyze,escapechar=|] +/* + * Simplification: lemma = wordtoken - inflection + */ +private String analyzeLemma(String wrd, String infl) +{ + return wrd.substring(0, wrd.length() - infl.length()); +} +\end{lstlisting} +The constructor of \emph{AffixStripper} calls the method \emph{analyzeWord()} +whose only job is to calculate each structure element in the correct order +(listing \ref{src:lemmaAnalyze}). All structure elements are also provided by getters. +\begin{lstlisting}[language=java,caption={Method to determine all word structure},label=src:lemmaAnalyze,escapechar=|] +private void analyzeWord() +{ + //analyze inflection first because it always occurs at the end of a word + inflection = analyzeInflection(wordtoken); + lemma = analyzeLemma(wordtoken, inflection); + analyzePrefix(lemma); + analyzeSuffix(lemma); + getAffixPosition(sortOutAffixes(prefixMorpheme), lemma, 0, "prefix"); + getAffixPosition(sortOutAffixes(suffixMorpheme), lemma, 0, "suffix"); + prefixNumber = prefixMorpheme.size(); + suffixNumber = suffixMorpheme.size(); + wordroot = analyzeRoot(prefixMorpheme, suffixMorpheme, getStemNumber()); +} +\end{lstlisting} + +To conclude, the Morphilo implementation as presented here, aims at fulfilling the task of a working prototype. It is important to note +that it neither claims to be a very efficient nor a ready software program to be used in production. However, it marks a crucial milestone +on the way to a production system. At some listings sources of improvement were made explicit; at others no suggestions were made. In the latter +case this does not imply that there is no potential for improvement. Once acceptability tests are carried out, it will be the task of a follow up project +to identify these potentials and implement them accordingly. \ No newline at end of file diff --git a/Morphilo_doc/_build/html/_sources/source/datamodel.rst.txt b/Morphilo_doc/_build/html/_sources/source/datamodel.rst.txt index 2d0aef4570bc6a16acf9acb185019c0b63dadaa2..f206ef3ffb8967f600e499b5b76eee996ac7de31 100644 --- a/Morphilo_doc/_build/html/_sources/source/datamodel.rst.txt +++ b/Morphilo_doc/_build/html/_sources/source/datamodel.rst.txt @@ -1,5 +1,60 @@ -Data Model Implementation -========================= +Data Model +========== + +Conceptualization +----------------- + +From both the user and task requirements one can derive that four basic +functions of data processing need to be carried out. Data have to be read, persistently +saved, searched, and deleted. Furthermore, some kind of user management +and multi-user processing is necessary. In addition, the framework should +support web technologies, be well documented, and easy to extent. Ideally, the +MVC pattern is realized. + +\subsection{Data Model}\label{subsec:datamodel} +The guidelines of the +\emph{TEI}-standard\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf} on the +word level are defined in line with the structure defined above in section \ref{subsec:morphologicalSystems}. +In listing \ref{lst:teiExamp} an +example is given for a possible markup at the word level for +\emph{comfortable}.\footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html} + +\begin{lstlisting}[language=XML, +caption={TEI-example for 'comfortable'},label=lst:teiExamp] +<w type="adjective"> + <m type="base"> + <m type="prefix" baseForm="con">com</m> + <m type="root">fort</m> + </m> + <m type="suffix">able</m> +</w> +\end{lstlisting} + +This data model reflects just one theoretical conception of a word structure model. +Crucially, the model emanates from the assumption +that the suffix node is on par with the word base. On the one hand, this +implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the +other hand, is enclosed in the base, which basically means a stronger lexical, +and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different +hierarchical levels has important consequences for the branching direction at +subword level (here right-branching). Left the theoretical interest aside, the +choice of the TEI standard is reasonable with view to a sustainable architecture that allows for +exchanging data with little to no additional adjustments. + +The negative account is that the model is not eligible for all languages. +It reflects a theoretical construction based on Indo-European +languages. If attention is paid to which language this software is used, it will +not be problematic. This is the case for most languages of the Indo-European +stem and corresponds to the overwhelming majority of all research carried out +(unfortunately). + +Implementation +-------------- + +As laid out in the task analysis in section \ref{subsec:datamodel}, it is +advantageous to use established standards. It was also shown that it makes sense +to keep the meta data of each corpus separate from the data model used for the +words to be analyzed. For the present case, the TEI-standard was identified as an appropriate markup for words. In terms of the implementation this means that @@ -26,3 +81,161 @@ Whereas attributes of the objecttype are specific to the repository framework, t recognized in the hierarchy of the meta data element starting with the name \emph{w} (line \ref{src:wordbegin}). +\begin{lstlisting}[language=XML,caption={Word Data +model},label=lst:worddatamodel,escapechar=|] <?xml version="1.0" encoding="UTF-8"?> +<objecttype + name="morphilo" + isChild="true" + isParent="true" + hasDerivates="true" + xmlns:xs="http://www.w3.org/2001/XMLSchema" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:noNamespaceSchemaLocation="datamodel.xsd"> + <metadata> + <element name="morphiloContainer" type="xml" style="dontknow" + notinherit="true" heritable="false"> + <xs:sequence> + <xs:element name="morphilo"> + <xs:complexType> + <xs:sequence> + <xs:element name="w" minOccurs="0" maxOccurs="unbounded">|label{src:wordbegin}| + <xs:complexType mixed="true"> + <xs:sequence> + <!-- stem --> + <xs:element name="m1" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:sequence> + <!-- base --> + <xs:element name="m2" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:sequence> + <!-- root --> + <xs:element name="m3" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + </xs:complexType> + </xs:element> + <!-- prefix --> + <xs:element name="m4" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="PrefixbaseForm" type="xs:string"/> + <xs:attribute name="position" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <xs:attribute name="type" type="xs:string"/> + </xs:complexType> + </xs:element> + <!-- suffix --> + <xs:element name="m5" minOccurs="0" maxOccurs="unbounded"> + <xs:complexType mixed="true"> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="SuffixbaseForm" type="xs:string"/> + <xs:attribute name="position" type="xs:string"/> + <xs:attribute name="inflection" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <!-- stem-Attribute --> + <xs:attribute name="type" type="xs:string"/> + <xs:attribute name="pos" type="xs:string"/> + <xs:attribute name="occurrence" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + <!-- w -Attribute auf Wortebene --> + <xs:attribute name="lemma" type="xs:string"/> + <xs:attribute name="complexType" type="xs:string"/> + <xs:attribute name="wordtype" type="xs:string"/> + <xs:attribute name="occurrence" type="xs:string"/> + <xs:attribute name="corpus" type="xs:string"/> + <xs:attribute name="begin" type="xs:string"/> + <xs:attribute name="end" type="xs:string"/> + </xs:complexType> + </xs:element> + </xs:sequence> + </xs:complexType> + </xs:element> + </xs:sequence> + </element> + <element name="wordtype" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="wordtype"/> + </element> + <element name="complexType" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="complexType"/> + </element> + <element name="corpus" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="corpus"/> + </element> + <element name="pos" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="pos"/> + </element> + <element name="PrefixbaseForm" type="classification" minOccurs="0" + maxOccurs="1"> + <classification id="PrefixbaseForm"/> + </element> + <element name="SuffixbaseForm" type="classification" minOccurs="0" + maxOccurs="1"> + <classification id="SuffixbaseForm"/> + </element> + <element name="inflection" type="classification" minOccurs="0" maxOccurs="1"> + <classification id="inflection"/> + </element> + <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded" > + <target type="corpmeta"/> + </element> + </metadata> +</objecttype> +\end{lstlisting} + +Additionally, it is worth mentioning that some attributes are modeled as a +\emph{classification}. All these have to be listed +as separate elements in the data model. This has been done for all attributes +that are more or less subject to little or no change. In fact, all known suffix +and prefix morphemes should be known for the language investigated and are +therefore defined as a classification. +The same is true for the parts of speech named \emph{pos} in the morphilo data +model above. +Here the PENN-Treebank tagset was used. Last, the different morphemic layers in +the standard model named \emph{m} are changed to $m1$ through $m5$. This is the +only change in the standard that could be problematic if the data is to be +processed elsewhere and the change is not documented more explicitly. Yet, this +change was necessary for the MyCoRe repository throws errors caused by ambiguity +issues on the different $m$-layers. + +The second data model describes only very few properties of the text corpora +from which the words are extracted. Listing \ref{lst:corpusdatamodel} depicts +only the meta data element. For the sake of simplicity of the prototype, this +data model is kept as simple as possible. The obligatory field is the name of +the corpus. Specific dates of the corpus are classified as optional because in +some cases a text cannot be dated reliably. + + +\begin{lstlisting}[language=XML,caption={Corpus Data +Model},label=lst:corpusdatamodel] +<metadata> + <!-- Pflichtfelder --> + <element name="korpusname" type="text" minOccurs="1" maxOccurs="1"/> + <!-- Optionale Felder --> + <element name="sprache" type="text" minOccurs="0" maxOccurs="1"/> + <element name="size" type="number" minOccurs="0" maxOccurs="1"/> + <element name="datefrom" type="text" minOccurs="0" maxOccurs="1"/> + <element name="dateuntil" type="text" minOccurs="0" maxOccurs="1"/> + <!-- number of words --> + <element name="NoW" type="text" minOccurs="0" maxOccurs="1"/> + <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded"> + <target type="morphilo"/> + </element> +</metadata> +\end{lstlisting} + +As a final remark, one might have noticed that all attributes are modelled as +strings although other data types are available and fields encoding the dates or +the number of words suggest otherwise. The MyCoRe framework even provides a +data type \emph{historydate}. There is not a very satisfying answer to its +disuse. +All that can be said is that the use of data types different than the string +leads later on to problems in the convergence between the search engine and the +repository framework. These issues seem to be well known and can be followed on +github. \ No newline at end of file diff --git a/Morphilo_doc/_build/html/_sources/source/framework.rst.txt b/Morphilo_doc/_build/html/_sources/source/framework.rst.txt new file mode 100644 index 0000000000000000000000000000000000000000..1b9925de0b23715d8a34406bc9994664224637f3 --- /dev/null +++ b/Morphilo_doc/_build/html/_sources/source/framework.rst.txt @@ -0,0 +1,27 @@ +Framework +========= + +\begin{figure} + \centering + \includegraphics[scale=0.33]{mycore_architecture-2.png} + \caption[MyCoRe-Architecture and Components]{MyCoRe-Architecture and Components\protect\footnotemark} + \label{fig:abbMyCoReStruktur} +\end{figure} +\footnotetext{source: https://www.mycore.de} +To specify the MyCoRe framework the morphilo application logic will have to be implemented, +the TEI data model specified, and the input, search and output mask programmed. + +There are three directories which are +important for adjusting the MyCoRe framework to the needs of one's own application. These three directories +correspond essentially to the three components in the MVC model as explicated in +section \ref{subsec:mvc}. Roughly, they are envisualized in figure \ref{fig:abbMyCoReStruktur} in the upper +right hand corner. More precisely, the view (\emph{Layout} in figure \ref{fig:abbMyCoReStruktur}) and the model layer +(\emph{Datenmodell} in figure \ref{fig:abbMyCoReStruktur}) can be done +completely via the ``interface'', which is a directory with a predefined +structure and some standard files. For the configuration of the logic an extra directory is offered (/src/main/java/custom/mycore/addons/). Here all, java classes +extending the controller layer should be added. +Practically, all three MVC layers are placed in the +\emph{src/main/}-directory of the application. In one of the subdirectories, +\emph{datamodel/def}, the datamodel specifications are defined as xml files. It parallels the model +layer in the MVC pattern. How the data model was defined will be explained in +section \ref{subsec:datamodelimpl}. \ No newline at end of file diff --git a/Morphilo_doc/_build/html/_sources/source/view.rst.txt b/Morphilo_doc/_build/html/_sources/source/view.rst.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f09e06bd9d7a0d9c1edd889d8ac44d3cb36757f --- /dev/null +++ b/Morphilo_doc/_build/html/_sources/source/view.rst.txt @@ -0,0 +1,247 @@ +View +==== + +Conceptualization +----------------- + +Lastly, the third directory (\emph{src/main/resources}) contains all code needed +for rendering the data to be displayed on the screen. So this corresponds to +the view in an MVC approach. It is done by xsl-files that (unfortunately) +contain some logic that really belongs to the controller. Thus, the division is +not as clear as implied in theory. I will discuss this issue more specifically in the +relevant subsection below. Among the resources are also all images, styles, and +javascripts. + +Implementation +-------------- + +As explained in section \ref{subsec:mvc}, the view component handles the visual +representation in the form of an interface that allows interaction between +the user and the task to be carried out by the machine. As a +webservice in the present case, all interaction happens via a browser, i.e. webpages are +visualized and responses are recognized by registering mouse or keyboard +events. More specifically, a webpage is rendered by transforming xml documents +to html pages. The MyCoRe repository framework uses an open source XSLT +processor from Apache, Xalan.\footnote{http://xalan.apache.org} This engine +transforms document nodes described by the XPath syntax into hypertext making +use of a special form of template matching. All templates are collected in so +called xml-encoded stylesheets. Since there are two data models with two +different structures, it is good practice to define two stylesheet files one for +each data model. + +As a demonstration, in listing \ref{lst:morphilostylesheet} below a short +extract is given for rendering the word data. + +\begin{lstlisting}[language=XML,caption={stylesheet +morphilo.xsl},label=lst:morphilostylesheet] +<?xml version="1.0" encoding="UTF-8"?> +<xsl:stylesheet + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xmlns:xalan="http://xml.apache.org/xalan" + xmlns:i18n="xalan://org.mycore.services.i18n.MCRTranslation" + xmlns:acl="xalan://org.mycore.access.MCRAccessManager" + xmlns:mcr="http://www.mycore.org/" xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:mods="http://www.loc.gov/mods/v3" + xmlns:encoder="xalan://java.net.URLEncoder" + xmlns:mcrxsl="xalan://org.mycore.common.xml.MCRXMLFunctions" + xmlns:mcrurn="xalan://org.mycore.urn.MCRXMLFunctions" + exclude-result-prefixes="xalan xlink mcr i18n acl mods mcrxsl mcrurn encoder" + version="1.0"> + <xsl:param name="MCR.Users.Superuser.UserName"/> + + <xsl:template match="/mycoreobject[contains(@ID,'_morphilo_')]"> + <head> + <link href="{$WebApplicationBaseURL}css/file.css" rel="stylesheet"/> + </head> + <div class="row"> + <xsl:call-template name="objectAction"> + <xsl:with-param name="id" select="@ID"/> + <xsl:with-param name="deriv" select="structure/derobjects/derobject/@xlink:href"/> + </xsl:call-template> + <xsl:variable name="objID" select="@ID"/> + <!-- Hier Ueberschrift setzen --> + <h1 style="text-indent: 4em;"> + <xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w"> + <xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/> + </xsl:if> + </h1> + <dl class="dl-horizontal"> + <!-- (1) Display word --> + <xsl:if test="metadata/def.morphiloContainer/morphiloContainer/morphilo/w"> + <dt> + <xsl:value-of select="i18n:translate('response.page.label.word')"/> + </dt> + <dd> + <xsl:value-of select="metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]"/> + </dd> + </xsl:if> + <!-- (2) Display lemma --> + ... + </xsl:template> + ... + <xsl:template name="objectAction"> + ... + </xsl:template> +... +</xsl:stylesheet> +\end{lstlisting} +This template matches with +the root node of each \emph{MyCoRe object} ensuring that a valid MyCoRe model is +used and checking that the document to be processed contains a unique +identifier, here a \emph{MyCoRe-ID}, and the name of the correct data model, +here \emph{morphilo}. +Then, another template, \emph{objectAction}, is called together with two parameters, the ids +of the document object and attached files. In the remainder all relevant +information from the document is accessed by XPath, such as the word and the lemma, +and enriched with hypertext annotations it is rendered as a hypertext document. +The template \emph{objectAction} is key to understand the coupling process in the software +framework. It is therefore separately listed in \ref{lst:objActionTempl}. + +\begin{lstlisting}[language=XML,caption={template +objectAction},label=lst:objActionTempl,escapechar=|] +<xsl:template name="objectAction"> + <xsl:param name="id" select="./@ID"/> + <xsl:param name="accessedit" select="acl:checkPermission($id,'writedb')"/> + <xsl:param name="accessdelete" select="acl:checkPermission($id,'deletedb')"/> + <xsl:variable name="derivCorp" select="./@label"/> + <xsl:variable name="corpID" select="metadata/def.corpuslink[@class='MCRMetaLinkID']/corpuslink/@xlink:href"/> + <xsl:if test="$accessedit or $accessdelete">|\label{ln:ng}| + <div class="dropdown pull-right"> + <xsl:if test="string-length($corpID) > 0 or $CurrentUser='administrator'"> + <button class="btn btn-default dropdown-toggle" style="margin:10px" type="button" id="dropdownMenu1" data-toggle="dropdown" aria-expanded="true"> + <span class="glyphicon glyphicon-cog" aria-hidden="true"></span> Annotieren + <span class="caret"></span> + </button> + </xsl:if> + <xsl:if test="string-length($corpID) > 0">|\label{ln:ru}| + <xsl:variable name="ifsDirectory" select="document(concat('ifs:/',$derivCorp))"/> + <ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1"> + <li role="presentation"> + |\label{ln:nw1}|<a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$derivCorp}&objID={$corpID}" role="menuitem" tabindex="-1">|\label{ln:nw2}| + <xsl:value-of select="i18n:translate('object.nextObject')"/> + </a> + </li> + <li role="presentation"> + <a href="{$WebApplicationBaseURL}receive/{$corpID}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('object.backToProject')"/> + </a> + </li> + </ul> + </xsl:if> + <xsl:if test="$CurrentUser='administrator'"> + <ul class="dropdown-menu" role="menu" aria-labelledby="dropdownMenu1"> + <li role="presentation"> + <a role="menuitem" tabindex="-1" href="{$WebApplicationBaseURL}content/publish/morphilo.xed?id={$id}"> + <xsl:value-of select="i18n:translate('object.editWord')"/> + </a> + </li> + <li role="presentation"> + <a href="{$ServletsBaseURL}object/delete{$HttpSession}?id={$id}" role="menuitem" tabindex="-1" class="confirm_deletion option" data-text="Wirklich loeschen"> + <xsl:value-of select="i18n:translate('object.delWord')"/> + </a> + </li> + </ul> + </xsl:if> + </div> + <div class="row" style="margin-left:0px; margin-right:10px"> + <xsl:apply-templates select="structure/derobjects/derobject[acl:checkPermission(@xlink:href,'read')]"> + <xsl:with-param name="objID" select="@ID"/> + </xsl:apply-templates> + </div> + </xsl:if> +</xsl:template> +\end{lstlisting} +The \emph{objectAction} template defines the selection menu appearing -- once manual tagging has +started -- on the upper right hand side of the webpage entitled +\emph{Annotieren} and displaying the two options \emph{next word} or \emph{back +to project}. +The first thing to note here is that in line \ref{ln:ng} a simple test +excludes all guest users from accessing the procedure. After ensuring that only +the user who owns the corpus project has access (line \ref{ln:ru}), s/he will be +able to access the drop down menu, which is really a url, e.g. line +\ref{ln:nw1}. The attentive reader might have noticed that +the url exactly matches the definition in the web-fragment.xml as shown in +listing \ref{lst:webfragment}, line \ref{ln:tag}, which resolves to the +respective java class there. Really, this mechanism is the data interface within the +MVC pattern. The url also contains two variables, named \emph{derivCorp} and +\emph{corpID}, that are needed to identify the corpus and file object by the +java classes (see section \ref{sec:javacode}). + +The morphilo.xsl stylesheet contains yet another modification that deserves mention. +In listing \ref{lst:derobjectTempl}, line \ref{ln:morphMenu}, two menu options -- +\emph{Tag automatically} and \emph{Tag manually} -- are defined. The former option +initiates ProcessCorpusServlet.java as can be seen again in listing \ref{lst:webfragment}, +line \ref{ln:process}, which determines words that are not in the master data base. +Still, it is important to note that the menu option is only displayed if two restrictions +are met. First, a file has to be uploaded (line \ref{ln:1test}) and, second, there must be +only one file. This is necessary because in the annotation process other files will be generated +that store the words that were not yet processed or a file that includes the final result. The +generated files follow a certain pattern. The file harboring the final, entire TEI-annotated +corpus is prefixed by \emph{tagged}, the other file is prefixed \emph{untagged}. This circumstance +is exploited for manipulating the second option (line \ref{ln:loop}). A loop runs through all +files in the respective directory and if a file name starts with \emph{untagged}, +the option to manually tag is displayed. + +\begin{lstlisting}[language=XML,caption={template +matching derobject},label=lst:derobjectTempl,escapechar=|] +<xsl:template match="derobject" mode="derivateActions"> + <xsl:param name="deriv" /> + <xsl:param name="parentObjID" /> + <xsl:param name="suffix" select="''" /> + <xsl:param name="id" select="../../../@ID" /> + <xsl:if test="acl:checkPermission($deriv,'writedb')"> + <xsl:variable name="ifsDirectory" select="document(concat('ifs:',$deriv,'/'))" /> + <xsl:variable name="path" select="$ifsDirectory/mcr_directory/path" /> + ... + <div class="options pull-right"> + <div class="btn-group" style="margin:10px"> + <a href="#" class="btn btn-default dropdown-toggle" data-toggle="dropdown"> + <i class="fa fa-cog"></i> + <xsl:value-of select="' Korpus'"/> + <span class="caret"></span> + </a> + <ul class="dropdown-menu dropdown-menu-right"> + <!-- Anpasssungen Morphilo -->|\label{ln:morphMenu}| + <xsl:if test="string-length($deriv) > 0">|\label{ln:1test}| + <xsl:if test="count($ifsDirectory/mcr_directory/children/child) = 1">|\label{ln:2test}| + <li role="presentation"> + <a href="{$ServletsBaseURL}object/process{$HttpSession}?id={$deriv}&objID={$id}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('derivate.process')"/> + </a> + </li> + </xsl:if> + <xsl:for-each select="$ifsDirectory/mcr_directory/children/child">|\label{ln:loop}| + <xsl:variable name="untagged" select="concat($path, 'untagged')"/> + <xsl:variable name="filename" select="concat($path,./name)"/> + <xsl:if test="starts-with($filename, $untagged)"> + <li role="presentation"> + <a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$deriv}&objID={$id}" role="menuitem" tabindex="-1"> + <xsl:value-of select="i18n:translate('derivate.taggen')"/> + </a> + </li> + </xsl:if> + </xsl:for-each> + </xsl:if> + ... + </ul> + </div> + </div> + </xsl:if> +</xsl:template> +\end{lstlisting} + +Besides the two stylesheets morphilo.xsl and corpmeta.xsl, other stylesheets have +to be adjusted. They will not be discussed in detail here for they are self-explanatory for the most part. +Essentially, they render the overall layout (\emph{common-layout.xsl}, \emph{skeleton\_layout\_template.xsl}) +or the presentation +of the search results (\emph{response-page.xsl}) and definitions of the solr search fields (\emph{searchfields-solr.xsl}). +The former and latter also inherit templates from \emph{response-general.xsl} and \emph{response-browse.xsl}, in which the +navigation bar of search results can be changed. For the use of multilinguality a separate configuration directory +has to be created containing as many \emph{.property}-files as different +languages want to be displayed. In the current case these are restricted to German and English (\emph{messages\_de.properties} and \emph{messages\_en.properties}). +The property files include all \emph{i18n} definitions. All these files are located in the \emph{resources} directory. + +Furthermore, a search mask and a page for manually entering the annotations had +to be designed. +For these files a specially designed xml standard (\emph{xed}) is recommended to be used within the +repository framework. \ No newline at end of file diff --git a/Morphilo_doc/_build/html/index.html b/Morphilo_doc/_build/html/index.html index 0310738f8a97e335f1a78fbcbeab33a52ecaca9a..91c11e71b709e6623eea1130842d039777641d21 100644 --- a/Morphilo_doc/_build/html/index.html +++ b/Morphilo_doc/_build/html/index.html @@ -6,7 +6,7 @@ <head> <meta http-equiv="X-UA-Compatible" content="IE=Edge" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> - <title>Morphilo documentation</title> + <title>Documentation Morphilo Project — Morphilo documentation</title> <link rel="stylesheet" href="_static/alabaster.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <script type="text/javascript" src="_static/documentation_options.js"></script> @@ -15,7 +15,7 @@ <script type="text/javascript" src="_static/doctools.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> - <link rel="next" title="Data Model Implementation" href="source/datamodel.html" /> + <link rel="next" title="Data Model" href="source/datamodel.html" /> <link rel="stylesheet" href="_static/custom.css" type="text/css" /> @@ -30,16 +30,32 @@ <div class="bodywrapper"> <div class="body" role="main"> - <div class="section" id="welcome-to-morphilo-s-documentation"> -<h1>Documentation Morphilo Project<a class="headerlink" href="#welcome-to-morphilo-s-documentation" title="Permalink to this headline">¶</a></h1> + <div class="section" id="documentation-morphilo-project"> +<h1>Documentation Morphilo Project<a class="headerlink" href="#documentation-morphilo-project" title="Permalink to this headline">¶</a></h1> <div class="toctree-wrapper compound"> <p class="caption"><span class="caption-text">Contents:</span></p> <ul> -<li class="toctree-l1"><a class="reference internal" href="source/datamodel.html">Data Model Implementation</a></li> +<li class="toctree-l1"><a class="reference internal" href="source/datamodel.html">Data Model</a><ul> +<li class="toctree-l2"><a class="reference internal" href="source/datamodel.html#conceptualization">Conceptualization</a></li> +<li class="toctree-l2"><a class="reference internal" href="source/datamodel.html#implementation">Implementation</a></li> +</ul> +</li> <li class="toctree-l1"><a class="reference internal" href="source/controller.html">Controller Adjustments</a><ul> <li class="toctree-l2"><a class="reference internal" href="source/controller.html#general-principle-of-operation">General Principle of Operation</a></li> +<li class="toctree-l2"><a class="reference internal" href="source/controller.html#conceptualization">Conceptualization</a></li> +<li class="toctree-l2"><a class="reference internal" href="source/controller.html#implementation">Implementation</a><ul> +<li class="toctree-l3"><a class="reference internal" href="source/controller.html#id13">}</a></li> +</ul> +</li> +</ul> +</li> +<li class="toctree-l1"><a class="reference internal" href="source/view.html">View</a><ul> +<li class="toctree-l2"><a class="reference internal" href="source/view.html#conceptualization">Conceptualization</a></li> +<li class="toctree-l2"><a class="reference internal" href="source/view.html#implementation">Implementation</a></li> </ul> </li> +<li class="toctree-l1"><a class="reference internal" href="source/architecture.html">Software Design</a></li> +<li class="toctree-l1"><a class="reference internal" href="source/framework.html">Framework</a></li> </ul> </div> </div> @@ -67,7 +83,7 @@ <h3>Related Topics</h3> <ul> <li><a href="#">Documentation overview</a><ul> - <li>Next: <a href="source/datamodel.html" title="next chapter">Data Model Implementation</a></li> + <li>Next: <a href="source/datamodel.html" title="next chapter">Data Model</a></li> </ul></li> </ul> </div> diff --git a/Morphilo_doc/_build/html/objects.inv b/Morphilo_doc/_build/html/objects.inv index a3bd984f43ed29ed035122249fe1f8fa8cc43c7d..ac6220ca7ddb74e3021a0040cda45c6da6630b09 100644 Binary files a/Morphilo_doc/_build/html/objects.inv and b/Morphilo_doc/_build/html/objects.inv differ diff --git a/Morphilo_doc/_build/html/searchindex.js b/Morphilo_doc/_build/html/searchindex.js index d50f9f39f2083c382a9ad2d95c00bd2da6af84e6..c36d4c6073f0027ee03fec4a2a93fdb62ce9ab1d 100644 --- a/Morphilo_doc/_build/html/searchindex.js +++ b/Morphilo_doc/_build/html/searchindex.js @@ -1 +1 @@ -Search.setIndex({docnames:["index","source/controller","source/datamodel"],envversion:52,filenames:["index.rst","source/controller.rst","source/datamodel.rst"],objects:{},objnames:{},objtypes:{},terms:{"case":2,For:2,The:2,accord:2,added:2,addit:2,adjust:0,affix:2,allow:2,also:2,applic:2,appropri:2,attribut:2,begin:2,best:2,both:2,can:2,chosen:2,compat:2,complet:2,compris:2,content:0,control:0,corpu:2,data:0,date:2,deriv:2,develop:2,diachron:2,dimens:2,effici:2,element:2,emph:2,end:2,extract:2,file:2,framework:2,futur:2,gener:0,given:2,guidelin:2,hand:2,have:2,hierarchi:2,histor:2,howev:2,identifi:2,ignor:2,implement:0,includ:2,index:0,inform:2,interact:2,line:2,link:2,list:2,lst:2,markup:2,mean:2,meet:2,meta:2,model:0,modul:[0,2],name:2,need:2,object:2,objecttyp:2,occurr:2,one:2,oper:0,other:2,page:0,phonolog:2,posit:2,potenti:2,predict:2,present:2,principl:0,process:2,product:2,quantif:2,question:2,reason:2,recogn:2,ref:2,regard:2,relev:2,repositori:2,requir:2,research:2,respect:2,result:2,same:2,search:0,solut:2,some:2,specif:2,src:2,standard:2,start:2,structur:2,subsequ:2,syntact:2,tei:2,term:2,thei:2,thereof:2,thi:2,time:2,type:2,use:2,wealth:2,were:2,wherea:2,word:2,wordbegin:2,worddatamodel:2,xml:2},titles:["Welcome to Morphilo\u2019s documentation!","Controller Adjustments","Data Model Implementation"],titleterms:{adjust:1,control:1,data:2,document:0,gener:1,implement:2,indic:0,model:2,morphilo:0,oper:1,principl:1,tabl:0,welcom:0}}) \ No newline at end of file +Search.setIndex({docnames:["index","source/architecture","source/controller","source/datamodel","source/framework","source/view"],envversion:52,filenames:["index.rst","source/architecture.rst","source/controller.rst","source/datamodel.rst","source/framework.rst","source/view.rst"],objects:{},objnames:{},objtypes:{},terms:{"0px":5,"10px":5,"1test":5,"2test":5,"4em":5,"abstract":3,"boolean":2,"case":[1,2,3,5],"catch":2,"class":[2,4,5],"default":[2,5],"enum":2,"final":[2,3,5],"function":[2,3],"import":[1,2,3,4,5],"int":2,"new":2,"null":2,"public":2,"return":2,"short":5,"throw":[2,3],"true":[1,2,3,5],"try":2,"void":2,"while":[1,2],AND:2,And:[1,2],For:[2,3,4,5],That:2,The:[1,2,3,5],Then:[2,5],There:[2,3,4],These:[2,3,4],Using:2,With:2,_morphilo_:5,abbmycorestruktur:4,abbruchbedingung:2,abl:[3,5],about:2,abov:[2,3],accept:2,access:[1,2,5],accessdelet:5,accessedit:5,accord:[2,3],accordingli:2,account:3,acl:5,actual:2,add:2,added:[2,3,4],adding:2,addit:[2,3],addition:3,addon:[2,4],address:2,addtomorphilodb:2,adject:3,adjust:[0,3,4,5],administ:1,administr:[2,5],admit:2,adopt:2,advantag:3,affix:[2,3],affixpo:2,affixstripp:2,affixtyp:2,after:[1,2,5],again:[2,5],agre:2,aim:2,al1:2,al2:2,algorithm:2,all:[1,2,3,4,5],allomorph:2,allow:[1,2,3,5],alon:2,alpha:1,alreadi:2,also:[1,2,3,5],altern:1,although:[2,3],alwai:2,ambigu:3,among:5,amp:5,analys:2,analysi:3,analyz:[2,3],analyzeinfl:2,analyzeinflect:2,analyzelemma:2,analyzepref:2,analyzeprefix:2,analyzeroot:2,analyzesuf:2,analyzesuffix:2,analyzeword:2,ani:[1,2],annot:[1,2,5],annotieren:5,anoth:[2,5],anpasssungen:5,answer:3,anzword:2,apach:[2,5],api:2,appear:[2,5],appli:[2,5],applic:[2,3,4],approach:[1,2,5],appropri:3,approxim:2,arbitrari:2,architect:1,architectur:[1,3,4],argument:2,aria:5,around:2,arrai:2,arraylist:2,asid:3,assert:2,assign:2,assum:[1,2],assumpt:[1,3],ation:2,attach:[3,5],attent:[3,5],attribut:[2,3],auf:3,aur:2,auslesen:2,authen:2,authentif:2,author:2,autom:1,automat:5,auver:2,avail:[1,2,3],avoid:2,awai:2,awkward:2,awver:2,back:[2,5],backtoproject:5,bar:5,base:[1,2,3,5],baseform:3,basi:2,basic:[2,3],becaus:[1,2,3,5],becom:[1,2],been:[2,3],befor:2,beforehand:1,begin:[2,3,4,5],beginnig:2,begintim:2,behind:2,being:2,belong:5,below:[2,5],besid:5,best:3,between:[1,2,3,5],beyond:2,bird:2,bit:2,both:[1,2,3],branch:[1,3],brows:5,browser:5,btn:5,bugfixend:2,bugfixstart:2,build:2,builder:2,button:5,calcul:[1,2],call:[1,2,5],callkeymeth:2,calll:2,can:[1,2,3,4,5],candid:2,cannot:[2,3],caption:[2,3,4,5],care:2,caret:5,carri:[1,2,3,5],caus:[2,3],center:[2,4],certain:5,chang:[2,3,5],changeabl:2,chap:2,chapter:2,check:[2,5],checkand:2,checkpermiss:5,child:5,children:5,choic:3,chosen:3,circumst:5,claim:2,classdiag:2,classid:2,classif:3,classifi:3,clear:[2,5],clearer:1,click:2,client:2,cmp:2,code:[2,5],cog:5,collect:[2,5],com:[2,3],come:2,comfort:3,common:5,compar:[1,2],comparemcr:2,comparemcrobject:2,compareto:2,comparison:[1,2],compat:3,compil:2,complet:[1,2,3,4],complex:2,complextyp:3,complic:2,compon:[1,2,4,5],componentsprotectfootnotemark:4,compound:2,compris:[2,3],con:3,concat:5,concaten:2,concept:[2,3],conceptu:0,concern:2,conclud:2,concret:1,condit:2,configur:[2,4,5],confirm:2,confirm_delet:5,congruenc:1,consequ:3,consid:2,consist:2,construct:[2,3],constructor:2,constructqc:2,contain:[1,2,5],content:[0,2,5],continu:1,control:[0,1,4,5],converg:3,core:2,corner:4,corp1derivid:2,corp2derivid:2,corpdoc:2,corpid:[2,5],corpmeta:[2,3,5],corpo1:2,corpo2:2,corpobj:2,corpora:[1,2,3],corpu:[1,2,3,5],corpusdatamodel:[2,3],corpusdiff:2,corpuslink:[2,3,5],corpuss:2,correct:[1,2,5],correl:2,correspond:[2,3,4,5],could:[1,2,3],count:[1,2,5],counter:2,coupl:5,cours:[1,2],creat:[1,2,5],createdbi:2,createmorphiloobject:2,createxml:2,creator:2,crpid1:2,crpid2:2,crucial:[1,2,3],css:5,current:5,currentus:[2,5],custom:[2,4],cut:2,data:[0,2,4,5],databas:[1,2],datamodel:[2,3,4],datamodelimpl:4,dataset:[1,2],date:[2,3],datefrom:[2,3],datefromcorp:2,datenmodel:4,dateuntil:[2,3],dateuntilcorp:2,deal:2,decid:1,decis:1,def:[2,4,5],defin:[1,2,3,4,5],definit:[2,5],degre:2,delet:[2,3,5],deleteaffix:2,deletedb:5,deliv:[1,2],delword:5,demonstr:[2,5],depend:2,depict:[2,3],depth:2,deriv:[1,2,3,5],derivateact:5,derivcorp:5,derivid:2,derobject:[2,5],derobjecttempl:5,describ:[1,2,3,5],deserv:5,design:[0,2,5],detail:[2,5],detaileddiff:2,determin:[1,2,5],develop:3,deviat:1,diachron:3,diagram:2,dictionari:2,diff:2,diffbegin:2,diffend:2,differ:[1,2,3,5],difflist:2,diffsum:2,dimens:3,direct:[2,3],directli:[1,3],directori:[2,4,5],discuss:[2,5],displai:[2,5],distribut:[1,2],disus:3,div:5,divis:5,doc:3,document:[1,2,3,5],doe:2,domin:3,done:[2,3,4,5],dontknow:3,doubl:2,down:[2,5],drop:5,dropdown:5,dropdownmenu1:5,due:2,each:[2,3,5],earlier:2,easi:3,easili:2,editword:5,effect:2,effici:[2,3],either:2,elem:2,element:[2,3],elig:3,elimin:2,elm:2,els:2,elsewher:3,eman:3,embed:2,embedaffix:2,embedded:2,emph:[1,2,3,4,5],empti:2,enabl:2,enclos:3,encod:[2,3,5],end:[2,3,4,5],endswith:2,endtim:2,engin:[2,3,5],english:[1,2,5],enough:1,enrich:5,ensur:5,ent:2,enter:5,entir:[2,5],entitl:5,entri:[1,2],entrynum:2,enumer:2,enumpref:2,envisu:4,equal:2,equalobject:2,equalocc2:2,equalocc:2,equival:2,err:2,error:[1,3],erzeugen:2,escapechar:[2,3,5],essenc:2,essenti:[4,5],establish:3,estim:1,etc:2,euer:2,european:3,evalu:2,evaluatefirst:2,even:3,event:5,everi:2,exact:[1,2],exactli:[1,2,5],exampl:[1,2,3],except:2,excerpt:2,exchang:[2,3],exclud:5,exist:2,expand:5,explain:[2,4,5],explan:2,explanatori:5,explic:4,explicit:2,explicitli:[2,3],exploit:5,exponenti:2,extend:[2,4],extent:3,extra:[1,2,4],extract:[2,3,5],extrem:2,eye:2,fact:[1,2,3],factori:2,fail:2,fall:2,fals:[2,3],felder:3,few:3,field:[2,3,5],fig:[1,2,4],figur:[1,2,4],file:[2,3,4,5],filenam:[2,5],filepathproc:2,filesaveend:2,filesavestart:2,fill:2,filter:2,find:2,first:[1,2,5],five:2,fix:2,flag:2,flow:1,focu:2,follow:[1,2,3,5],footnot:[2,3,5],footnotetext:4,form:[1,2,5],format:2,former:[2,5],fort:3,four:[2,3],fragment:[2,5],framework:[0,2,3,5],freeli:2,frequenc:[1,2],frequent:2,from:[1,2,3,5],fuer:2,fulfil:2,full:2,fulli:[1,2],further:[1,2],furthermor:[3,5],futur:3,gauss:1,gegeben:2,gener:[0,5],german:5,get:2,getaffixposit:2,getalldiffer:2,getallequalmcrobject:2,getattributevalu:2,getbaseurl:2,getcont:2,getcontentfromfil:2,getcontrolnodedetail:2,getcorp:2,getcorpusmetadata:2,getcurrentsess:2,getderiv:2,getderivatefilepath:2,getequalnumb:2,getequalobjectnumb:2,getfieldvalu:2,getfilenam:2,getid:2,getinst:2,getmaindocnam:2,getmorphem:2,getnamespac:2,getnumberofword:2,getnumfound:2,getoccurrencesfromequaltext:2,getpath:2,getprettyformat:2,getresult:2,getsolrcli:2,getstemnumb:2,getter:2,gettestnodedetail:2,gettext:2,getunknownword:2,getunknowword:2,geturl:2,geturlparamet:2,getuserid:2,getuserinform:2,getvalu:2,getxmlfromobject:2,getxpathloc:2,gist:1,github:3,give:2,given:[1,2,3,5],gleich:2,glyphicon:5,good:5,gov:5,greater:2,group:5,grow:[1,2],guest:5,guidelin:3,had:5,hand:[1,2,3,4,5],handl:[2,5],happen:[2,5],harbor:[2,5],has:[1,2,3,5],hasderiv:3,hash:2,hashmap:2,hauver:2,have:[1,2,3,4,5],head:5,heavili:2,help:1,helpobj:2,henc:2,her:2,here:[1,2,3,4,5],herit:3,hidden:[1,5],hier:5,hierarch:3,hierarchi:[2,3],higher:2,his:2,histor:3,historyd:3,hit:2,horizont:5,houyr:2,hover:2,how:[1,2,4],howev:[1,2,3],href:[2,5],html:[3,5],http:[2,3,4,5],httpsession:5,hundr:2,hypertext:5,i18n:5,idea:2,ideal:3,ident:2,identifi:[2,3,5],ids:[2,5],ifs:5,ifsdirectori:5,ignor:[2,3],illustr:[1,2],imag:5,impact:1,implement:[0,4],impli:[2,3,5],implicitli:2,importantli:2,improv:2,includ:[2,3,5],includegraph:[2,4],increas:2,increment:2,incrocc:2,indent:5,independ:1,index:[0,2],indic:2,indo:3,infl:2,inflect:[2,3],inflectionenum:2,inflenum:2,inform:[1,3,5],ingredi:2,inherit:5,initi:[2,5],inner:2,input:[1,4],inspir:2,instanc:[2,3],instanti:2,instead:1,instruct:2,integ:2,integr:2,intend:2,interact:[2,3,5],interest:[2,3],interfac:[2,4,5],interrupt:1,investig:[2,3],involv:2,ioexcept:2,ion:2,isauthor:2,ischild:3,isempti:2,isequ:2,ispar:3,issu:[3,5],ist:2,item:[1,2],itemlabel:2,iter:2,itm:2,its:[2,3],java:[2,4,5],javacod:5,javascript:5,jdm:2,jdom:2,jdomdoc:2,jdomdochelp:2,jdomobject:2,jdomorphilo:2,job:2,just:[1,2,3],keep:[2,3],kei:[2,5],kept:3,keyboard:5,keyset:2,kind:3,known:3,korpu:5,korpusnam:[2,3],label:[2,3,4,5],labelenumi:2,labelledbi:5,laid:3,landscap:2,languag:[1,2,3,5],larg:[1,2],larger:2,last:[2,3],lastli:5,later:[1,2,3],latter:[2,5],law:2,layer:[3,4],layout:[4,5],lead:3,least:2,left:[2,3,5],leftov:2,leftsov:2,lemma:[2,3,5],lemmaanalyz:2,length:[2,5],less:[2,3],let:2,letter:2,level:[1,2,3],lexic:[1,2,3],like:[1,2],limit:2,line:[2,3,5],linear:2,link:[1,2,3,5],list:[2,3,5],littl:3,loc:5,locat:5,loeschen:5,logic:[2,4,5],longer:2,look:[1,2],loop:[2,5],lower:2,lst:[2,3,5],lstlist:[2,3,5],machin:5,made:[1,2],main:[2,4,5],mainfil:2,mainli:2,major:3,make:[1,2,3,5],manag:[1,2,3],mani:[2,5],manipul:[2,5],manual:[1,2,5],map:2,margin:5,mark:[1,2],markup:3,mask:[2,4,5],master:[1,2,5],match:[1,2,5],materi:1,matter:2,maxoccur:[2,3],mcr:5,mcr_directori:5,mcraccessmanag:5,mcridstr:2,mcrmetadatamanag:2,mcrmetalinkid:5,mcrobj1:2,mcrobj2:2,mcrobj:2,mcrobject:2,mcrobjectid:2,mcrobjekt:2,mcrpath:2,mcrservlet:2,mcrservletjob:2,mcrsessionmgr:2,mcrsolrclientfactori:2,mcrtranslat:5,mcrurn:5,mcrxmlfunction:[2,5],mcrxsl:5,mdm:2,mean:[2,3],measur:2,mechan:[2,5],meet:3,ment:2,mention:[2,3,5],menu:[2,5],menuitem:5,mere:2,messages_d:5,messages_en:5,met:5,meta:[2,3],metadata:[1,2,3,5],method:2,middl:[1,2],might:[3,5],mileston:2,minoccur:3,mix:3,mod:5,mode:[2,5],model:[0,2,4,5],modif:5,modul:[0,3],modular:2,monomorphem:2,more:[2,3,4,5],moreov:2,morphem:[2,3],morphil:1,morphilo:[1,2,3,4,5],morphilo_uml:2,morphilocontain:[2,3,5],morphilostylesheet:5,morphmenu:5,morpholog:2,morphologicalsystem:3,most:[2,3,5],mous:5,multi:3,multilingu:5,multipl:2,must:[2,5],mvc:[3,4,5],mycor:[2,3,4,5],mycore_architectur:4,mycoreobject:[2,5],name:[2,3,5],namespac:2,natur:2,navig:5,necessari:[1,2,3,5],need:[2,3,4,5],neg:3,neither:2,net:5,newli:[1,2],newroot:2,next:[2,5],nextobject:5,nirgend:2,node:[2,3,5],nonamespaceschemaloc:3,nor:2,normal:[1,5],notavail:2,note:[1,2,5],notic:[3,5],notinherit:3,now:[2,3],number:[1,2,3],numberformatexcept:2,nur:2,nw1:5,nw2:5,oar:2,obaer:2,ober:2,obj:2,objactiontempl:5,object:[1,2,3,5],objectact:5,objectid:2,objecttyp:3,objid:[2,5],obligatori:3,obuh:2,obviou:2,occdiff:2,occur:[1,2],occurr:[2,3],oder:2,oed:2,oedfootnot:2,oer:2,ofaer:2,ofer:2,oferr:2,off:2,offer:[2,4],offerr:2,offr:2,ofir:2,ofor:2,ofowr:2,oger:2,oher:2,onc:[2,5],one:[2,3,4,5],ones:2,ongo:1,onli:[1,2,3,5],onward:2,ooer:2,oor:2,oouer:2,open:5,oper:0,oppos:2,optim:1,option:[2,3,5],optional:3,order:2,org:[2,3,5],origin:2,other:[1,2,3,5],otherwis:[1,2,3],ouer:2,ouir:2,our:2,out:[1,2,3,5],outer:2,output:[1,2,4],outputstr:2,outputt:2,ouuer:2,ouur:2,ouver:2,ouyr:2,ova:2,ovah:2,ovar:2,ove:2,over:2,overal:5,overrid:2,overwhelm:3,overwrit:2,overwritten:2,ovir:2,ovr:2,ovuh:2,ovur:2,ovver:2,ovyr:2,ower:2,owir:2,own:[4,5],owr:2,owuer:2,owur:2,owver:2,owwer:2,owyr:2,oxford:2,packag:2,page:[0,2,5],paid:3,par:3,parallel:4,param:5,paramet:[2,5],parentobjid:5,parseint:2,part:[2,3,5],partial:1,pass:2,path:[2,5],pattern:[2,3,4,5],pcs2:2,pcs:2,pdf:3,penn:3,persist:[1,3],perspect:2,pflichtfeld:3,phase:2,phonolog:3,place:[2,4],plain:1,plu:2,png:[2,4],point:[1,2],popul:1,pos:[2,3],posit:[2,3],possibl:[1,2,3],post:2,potenti:[2,3],practic:[4,5],precis:[2,4],predefin:4,predict:3,pref:2,prefcutoff:2,prefenum:2,prefer:2,prefix:[2,3,5],prefixallomorph:2,prefixbaseform:3,prefixenum:2,prefixmorphem:2,prefixnumb:2,prefloop:2,prefputmorph:2,present:[2,3,5],prevent:2,previou:[1,2],previous:[1,2],principl:0,printstacktrac:2,privat:2,probabl:2,problem:[2,3],problemat:3,proc:2,procedur:[2,5],process:[1,2,3,5],processcorpusservlet:[2,5],processor:5,procwd:2,product:[2,3],program:[2,4],progress:1,project:[2,5],prop:2,properti:[3,5],propos:2,protect:2,prototyp:[2,3],provid:[1,2,3],publish:[2,5],pull:5,put:2,putal:2,qry:2,qualiti:[1,2],qualitycontrol:2,quantif:3,queri:2,question:[2,3],rang:2,rather:2,reach:2,read:[2,3,5],reader:5,readi:2,realiz:3,realli:[2,5],reason:[1,2,3],receiv:[1,2,5],recht:2,recogn:[2,3,5],recognit:2,recommend:5,recurs:2,redirect:2,ref:[1,2,3,4,5],refactor:2,reflect:3,regard:3,region:1,regist:[2,5],reject:1,rekurs:2,rel:5,relat:2,releas:3,relev:[2,3,5],reliabl:3,remain:2,remaind:[2,5],remark:3,remov:2,removecont:2,render:[2,5],renewcommand:2,repeat:2,repetit:2,replac:2,repositori:[2,3,5],represent:[2,5],request:2,requir:3,research:[1,3],resolv:[2,5],resourc:[1,5],respect:[1,2,3,5],respond:2,respons:[2,5],rest:2,restart:2,restrict:5,restword:2,result:[1,2,3,5],resultset:2,resum:[1,2],retrievemcrobject:2,reveal:2,revers:2,right:[2,3,4,5],rise:2,role:[2,5],roll:2,root:[2,3,5],rootanalyz:2,roughli:4,row:5,rslt:2,rudimentari:2,run:[2,5],said:3,sake:3,same:[2,3],sampl:[1,2],satisfi:3,save:[1,2,3],saxexcept:2,scale:[2,4],screen:5,search:[0,2,3,4,5],searchabl:2,searchfield:5,sec:[2,5],second:[1,2,3,5],section:[2,3,4,5],see:[1,2,5],seem:3,seen:[2,5],segment:2,select:5,self:5,send:2,sens:[2,3],separ:[3,5],sequenc:3,serv:2,server:1,servflag:2,servic:[2,5],servlet:2,servletsbaseurl:5,servstat:2,set:[1,2],setattribut:2,setfield:2,setformat:2,setignoreattributeord:2,setignorecom:2,setignorediffbetweentextandcdata:2,setignorewhitespac:2,setnormalizewhitespac:2,setocc:2,setqueri:2,setrow:2,settext:2,setxlink:2,setzen:5,sever:2,share:1,she:1,shortcom:2,shorter:2,should:[2,3,4],show:2,shown:[2,3,5],side:[2,5],similar:2,simpl:[1,2,3,5],simplest:2,simpli:2,simplic:3,simplif:2,simplifi:2,sinc:[2,5],sind:2,size:[1,2,3],skeleton_layout_templ:5,slr:2,small:2,snippet:2,softwar:[0,2,3,5],solr:[2,5],solrclient:2,solrdocumentlist:2,solrend:2,solrqueri:2,solrresult:2,solrstart:2,solut:[2,3],solv:2,some:[2,3,4,5],someth:2,somewhat:2,sort:[1,2],sortedbylengthmap:2,sortoutaffix:2,sourc:[2,4,5],space:[2,5],span:5,special:5,specif:[2,3,4,5],specifi:[1,2,4],speech:3,sprach:3,src:[2,3,4,5],stand:2,standard:[1,2,3,4,5],standardfootnot:3,start:[1,2,3,5],startswith:2,state:2,statist:[1,2],stem:[2,3],stemnumb:2,step:2,still:[1,2,5],stop:2,store:5,stream:1,string:[2,3,5],stronger:3,structur:[2,3,4,5],style:[3,5],stylesheet:5,subdirectori:4,subdivid:2,subject:3,subsec:[2,3,4,5],subsect:[3,5],subsequ:3,substanti:2,substitut:1,substr:2,substract:2,subtl:1,subword:3,succe:1,success:2,suf:2,suffic:2,suffix:[2,3,5],suffixallomorph:2,suffixbaseform:3,suffixenum:2,suffixmorphem:2,suffixnumb:2,suggest:[1,2,3],sum:2,superus:5,support:3,suppos:2,sustain:3,syntact:3,syntax:5,system:2,tabindex:5,tabl:1,tag:[1,2,5],tagcorpusservlet:2,taggen:5,tagmanu:2,tagservlet:2,tagset:3,tagurl:2,take:[1,2],taken:2,target:[2,3],task:[2,3,5],technolog:[2,3],tei:[2,3,4,5],teiexamp:3,templat:5,term:3,test:[1,2,5],text:[1,2,3,5],textfil:2,tha:2,than:[2,3],theenumi:2,thei:[1,2,3,4,5],them:[1,2],theoret:[1,3],theori:5,therefor:[2,3,5],thereof:3,thi:[1,2,3,5],thing:5,think:2,third:[2,5],though:2,three:[2,4],through:[1,2,3,5],throughout:2,thu:[1,2,5],time:[2,3],timecorpusbegin:2,timecorpusend:2,timecorrect:2,togeth:[1,2,5],toggl:5,token:2,tomcat:2,tool:1,tostr:2,transfer:2,transform:5,translat:5,treebank:3,treemap:2,truth:2,turn:2,two:[1,2,5],type:[2,3,5],typic:2,ueberschrift:5,ufara:2,ufe:2,ufer:2,ufera:2,uferr:2,uferra:2,ufor:2,ufora:2,ufr:2,ufra:2,ufyrra:2,unbound:3,und:2,under:2,underli:[1,2],understand:[2,5],unequ:2,unfortun:[2,3,5],uniqu:[2,5],unit:[1,2],unknown:2,unlik:2,untag:[1,2,5],until:2,updat:2,updateend:2,updatestart:2,upload:[2,5],upper:[4,5],url:[2,5],urlencod:5,urn:5,use:[2,3,5],used:[1,2,3,5],user:[1,2,3,5],usernam:[2,5],uses:[2,5],using:2,usual:[1,2],utf:[3,5],uuer:2,uuera:2,uvver:2,uvvor:2,valid:[2,5],valu:[1,2,5],variabl:[2,5],variant:2,vereinheitlichung:2,veri:[1,2,3],version:[2,3,5],vfere:2,via:[2,4,5],view:[0,3,4],visibl:2,visual:[1,5],vuer:2,vver:2,wai:2,want:5,war:2,wdtpe:2,wealth:3,web:[2,3,5],webapplicationbaseurl:5,webfrag:[2,5],webinterfac:2,webpag:[2,5],webservic:5,well:[2,3],wenn:2,were:[2,3,5],what:2,when:2,where:2,wherea:[2,3],whether:1,which:[1,2,3,4,5],who:[1,5],whose:2,wirklich:5,within:[1,2,5],without:2,word:[1,2,3,5],wordbegin:3,worddatamodel:[2,3],wordroot:2,wordtoken:2,wordtyp:[2,3],work:[1,2],workaround:2,workload:1,worteben:3,worth:[2,3],worttyp:2,would:2,wrd:2,write:2,writeal:2,writealldata:2,writecont:2,writedb:5,written:[1,2],wrong:2,www:[2,3,4,5],xalan:5,xed:[2,5],xlink:[2,5],xlinknamespac:2,xml:[2,3,4,5],xmldiff:2,xmln:[3,5],xmloutputt:2,xmlschema:3,xmlunit:2,xpath:[2,5],xpathex:2,xpathexpress:2,xpathfactori:2,xpexp:2,xpfac:2,xpfacti:2,xsd:3,xsi:3,xsl:5,xslt:5,yes:1,yet:[1,2,3,5],yfera:2,yfere:2,yferra:2,zero:2,zerobegin:2,zeroend:2,zipf:2,zur:2},titles:["Documentation Morphilo Project","Software Design","Controller Adjustments","Data Model","Framework","View"],titleterms:{adjust:2,conceptu:[2,3,5],control:2,data:3,design:1,document:0,framework:4,gener:2,implement:[2,3,5],indic:0,model:3,morphilo:0,oper:2,principl:2,project:0,softwar:1,tabl:0,view:5,welcom:[]}}) \ No newline at end of file diff --git a/Morphilo_doc/_build/html/source/architecture.html b/Morphilo_doc/_build/html/source/architecture.html new file mode 100644 index 0000000000000000000000000000000000000000..bb49762dfb34c0975d8a375948adbcaa059fe235 --- /dev/null +++ b/Morphilo_doc/_build/html/source/architecture.html @@ -0,0 +1,147 @@ + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <meta http-equiv="X-UA-Compatible" content="IE=Edge" /> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> + <title>Software Design — Morphilo documentation</title> + <link rel="stylesheet" href="../_static/alabaster.css" type="text/css" /> + <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> + <script type="text/javascript" src="../_static/documentation_options.js"></script> + <script type="text/javascript" src="../_static/jquery.js"></script> + <script type="text/javascript" src="../_static/underscore.js"></script> + <script type="text/javascript" src="../_static/doctools.js"></script> + <link rel="index" title="Index" href="../genindex.html" /> + <link rel="search" title="Search" href="../search.html" /> + <link rel="next" title="Framework" href="framework.html" /> + <link rel="prev" title="View" href="view.html" /> + + <link rel="stylesheet" href="../_static/custom.css" type="text/css" /> + + + <meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" /> + + </head><body> + + + <div class="document"> + <div class="documentwrapper"> + <div class="bodywrapper"> + <div class="body" role="main"> + + <div class="section" id="software-design"> +<h1>Software Design<a class="headerlink" href="#software-design" title="Permalink to this headline">¶</a></h1> +<img alt="source/architecture.*" src="source/architecture.*" /> +<p>The architecture of a possible <strong>take-and-share</strong>-approach for language +resources is visualized in figure ref{fig:architect}. Because the very gist +of the approach becomes clearer if describing a concrete example, the case of +annotating lexical derivatives of Middle English and a respective database is +given as an illustration. +However, any other tool that helps with manual annotations and manages metadata of a corpus could be +substituted here instead.</p> +<p>After inputting an untagged corpus or plain text, it is determined whether the +input material was annotated previously by a different user. This information is +usually provided by the metadata administered by the annotation tool; in the case at +hand it is called emph{Morphilizer} in figure ref{fig:architect}. An +alternative is a simple table look-up for all occurring words in the datasets Corpus 1 through Corpus n. If contained +completely, the emph{yes}-branch is followed up further – otherwise emph{no} +succeeds. The difference between the two branches is subtle, yet crucial. On +both branches, the annotation tool (here emph{Morphilizer}) is called, which, first, +sorts out all words that are not contained in the master database (here emph{Morphilo-DB}) +and, second, makes reasonable suggestions on an optimal annotation of +the items. In both cases the +annotations are linked to the respective items (e.g. words) in the +text, but they are also persistently saved in an extra dataset, i.e. Corpus 1 +through n, together with all available metadata.</p> +<p>The difference between both information streams is that +in the emph{yes}-branch a comparison between the newly created dataset and +all of the previous datasets of this text is carried out. Within this +unit, all deviations and congruencies are marked and counted. The underlying +assumption is that with a growing number of comparable texts the +correct annotations approach a theoretic true value of a correct annotation +while errors level out provided that the sample size is large enough. How the +distribution of errors and correct annotations exactly looks like and if a +normal distribution can be assumed is still object of the ongoing research, but +independent of the concrete results, the component (called emph{compare +manual annotations} in figure ref{fig:architect}) allows for specifying the +exact form of the sample population. +In fact, it is necessary at that point to define the form of the distribution, +sample size, and the rejection region. The standard setting are a normal +distribution, a rejection region of $alpha = 0.05$ and sample size of $30$ so +that a simple Gauss-Test can be calculated.</p> +<p>Continuing the information flow further, these statistical calculations are +delivered to the quality-control-component. Based on the statistics, the +respective items together with the metadata, frequencies, and, of course, +annotations are written to the master database. All information in the master +database is directly used for automated annotations. Thus it is directly matched +to the input texts or corpora respectively through the emph{Morphilizer}-tool. +The annotation tool decides on the entries looked up in the master which items +are to be manually annotated.</p> +<p>The processes just described are all hidden from the user who has no possibility +to impact the set quality standards but by errors in the annotation process. The +user will only see the number of items of the input text he or she will process manually. The +annotator will also see an estimation of the workload beforehand. On this +number, a decision can be made if to start the annotation at all. It will be +possible to interrupt the annotation work and save progress on the server. And +the user will have access to the annotations made in the respective dataset, +correct them or save them and resume later. It is important to note that the user will receive +the tagged document only after all items are fully annotated. No partially +tagged text can be output.</p> +</div> + + + </div> + </div> + </div> + <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> + <div class="sphinxsidebarwrapper"><div class="relations"> +<h3>Related Topics</h3> +<ul> + <li><a href="../index.html">Documentation overview</a><ul> + <li>Previous: <a href="view.html" title="previous chapter">View</a></li> + <li>Next: <a href="framework.html" title="next chapter">Framework</a></li> + </ul></li> +</ul> +</div> + <div role="note" aria-label="source link"> + <h3>This Page</h3> + <ul class="this-page-menu"> + <li><a href="../_sources/source/architecture.rst.txt" + rel="nofollow">Show Source</a></li> + </ul> + </div> +<div id="searchbox" style="display: none" role="search"> + <h3>Quick search</h3> + <div class="searchformwrapper"> + <form class="search" action="../search.html" method="get"> + <input type="text" name="q" /> + <input type="submit" value="Go" /> + <input type="hidden" name="check_keywords" value="yes" /> + <input type="hidden" name="area" value="default" /> + </form> + </div> +</div> +<script type="text/javascript">$('#searchbox').show(0);</script> + </div> + </div> + <div class="clearer"></div> + </div> + <div class="footer"> + ©2018, Hagen Peukert. + + | + Powered by <a href="http://sphinx-doc.org/">Sphinx 1.7.2</a> + & <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.10</a> + + | + <a href="../_sources/source/architecture.rst.txt" + rel="nofollow">Page source</a> + </div> + + + + + </body> +</html> \ No newline at end of file diff --git a/Morphilo_doc/_build/html/source/controller.html b/Morphilo_doc/_build/html/source/controller.html index a1b0ca7d522d0289259da48b18aa64f129848380..92abe3a963110abf67f18cc5dcfc31a372e2b587 100644 --- a/Morphilo_doc/_build/html/source/controller.html +++ b/Morphilo_doc/_build/html/source/controller.html @@ -15,7 +15,8 @@ <script type="text/javascript" src="../_static/doctools.js"></script> <link rel="index" title="Index" href="../genindex.html" /> <link rel="search" title="Search" href="../search.html" /> - <link rel="prev" title="Data Model Implementation" href="datamodel.html" /> + <link rel="next" title="View" href="view.html" /> + <link rel="prev" title="Data Model" href="datamodel.html" /> <link rel="stylesheet" href="../_static/custom.css" type="text/css" /> @@ -34,6 +35,969 @@ <h1>Controller Adjustments<a class="headerlink" href="#controller-adjustments" title="Permalink to this headline">¶</a></h1> <div class="section" id="general-principle-of-operation"> <h2>General Principle of Operation<a class="headerlink" href="#general-principle-of-operation" title="Permalink to this headline">¶</a></h2> +<p>Figure ref{fig:classDiag} illustrates the dependencies of the five java classes that were integrated to add the morphilo +functionality defined in the default package emph{custom.mycore.addons.morphilo}. The general principle of operation +is the following. The handling of data search, upload, saving, and user +authentification is fully left to the MyCoRe functionality that is completely +implemented. The class emph{ProcessCorpusServlet.java} receives a request from the webinterface to process an uploaded file, +i.e. a simple text corpus, and it checks if any of the words are available in the master database. All words that are not +listed in the master database are written to an extra file. These are the words that have to be manually annotated. At the end, the +servlet sends a response back to the user interface. In case of all words are contained in the master, an xml file is generated from the +master database that includes all annotated words of the original corpus. Usually this will not be the case for larger textfiles. +So if some words are not in the master, the user will get the response to initiate the manual annotation process.</p> +<p>The manual annotation process is processed by the class +emph{{Tag-Corpus-Serv-let-.ja-va}}, which will build a JDOM object for the first word in the extra file. +This is done by creating an object of the emph{JDOMorphilo.java} class. This class, in turn, will use the methods of +emph{AffixStripper.java} that make simple, but reasonable, suggestions on the word structure. This JDOM object is then +given as a response back to the user. It is presented as a form, in which the user can make changes. This is necessary +because the word structure algorithm of emph{AffixStripper.java} errs in some cases. Once the user agrees on the +suggestions or on his or her corrections, the JDOM object is saved as an xml that is only searchable, visible, and +changeable by the authenicated user (and the administrator), another file containing all processed words is created or +updated respectively and the emph{TagCorpusServlet.java} servlet will restart until the last word in the extra list is +processed. This enables the user to stop and resume her or his annotation work at a later point in time. The +emph{TagCorpusServlet} will call methods from emph{ProcessCorpusServlet.java} to adjust the content of the extra +files harboring the untagged words. If this file is empty, and only then, it is replaced by the file comprising all words +from the original text file, both the ones from the master database and the ones that are annotated by the user, +in an annotated xml representation.</p> +<p>Each time emph{ProcessCorpusServlet.java} is instantiated, it also instantiates emph{QualityControl.java}. This class checks if a +new word can be transferred to the master database. The algorithm can be freely adopted to higher or lower quality standards. +In its present configuration, a method tests at a limit of 20 different +registered users agreeing on the annotation of the same word. More specifically, +if 20 JDOM objects are identical except in the attribute field emph{occurrences} in the metadata node, the JDOM object becomes +part of the master. The latter is easily done by changing the attribute emph{creator} from the user name +to emph{<a href="#id1"><span class="problematic" id="id2">``</span></a>administrator’‘} in the service node. This makes the dataset part of the master database. Moreover, the emph{occurrences} +attribute is updated by adding up all occurrences of the word that stem from +different text corpora of the same time range. +begin{landscape}</p> +<blockquote> +<div><dl class="docutils"> +<dt>begin{figure}</dt> +<dd>centering +includegraphics[scale=0.55]{morphilo_uml.png} +caption{Class Diagram Morphilo} +label{fig:classDiag}</dd> +</dl> +<p>end{figure}</p> +</div></blockquote> +<p>end{landscape}</p> +</div> +<div class="section" id="conceptualization"> +<h2>Conceptualization<a class="headerlink" href="#conceptualization" title="Permalink to this headline">¶</a></h2> +<p>The controller component is largely +specified and ready to use in some hundred or so java classes handling the +logic of the search such as indexing, but also dealing with directories and +files as saving, creating, deleting, and updating files. +Moreover, a rudimentary user management comprising different roles and +rights is offered. The basic technology behind the controller’s logic is the +servlet. As such all new code has to be registered as a servlet in the +web-fragment.xml (here the Apache Tomcat container) as listing ref{lst:webfragment} shows.</p> +<p>begin{lstlisting}[language=XML,caption={Servlet Registering in the +web-fragment.xml (excerpt)},label=lst:webfragment,escapechar=|] +<servlet></p> +<blockquote> +<div><servlet-name>ProcessCorpusServlet</servlet-name> +<servlet-class>custom.mycore.addons.morphilo.ProcessCorpusServlet</servlet-class></div></blockquote> +<p></servlet> +<servlet-mapping></p> +<blockquote> +<div><servlet-name>ProcessCorpusServlet</servlet-name> +<url-pattern>/servlets/object/process</url-pattern>|label{ln:process}|</div></blockquote> +<p></servlet-mapping> +<servlet></p> +<blockquote> +<div><servlet-name>TagCorpusServlet</servlet-name> +<servlet-class>custom.mycore.addons.morphilo.TagCorpusServlet</servlet-class></div></blockquote> +<p></servlet> +<servlet-mapping></p> +<blockquote> +<div><servlet-name>TagCorpusServlet</servlet-name> +<url-pattern>/servlets/object/tag</url-pattern>|label{ln:tag}|</div></blockquote> +<p></servlet-mapping> +end{lstlisting}</p> +<p>Now, the logic has to be extended by the specifications analyzed in chapter +ref{chap:concept} on conceptualization. More specifically, some +classes have to be added that take care of analyzing words +(emph{AffixStripper.java, InflectionEnum.java, SuffixEnum.java, +PrefixEnum.java}), extracting the relevant words from the text and checking the +uniqueness of the text (emph{ProcessCorpusServlet.java}), make reasonable +suggestions on the annotation (emph{TagCorpusServlet.java}), build the object +of each annotated word (emph{JDOMorphilo.java}), and check on the quality by applying +statistical models (emph{QualityControl.java}).</p> +</div> +<div class="section" id="implementation"> +<h2>Implementation<a class="headerlink" href="#implementation" title="Permalink to this headline">¶</a></h2> +<p>Having taken a bird’s eye perspective in the previous chapter, it is now time to take a look at the specific implementation at the level +of methods. Starting with the main servlet, emph{ProcessCorpusServlet.java}, the class defines four getter method: +renewcommand{labelenumi}{(theenumi)} +begin{enumerate}</p> +<blockquote> +<div>itemlabel{itm:geturl} public String getURLParameter(MCRServletJob, String) +itemlabel{itm:getcorp} public String getCorpusMetadata(MCRServletJob, String) +itemlabel{itm:getcont} public ArrayList<String> getContentFromFile(MCRServletJob, String) +itemlabel{itm:getderiv} public Path getDerivateFilePath(MCRServletJob, String) +itemlabel{itm:now} public int getNumberOfWords(MCRServletJob job, String)</div></blockquote> +<p>end{enumerate} +Since each servlet in MyCoRe extends the class MCRServlet, it has access to MCRServletJob, from which the http requests and responses +can be used. This is the first argument in the above methods. The second argument of method (ref{itm:geturl}) specifies the name of an url parameter, i.e. +the object id or the id of the derivate. The method returns the value of the given parameter. Typically MyCoRe uses the url to exchange +these ids. The second method provides us with the value of a data field in the xml document. So the string defines the name of an attribute. +emph{getContentFromFile(MCRServletJob, String)} returns the words as a list from a file when given the filename as a string. +The getter listed in ref{itm:getderiv}), returns the Path from the MyCoRe repository when the name of +the file is specified. And finally, method (ref{itm:now}) returns the number of words by simply returning +emph{getContentFromFile(job, fileName).size()}.</p> +<p>There are two methods in every MyCoRe-Servlet that have to be overwritten, +emph{protected void render(MCRServletJob, Exception)}, which redirects the requests as emph{POST} or emph{GET} responds, and +emph{protected void think(MCRServletJob)}, in which the logic is implemented. Since the latter is important to understand the +core idea of the Morphilo algorithm, it is displayed in full length in source code ref{src:think}.</p> +<p>begin{lstlisting}[language=java,caption={The overwritten think method},label=src:think,escapechar=|] +protected void think(MCRServletJob job) throws Exception +{</p> +<blockquote> +<div><p>this.job = job; +String dateFromCorp = getCorpusMetadata(job, “def.datefrom”); +String dateUntilCorp = getCorpusMetadata(job, “def.dateuntil”); +String corpID = getURLParameter(job, “objID”); +String derivID = getURLParameter(job, “id”);</p> +<p>//if NoW is 0, fill with anzWords +MCRObject helpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(corpID));|label{ln:bugfixstart}| +Document jdomDocHelp = helpObj.createXML(); +XPathFactory xpfacty = XPathFactory.instance(); +XPathExpression<Element> xpExp = xpfacty.compile(“//NoW”, Filters.element()); +Element elem = xpExp.evaluateFirst(jdomDocHelp); +//fixes transferred morphilo data from previous stand alone project +int corpussize = getNumberOfWords(job, “”); +if (Integer.parseInt(elem.getText()) != corpussize) +{</p> +<blockquote> +<div>elem.setText(Integer.toString(corpussize)); +helpObj = new MCRObject(jdomDocHelp); +MCRMetadataManager.update(helpObj);</div></blockquote> +<p>}|label{ln:bugfixend}|</p> +<p>//Check if the uploaded corpus was processed before +SolrClient slr = MCRSolrClientFactory.getSolrClient();|label{ln:solrstart}| +SolrQuery qry = new SolrQuery(); +qry.setFields(“korpusname”, “datefrom”, “dateuntil”, “NoW”, “id”); +qry.setQuery(“datefrom:” + dateFromCorp + ” AND dateuntil:” + dateUntilCorp + ” AND NoW:” + corpussize); +SolrDocumentList rslt = slr.query(qry).getResults();|label{ln:solrresult}|</p> +<p>Boolean incrOcc = true; +// if resultset contains only one, then it must be the newly created corpus +if (slr.query(qry).getResults().getNumFound() > 1) +{</p> +<blockquote> +<div>incrOcc = false;</div></blockquote> +<p>}|label{ln:solrend}|</p> +<p>//match all words in corpus with morphilo (creator=administrator) and save all words that are not in morphilo DB in leftovers +ArrayList<String> leftovers = new ArrayList<String>(); +ArrayList<String> processed = new ArrayList<String>();</p> +<p>leftovers = getUnknownWords(getContentFromFile(job, “”), dateFromCorp, dateUntilCorp, “”, incrOcc, incrOcc, false);|label{ln:callkeymeth}|</p> +<p>//write all words of leftover in file as derivative to respective corpmeta dataset +MCRPath root = MCRPath.getPath(derivID, “/”);|label{ln:filesavestart}| +Path fn = getDerivateFilePath(job, “”).getFileName(); +Path p = root.resolve(“untagged-” + fn); +Files.write(p, leftovers);|label{ln:filesaveend}|</p> +<p>//create a file for all words that were processed +Path procWds = root.resolve(“processed-” + fn); +Files.write(procWds, processed);</p> +</div></blockquote> +<p>} +end{lstlisting} +Using the above mentioned getter methods, the emph{think} method assigns values to the object ID, needed to get the xml document +that contain the corpus metadata, the file ID, and the beginning and starting dates from the corpus to be analyzed. Lines ref{ln:bugfixstart} +through ref{ln:bugfixend} show how to access a mycore object as an xml document, a procedure that will be used in different variants +throughout this implementation. +By means of the object ID, the respective corpus is identified and a JDOM document is constructed, which can then be accessed +by XPath. The XPath factory instances are collections of the xml nodes. In the present case, it is save to assume that only one element +of emph{NoW} is available (see corpus datamodel listing ref{lst:corpusdatamodel} with $maxOccurs=‘1’$). So we do not have to loop through +the collection, but use the first node named emph{NoW}. The if-test checks if the number of words of the uploaded file is the +same as the number written in the document. When the document is initially created by the MyCoRe logic it was configured to be zero. +If unequal, the setText(String) method is used to write the number of words of the corpus to the document.</p> +<p>Lines ref{ln:solrstart}–ref{ln:solrend} reveal the second important ingredient, i.e. controlling the search engine. First, a solr +client and a query are initialized. Then, the output of the result set is defined by giving the fields of interest of the document. +In the case at hand, it is the id, the name of the corpus, the number of words, and the beginnig and ending dates. With emph{setQuery} +it is possible to assign values to some or all of these fields. Finally, emph{getResults()} carries out the search and writes +all hits to a emph{SolrDocumentList} (line ref{ln:solrresult}). The test that follows is really only to set a Boolean +encoding if the number of occurrences of that word in the master should be updated. To avoid multiple counts, +incrementing the word frequency is only done if it is a new corpus.</p> +<p>In line ref{ln:callkeymeth} emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} is called and +returned as a list of words. This method is key and will be discussed in depth below. Finally, lines +ref{ln:filesavestart}–ref{ln:filesaveend} show how to handle file objects in MyCoRe. Using the file ID, the root path and the name +of the first file in that path are identified. Then, a second file starting with <a href="#id3"><span class="problematic" id="id4">``</span></a>untagged’’ is created and all words returned from +the emph{getUnknownWords} is written to that file. By the same token an empty file is created (in the last two lines of the emph{think}-method), +in which all words that are manually annotated will be saved.</p> +<p>In a refactoring phase, the method emph{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)} could be subdivided into +three methods: for each Boolean parameter one. In fact, this method handles more than one task. This is mainly due to multiple code avoidance. +%this is just wrong because no resultset will substantially be more than 10-20 +%In addition, for large text files this method would run into efficiency problems if the master database also reaches the intended size of about +%$100,000$ entries and beyond because +In essence, an outer loop runs through all words of the corpus and an inner loop runs through all hits in the solr result set. Because the result +set is supposed to be small, approximately between $10-20$ items, efficiency +problems are unlikely to cause a problem, although there are some more loops running through collection of about the same sizes. +%As the hits naturally grow larger with an increasing size of the data base, processing time will rise exponentially. +Since each word is identified on the basis of its projected word type, the word form, and the time range it falls into, it is these variables that +have to be checked for existence in the documents. If not in the xml documents, +emph{null} is returned and needs to be corrected. Moreover, user authentification must be considered. There are three different XPaths that are relevant. +begin{itemize}</p> +<blockquote> +<div>item[-] emph{//service/servflags/servflag[@type=’createdby’]} to test for the correct user +item[-] emph{//morphiloContainer/morphilo} to create the annotated document +item[-] emph{//morphiloContainer/morphilo/w} to set occurrences or add a link</div></blockquote> +<p>end{itemize}</p> +<p>As an illustration of the core functioning of this method, listing ref{src:getUnknowWords} is given. +begin{lstlisting}[language=java,caption={Mode of Operation of getUnknownWords Method},label=src:getUnknowWords,escapechar=|] +public ArrayList<String> getUnknownWords(</p> +<blockquote> +<div><p>ArrayList<String> corpus, +String timeCorpusBegin, +String timeCorpusEnd, +String wdtpe, +Boolean setOcc, +Boolean setXlink, +Boolean writeAllData) throws Exception +{</p> +<blockquote> +<div><p>String currentUser = MCRSessionMgr.getCurrentSession().getUserInformation().getUserID(); +ArrayList lo = new ArrayList();</p> +<p>for (int i = 0; i < corpus.size(); i++) +{</p> +<blockquote> +<div><p>SolrClient solrClient = MCRSolrClientFactory.getSolrClient(); +SolrQuery query = new SolrQuery(); +query.setFields(“w”,”occurrence”,”begin”,”end”, “id”, “wordtype”); +query.setQuery(corpus.get(i)); +query.setRows(50); //more than 50 items are extremely unlikely +SolrDocumentList results = solrClient.query(query).getResults(); +Boolean available = false; +for (int entryNum = 0; entryNum < results.size(); entryNum++) +{</p> +<blockquote> +<div>… +// update in MCRMetaDataManager +String mcrIDString = results.get(entryNum).getFieldValue(“id”).toString(); +//MCRObjekt auslesen und JDOM-Document erzeugen: +MCRObject mcrObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(mcrIDString)); +Document jdomDoc = mcrObj.createXML(); +… +//check and correction for word type +… +//checkand correction time: timeCorrect +… +//check if user correct: isAuthorized</div></blockquote> +<p>… +XPathExpression<Element> xp = xpfac.compile(“//morphiloContainer/morphilo/w”, Filters.element()); +//Iterates w-elements and increments occurrence attribute if setOcc is true +for (Element e : xp.evaluate(jdomDoc)) +{</p> +<blockquote> +<div><dl class="docutils"> +<dt>//wenn Rechte da sind und Worttyp nirgends gegeben oder gleich ist</dt> +<dd><blockquote class="first"> +<div><dl class="docutils"> +<dt>if (isAuthorized && timeCorrect</dt> +<dd>&& ((e.getAttributeValue(“wordtype”) == null && wdtpe.equals(“”)) +|| e.getAttributeValue(“wordtype”).equals(wordtype))) // nur zur Vereinheitlichung</dd> +</dl> +</div></blockquote> +<dl class="last docutils"> +<dt>{</dt> +<dd><blockquote class="first"> +<div>int oc = -1; +available = true;|label{ln:available}|</div></blockquote> +<dl class="last docutils"> +<dt>try</dt> +<dd><blockquote class="first"> +<div><dl class="docutils"> +<dt>{</dt> +<dd>//adjust occurrence Attribut +if (setOcc)</dd> +</dl> +</div></blockquote> +<dl class="last docutils"> +<dt>{</dt> +<dd><dl class="first last docutils"> +<dt>oc = Integer.parseInt(e.getAttributeValue(“occurrence”));</dt> +<dd><blockquote class="first"> +<div>e.setAttribute(“occurrence”, Integer.toString(oc + 1));</div></blockquote> +<p class="last">}</p> +</dd> +</dl> +</dd> +<dt>//write morphilo-ObjectID in xml of corpmeta</dt> +<dd><blockquote class="first"> +<div><blockquote> +<div><blockquote> +<div><p>if (setXlink) +{</p> +<blockquote> +<div>Namespace xlinkNamespace = Namespace.getNamespace(“xlink”, “<a class="reference external" href="http://www.w3.org/1999/xlink">http://www.w3.org/1999/xlink</a>”);|label{ln:namespace}| +MCRObject corpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(getURLParameter(job, “objID”))); +Document corpDoc = corpObj.createXML(); +XPathExpression<Element> xpathEx = xpfac.compile(“//corpuslink”, Filters.element()); +Element elm = xpathEx.evaluateFirst(corpDoc); +elm.setAttribute(“href” , mcrIDString, xlinkNamespace);</div></blockquote> +<p>} +mcrObj = new MCRObject(jdomDoc);|label{ln:updatestart}| +MCRMetadataManager.update(mcrObj); +QualityControl qc = new QualityControl(mcrObj);|label{ln:updateend}|</p> +</div></blockquote> +<p>} +catch(NumberFormatException except) +{</p> +<blockquote> +<div>// ignore</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p class="last">}</p> +</dd> +</dl> +</dd> +</dl> +</dd> +</dl> +</dd> +</dl> +<p>if (!available) // if not available in datasets under the given conditions <a href="#id16"><span class="problematic" id="id17">|\label{ln:notavailable}|</span></a> +{</p> +<blockquote> +<div>lo.add(corpus.get(i));</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +return lo;</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +</div></blockquote> +<p>end{lstlisting} +As can be seen from the functionality of listing ref{src:getUnknowWords}, getting the unknown words of a corpus, is rather a side effect for the equally named method. +More precisely, a Boolean (line ref{ln:available}) is set when the document is manipulated otherwise because it is clear that the word must exist then. +If the Boolean remains false (line ref{ln:notavailable}), the word is put on the list of words that have to be annotated manually. As already explained above, the +first loop runs through all words (corpus) and the following lines a solr result set is created. This set is also looped through and it is checked if the time range, +the word type and the user are authorized. In the remainder, the occurrence attribute of the morphilo document can be incremented (setOcc is true) or/and the word is linked to the +corpus meta data (setXlink is true). While all code lines are equivalent with +what was explained in listing ref{src:think}, it suffices to focus on an +additional name space, i.e. +<a href="#id5"><span class="problematic" id="id6">``</span></a>xlink’’ has to be defined (line ref{ln:namespace}). Once the linking of word +and corpus is set, the entire MyCoRe object has to be updated. This is done by the functionality of the framework (lines ref{ln:updatestart}–ref{ln:updateend}). +At the end, an instance of emph{QualityControl} is created.</p> +<p>%QualityControl +The class emph{QualityControl} is instantiated with a constructor +depicted in listing ref{src:constructQC}. +begin{lstlisting}[language=java,caption={Constructor of QualityControl.java},label=src:constructQC,escapechar=|] +private MCRObject mycoreObject; +/* Constructor calls method to carry out quality control, i.e. if at least 20</p> +<blockquote> +<div><ul class="simple"> +<li>different users agree 100% on the segments of the word under investigation</li> +</ul> +<p><a href="#id7"><span class="problematic" id="id8">*</span></a>/</p> +</div></blockquote> +<p>public QualityControl(MCRObject mycoreObject) throws Exception +{</p> +<blockquote> +<div><p>this.mycoreObject = mycoreObject; +if (getEqualObjectNumber() > 20) +{</p> +<blockquote> +<div>addToMorphiloDB();</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +end{lstlisting} +The constructor takes an MyCoRe object, a potential word candidate for the +master data base, which is assigned to a private class variable because the +object is used though not changed by some other java methods. +More importantly, there are two more methods: emph{getEqualNumber()} and +emph{addToMorphiloDB()}. While the former initiates a process of counting and +comparing objects, the latter is concerned with calculating the correct number +of occurrences from different, but not the same texts, and generating a MyCoRe object with the same content but with two different flags in the emph{//service/servflags/servflag}-node, i.e. emph{createdby=’administrator’} and emph{state=’published’}. +And of course, the emph{occurrence} attribute is set to the newly calculated value. The logic corresponds exactly to what was explained in +listing ref{src:think} and will not be repeated here. The only difference are the paths compiled by the XPathFactory. They are +begin{itemize}</p> +<blockquote> +<div>item[-] emph{//service/servflags/servflag[@type=’createdby’]} and +item[-] emph{//service/servstates/servstate[@classid=’state’]}.</div></blockquote> +<p>end{itemize} +It is more instructive to document how the number of occurrences is calculated. There are two steps involved. First, a list with all mycore objects that are +equal to the object which the class is instantiated with (<a href="#id9"><span class="problematic" id="id10">``</span></a>mycoreObject’’ in listing ref{src:constructQC}) is created. This list is looped and all occurrence +attributes are summed up. Second, all occurrences from equal texts are substracted. Equal texts are identified on the basis of its meta data and its derivate. +There are some obvious shortcomings of this approach, which will be discussed in chapter ref{chap:results}, section ref{sec:improv}. Here, suffice it to +understand the mode of operation. Listing ref{src:equalOcc} shows a possible solution. +begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (1)},label=src:equalOcc,escapechar=|] +/* returns number of Occurrences if Objects are equal, zero otherwise</p> +<blockquote> +<div><a href="#id11"><span class="problematic" id="id12">*</span></a>/</div></blockquote> +<p>private int getOccurrencesFromEqualTexts(MCRObject mcrobj1, MCRObject mcrobj2) throws SAXException, IOException +{</p> +<blockquote> +<div><p>int occurrences = 1; +//extract corpmeta ObjectIDs from morphilo-Objects +String crpID1 = getAttributeValue(“//corpuslink”, “href”, mcrobj1); +String crpID2 = getAttributeValue(“//corpuslink”, “href”, mcrobj2); +//get these two corpmeta Objects +MCRObject corpo1 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID1)); +MCRObject corpo2 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID2)); +//are the texts equal? get list of ‘processed-words’ derivate +String corp1DerivID = getAttributeValue(“//structure/derobjects/derobject”, “href”, corpo1); +String corp2DerivID = getAttributeValue(“//structure/derobjects/derobject”, “href”, corpo2);</p> +<p>ArrayList result = new ArrayList(getContentFromFile(corp1DerivID, “”));|label{ln:writeContent}| +result.remove(getContentFromFile(corp2DerivID, “”));|label{ln:removeContent}| +if (result.size() == 0) // the texts are equal +{</p> +<blockquote> +<div>// extract occurrences of one the objects +occurrences = Integer.parseInt(getAttributeValue(“//morphiloContainer/morphilo/w”, “occurrence”, mcrobj1));</div></blockquote> +<p>} +else +{</p> +<blockquote> +<div>occurrences = 0; //project metadata happened to be the same, but texts are different</div></blockquote> +<p>} +return occurrences;</p> +</div></blockquote> +<p>} +end{lstlisting} +In this implementation, the ids from the emph{corpmeta} data model are accessed via the xlink attribute in the morphilo documents. +The method emph{getAttributeValue(String, String, MCRObject)} does exactly the same as demonstrated earlier (see from line ref{ln:namespace} +on in listing ref{src:getUnknowWords}). The underlying logic is that the texts are equal if exactly the same number of words were uploaded. +So all words from one file are written to a list (line ref{ln:writeContent}) and words from the other file are removed from the +very same list (line ref{ln:removeContent}). If this list is empty, then the exact same number of words must have been in both files and the occurrences +are adjusted accordingly. Since this method is called from another private method that only contains a loop through all equal objects, one gets +the occurrences from all equal texts. For reasons of confirmability, the looping method is also given: +begin{lstlisting}[language=java,caption={Occurrence Extraction from Equal Texts (2)},label=src:equalOcc2,escapechar=|] +private int getOccurrencesFromEqualTexts() throws Exception +{</p> +<blockquote> +<div><p>ArrayList<MCRObject> equalObjects = new ArrayList<MCRObject>(); +equalObjects = getAllEqualMCRObjects(); +int occurrences = 0; +for (MCRObject obj : equalObjects) +{</p> +<blockquote> +<div>occurrences = occurrences + getOccurrencesFromEqualTexts(mycoreObject, obj);</div></blockquote> +<p>} +return occurrences;</p> +</div></blockquote> +<p>} +end{lstlisting}</p> +<p>Now, the constructor in listing ref{src:constructQC} reveals another method that rolls out an equally complex concatenation of procedures. +As implied above, emph{getEqualObjectNumber()} returns the number of equally annotated words. It does this by falling back to another +method from which the size of the returned list is calculated (emph{getAllEqualMCRObjects().size()}). Hence, we should care about +emph{getAllEqualMCRObjects()}. This method really has the same design as emph{int getOccurrencesFromEqualTexts()} in listing ref{src:equalOcc2}. +The difference is that another method (emph{Boolean compareMCRObjects(MCRObject, MCRObject, String)}) is used within the loop and +that all equal objects are put into the list of MyCoRe objects that are returned. If this list comprises more than 20 +entries,footnote{This number is somewhat arbitrary. It is inspired by the sample size n in t-distributed data.} the respective document +will be integrated in the master data base by the process described above. +The comparator logic is shown in listing ref{src:compareMCR}. +begin{lstlisting}[language=java,caption={Comparison of MyCoRe objects},label=src:compareMCR,escapechar=|] +private Boolean compareMCRObjects(MCRObject mcrobj1, MCRObject mcrobj2, String xpath) throws SAXException, IOException +{</p> +<blockquote> +<div><p>Boolean isEqual = false; +Boolean beginTime = false; +Boolean endTime = false; +Boolean occDiff = false; +Boolean corpusDiff = false;</p> +<p>String source = getXMLFromObject(mcrobj1, xpath); +String target = getXMLFromObject(mcrobj2, xpath);</p> +<p>XMLUnit.setIgnoreAttributeOrder(true); +XMLUnit.setIgnoreComments(true); +XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true); +XMLUnit.setIgnoreWhitespace(true); +XMLUnit.setNormalizeWhitespace(true);</p> +<p>//differences in occurrences, end, begin should be ignored +try +{</p> +<blockquote> +<div><p>Diff xmlDiff = new Diff(source, target); +DetailedDiff dd = new DetailedDiff(xmlDiff); +//counters for differences +int i = 0; +int j = 0; +int k = 0; +int l = 0; +// list containing all differences +List differences = dd.getAllDifferences();|label{ln:difflist}| +for (Object object : differences) +{</p> +<blockquote> +<div><p>Difference difference = (Difference) object; +<a class="reference external" href="mailto://%40begin">//<span>@</span>begin</a>,@end,… node is not in the difference list if the count is 0 +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@begin”)) i++;|label{ln:diffbegin}| +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@end”)) j++; +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@occurrence”)) k++; +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@corpus”)) l++;|label{ln:diffend}| +//@begin and @end have different values: they must be checked if they fall right in the allowed time range +if ( difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())</p> +<blockquote> +<div>&& difference.getControlNodeDetail().getXpathLocation().endsWith(“@begin”) +&& (Integer.parseInt(difference.getControlNodeDetail().getValue()) < Integer.parseInt(difference.getTestNodeDetail().getValue())) )</div></blockquote> +<dl class="docutils"> +<dt>{</dt> +<dd>beginTime = true;</dd> +</dl> +<p>} +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())</p> +<blockquote> +<div>&& difference.getControlNodeDetail().getXpathLocation().endsWith(“@end”) +&& (Integer.parseInt(difference.getControlNodeDetail().getValue()) > Integer.parseInt(difference.getTestNodeDetail().getValue())) )</div></blockquote> +<dl class="docutils"> +<dt>{</dt> +<dd>endTime = true;</dd> +</dl> +<p>} +//attribute values of @occurrence and @corpus are ignored if they are different +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())</p> +<blockquote> +<div>&& difference.getControlNodeDetail().getXpathLocation().endsWith(“@occurrence”))</div></blockquote> +<dl class="docutils"> +<dt>{</dt> +<dd>occDiff = true;</dd> +</dl> +<p>} +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation())</p> +<blockquote> +<div>&& difference.getControlNodeDetail().getXpathLocation().endsWith(“@corpus”))</div></blockquote> +<dl class="docutils"> +<dt>{</dt> +<dd>corpusDiff = true;</dd> +</dl> +<p>}</p> +</div></blockquote> +<p>} +//if any of @begin, @end … is identical set Boolean to true +if (i == 0) beginTime = true;|label{ln:zerobegin}| +if (j == 0) endTime = true; +if (k == 0) occDiff = true; +if (l == 0) corpusDiff = true;|label{ln:zeroend}| +//if the size of differences is greater than the number of changes admitted in @begin, @end … something else must be different +if (beginTime && endTime && occDiff && corpusDiff && (i + j + k + l) == dd.getAllDifferences().size()) isEqual = true;|label{ln:diffsum}| +} +catch (SAXException e) +{</p> +<blockquote> +<div>e.printStackTrace();</div></blockquote> +<p>} +catch (IOException e) +{</p> +<blockquote> +<div>e.printStackTrace();</div></blockquote> +<p>}</p> +</div></blockquote> +<p>return isEqual;</p> +</div></blockquote> +<p>} +end{lstlisting} +In this method, XMLUnit is heavily used to make all necessary node comparisons. The matter becomes more complicated, however, if some attributes +are not only ignored, but evaluated according to a given definition as it is the case for the time range. If the evaluator and builder classes are +not to be overwritten entirely because needed for evaluating other nodes of the +xml document, the above solution appears a bit awkward. So there is potential for improvement before the production version is to be programmed.</p> +<p>XMLUnit provides us with a +list of the differences of the two documents (see line ref{ln:difflist}). There are four differences allowed, that is, the attributes emph{occurrence}, +emph{corpus}, emph{begin}, and emph{end}. For each of them a Boolean variable is set. Because any of the attributes could also be equal to the master +document and the difference list only contains the actual differences, one has to find a way to define both, equal and different, for the attributes. +This could be done by ignoring these nodes. Yet, this would not include testing if the beginning and ending dates fall into the range of the master +document. Therefore the attributes are counted as lines ref{ln:diffbegin} through ref{ln:diffend} reveal. If any two documents +differ in some of the four attributes just specified, then the sum of the counters (line ref{ln:diffsum}) should not be greater than the collected differences +by XMLUnit. The rest of the if-tests assign truth values to the respective +Booleans. It is probably worth mentioning that if all counters are zero (lines +ref{ln:zerobegin}-ref{ln:zeroend}) the attributes and values are identical and hence the Boolean has to be set explicitly. Otherwise the test in line ref{ln:diffsum} would fail.</p> +<p>%TagCorpusServlet +Once quality control (explained in detail further down) has been passed, it is +the user’s turn to interact further. By clicking on the option emph{Manual tagging}, the emph{TagCorpusServlet} will be callled. This servlet instantiates +emph{ProcessCorpusServlet} to get access to the emph{getUnknownWords}-method, which delivers the words still to be +processed and which overwrites the content of the file starting with emph{untagged}. For the next word in emph{leftovers} a new MyCoRe object is created +using the JDOM API and added to the file beginning with emph{processed}. In line ref{ln:tagmanu} of listing ref{src:tagservlet}, the previously defined +entry mask is called, with which the proposed word structure could be confirmed or changed. How the word structure is determined will be shown later in +the text. +begin{lstlisting}[language=java,caption={Manual Tagging Procedure},label=src:tagservlet,escapechar=|] +… +if (!leftovers.isEmpty()) +{</p> +<blockquote> +<div><p>ArrayList<String> processed = new ArrayList<String>(); +//processed.add(leftovers.get(0)); +JDOMorphilo jdm = new JDOMorphilo(); +MCRObject obj = jdm.createMorphiloObject(job, leftovers.get(0));|label{ln:jdomobject}| +//write word to be annotated in process list and save it +Path filePathProc = pcs.getDerivateFilePath(job, “processed”).getFileName(); +Path proc = root.resolve(filePathProc); +processed = pcs.getContentFromFile(job, “processed”); +processed.add(leftovers.get(0)); +Files.write(proc, processed);</p> +<p>//call entry mask for next word +tagUrl = prop.getBaseURL() + “content/publish/morphilo.xed?id=” + obj.getId();|label{ln:tagmanu}|</p> +</div></blockquote> +<p>} +else +{</p> +<blockquote> +<div><p>//initiate process to give a complete tagged file of the original corpus +//if untagged-file is empty, match original file with morphilo +//creator=administrator OR creator=username and write matches in a new file +ArrayList<String> complete = new ArrayList<String>(); +ProcessCorpusServlet pcs2 = new ProcessCorpusServlet(); +complete = pcs2.getUnknownWords(</p> +<blockquote> +<div>pcs2.getContentFromFile(job, “”), //main corpus file +pcs2.getCorpusMetadata(job, “def.datefrom”), +pcs2.getCorpusMetadata(job, “def.dateuntil”), +“”, //wordtype +false, +false, +true);</div></blockquote> +<p>Files.delete(p); +MCRXMLFunctions mdm = new MCRXMLFunctions(); +String mainFile = mdm.getMainDocName(derivID); +Path newRoot = root.resolve(“tagged-” + mainFile); +Files.write(newRoot, complete);</p> +<p>//return to Menu page +tagUrl = prop.getBaseURL() + “receive/” + corpID;</p> +</div></blockquote> +<p>} +end{lstlisting} +At the point where no more items are in emph{leftsovers} the emph{getUnknownWords}-method is called whereas the last Boolean parameter +is set true. This indicates that the array list containing all available and relevant data to the respective user is returned as seen in +the code snippet in listing ref{src:writeAll}. +begin{lstlisting}[language=java,caption={Code snippet to deliver all data to the user},label=src:writeAll,escapechar=|] +… +// all data is written to lo in TEI +if (writeAllData && isAuthorized && timeCorrect) +{</p> +<blockquote> +<div><p>XPathExpression<Element> xpath = xpfac.compile(“//morphiloContainer/morphilo”, Filters.element()); +for (Element e : xpath.evaluate(jdomDoc)) +{</p> +<blockquote> +<div>XMLOutputter outputter = new XMLOutputter(); +outputter.setFormat(Format.getPrettyFormat()); +lo.add(outputter.outputString(e.getContent()));</div></blockquote> +<p>}</p> +</div></blockquote> +<div class="section" id="id13"> +<h3>}<a class="headerlink" href="#id13" title="Permalink to this headline">¶</a></h3> +<p>end{lstlisting} +The complete list (emph{lo}) is written to yet a third file starting with emph{tagged} and finally returned to the main project webpage.</p> +<p>%JDOMorphilo +The interesting question is now where does the word structure come from, which is filled in the entry mask as asserted above. +In listing ref{src:tagservlet} line ref{ln:jdomobject}, one can see that a JDOM object is created and the method +emph{createMorphiloObject(MCRServletJob, String)} is called. The string parameter is the word that needs to be analyzed. +Most of the method is a mere application of the JDOM API given the data model in chapter ref{chap:concept} section +ref{subsec:datamodel} and listing ref{lst:worddatamodel}. That means namespaces, elements and their attributes are defined in the correct +order and hierarchy.</p> +<p>To fill the elements and attributes with text, i.e. prefixes, suffixes, stems, etc., a Hashmap – containing the morpheme as +key and its position as value – are created that are filled with the results from an AffixStripper instantiation. Depending on how many prefixes +or suffixes respectively are put in the hashmap, the same number of xml elements are created. As a final step, a valid MyCoRe id is generated using +the existing MyCoRe functionality, the object is created and returned to the TagCorpusServlet.</p> +<p>%AffixStripper explanation +Last, the analyses of the word structure will be considered. It is implemented +in the emph{AffixStripper.java} file. +All lexical affix morphemes and their allomorphs as well as the inflections were extracted from the +OEDfootnote{Oxford English Dictionary <a class="reference external" href="http://www.oed.com/">http://www.oed.com/</a>} and saved as enumerated lists (see the example in listing ref{src:enumPref}). +The allomorphic items of these lists are mapped successively to the beginning in the case of prefixes +(see listing ref{src:analyzePref}, line ref{ln:prefLoop}) or to the end of words in the case of suffixes +(see listing ref{src:analyzeSuf}). Since each +morphemic variant maps to its morpheme right away, it makes sense to use the morpheme and so +implicitly keep the relation to its allomorph.</p> +<p>begin{lstlisting}[language=java,caption={Enumeration Example for the Prefix “over”},label=src:enumPref,escapechar=|] +package custom.mycore.addons.morphilo;</p> +<p>public enum PrefixEnum { +…</p> +<blockquote> +<div>over(“over”), ufer(“over”), ufor(“over”), uferr(“over”), uvver(“over”), obaer(“over”), ober(“over)”), ofaer(“over”), +ofere(“over”), ofir(“over”), ofor(“over”), ofer(“over”), ouer(“over”),oferr(“over”), offerr(“over”), offr(“over”), aure(“over”), +war(“over”), euer(“over”), oferre(“over”), oouer(“over”), oger(“over”), ouere(“over”), ouir(“over”), ouire(“over”), +ouur(“over”), ouver(“over”), ouyr(“over”), ovar(“over”), overe(“over”), ovre(“over”),ovur(“over”), owuere(“over”), owver(“over”), +houyr(“over”), ouyre(“over”), ovir(“over”), ovyr(“over”), hover(“over”), auver(“over”), awver(“over”), ovver(“over”), +hauver(“over”), ova(“over”), ove(“over”), obuh(“over”), ovah(“over”), ovuh(“over”), ofowr(“over”), ouuer(“over”), oure(“over”), +owere(“over”), owr(“over”), owre(“over”), owur(“over”), owyr(“over”), our(“over”), ower(“over”), oher(“over”), +ooer(“over”), oor(“over”), owwer(“over”), ovr(“over”), owir(“over”), oar(“over”), aur(“over”), oer(“over”), ufara(“over”), +ufera(“over”), ufere(“over”), uferra(“over”), ufora(“over”), ufore(“over”), ufra(“over”), ufre(“over”), ufyrra(“over”), +yfera(“over”), yfere(“over”), yferra(“over”), uuera(“over”), ufe(“over”), uferre(“over”), uuer(“over”), uuere(“over”), +vfere(“over”), vuer(“over”), vuere(“over”), vver(“over”), uvvor(“over”) …</div></blockquote> +<dl class="docutils"> +<dt>…chap:results</dt> +<dd><p class="first">private String morpheme; +//constructor +PrefixEnum(String morpheme) +{</p> +<blockquote> +<div>this.morpheme = morpheme;</div></blockquote> +<p>} +//getter Method</p> +<p>public String getMorpheme() +{</p> +<blockquote> +<div>return this.morpheme;</div></blockquote> +<p class="last">}</p> +</dd> +</dl> +<p>} +end{lstlisting} +As can be seen in line ref{ln:prefPutMorph} in listing ref{src:analyzePref}, the morpheme is saved to a hash map together with its position, i.e. the size of the +map plus one at the time being. In line ref{ln:prefCutoff} the emph{analyzePrefix} method is recursively called until no more matches can be made.</p> +<p>begin{lstlisting}[language=java,caption={Method to recognize prefixes},label=src:analyzePref,escapechar=|] +private Map<String, Integer> prefixMorpheme = new HashMap<String,Integer>(); +… +private void analyzePrefix(String restword) +{</p> +<blockquote> +<div><p>if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion +{</p> +<blockquote> +<div><p>for (PrefixEnum prefEnum : PrefixEnum.values())|label{ln:prefLoop}| +{</p> +<blockquote> +<div><p>String s = prefEnum.toString(); +if (restword.startsWith(s)) +{</p> +<blockquote> +<div>prefixMorpheme.put(s, prefixMorpheme.size() + 1);|label{ln:prefPutMorph}| +//cut off the prefix that is added to the list +analyzePrefix(restword.substring(s.length()));|label{ln:prefCutoff}|</div></blockquote> +<p>} +else +{</p> +<blockquote> +<div>analyzePrefix(“”);</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +end{lstlisting}</p> +<p>The recognition of suffixes differs only in the cut-off direction since suffixes occur at the end of a word. +Hence, line ref{ln:prefCutoff} in listing ref{src:analyzePref} reads in the case of suffixes.</p> +<p>begin{lstlisting}[language=java,caption={Cut-off mechanism for suffixes},label=src:analyzeSuf,escapechar=|] +analyzeSuffix(restword.substring(0, restword.length() - s.length())); +end{lstlisting}</p> +<p>It is important to note that inflections are suffixes (in the given model case of Middle English morphology) that usually occur at the very +end of a word, i.e. after all lexical suffixes, only once. It follows that inflections +have to be recognized at first without any repetition. So the procedure for inflections can be simplified +to a substantial degree as listing ref{src:analyzeInfl} shows.</p> +<p>begin{lstlisting}[language=java,caption={Method to recognize inflections},label=src:analyzeInfl,escapechar=|] +private String analyzeInflection(String wrd) +{</p> +<blockquote> +<div><p>String infl = “”; +for (InflectionEnum inflEnum : InflectionEnum.values()) +{</p> +<blockquote> +<div><p>if (wrd.endsWith(inflEnum.toString())) +{</p> +<blockquote> +<div>infl = inflEnum.toString();</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +return infl;</p> +</div></blockquote> +<p>} +end{lstlisting}</p> +<p>Unfortunately the embeddedness problem prevents a very simple algorithm. Embeddedness occurs when a lexical item +is a substring of another lexical item. To illustrate, the suffix emph{ion} is also contained in the suffix emph{ation}, as is +emph{ent} in emph{ment}, and so on. The embeddedness problem cannot be solved completely on the basis of linear modelling, but +for a large part of embedded items one can work around it using implicitly Zipf’s law, i.e. the correlation between frequency +and length of lexical items. The longer a word becomes, the less frequent it will occur. The simplest logic out of it is to assume +that longer suffixes (measured in letters) are preferred over shorter suffixes because it is more likely tha the longer the suffix string becomes, +the more likely it is one (as opposed to several) suffix unit(s). This is done in listing ref{src:embedAffix}, whereas +the inner class emph{sortedByLengthMap} returns a list sorted by length and the loop from line ref{ln:deleteAffix} onwards deletes +the respective substrings.</p> +<p>begin{lstlisting}[language=java,caption={Method to workaround embeddedness},label=src:embedAffix,escapechar=|] +private Map<String, Integer> sortOutAffixes(Map<String, Integer> affix) +{</p> +<blockquote> +<div><dl class="docutils"> +<dt>Map<String,Integer> sortedByLengthMap = new TreeMap<String, Integer>(new Comparator<String>()</dt> +<dd><dl class="first docutils"> +<dt>{</dt> +<dd><p class="first">@Override +public int compare(String s1, String s2) +{</p> +<blockquote> +<div>int cmp = Integer.compare(s1.length(), s2.length()); +return cmp != 0 ? cmp : s1.compareTo(s2);</div></blockquote> +<p class="last">}</p> +</dd> +</dl> +<p class="last">}</p> +</dd> +</dl> +<p>); +sortedByLengthMap.putAll(affix); +ArrayList<String> al1 = new ArrayList<String>(sortedByLengthMap.keySet()); +ArrayList<String> al2 = al1; +Collections.reverse(al2); +for (String s2 : al1)|label{ln:deleteAffix}| +{</p> +<blockquote> +<div><dl class="docutils"> +<dt>for (String s1 <span class="classifier-delimiter">:</span> <span class="classifier">al2)</span></dt> +<dd><p class="first">if (s1.contains(s2) && s1.length() > s2.length()) +{</p> +<blockquote> +<div>affix.remove(s2);</div></blockquote> +<p class="last">}</p> +</dd> +</dl> +<p>}</p> +</div></blockquote> +<p>return affix;</p> +</div></blockquote> +<p>} +end{lstlisting}</p> +<p>Finally, the position of the affix has to be calculated because the hashmap in line ref{ln:prefPutMorph} in +listing ref{src:analyzePref} does not keep the original order for changes taken place in addressing the affix embeddedness +(listing ref{src:embedAffix}). Listing ref{src:affixPos} depicts the preferred solution. +The recursive construction of the method is similar to emph{private void analyzePrefix(String)} (listing ref{src:analyzePref}) +only that the two affix types are handled in one method. For that, an additional parameter taking the form either emph{suffix} +or emph{prefix} is included.</p> +<p>begin{lstlisting}[language=java,caption={Method to determine position of the affix},label=src:affixPos,escapechar=|] +private void getAffixPosition(Map<String, Integer> affix, String restword, int pos, String affixtype) +{</p> +<blockquote> +<div><p>if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion +{</p> +<blockquote> +<div><p>for (String s : affix.keySet()) +{</p> +<blockquote> +<div><p>if (restword.startsWith(s) && affixtype.equals(“prefix”)) +{</p> +<blockquote> +<div><blockquote> +<div>pos++; +prefixMorpheme.put(s, pos);</div></blockquote> +<dl class="docutils"> +<dt>//prefixAllomorph.add(pos-1, restword.substring(s.length()));</dt> +<dd>getAffixPosition(affix, restword.substring(s.length()), pos, affixtype);</dd> +</dl> +</div></blockquote> +<p>} +else if (restword.endsWith(s) && affixtype.equals(“suffix”)) +{</p> +<blockquote> +<div>pos++; +suffixMorpheme.put(s, pos); +//suffixAllomorph.add(pos-1, restword.substring(s.length())); +getAffixPosition(affix, restword.substring(0, restword.length() - s.length()), pos, affixtype);</div></blockquote> +<p>} +else +{</p> +<blockquote> +<div>getAffixPosition(affix, “”, pos, affixtype);</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +end{lstlisting}</p> +<p>To give the complete word structure, the root of a word should also be provided. In listing ref{src:rootAnalyze} a simple solution is offered, however, +considering compounds as words consisting of more than one root. +begin{lstlisting}[language=java,caption={Method to determine roots},label=src:rootAnalyze,escapechar=|] +private ArrayList<String> analyzeRoot(Map<String, Integer> pref, Map<String, Integer> suf, int stemNumber) +{</p> +<blockquote> +<div><p>ArrayList<String> root = new ArrayList<String>(); +int j = 1; //one root always exists +// if word is a compound several roots exist +while (j <= stemNumber) +{</p> +<blockquote> +<div><p>j++; +String rest = lemma;|label{ln:lemma}|</p> +<p>for (int i=0;i<pref.size();i++) +{</p> +<blockquote> +<div><p>for (String s : pref.keySet()) +{</p> +<blockquote> +<div><dl class="docutils"> +<dt>//if (i == pref.get(s))</dt> +<dd><p class="first">if (rest.length() > s.length() && s.equals(rest.substring(0, s.length()))) +{</p> +<blockquote class="last"> +<div>rest = rest.substring(s.length(),rest.length());</div></blockquote> +</dd> +</dl> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +<p>for (int i=0;i<suf.size();i++) +{</p> +<blockquote> +<div><p>for (String s : suf.keySet()) +{</p> +<blockquote> +<div><p>//if (i == suf.get(s)) +if (s.length() < rest.length() && (s.equals(rest.substring(rest.length() - s.length(), rest.length())))) +{</p> +<blockquote> +<div>rest = rest.substring(0, rest.length() - s.length());</div></blockquote> +<p>}</p> +</div></blockquote> +<p>}</p> +</div></blockquote> +<p>} +root.add(rest);</p> +</div></blockquote> +<p>} +return root;</p> +</div></blockquote> +<p>} +end{lstlisting} +The logic behind this method is that the root is the remainder of a word when all prefixes and suffixes are substracted. +So the loops run through the number of prefixes and suffixes at each position and substract the affix. Really, there is +some code doubling with the previously described methods, which could be eliminated by making it more modular in a possible +refactoring phase. Again, this is not the concern of a prototype. Line ref{ln:lemma} defines the initial state of a root, +which is the case for monomorphemic words. The emph{lemma} is defined as the wordtoken without the inflection. Thus listing +ref{src:lemmaAnalyze} reveals how the class variable is calculated +begin{lstlisting}[language=java,caption={Method to determine lemma},label=src:lemmaAnalyze,escapechar=|] +/*</p> +<blockquote> +<div><ul class="simple"> +<li>Simplification: lemma = wordtoken - inflection</li> +</ul> +<p><a href="#id14"><span class="problematic" id="id15">*</span></a>/</p> +</div></blockquote> +<p>private String analyzeLemma(String wrd, String infl) +{</p> +<blockquote> +<div>return wrd.substring(0, wrd.length() - infl.length());</div></blockquote> +<p>} +end{lstlisting} +The constructor of emph{AffixStripper} calls the method emph{analyzeWord()} +whose only job is to calculate each structure element in the correct order +(listing ref{src:lemmaAnalyze}). All structure elements are also provided by getters. +begin{lstlisting}[language=java,caption={Method to determine all word structure},label=src:lemmaAnalyze,escapechar=|] +private void analyzeWord() +{</p> +<blockquote> +<div>//analyze inflection first because it always occurs at the end of a word +inflection = analyzeInflection(wordtoken); +lemma = analyzeLemma(wordtoken, inflection); +analyzePrefix(lemma); +analyzeSuffix(lemma); +getAffixPosition(sortOutAffixes(prefixMorpheme), lemma, 0, “prefix”); +getAffixPosition(sortOutAffixes(suffixMorpheme), lemma, 0, “suffix”); +prefixNumber = prefixMorpheme.size(); +suffixNumber = suffixMorpheme.size(); +wordroot = analyzeRoot(prefixMorpheme, suffixMorpheme, getStemNumber());</div></blockquote> +<p>} +end{lstlisting}</p> +<p>To conclude, the Morphilo implementation as presented here, aims at fulfilling the task of a working prototype. It is important to note +that it neither claims to be a very efficient nor a ready software program to be used in production. However, it marks a crucial milestone +on the way to a production system. At some listings sources of improvement were made explicit; at others no suggestions were made. In the latter +case this does not imply that there is no potential for improvement. Once acceptability tests are carried out, it will be the task of a follow up project +to identify these potentials and implement them accordingly.</p> +</div> </div> </div> @@ -47,6 +1011,11 @@ <ul> <li><a class="reference internal" href="#">Controller Adjustments</a><ul> <li><a class="reference internal" href="#general-principle-of-operation">General Principle of Operation</a></li> +<li><a class="reference internal" href="#conceptualization">Conceptualization</a></li> +<li><a class="reference internal" href="#implementation">Implementation</a><ul> +<li><a class="reference internal" href="#id13">}</a></li> +</ul> +</li> </ul> </li> </ul> @@ -54,7 +1023,8 @@ <h3>Related Topics</h3> <ul> <li><a href="../index.html">Documentation overview</a><ul> - <li>Previous: <a href="datamodel.html" title="previous chapter">Data Model Implementation</a></li> + <li>Previous: <a href="datamodel.html" title="previous chapter">Data Model</a></li> + <li>Next: <a href="view.html" title="next chapter">View</a></li> </ul></li> </ul> </div> diff --git a/Morphilo_doc/_build/html/source/datamodel.html b/Morphilo_doc/_build/html/source/datamodel.html index b46e97ea7f4d4fc721b00634b6bce722c94b59e7..ed8ca884744fbcaca1943ccd4ae64c3683dd30ea 100644 --- a/Morphilo_doc/_build/html/source/datamodel.html +++ b/Morphilo_doc/_build/html/source/datamodel.html @@ -6,7 +6,7 @@ <head> <meta http-equiv="X-UA-Compatible" content="IE=Edge" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> - <title>Data Model Implementation — Morphilo documentation</title> + <title>Data Model — Morphilo documentation</title> <link rel="stylesheet" href="../_static/alabaster.css" type="text/css" /> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> <script type="text/javascript" src="../_static/documentation_options.js"></script> @@ -16,7 +16,7 @@ <link rel="index" title="Index" href="../genindex.html" /> <link rel="search" title="Search" href="../search.html" /> <link rel="next" title="Controller Adjustments" href="controller.html" /> - <link rel="prev" title="Welcome to Morphilo’s documentation!" href="../index.html" /> + <link rel="prev" title="Documentation Morphilo Project" href="../index.html" /> <link rel="stylesheet" href="../_static/custom.css" type="text/css" /> @@ -31,8 +31,60 @@ <div class="bodywrapper"> <div class="body" role="main"> - <div class="section" id="data-model-implementation"> -<h1>Data Model Implementation<a class="headerlink" href="#data-model-implementation" title="Permalink to this headline">¶</a></h1> + <div class="section" id="data-model"> +<h1>Data Model<a class="headerlink" href="#data-model" title="Permalink to this headline">¶</a></h1> +<div class="section" id="conceptualization"> +<h2>Conceptualization<a class="headerlink" href="#conceptualization" title="Permalink to this headline">¶</a></h2> +<p>From both the user and task requirements one can derive that four basic +functions of data processing need to be carried out. Data have to be read, persistently +saved, searched, and deleted. Furthermore, some kind of user management +and multi-user processing is necessary. In addition, the framework should +support web technologies, be well documented, and easy to extent. Ideally, the +MVC pattern is realized.</p> +<p>subsection{Data Model}label{subsec:datamodel} +The guidelines of the +emph{TEI}-standardfootnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf} on the +word level are defined in line with the structure defined above in section ref{subsec:morphologicalSystems}. +In listing ref{lst:teiExamp} an +example is given for a possible markup at the word level for +emph{comfortable}.footnote{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html}</p> +<p>begin{lstlisting}[language=XML, +caption={TEI-example for ‘comfortable’},label=lst:teiExamp] +<w type=”adjective”></p> +<blockquote> +<div><dl class="docutils"> +<dt><m type=”base”></dt> +<dd><m type=”prefix” baseForm=”con”>com</m> +<m type=”root”>fort</m></dd> +</dl> +<p></m> +<m type=”suffix”>able</m></p> +</div></blockquote> +<p></w> +end{lstlisting}</p> +<p>This data model reflects just one theoretical conception of a word structure model. +Crucially, the model emanates from the assumption +that the suffix node is on par with the word base. On the one hand, this +implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the +other hand, is enclosed in the base, which basically means a stronger lexical, +and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different +hierarchical levels has important consequences for the branching direction at +subword level (here right-branching). Left the theoretical interest aside, the +choice of the TEI standard is reasonable with view to a sustainable architecture that allows for +exchanging data with little to no additional adjustments.</p> +<p>The negative account is that the model is not eligible for all languages. +It reflects a theoretical construction based on Indo-European +languages. If attention is paid to which language this software is used, it will +not be problematic. This is the case for most languages of the Indo-European +stem and corresponds to the overwhelming majority of all research carried out +(unfortunately).</p> +</div> +<div class="section" id="implementation"> +<h2>Implementation<a class="headerlink" href="#implementation" title="Permalink to this headline">¶</a></h2> +<p>As laid out in the task analysis in section ref{subsec:datamodel}, it is +advantageous to use established standards. It was also shown that it makes sense +to keep the meta data of each corpus separate from the data model used for the +words to be analyzed.</p> <p>For the present case, the TEI-standard was identified as an appropriate markup for words. In terms of the implementation this means that the TEI guidelines have to be implemented as an object type compatible with the chosen @@ -56,6 +108,228 @@ in listing ref{lst:worddatamodel}. Whereas attributes of the objecttype are specific to the repository framework, the TEI structure can be recognized in the hierarchy of the meta data element starting with the name emph{w} (line ref{src:wordbegin}).</p> +<p>begin{lstlisting}[language=XML,caption={Word Data +model},label=lst:worddatamodel,escapechar=|] <?xml version=”1.0” encoding=”UTF-8”?> +<objecttype</p> +<blockquote> +<div><p>name=”morphilo” +isChild=”true” +isParent=”true” +hasDerivates=”true” +xmlns:xs=”http://www.w3.org/2001/XMLSchema” +xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” +xsi:noNamespaceSchemaLocation=”datamodel.xsd”> +<metadata></p> +<blockquote> +<div><element name=”morphiloContainer” type=”xml” style=”dontknow”</div></blockquote> +<dl class="docutils"> +<dt>notinherit=”true” heritable=”false”></dt> +<dd><blockquote class="first"> +<div><dl class="docutils"> +<dt><xs:sequence></dt> +<dd><dl class="first docutils"> +<dt><xs:element name=”morphilo”></dt> +<dd><dl class="first docutils"> +<dt><xs:complexType></dt> +<dd><dl class="first docutils"> +<dt><xs:sequence></dt> +<dd><dl class="first docutils"> +<dt><xs:element name=”w” minOccurs=”0” maxOccurs=”unbounded”>|label{src:wordbegin}|</dt> +<dd><dl class="first docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><dl class="first docutils"> +<dt><xs:sequence></dt> +<dd><p class="first"><!– stem –> +<xs:element name=”m1” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><dl class="first docutils"> +<dt><xs:sequence></dt> +<dd><p class="first"><!– base –> +<xs:element name=”m2” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><dl class="first docutils"> +<dt><xs:sequence></dt> +<dd><p class="first"><!– root –> +<xs:element name=”m3” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><xs:attribute name=”type” type=”xs:string”/></dd> +</dl> +<p></xs:complexType></p> +</div></blockquote> +<p></xs:element> +<!– prefix –> +<xs:element name=”m4” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><xs:attribute name=”type” type=”xs:string”/> +<xs:attribute name=”PrefixbaseForm” type=”xs:string”/> +<xs:attribute name=”position” type=”xs:string”/></dd> +</dl> +<p></xs:complexType></p> +</div></blockquote> +<p class="last"></xs:element></p> +</dd> +</dl> +<p class="last"></xs:sequence> +<xs:attribute name=”type” type=”xs:string”/></p> +</dd> +</dl> +<p></xs:complexType></p> +</div></blockquote> +<p></xs:element> +<!– suffix –> +<xs:element name=”m5” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xs:complexType mixed=”true”></dt> +<dd><xs:attribute name=”type” type=”xs:string”/> +<xs:attribute name=”SuffixbaseForm” type=”xs:string”/> +<xs:attribute name=”position” type=”xs:string”/> +<xs:attribute name=”inflection” type=”xs:string”/></dd> +</dl> +<p></xs:complexType></p> +</div></blockquote> +<p class="last"></xs:element></p> +</dd> +</dl> +<p class="last"></xs:sequence> +<!– stem-Attribute –> +<xs:attribute name=”type” type=”xs:string”/> +<xs:attribute name=”pos” type=”xs:string”/> +<xs:attribute name=”occurrence” type=”xs:string”/></p> +</dd> +</dl> +<p></xs:complexType></p> +</div></blockquote> +<p class="last"></xs:element></p> +</dd> +</dl> +<p class="last"></xs:sequence> +<!– w -Attribute auf Wortebene –> +<xs:attribute name=”lemma” type=”xs:string”/> +<xs:attribute name=”complexType” type=”xs:string”/> +<xs:attribute name=”wordtype” type=”xs:string”/> +<xs:attribute name=”occurrence” type=”xs:string”/> +<xs:attribute name=”corpus” type=”xs:string”/> +<xs:attribute name=”begin” type=”xs:string”/> +<xs:attribute name=”end” type=”xs:string”/></p> +</dd> +</dl> +<p class="last"></xs:complexType></p> +</dd> +</dl> +<p class="last"></xs:element></p> +</dd> +</dl> +<p class="last"></xs:sequence></p> +</dd> +</dl> +<p class="last"></xs:complexType></p> +</dd> +</dl> +<p class="last"></xs:element></p> +</dd> +</dl> +<p></xs:sequence></p> +</div></blockquote> +<p></element> +<element name=”wordtype” type=”classification” minOccurs=”0” maxOccurs=”1”></p> +<blockquote> +<div><classification id=”wordtype”/></div></blockquote> +<p></element> +<element name=”complexType” type=”classification” minOccurs=”0” maxOccurs=”1”></p> +<blockquote> +<div><classification id=”complexType”/></div></blockquote> +<p></element> +<element name=”corpus” type=”classification” minOccurs=”0” maxOccurs=”1”></p> +<blockquote> +<div><classification id=”corpus”/></div></blockquote> +<p></element> +<element name=”pos” type=”classification” minOccurs=”0” maxOccurs=”1”></p> +<blockquote> +<div><classification id=”pos”/></div></blockquote> +<p></element> +<element name=”PrefixbaseForm” type=”classification” minOccurs=”0” +maxOccurs=”1”></p> +<blockquote> +<div><classification id=”PrefixbaseForm”/></div></blockquote> +<p></element> +<element name=”SuffixbaseForm” type=”classification” minOccurs=”0” +maxOccurs=”1”></p> +<blockquote> +<div><classification id=”SuffixbaseForm”/></div></blockquote> +<p></element> +<element name=”inflection” type=”classification” minOccurs=”0” maxOccurs=”1”></p> +<blockquote> +<div><classification id=”inflection”/></div></blockquote> +<p></element> +<element name=”corpuslink” type=”link” minOccurs=”0” maxOccurs=”unbounded” ></p> +<blockquote> +<div><target type=”corpmeta”/></div></blockquote> +<p class="last"></element></p> +</dd> +</dl> +<p></metadata></p> +</div></blockquote> +<p></objecttype> +end{lstlisting}</p> +<p>Additionally, it is worth mentioning that some attributes are modeled as a +emph{classification}. All these have to be listed +as separate elements in the data model. This has been done for all attributes +that are more or less subject to little or no change. In fact, all known suffix +and prefix morphemes should be known for the language investigated and are +therefore defined as a classification. +The same is true for the parts of speech named emph{pos} in the morphilo data +model above. +Here the PENN-Treebank tagset was used. Last, the different morphemic layers in +the standard model named emph{m} are changed to $m1$ through $m5$. This is the +only change in the standard that could be problematic if the data is to be +processed elsewhere and the change is not documented more explicitly. Yet, this +change was necessary for the MyCoRe repository throws errors caused by ambiguity +issues on the different $m$-layers.</p> +<p>The second data model describes only very few properties of the text corpora +from which the words are extracted. Listing ref{lst:corpusdatamodel} depicts +only the meta data element. For the sake of simplicity of the prototype, this +data model is kept as simple as possible. The obligatory field is the name of +the corpus. Specific dates of the corpus are classified as optional because in +some cases a text cannot be dated reliably.</p> +<p>begin{lstlisting}[language=XML,caption={Corpus Data +Model},label=lst:corpusdatamodel] +<metadata></p> +<blockquote> +<div><p><!– Pflichtfelder –> +<element name=”korpusname” type=”text” minOccurs=”1” maxOccurs=”1”/> +<!– Optionale Felder –> +<element name=”sprache” type=”text” minOccurs=”0” maxOccurs=”1”/> +<element name=”size” type=”number” minOccurs=”0” maxOccurs=”1”/> +<element name=”datefrom” type=”text” minOccurs=”0” maxOccurs=”1”/> +<element name=”dateuntil” type=”text” minOccurs=”0” maxOccurs=”1”/> +<!– number of words –> +<element name=”NoW” type=”text” minOccurs=”0” maxOccurs=”1”/> +<element name=”corpuslink” type=”link” minOccurs=”0” maxOccurs=”unbounded”></p> +<blockquote> +<div><target type=”morphilo”/></div></blockquote> +<p></element></p> +</div></blockquote> +<p></metadata> +end{lstlisting}</p> +<p>As a final remark, one might have noticed that all attributes are modelled as +strings although other data types are available and fields encoding the dates or +the number of words suggest otherwise. The MyCoRe framework even provides a +data type emph{historydate}. There is not a very satisfying answer to its +disuse. +All that can be said is that the use of data types different than the string +leads later on to problems in the convergence between the search engine and the +repository framework. These issues seem to be well known and can be followed on +github.</p> +</div> </div> @@ -63,11 +337,20 @@ emph{w} (line ref{src:wordbegin}).</p> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> - <div class="sphinxsidebarwrapper"><div class="relations"> + <div class="sphinxsidebarwrapper"> + <h3><a href="../index.html">Table Of Contents</a></h3> + <ul> +<li><a class="reference internal" href="#">Data Model</a><ul> +<li><a class="reference internal" href="#conceptualization">Conceptualization</a></li> +<li><a class="reference internal" href="#implementation">Implementation</a></li> +</ul> +</li> +</ul> +<div class="relations"> <h3>Related Topics</h3> <ul> <li><a href="../index.html">Documentation overview</a><ul> - <li>Previous: <a href="../index.html" title="previous chapter">Welcome to Morphilo’s documentation!</a></li> + <li>Previous: <a href="../index.html" title="previous chapter">Documentation Morphilo Project</a></li> <li>Next: <a href="controller.html" title="next chapter">Controller Adjustments</a></li> </ul></li> </ul> diff --git a/Morphilo_doc/_build/html/source/framework.html b/Morphilo_doc/_build/html/source/framework.html new file mode 100644 index 0000000000000000000000000000000000000000..d10a21930aa3b32961f54778cc3aa2309e4d4da2 --- /dev/null +++ b/Morphilo_doc/_build/html/source/framework.html @@ -0,0 +1,114 @@ + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <meta http-equiv="X-UA-Compatible" content="IE=Edge" /> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> + <title>Framework — Morphilo documentation</title> + <link rel="stylesheet" href="../_static/alabaster.css" type="text/css" /> + <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> + <script type="text/javascript" src="../_static/documentation_options.js"></script> + <script type="text/javascript" src="../_static/jquery.js"></script> + <script type="text/javascript" src="../_static/underscore.js"></script> + <script type="text/javascript" src="../_static/doctools.js"></script> + <link rel="index" title="Index" href="../genindex.html" /> + <link rel="search" title="Search" href="../search.html" /> + <link rel="prev" title="Software Design" href="architecture.html" /> + + <link rel="stylesheet" href="../_static/custom.css" type="text/css" /> + + + <meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" /> + + </head><body> + + + <div class="document"> + <div class="documentwrapper"> + <div class="bodywrapper"> + <div class="body" role="main"> + + <div class="section" id="framework"> +<h1>Framework<a class="headerlink" href="#framework" title="Permalink to this headline">¶</a></h1> +<dl class="docutils"> +<dt>begin{figure}</dt> +<dd>centering +includegraphics[scale=0.33]{mycore_architecture-2.png} +caption[MyCoRe-Architecture and Components]{MyCoRe-Architecture and Componentsprotectfootnotemark} +label{fig:abbMyCoReStruktur}</dd> +</dl> +<p>end{figure} +footnotetext{source: <a class="reference external" href="https://www.mycore.de">https://www.mycore.de</a>} +To specify the MyCoRe framework the morphilo application logic will have to be implemented, +the TEI data model specified, and the input, search and output mask programmed.</p> +<p>There are three directories which are +important for adjusting the MyCoRe framework to the needs of one’s own application. These three directories +correspond essentially to the three components in the MVC model as explicated in +section ref{subsec:mvc}. Roughly, they are envisualized in figure ref{fig:abbMyCoReStruktur} in the upper +right hand corner. More precisely, the view (emph{Layout} in figure ref{fig:abbMyCoReStruktur}) and the model layer +(emph{Datenmodell} in figure ref{fig:abbMyCoReStruktur}) can be done +completely via the <a href="#id1"><span class="problematic" id="id2">``</span></a>interface’‘, which is a directory with a predefined +structure and some standard files. For the configuration of the logic an extra directory is offered (/src/main/java/custom/mycore/addons/). Here all, java classes +extending the controller layer should be added. +Practically, all three MVC layers are placed in the +emph{src/main/}-directory of the application. In one of the subdirectories, +emph{datamodel/def}, the datamodel specifications are defined as xml files. It parallels the model +layer in the MVC pattern. How the data model was defined will be explained in +section ref{subsec:datamodelimpl}.</p> +</div> + + + </div> + </div> + </div> + <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> + <div class="sphinxsidebarwrapper"><div class="relations"> +<h3>Related Topics</h3> +<ul> + <li><a href="../index.html">Documentation overview</a><ul> + <li>Previous: <a href="architecture.html" title="previous chapter">Software Design</a></li> + </ul></li> +</ul> +</div> + <div role="note" aria-label="source link"> + <h3>This Page</h3> + <ul class="this-page-menu"> + <li><a href="../_sources/source/framework.rst.txt" + rel="nofollow">Show Source</a></li> + </ul> + </div> +<div id="searchbox" style="display: none" role="search"> + <h3>Quick search</h3> + <div class="searchformwrapper"> + <form class="search" action="../search.html" method="get"> + <input type="text" name="q" /> + <input type="submit" value="Go" /> + <input type="hidden" name="check_keywords" value="yes" /> + <input type="hidden" name="area" value="default" /> + </form> + </div> +</div> +<script type="text/javascript">$('#searchbox').show(0);</script> + </div> + </div> + <div class="clearer"></div> + </div> + <div class="footer"> + ©2018, Hagen Peukert. + + | + Powered by <a href="http://sphinx-doc.org/">Sphinx 1.7.2</a> + & <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.10</a> + + | + <a href="../_sources/source/framework.rst.txt" + rel="nofollow">Page source</a> + </div> + + + + + </body> +</html> \ No newline at end of file diff --git a/Morphilo_doc/_build/html/source/view.html b/Morphilo_doc/_build/html/source/view.html new file mode 100644 index 0000000000000000000000000000000000000000..d9765a28c4f2be4ef9f21f8205467ef76c899ed4 --- /dev/null +++ b/Morphilo_doc/_build/html/source/view.html @@ -0,0 +1,438 @@ + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <meta http-equiv="X-UA-Compatible" content="IE=Edge" /> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> + <title>View — Morphilo documentation</title> + <link rel="stylesheet" href="../_static/alabaster.css" type="text/css" /> + <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> + <script type="text/javascript" src="../_static/documentation_options.js"></script> + <script type="text/javascript" src="../_static/jquery.js"></script> + <script type="text/javascript" src="../_static/underscore.js"></script> + <script type="text/javascript" src="../_static/doctools.js"></script> + <link rel="index" title="Index" href="../genindex.html" /> + <link rel="search" title="Search" href="../search.html" /> + <link rel="next" title="Software Design" href="architecture.html" /> + <link rel="prev" title="Controller Adjustments" href="controller.html" /> + + <link rel="stylesheet" href="../_static/custom.css" type="text/css" /> + + + <meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" /> + + </head><body> + + + <div class="document"> + <div class="documentwrapper"> + <div class="bodywrapper"> + <div class="body" role="main"> + + <div class="section" id="view"> +<h1>View<a class="headerlink" href="#view" title="Permalink to this headline">¶</a></h1> +<div class="section" id="conceptualization"> +<h2>Conceptualization<a class="headerlink" href="#conceptualization" title="Permalink to this headline">¶</a></h2> +<p>Lastly, the third directory (emph{src/main/resources}) contains all code needed +for rendering the data to be displayed on the screen. So this corresponds to +the view in an MVC approach. It is done by xsl-files that (unfortunately) +contain some logic that really belongs to the controller. Thus, the division is +not as clear as implied in theory. I will discuss this issue more specifically in the +relevant subsection below. Among the resources are also all images, styles, and +javascripts.</p> +</div> +<div class="section" id="implementation"> +<h2>Implementation<a class="headerlink" href="#implementation" title="Permalink to this headline">¶</a></h2> +<p>As explained in section ref{subsec:mvc}, the view component handles the visual +representation in the form of an interface that allows interaction between +the user and the task to be carried out by the machine. As a +webservice in the present case, all interaction happens via a browser, i.e. webpages are +visualized and responses are recognized by registering mouse or keyboard +events. More specifically, a webpage is rendered by transforming xml documents +to html pages. The MyCoRe repository framework uses an open source XSLT +processor from Apache, Xalan.footnote{http://xalan.apache.org} This engine +transforms document nodes described by the XPath syntax into hypertext making +use of a special form of template matching. All templates are collected in so +called xml-encoded stylesheets. Since there are two data models with two +different structures, it is good practice to define two stylesheet files one for +each data model.</p> +<p>As a demonstration, in listing ref{lst:morphilostylesheet} below a short +extract is given for rendering the word data.</p> +<p>begin{lstlisting}[language=XML,caption={stylesheet +morphilo.xsl},label=lst:morphilostylesheet] +<?xml version=”1.0” encoding=”UTF-8”?> +<xsl:stylesheet</p> +<blockquote> +<div><p>xmlns:xsl=”http://www.w3.org/1999/XSL/Transform” +xmlns:xalan=”http://xml.apache.org/xalan” +xmlns:i18n=”xalan://org.mycore.services.i18n.MCRTranslation” +xmlns:acl=”xalan://org.mycore.access.MCRAccessManager” +xmlns:mcr=”http://www.mycore.org/” xmlns:xlink=”http://www.w3.org/1999/xlink” +xmlns:mods=”http://www.loc.gov/mods/v3” +xmlns:encoder=”xalan://java.net.URLEncoder” +xmlns:mcrxsl=”xalan://org.mycore.common.xml.MCRXMLFunctions” +xmlns:mcrurn=”xalan://org.mycore.urn.MCRXMLFunctions” +exclude-result-prefixes=”xalan xlink mcr i18n acl mods mcrxsl mcrurn encoder” +version=”1.0”> +<xsl:param name=”MCR.Users.Superuser.UserName”/></p> +<dl class="docutils"> +<dt><xsl:template match=”/mycoreobject[contains(@ID,’_morphilo_’)]”></dt> +<dd><dl class="first docutils"> +<dt><head></dt> +<dd><link href=”{$WebApplicationBaseURL}css/file.css” rel=”stylesheet”/></dd> +</dl> +<p></head> +<div class=”row”></p> +<blockquote class="last"> +<div><dl class="docutils"> +<dt><xsl:call-template name=”objectAction”></dt> +<dd><xsl:with-param name=”id” select=”@ID”/> +<xsl:with-param name=”deriv” select=”structure/derobjects/derobject/@xlink:href”/></dd> +</dl> +<p></xsl:call-template> +<xsl:variable name=”objID” select=”@ID”/> +<!– Hier Ueberschrift setzen –> +<h1 style=”text-indent: 4em;”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xsl:if test=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w”></dt> +<dd><xsl:value-of select=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]”/></dd> +</dl> +<p></xsl:if></p> +</div></blockquote> +<p></h1> +<dl class=”dl-horizontal”> +<!– (1) Display word –></p> +<blockquote> +<div><dl class="docutils"> +<dt><xsl:if test=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w”></dt> +<dd><dl class="first docutils"> +<dt><dt></dt> +<dd><xsl:value-of select=”i18n:translate(‘response.page.label.word’)”/></dd> +</dl> +<p></dt> +<dd></p> +<blockquote> +<div><xsl:value-of select=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text()[string-length(normalize-space(.))>0]”/></div></blockquote> +<p class="last"></dd></p> +</dd> +</dl> +<p></xsl:if></p> +</div></blockquote> +<dl class="docutils"> +<dt><!– (2) Display lemma –></dt> +<dd>…</dd> +</dl> +</div></blockquote> +</dd> +</dl> +<p></xsl:template> +… +<xsl:template name=”objectAction”> +… +</xsl:template></p> +</div></blockquote> +<p>… +</xsl:stylesheet> +end{lstlisting} +This template matches with +the root node of each emph{MyCoRe object} ensuring that a valid MyCoRe model is +used and checking that the document to be processed contains a unique +identifier, here a emph{MyCoRe-ID}, and the name of the correct data model, +here emph{morphilo}. +Then, another template, emph{objectAction}, is called together with two parameters, the ids +of the document object and attached files. In the remainder all relevant +information from the document is accessed by XPath, such as the word and the lemma, +and enriched with hypertext annotations it is rendered as a hypertext document. +The template emph{objectAction} is key to understand the coupling process in the software +framework. It is therefore separately listed in ref{lst:objActionTempl}.</p> +<p>begin{lstlisting}[language=XML,caption={template +objectAction},label=lst:objActionTempl,escapechar=|] +<xsl:template name=”objectAction”></p> +<blockquote> +<div><p><xsl:param name=”id” select=”./@ID”/> +<xsl:param name=”accessedit” select=”acl:checkPermission($id,’writedb’)”/> +<xsl:param name=”accessdelete” select=”acl:checkPermission($id,’deletedb’)”/> +<xsl:variable name=”derivCorp” select=”./@label”/> +<xsl:variable name=”corpID” select=”metadata/def.corpuslink[@class=’MCRMetaLinkID’]/corpuslink/@xlink:href”/> +<xsl:if test=”$accessedit or $accessdelete”>|label{ln:ng}| +<div class=”dropdown pull-right”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xsl:if test=”string-length($corpID) &gt; 0 or $CurrentUser=’administrator’”></dt> +<dd><dl class="first docutils"> +<dt><button class=”btn btn-default dropdown-toggle” style=”margin:10px” type=”button” id=”dropdownMenu1” data-toggle=”dropdown” aria-expanded=”true”></dt> +<dd><span class=”glyphicon glyphicon-cog” aria-hidden=”true”></span> Annotieren +<span class=”caret”></span></dd> +</dl> +<p class="last"></button></p> +</dd> +</dl> +<p></xsl:if> +<xsl:if test=”string-length($corpID) &gt; 0”>|label{ln:ru}|</p> +<blockquote> +<div><p><xsl:variable name=”ifsDirectory” select=”document(concat(‘ifs:/’,$derivCorp))”/> +<ul class=”dropdown-menu” role=”menu” aria-labelledby=”dropdownMenu1”></p> +<blockquote> +<div><dl class="docutils"> +<dt><li role=”presentation”></dt> +<dd><dl class="first docutils"> +<dt><a href="#id1"><span class="problematic" id="id2">|\label{ln:nw1}|<a href="{$ServletsBaseURL}object/tag{$HttpSession}?id={$derivCorp}&amp;objID={$corpID}" role="menuitem" tabindex="-1">|</span></a>label{ln:nw2}|</dt> +<dd><xsl:value-of select=”i18n:translate(‘object.nextObject’)”/></dd> +</dl> +<p class="last"></a></p> +</dd> +</dl> +<p></li> +<li role=”presentation”></p> +<blockquote> +<div><dl class="docutils"> +<dt><a href=”{$WebApplicationBaseURL}receive/{$corpID}” role=”menuitem” tabindex=”-1”></dt> +<dd><xsl:value-of select=”i18n:translate(‘object.backToProject’)”/></dd> +</dl> +<p></a></p> +</div></blockquote> +<p></li></p> +</div></blockquote> +<p></ul></p> +</div></blockquote> +<p></xsl:if> +<xsl:if test=”$CurrentUser=’administrator’”></p> +<blockquote> +<div><dl class="docutils"> +<dt><ul class=”dropdown-menu” role=”menu” aria-labelledby=”dropdownMenu1”></dt> +<dd><blockquote class="first"> +<div><dl class="docutils"> +<dt><li role=”presentation”></dt> +<dd><dl class="first docutils"> +<dt><a role=”menuitem” tabindex=”-1” href=”{$WebApplicationBaseURL}content/publish/morphilo.xed?id={$id}”></dt> +<dd><xsl:value-of select=”i18n:translate(‘object.editWord’)”/></dd> +</dl> +<p class="last"></a></p> +</dd> +</dl> +<p></li> +<li role=”presentation”></p> +<blockquote> +<div><dl class="docutils"> +<dt><a href=”{$ServletsBaseURL}object/delete{$HttpSession}?id={$id}” role=”menuitem” tabindex=”-1” class=”confirm_deletion option” data-text=”Wirklich loeschen”></dt> +<dd><xsl:value-of select=”i18n:translate(‘object.delWord’)”/></dd> +</dl> +<p></a></p> +</div></blockquote> +</div></blockquote> +<p class="last"></li></p> +</dd> +</dl> +<p></ul></p> +</div></blockquote> +<p></xsl:if> +</div> +<div class=”row” style=”margin-left:0px; margin-right:10px”></p> +<blockquote> +<div><dl class="docutils"> +<dt><xsl:apply-templates select=”structure/derobjects/derobject[acl:checkPermission(@xlink:href,’read’)]”></dt> +<dd><xsl:with-param name=”objID” select=”@ID”/></dd> +</dl> +<p></xsl:apply-templates></p> +</div></blockquote> +<p></div></p> +</div></blockquote> +<p></xsl:if></p> +</div></blockquote> +<p></xsl:template> +end{lstlisting} +The emph{objectAction} template defines the selection menu appearing – once manual tagging has +started – on the upper right hand side of the webpage entitled +emph{Annotieren} and displaying the two options emph{next word} or emph{back +to project}. +The first thing to note here is that in line ref{ln:ng} a simple test +excludes all guest users from accessing the procedure. After ensuring that only +the user who owns the corpus project has access (line ref{ln:ru}), s/he will be +able to access the drop down menu, which is really a url, e.g. line +ref{ln:nw1}. The attentive reader might have noticed that +the url exactly matches the definition in the web-fragment.xml as shown in +listing ref{lst:webfragment}, line ref{ln:tag}, which resolves to the +respective java class there. Really, this mechanism is the data interface within the +MVC pattern. The url also contains two variables, named emph{derivCorp} and +emph{corpID}, that are needed to identify the corpus and file object by the +java classes (see section ref{sec:javacode}).</p> +<p>The morphilo.xsl stylesheet contains yet another modification that deserves mention. +In listing ref{lst:derobjectTempl}, line ref{ln:morphMenu}, two menu options – +emph{Tag automatically} and emph{Tag manually} – are defined. The former option +initiates ProcessCorpusServlet.java as can be seen again in listing ref{lst:webfragment}, +line ref{ln:process}, which determines words that are not in the master data base. +Still, it is important to note that the menu option is only displayed if two restrictions +are met. First, a file has to be uploaded (line ref{ln:1test}) and, second, there must be +only one file. This is necessary because in the annotation process other files will be generated +that store the words that were not yet processed or a file that includes the final result. The +generated files follow a certain pattern. The file harboring the final, entire TEI-annotated +corpus is prefixed by emph{tagged}, the other file is prefixed emph{untagged}. This circumstance +is exploited for manipulating the second option (line ref{ln:loop}). A loop runs through all +files in the respective directory and if a file name starts with emph{untagged}, +the option to manually tag is displayed.</p> +<p>begin{lstlisting}[language=XML,caption={template +matching derobject},label=lst:derobjectTempl,escapechar=|] +<xsl:template match=”derobject” mode=”derivateActions”></p> +<blockquote> +<div><p><xsl:param name=”deriv” /> +<xsl:param name=”parentObjID” /> +<xsl:param name=”suffix” select=”’‘” /> +<xsl:param name=”id” select=”../../../@ID” /> +<xsl:if test=”acl:checkPermission($deriv,’writedb’)”></p> +<blockquote> +<div><xsl:variable name=”ifsDirectory” select=”document(concat(‘ifs:’,$deriv,’/’))” /> +<xsl:variable name=”path” select=”$ifsDirectory/mcr_directory/path” /></div></blockquote> +<dl class="docutils"> +<dt>…</dt> +<dd><blockquote class="first"> +<div><dl class="docutils"> +<dt><div class=”options pull-right”></dt> +<dd><dl class="first docutils"> +<dt><div class=”btn-group” style=”margin:10px”></dt> +<dd><dl class="first docutils"> +<dt><a href=”#” class=”btn btn-default dropdown-toggle” data-toggle=”dropdown”></dt> +<dd><i class=”fa fa-cog”></i> +<xsl:value-of select=”’ Korpus’”/> +<span class=”caret”></span></dd> +</dl> +<p class="last"></a></p> +</dd> +<dt><ul class=”dropdown-menu dropdown-menu-right”></dt> +<dd><p class="first"><!– Anpasssungen Morphilo –>|label{ln:morphMenu}| +<xsl:if test=”string-length($deriv) &gt; 0”>|label{ln:1test}|</p> +<blockquote> +<div><dl class="docutils"> +<dt><xsl:if test=”count($ifsDirectory/mcr_directory/children/child) = 1”>|label{ln:2test}|</dt> +<dd><dl class="first docutils"> +<dt><li role=”presentation”></dt> +<dd><dl class="first docutils"> +<dt><a href=”{$ServletsBaseURL}object/process{$HttpSession}?id={$deriv}&amp;objID={$id}” role=”menuitem” tabindex=”-1”></dt> +<dd><xsl:value-of select=”i18n:translate(‘derivate.process’)”/></dd> +</dl> +<p class="last"></a></p> +</dd> +</dl> +<p class="last"></li></p> +</dd> +</dl> +<p></xsl:if> +<xsl:for-each select=”$ifsDirectory/mcr_directory/children/child”>|label{ln:loop}|</p> +<blockquote> +<div><p><xsl:variable name=”untagged” select=”concat($path, ‘untagged’)”/> +<xsl:variable name=”filename” select=”concat($path,./name)”/> +<xsl:if test=”starts-with($filename, $untagged)”></p> +<blockquote> +<div><dl class="docutils"> +<dt><li role=”presentation”></dt> +<dd><dl class="first docutils"> +<dt><a href=”{$ServletsBaseURL}object/tag{$HttpSession}?id={$deriv}&amp;objID={$id}” role=”menuitem” tabindex=”-1”></dt> +<dd><xsl:value-of select=”i18n:translate(‘derivate.taggen’)”/></dd> +</dl> +<p class="last"></a></p> +</dd> +</dl> +<p></li></p> +</div></blockquote> +<p></xsl:if></p> +</div></blockquote> +<p></xsl:for-each></p> +</div></blockquote> +<p class="last"></xsl:if></p> +</dd> +</dl> +<p class="last">… +</ul></p> +</dd> +</dl> +<p></div></p> +</div></blockquote> +<p class="last"></div></p> +</dd> +</dl> +<p></xsl:if></p> +</div></blockquote> +<p></xsl:template> +end{lstlisting}</p> +<p>Besides the two stylesheets morphilo.xsl and corpmeta.xsl, other stylesheets have +to be adjusted. They will not be discussed in detail here for they are self-explanatory for the most part. +Essentially, they render the overall layout (emph{common-layout.xsl}, emph{skeleton_layout_template.xsl}) +or the presentation +of the search results (emph{response-page.xsl}) and definitions of the solr search fields (emph{searchfields-solr.xsl}). +The former and latter also inherit templates from emph{response-general.xsl} and emph{response-browse.xsl}, in which the +navigation bar of search results can be changed. For the use of multilinguality a separate configuration directory +has to be created containing as many emph{.property}-files as different +languages want to be displayed. In the current case these are restricted to German and English (emph{messages_de.properties} and emph{messages_en.properties}). +The property files include all emph{i18n} definitions. All these files are located in the emph{resources} directory.</p> +<p>Furthermore, a search mask and a page for manually entering the annotations had +to be designed. +For these files a specially designed xml standard (emph{xed}) is recommended to be used within the +repository framework.</p> +</div> +</div> + + + </div> + </div> + </div> + <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> + <div class="sphinxsidebarwrapper"> + <h3><a href="../index.html">Table Of Contents</a></h3> + <ul> +<li><a class="reference internal" href="#">View</a><ul> +<li><a class="reference internal" href="#conceptualization">Conceptualization</a></li> +<li><a class="reference internal" href="#implementation">Implementation</a></li> +</ul> +</li> +</ul> +<div class="relations"> +<h3>Related Topics</h3> +<ul> + <li><a href="../index.html">Documentation overview</a><ul> + <li>Previous: <a href="controller.html" title="previous chapter">Controller Adjustments</a></li> + <li>Next: <a href="architecture.html" title="next chapter">Software Design</a></li> + </ul></li> +</ul> +</div> + <div role="note" aria-label="source link"> + <h3>This Page</h3> + <ul class="this-page-menu"> + <li><a href="../_sources/source/view.rst.txt" + rel="nofollow">Show Source</a></li> + </ul> + </div> +<div id="searchbox" style="display: none" role="search"> + <h3>Quick search</h3> + <div class="searchformwrapper"> + <form class="search" action="../search.html" method="get"> + <input type="text" name="q" /> + <input type="submit" value="Go" /> + <input type="hidden" name="check_keywords" value="yes" /> + <input type="hidden" name="area" value="default" /> + </form> + </div> +</div> +<script type="text/javascript">$('#searchbox').show(0);</script> + </div> + </div> + <div class="clearer"></div> + </div> + <div class="footer"> + ©2018, Hagen Peukert. + + | + Powered by <a href="http://sphinx-doc.org/">Sphinx 1.7.2</a> + & <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.10</a> + + | + <a href="../_sources/source/view.rst.txt" + rel="nofollow">Page source</a> + </div> + + + + + </body> +</html> \ No newline at end of file diff --git a/Morphilo_doc/_build/latex/Morphilo.aux b/Morphilo_doc/_build/latex/Morphilo.aux index d61af2f34b83ad5c0861d44998f62558926a3f47..e36a19b4547e322743294a8593d21990df9f46ba 100644 --- a/Morphilo_doc/_build/latex/Morphilo.aux +++ b/Morphilo_doc/_build/latex/Morphilo.aux @@ -18,20 +18,13 @@ \providecommand\HyField@AuxAddToCoFields[2]{} \babel@aux{english}{} \newlabel{index::doc}{{}{1}{}{section*.2}{}} -\@writefile{toc}{\contentsline {chapter}{\numberline {1}Data Model Implementation}{1}{chapter.1}} +\@writefile{toc}{\contentsline {chapter}{\numberline {1}Data Model}{1}{chapter.1}} \@writefile{lof}{\addvspace {10\p@ }} \@writefile{lot}{\addvspace {10\p@ }} -\newlabel{source/datamodel:data-model-implementation}{{1}{1}{Data Model Implementation}{chapter.1}{}} -\newlabel{source/datamodel::doc}{{1}{1}{Data Model Implementation}{chapter.1}{}} -\newlabel{source/datamodel:welcome-to-morphilo-s-documentation}{{1}{1}{Data Model Implementation}{chapter.1}{}} -\@writefile{toc}{\contentsline {chapter}{\numberline {2}Controller Adjustments}{3}{chapter.2}} -\@writefile{lof}{\addvspace {10\p@ }} -\@writefile{lot}{\addvspace {10\p@ }} -\newlabel{source/controller:controller-adjustments}{{2}{3}{Controller Adjustments}{chapter.2}{}} -\newlabel{source/controller::doc}{{2}{3}{Controller Adjustments}{chapter.2}{}} -\@writefile{toc}{\contentsline {section}{\numberline {2.1}General Principle of Operation}{3}{section.2.1}} -\newlabel{source/controller:general-principle-of-operation}{{2.1}{3}{General Principle of Operation}{section.2.1}{}} -\@writefile{toc}{\contentsline {chapter}{\numberline {3}Indices and tables}{5}{chapter.3}} -\@writefile{lof}{\addvspace {10\p@ }} -\@writefile{lot}{\addvspace {10\p@ }} -\newlabel{index:indices-and-tables}{{3}{5}{Indices and tables}{chapter.3}{}} +\newlabel{source/datamodel:documentation-morphilo-project}{{1}{1}{Data Model}{chapter.1}{}} +\newlabel{source/datamodel::doc}{{1}{1}{Data Model}{chapter.1}{}} +\newlabel{source/datamodel:data-model}{{1}{1}{Data Model}{chapter.1}{}} +\@writefile{toc}{\contentsline {section}{\numberline {1.1}Conceptualization}{1}{section.1.1}} +\newlabel{source/datamodel:conceptualization}{{1.1}{1}{Conceptualization}{section.1.1}{}} +\@writefile{toc}{\contentsline {section}{\numberline {1.2}Implementation}{1}{section.1.2}} +\newlabel{source/datamodel:implementation}{{1.2}{1}{Implementation}{section.1.2}{}} diff --git a/Morphilo_doc/_build/latex/Morphilo.fdb_latexmk b/Morphilo_doc/_build/latex/Morphilo.fdb_latexmk index 2c47a323543c890ff84ebe3c4f15e5485438260e..0ef110905f1c45c9de7d3ad66d77d33b2ab56173 100644 --- a/Morphilo_doc/_build/latex/Morphilo.fdb_latexmk +++ b/Morphilo_doc/_build/latex/Morphilo.fdb_latexmk @@ -1,11 +1,11 @@ # Fdb version 3 -["makeindex Morphilo.idx"] 1539349060 "Morphilo.idx" "Morphilo.ind" "Morphilo" 1539349061 - "Morphilo.idx" 1539349060 0 d41d8cd98f00b204e9800998ecf8427e "" +["makeindex Morphilo.idx"] 1539349060 "Morphilo.idx" "Morphilo.ind" "Morphilo" 1539787485 + "Morphilo.idx" 1539787352 0 d41d8cd98f00b204e9800998ecf8427e "" (generated) + "Morphilo.ilg" "Morphilo.ind" -["pdflatex"] 1539349060 "Morphilo.tex" "Morphilo.pdf" "Morphilo" 1539349061 +["pdflatex"] 1539787485 "Morphilo.tex" "Morphilo.pdf" "Morphilo" 1539787485 "/etc/texmf/web2c/texmf.cnf" 1534936626 1101 af7716885e081ab43982cab7b4672c1a "" - "/usr/share/texlive/texmf-dist/fonts/enc/dvips/base/8r.enc" 1480098666 4850 80dc9bab7f31fb78a000ccfed0e27cab "" "/usr/share/texlive/texmf-dist/fonts/map/fontname/texfonts.map" 1511824771 3332 103109f5612ad95229751940c61aada0 "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb8r.tfm" 1480098688 4484 b828043cbd581d289d955903c1339981 "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb8t.tfm" 1480098688 6628 34c39492c0adc454c1c199922bba8363 "" @@ -13,28 +13,18 @@ "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvr8t.tfm" 1480098688 7040 b2bd27e2bfe6f6948cbc3239cae7444f "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8r.tfm" 1480098689 4524 6bce29db5bc272ba5f332261583fee9c "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8t.tfm" 1480098689 6880 f19b8995b61c334d78fc734065f6b4d4 "" - "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8c.tfm" 1480098689 1352 fa28a7e6d323c65ce7d13d5342ff6be2 "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8r.tfm" 1480098689 4408 25b74d011a4c66b7f212c0cc3c90061b "" "/usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8t.tfm" 1480098689 6672 e3ab9e37e925f3045c9005e6d1473d56 "" "/usr/share/texlive/texmf-dist/fonts/tfm/jknappen/ec/ecrm1000.tfm" 1480098696 3584 adb004a0c8e7c46ee66cad73671f37b4 "" - "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/cmextra/cmex7.tfm" 1480098698 1004 54797486969f23fa377b128694d548df "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msam10.tfm" 1480098698 916 f87d7c45f9c908e672703b83b72241a3 "" - "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msam5.tfm" 1480098698 924 9904cf1d39e9767e7a3622f2a125a565 "" - "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msam7.tfm" 1480098698 928 2dc8d444221b7a635bb58038579b861a "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msbm10.tfm" 1480098698 908 2921f8a10601f252058503cc6570e581 "" - "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msbm5.tfm" 1480098698 940 75ac932a52f80982a9f8ea75d03a34cf "" - "/usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msbm7.tfm" 1480098698 940 228d6584342e91276bf566bcf9716b83 "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/cm/cmex10.tfm" 1480098701 992 662f679a0b3d2d53c1b94050fdaa3f50 "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/cm/cmmi12.tfm" 1480098701 1524 4414a8315f39513458b80dfc63bff03a "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/cm/cmr12.tfm" 1480098701 1288 655e228510b4c2a1abe905c368440826 "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/cm/cmr17.tfm" 1480098701 1292 296a67155bdbfc32aa9c636f21e91433 "" "/usr/share/texlive/texmf-dist/fonts/tfm/public/cm/cmsy10.tfm" 1480098701 1124 6c73e740cf17375f03eec0ee63599741 "" - "/usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvb8a.pfb" 1480098746 35941 f27169cc74234d5bd5e4cca5abafaabb "" - "/usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmb8a.pfb" 1480098746 44729 811d6c62865936705a31c797a1d5dada "" - "/usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmr8a.pfb" 1480098746 46026 6dab18b61c907687b520c72847215a68 "" "/usr/share/texlive/texmf-dist/fonts/vf/adobe/helvetic/phvb8t.vf" 1480098757 2340 0efed6a948c3c37d870e4e7ddb85c7c3 "" "/usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmb8t.vf" 1480098758 2340 df9c920cc5688ebbf16a93f45ce7bdd3 "" - "/usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmr8c.vf" 1480098758 3556 8a9a6dcbcd146ef985683f677f4758a6 "" "/usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmr8t.vf" 1480098758 2348 91706c542228501c410c266421fbe30c "" "/usr/share/texlive/texmf-dist/tex/context/base/mkii/supp-pdf.mkii" 1480098806 71627 94eb9990bed73c364d7f53f960cc8c5b "" "/usr/share/texlive/texmf-dist/tex/generic/babel-english/english.ldf" 1496785618 7008 9ff5fdcc865b01beca2b0fe4a46231d4 "" @@ -116,7 +106,6 @@ "/usr/share/texlive/texmf-dist/tex/latex/psnfss/t1phv.fd" 1480098837 1488 9a55ac1cde6b4798a7f56844bb75a553 "" "/usr/share/texlive/texmf-dist/tex/latex/psnfss/t1ptm.fd" 1480098837 774 61d7da1e9f9e74989b196d147e623736 "" "/usr/share/texlive/texmf-dist/tex/latex/psnfss/times.sty" 1480098837 857 6c716f26c5eadfb81029fcd6ce2d45e6 "" - "/usr/share/texlive/texmf-dist/tex/latex/psnfss/ts1ptm.fd" 1480098837 619 96f56dc5d1ef1fe1121f1cfeec70ee0c "" "/usr/share/texlive/texmf-dist/tex/latex/tabulary/tabulary.sty" 1480098840 13791 8c83287d79183c3bf58fd70871e8a70b "" "/usr/share/texlive/texmf-dist/tex/latex/titlesec/titlesec.sty" 1480098841 37387 afa86533e532701faf233f3f592c61e0 "" "/usr/share/texlive/texmf-dist/tex/latex/tools/array.sty" 1485129666 12396 d41f82b039f900e95f351e54ae740f31 "" @@ -130,20 +119,20 @@ "/usr/share/texmf/web2c/texmf.cnf" 1520210507 32485 c64754543d8ac501bea6e75e209ea521 "" "/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map" 1534936964 2700761 ac0584cc9514ab21918550a6948c4ee2 "" "/var/lib/texmf/web2c/pdftex/pdflatex.fmt" 1534936984 4127050 03a6fcb3ed24b2a3ea3480b0b9907a5c "" - "Morphilo.aux" 1539349061 2013 11aa85ec61008c3d3e26ffd886907985 "" + "Morphilo.aux" 1539787473 1477 793e7b87aef4b9bac134adf8f9125ab3 "" "Morphilo.ind" 1539349060 0 d41d8cd98f00b204e9800998ecf8427e "makeindex Morphilo.idx" - "Morphilo.out" 1539349061 682 6029c0089617742ffad26495c33fcf03 "" - "Morphilo.tex" 1539349059 4616 848d76cc8b069c19a8fb2a2ce4b09737 "" - "Morphilo.toc" 1539349061 342 ac6875778cea5ee102b9b5dfd776b04e "" + "Morphilo.out" 1539787473 359 8f0f08e0cc33542e46a154cf270f9233 "" + "Morphilo.tex" 1539787485 97592 966a4c392c3d9d75dc858d7c1efdd907 "" + "Morphilo.toc" 1539787352 0 d41d8cd98f00b204e9800998ecf8427e "" "footnotehyper-sphinx.sty" 1523881736 8886 0562fcad2b7e25f93331edc6fc422c87 "" "sphinx.sty" 1523881736 67712 9b578972569f0169bf44cfae88da82f2 "" - "sphinxhighlight.sty" 1539349059 8137 b8d4ef963833564f6e4eadc09cd757c4 "" + "sphinxhighlight.sty" 1539787485 8137 b8d4ef963833564f6e4eadc09cd757c4 "" "sphinxmanual.cls" 1523881736 3589 0b0aac49c6f36925cf5f9d524a75a978 "" "sphinxmulticell.sty" 1523881736 14618 0defbdc8536ad2e67f1eac6a1431bc55 "" (generated) - "Morphilo.log" - "Morphilo.aux" + "Morphilo.out" "Morphilo.idx" - "Morphilo.toc" + "Morphilo.aux" "Morphilo.pdf" - "Morphilo.out" + "Morphilo.log" + "Morphilo.toc" diff --git a/Morphilo_doc/_build/latex/Morphilo.fls b/Morphilo_doc/_build/latex/Morphilo.fls index 8656387be1115dff4ad5ddf170419b37e4ba3ac2..f9629b7b620fafcf0c2021b4a83dff8c38c0be6e 100644 --- a/Morphilo_doc/_build/latex/Morphilo.fls +++ b/Morphilo_doc/_build/latex/Morphilo.fls @@ -241,32 +241,13 @@ INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvr8t.tfm INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb8t.tfm INPUT Morphilo.toc INPUT Morphilo.toc -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8t.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/cmextra/cmex7.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/cmextra/cmex7.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msam7.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msam5.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msbm7.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/public/amsfonts/symbols/msbm5.tfm OUTPUT Morphilo.toc INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/helvetic/phvb8t.vf INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb8r.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmb8t.vf -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8r.tfm -INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmr8t.vf -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8r.tfm INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/helvetic/phvb8t.vf INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb8r.tfm -INPUT /usr/share/texlive/texmf-dist/tex/latex/psnfss/ts1ptm.fd -INPUT /usr/share/texlive/texmf-dist/tex/latex/psnfss/ts1ptm.fd -INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8c.tfm -INPUT Morphilo.ind -INPUT Morphilo.ind -INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmr8c.vf -INPUT Morphilo.aux -INPUT ./Morphilo.out -INPUT ./Morphilo.out -INPUT /usr/share/texlive/texmf-dist/fonts/enc/dvips/base/8r.enc -INPUT /usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvb8a.pfb -INPUT /usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmb8a.pfb -INPUT /usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmr8a.pfb +INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8t.tfm +INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmr8t.vf +INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmr8r.tfm +INPUT /usr/share/texlive/texmf-dist/fonts/vf/adobe/times/ptmb8t.vf +INPUT /usr/share/texlive/texmf-dist/fonts/tfm/adobe/times/ptmb8r.tfm diff --git a/Morphilo_doc/_build/latex/Morphilo.log b/Morphilo_doc/_build/latex/Morphilo.log index 46c0b95845a6c49ef721b80100db50e149ad94fc..aaf690dfb680ffd711c6d77c12eecdf6077fcd43 100644 --- a/Morphilo_doc/_build/latex/Morphilo.log +++ b/Morphilo_doc/_build/latex/Morphilo.log @@ -1,4 +1,4 @@ -This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/Debian) (preloaded format=pdflatex 2018.8.22) 12 OCT 2018 14:57 +This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/Debian) (preloaded format=pdflatex 2018.8.22) 17 OCT 2018 16:44 entering extended mode restricted \write18 enabled. %&-line parsing enabled. @@ -1137,10 +1137,7 @@ LaTeX Font Info: Font shape `T1/phv/bx/n' in size <12> not available ] LaTeX Font Info: Font shape `T1/phv/bx/n' in size <14.4> not available (Font) Font shape `T1/phv/b/n' tried instead on input line 69. - (./Morphilo.toc -LaTeX Font Info: Font shape `T1/ptm/bx/n' in size <10> not available -(Font) Font shape `T1/ptm/b/n' tried instead on input line 2. -) + (./Morphilo.toc) \tf@toc=\write6 \openout6 = `Morphilo.toc'. @@ -1150,48 +1147,57 @@ LaTeX Font Info: Font shape `T1/ptm/bx/n' in size <10> not available ] Chapter 1. -[1 -] [2 +Underfull \hbox (badness 10000) in paragraph at lines 94--97 +[]\T1/ptm/m/n/10 begin{lstlisting}[language=XML, caption={TEI-example for `com- +fort-able'},label=lst:teiExamp] <w + [] -] -Chapter 2. -[3] [4 +LaTeX Font Info: Font shape `T1/ptm/bx/n' in size <10> not available +(Font) Font shape `T1/ptm/b/n' tried instead on input line 99. +[1 ] -Chapter 3. -LaTeX Font Info: Try loading font information for TS1+ptm on input line 110. +Underfull \hbox (badness 10000) in paragraph at lines 168--176 +[]\T1/ptm/m/n/10 name=^^Qmorphilo^^Q is-Child=^^Qtrue^^Q is-Par-ent=^^Qtrue^^Q +has-Derivates=^^Qtrue^^Q + [] + -(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ts1ptm.fd -File: ts1ptm.fd 2001/06/04 font definitions for TS1/ptm. -) (./Morphilo.ind) -Package atveryend Info: Empty hook `BeforeClearDocument' on input line 125. +! LaTeX Error: Too deeply nested. -[5] -Package atveryend Info: Empty hook `AfterLastShipout' on input line 125. - (./Morphilo.aux) -Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 125. -Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 125. -Package rerunfilecheck Info: File `Morphilo.out' has not changed. -(rerunfilecheck) Checksum: 6029C0089617742FFAD26495C33FCF03;682. -Package atveryend Info: Empty hook `AtVeryVeryEnd' on input line 125. - ) +See the LaTeX manual or LaTeX Companion for explanation. +Type H <return> for immediate help. + ... + +l.185 ...reater{}}] \leavevmode\begin{description} + +? +! Interruption. +\GenericError ... + \endgroup +l.185 ...reater{}}] \leavevmode\begin{description} + +? ^^[[A +Type <return> to proceed, S to scroll future error messages, +R to run without stopping, Q to run quietly, +I to insert something, E to edit your file, +H for help, X to quit. +? +! Emergency stop. +\GenericError ... + \endgroup +l.185 ...reater{}}] \leavevmode\begin{description} + +End of file on the terminal! + + Here is how much of TeX's memory you used: - 13509 strings out of 492982 - 186890 string characters out of 6134896 - 274532 words of memory out of 5000000 - 16777 multiletter control sequences out of 15000+600000 - 37190 words of font info for 55 fonts, out of 8000000 for 9000 + 13449 strings out of 492982 + 186097 string characters out of 6134896 + 277384 words of memory out of 5000000 + 16746 multiletter control sequences out of 15000+600000 + 35585 words of font info for 48 fonts, out of 8000000 for 9000 1142 hyphenation exceptions out of 8191 37i,11n,45p,280b,360s stack positions out of 5000i,500n,10000p,200000b,80000s -{/usr/share/texlive/texmf-dist/fonts/enc/dvips/base/8r.enc}</usr/share/texliv -e/texmf-dist/fonts/type1/urw/helvetic/uhvb8a.pfb></usr/share/texlive/texmf-dist -/fonts/type1/urw/times/utmb8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/ur -w/times/utmr8a.pfb> -Output written on Morphilo.pdf (9 pages, 47006 bytes). -PDF statistics: - 89 PDF objects out of 1000 (max. 8388607) - 69 compressed objects within 1 object stream - 14 named destinations out of 1000 (max. 500000) - 53 words of extra memory for PDF output out of 10000 (max. 10000000) - +! ==> Fatal error occurred, no output PDF file produced! diff --git a/Morphilo_doc/_build/latex/Morphilo.out b/Morphilo_doc/_build/latex/Morphilo.out index 5525dfa16691e92fa37ebad08f8b33c7a0778ed0..11fd6921b1623656015f6c8dae2ffa3c8773e8e2 100644 --- a/Morphilo_doc/_build/latex/Morphilo.out +++ b/Morphilo_doc/_build/latex/Morphilo.out @@ -1,4 +1,3 @@ -\BOOKMARK [0][-]{chapter.1}{\376\377\000D\000a\000t\000a\000\040\000M\000o\000d\000e\000l\000\040\000I\000m\000p\000l\000e\000m\000e\000n\000t\000a\000t\000i\000o\000n}{}% 1 -\BOOKMARK [0][-]{chapter.2}{\376\377\000C\000o\000n\000t\000r\000o\000l\000l\000e\000r\000\040\000A\000d\000j\000u\000s\000t\000m\000e\000n\000t\000s}{}% 2 -\BOOKMARK [1][-]{section.2.1}{\376\377\000G\000e\000n\000e\000r\000a\000l\000\040\000P\000r\000i\000n\000c\000i\000p\000l\000e\000\040\000o\000f\000\040\000O\000p\000e\000r\000a\000t\000i\000o\000n}{chapter.2}% 3 -\BOOKMARK [0][-]{chapter.3}{\376\377\000I\000n\000d\000i\000c\000e\000s\000\040\000a\000n\000d\000\040\000t\000a\000b\000l\000e\000s}{}% 4 +\BOOKMARK [0][-]{chapter.1}{\376\377\000D\000a\000t\000a\000\040\000M\000o\000d\000e\000l}{}% 1 +\BOOKMARK [1][-]{section.1.1}{\376\377\000C\000o\000n\000c\000e\000p\000t\000u\000a\000l\000i\000z\000a\000t\000i\000o\000n}{chapter.1}% 2 +\BOOKMARK [1][-]{section.1.2}{\376\377\000I\000m\000p\000l\000e\000m\000e\000n\000t\000a\000t\000i\000o\000n}{chapter.1}% 3 diff --git a/Morphilo_doc/_build/latex/Morphilo.pdf b/Morphilo_doc/_build/latex/Morphilo.pdf deleted file mode 100644 index 5aae41856019d74b9c736803c3a49a5f541e0dd7..0000000000000000000000000000000000000000 Binary files a/Morphilo_doc/_build/latex/Morphilo.pdf and /dev/null differ diff --git a/Morphilo_doc/_build/latex/Morphilo.tex b/Morphilo_doc/_build/latex/Morphilo.tex index 4512096e65bbb645e0e4be9fe071935a699745c8..4e090d906b5cff45a39dea629d9ee6d518170cb8 100644 --- a/Morphilo_doc/_build/latex/Morphilo.tex +++ b/Morphilo_doc/_build/latex/Morphilo.tex @@ -56,7 +56,7 @@ \title{Morphilo Documentation} -\date{Oct 12, 2018} +\date{Oct 17, 2018} \release{} \author{Hagen Peukert} \newcommand{\sphinxlogo}{\vbox{}} @@ -71,8 +71,70 @@ -\chapter{Data Model Implementation} -\label{\detokenize{source/datamodel:data-model-implementation}}\label{\detokenize{source/datamodel::doc}}\label{\detokenize{source/datamodel:welcome-to-morphilo-s-documentation}} +\chapter{Data Model} +\label{\detokenize{source/datamodel:documentation-morphilo-project}}\label{\detokenize{source/datamodel::doc}}\label{\detokenize{source/datamodel:data-model}} + +\section{Conceptualization} +\label{\detokenize{source/datamodel:conceptualization}} +From both the user and task requirements one can derive that four basic +functions of data processing need to be carried out. Data have to be read, persistently +saved, searched, and deleted. Furthermore, some kind of user management +and multi-user processing is necessary. In addition, the framework should +support web technologies, be well documented, and easy to extent. Ideally, the +MVC pattern is realized. + +subsection\{Data Model\}label\{subsec:datamodel\} +The guidelines of the +emph\{TEI\}-standardfootnote\{http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf\} on the +word level are defined in line with the structure defined above in section ref\{subsec:morphologicalSystems\}. +In listing ref\{lst:teiExamp\} an +example is given for a possible markup at the word level for +emph\{comfortable\}.footnote\{http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-m.html\} + +begin\{lstlisting\}{[}language=XML, +caption=\{TEI-example for ‘comfortable’\},label=lst:teiExamp{]} +\textless{}w type=”adjective”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}m type=”base”\textgreater{}}] \leavevmode +\textless{}m type=”prefix” baseForm=”con”\textgreater{}com\textless{}/m\textgreater{} +\textless{}m type=”root”\textgreater{}fort\textless{}/m\textgreater{} + +\end{description} + +\textless{}/m\textgreater{} +\textless{}m type=”suffix”\textgreater{}able\textless{}/m\textgreater{} +\end{quote} + +\textless{}/w\textgreater{} +end\{lstlisting\} + +This data model reflects just one theoretical conception of a word structure model. +Crucially, the model emanates from the assumption +that the suffix node is on par with the word base. On the one hand, this +implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the +other hand, is enclosed in the base, which basically means a stronger lexical, +and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different +hierarchical levels has important consequences for the branching direction at +subword level (here right-branching). Left the theoretical interest aside, the +choice of the TEI standard is reasonable with view to a sustainable architecture that allows for +exchanging data with little to no additional adjustments. + +The negative account is that the model is not eligible for all languages. +It reflects a theoretical construction based on Indo-European +languages. If attention is paid to which language this software is used, it will +not be problematic. This is the case for most languages of the Indo-European +stem and corresponds to the overwhelming majority of all research carried out +(unfortunately). + + +\section{Implementation} +\label{\detokenize{source/datamodel:implementation}} +As laid out in the task analysis in section ref\{subsec:datamodel\}, it is +advantageous to use established standards. It was also shown that it makes sense +to keep the meta data of each corpus separate from the data model used for the +words to be analyzed. + For the present case, the TEI-standard was identified as an appropriate markup for words. In terms of the implementation this means that the TEI guidelines have to be implemented as an object type compatible with the chosen @@ -98,12 +160,1999 @@ Whereas attributes of the objecttype are specific to the repository framework, t recognized in the hierarchy of the meta data element starting with the name emph\{w\} (line ref\{src:wordbegin\}). +begin\{lstlisting\}{[}language=XML,caption=\{Word Data +model\},label=lst:worddatamodel,escapechar=\textbar{}{]} \textless{}?xml version=”1.0” encoding=”UTF-8”?\textgreater{} +\textless{}objecttype +\begin{quote} + +name=”morphilo” +isChild=”true” +isParent=”true” +hasDerivates=”true” +xmlns:xs=”http://www.w3.org/2001/XMLSchema” +xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” +xsi:noNamespaceSchemaLocation=”datamodel.xsd”\textgreater{} +\textless{}metadata\textgreater{} +\begin{quote} + +\textless{}element name=”morphiloContainer” type=”xml” style=”dontknow” +\end{quote} +\begin{description} +\item[{notinherit=”true” heritable=”false”\textgreater{}}] \leavevmode\begin{quote} +\begin{description} +\item[{\textless{}xs:sequence\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:element name=”morphilo”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:complexType\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:sequence\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:element name=”w” minOccurs=”0” maxOccurs=”unbounded”\textgreater{}\textbar{}label\{src:wordbegin\}\textbar{}}] \leavevmode\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:sequence\textgreater{}}] \leavevmode +\textless{}!\textendash{} stem \textendash{}\textgreater{} +\textless{}xs:element name=”m1” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:sequence\textgreater{}}] \leavevmode +\textless{}!\textendash{} base \textendash{}\textgreater{} +\textless{}xs:element name=”m2” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}xs:sequence\textgreater{}}] \leavevmode +\textless{}!\textendash{} root \textendash{}\textgreater{} +\textless{}xs:element name=”m3” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode +\textless{}xs:attribute name=”type” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} +\end{quote} + +\textless{}/xs:element\textgreater{} +\textless{}!\textendash{} prefix \textendash{}\textgreater{} +\textless{}xs:element name=”m4” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode +\textless{}xs:attribute name=”type” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”PrefixbaseForm” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”position” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} +\end{quote} + +\textless{}/xs:element\textgreater{} + +\end{description} + +\textless{}/xs:sequence\textgreater{} +\textless{}xs:attribute name=”type” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} +\end{quote} + +\textless{}/xs:element\textgreater{} +\textless{}!\textendash{} suffix \textendash{}\textgreater{} +\textless{}xs:element name=”m5” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xs:complexType mixed=”true”\textgreater{}}] \leavevmode +\textless{}xs:attribute name=”type” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”SuffixbaseForm” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”position” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”inflection” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} +\end{quote} + +\textless{}/xs:element\textgreater{} + +\end{description} + +\textless{}/xs:sequence\textgreater{} +\textless{}!\textendash{} stem-Attribute \textendash{}\textgreater{} +\textless{}xs:attribute name=”type” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”pos” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”occurrence” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} +\end{quote} + +\textless{}/xs:element\textgreater{} + +\end{description} + +\textless{}/xs:sequence\textgreater{} +\textless{}!\textendash{} w -Attribute auf Wortebene \textendash{}\textgreater{} +\textless{}xs:attribute name=”lemma” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”complexType” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”wordtype” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”occurrence” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”corpus” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”begin” type=”xs:string”/\textgreater{} +\textless{}xs:attribute name=”end” type=”xs:string”/\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} + +\end{description} + +\textless{}/xs:element\textgreater{} + +\end{description} + +\textless{}/xs:sequence\textgreater{} + +\end{description} + +\textless{}/xs:complexType\textgreater{} + +\end{description} + +\textless{}/xs:element\textgreater{} + +\end{description} + +\textless{}/xs:sequence\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”wordtype” type=”classification” minOccurs=”0” maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”wordtype”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”complexType” type=”classification” minOccurs=”0” maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”complexType”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”corpus” type=”classification” minOccurs=”0” maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”corpus”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”pos” type=”classification” minOccurs=”0” maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”pos”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”PrefixbaseForm” type=”classification” minOccurs=”0” +maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”PrefixbaseForm”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”SuffixbaseForm” type=”classification” minOccurs=”0” +maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”SuffixbaseForm”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”inflection” type=”classification” minOccurs=”0” maxOccurs=”1”\textgreater{} +\begin{quote} + +\textless{}classification id=”inflection”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\textless{}element name=”corpuslink” type=”link” minOccurs=”0” maxOccurs=”unbounded” \textgreater{} +\begin{quote} + +\textless{}target type=”corpmeta”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} + +\end{description} + +\textless{}/metadata\textgreater{} +\end{quote} + +\textless{}/objecttype\textgreater{} +end\{lstlisting\} + +Additionally, it is worth mentioning that some attributes are modeled as a +emph\{classification\}. All these have to be listed +as separate elements in the data model. This has been done for all attributes +that are more or less subject to little or no change. In fact, all known suffix +and prefix morphemes should be known for the language investigated and are +therefore defined as a classification. +The same is true for the parts of speech named emph\{pos\} in the morphilo data +model above. +Here the PENN-Treebank tagset was used. Last, the different morphemic layers in +the standard model named emph\{m\} are changed to \$m1\$ through \$m5\$. This is the +only change in the standard that could be problematic if the data is to be +processed elsewhere and the change is not documented more explicitly. Yet, this +change was necessary for the MyCoRe repository throws errors caused by ambiguity +issues on the different \$m\$-layers. + +The second data model describes only very few properties of the text corpora +from which the words are extracted. Listing ref\{lst:corpusdatamodel\} depicts +only the meta data element. For the sake of simplicity of the prototype, this +data model is kept as simple as possible. The obligatory field is the name of +the corpus. Specific dates of the corpus are classified as optional because in +some cases a text cannot be dated reliably. + +begin\{lstlisting\}{[}language=XML,caption=\{Corpus Data +Model\},label=lst:corpusdatamodel{]} +\textless{}metadata\textgreater{} +\begin{quote} + +\textless{}!\textendash{} Pflichtfelder \textendash{}\textgreater{} +\textless{}element name=”korpusname” type=”text” minOccurs=”1” maxOccurs=”1”/\textgreater{} +\textless{}!\textendash{} Optionale Felder \textendash{}\textgreater{} +\textless{}element name=”sprache” type=”text” minOccurs=”0” maxOccurs=”1”/\textgreater{} +\textless{}element name=”size” type=”number” minOccurs=”0” maxOccurs=”1”/\textgreater{} +\textless{}element name=”datefrom” type=”text” minOccurs=”0” maxOccurs=”1”/\textgreater{} +\textless{}element name=”dateuntil” type=”text” minOccurs=”0” maxOccurs=”1”/\textgreater{} +\textless{}!\textendash{} number of words \textendash{}\textgreater{} +\textless{}element name=”NoW” type=”text” minOccurs=”0” maxOccurs=”1”/\textgreater{} +\textless{}element name=”corpuslink” type=”link” minOccurs=”0” maxOccurs=”unbounded”\textgreater{} +\begin{quote} + +\textless{}target type=”morphilo”/\textgreater{} +\end{quote} + +\textless{}/element\textgreater{} +\end{quote} + +\textless{}/metadata\textgreater{} +end\{lstlisting\} + +As a final remark, one might have noticed that all attributes are modelled as +strings although other data types are available and fields encoding the dates or +the number of words suggest otherwise. The MyCoRe framework even provides a +data type emph\{historydate\}. There is not a very satisfying answer to its +disuse. +All that can be said is that the use of data types different than the string +leads later on to problems in the convergence between the search engine and the +repository framework. These issues seem to be well known and can be followed on +github. + \chapter{Controller Adjustments} \label{\detokenize{source/controller:controller-adjustments}}\label{\detokenize{source/controller::doc}} \section{General Principle of Operation} \label{\detokenize{source/controller:general-principle-of-operation}} +Figure ref\{fig:classDiag\} illustrates the dependencies of the five java classes that were integrated to add the morphilo +functionality defined in the default package emph\{custom.mycore.addons.morphilo\}. The general principle of operation +is the following. The handling of data search, upload, saving, and user +authentification is fully left to the MyCoRe functionality that is completely +implemented. The class emph\{ProcessCorpusServlet.java\} receives a request from the webinterface to process an uploaded file, +i.e. a simple text corpus, and it checks if any of the words are available in the master database. All words that are not +listed in the master database are written to an extra file. These are the words that have to be manually annotated. At the end, the +servlet sends a response back to the user interface. In case of all words are contained in the master, an xml file is generated from the +master database that includes all annotated words of the original corpus. Usually this will not be the case for larger textfiles. +So if some words are not in the master, the user will get the response to initiate the manual annotation process. + +The manual annotation process is processed by the class +emph\{\{Tag-Corpus-Serv-let-.ja-va\}\}, which will build a JDOM object for the first word in the extra file. +This is done by creating an object of the emph\{JDOMorphilo.java\} class. This class, in turn, will use the methods of +emph\{AffixStripper.java\} that make simple, but reasonable, suggestions on the word structure. This JDOM object is then +given as a response back to the user. It is presented as a form, in which the user can make changes. This is necessary +because the word structure algorithm of emph\{AffixStripper.java\} errs in some cases. Once the user agrees on the +suggestions or on his or her corrections, the JDOM object is saved as an xml that is only searchable, visible, and +changeable by the authenicated user (and the administrator), another file containing all processed words is created or +updated respectively and the emph\{TagCorpusServlet.java\} servlet will restart until the last word in the extra list is +processed. This enables the user to stop and resume her or his annotation work at a later point in time. The +emph\{TagCorpusServlet\} will call methods from emph\{ProcessCorpusServlet.java\} to adjust the content of the extra +files harboring the untagged words. If this file is empty, and only then, it is replaced by the file comprising all words +from the original text file, both the ones from the master database and the ones that are annotated by the user, +in an annotated xml representation. + +Each time emph\{ProcessCorpusServlet.java\} is instantiated, it also instantiates emph\{QualityControl.java\}. This class checks if a +new word can be transferred to the master database. The algorithm can be freely adopted to higher or lower quality standards. +In its present configuration, a method tests at a limit of 20 different +registered users agreeing on the annotation of the same word. More specifically, +if 20 JDOM objects are identical except in the attribute field emph\{occurrences\} in the metadata node, the JDOM object becomes +part of the master. The latter is easily done by changing the attribute emph\{creator\} from the user name +to emph\{{\color{red}\bfseries{}{}`{}`}administrator’‘\} in the service node. This makes the dataset part of the master database. Moreover, the emph\{occurrences\} +attribute is updated by adding up all occurrences of the word that stem from +different text corpora of the same time range. +begin\{landscape\} +\begin{quote} +\begin{description} +\item[{begin\{figure\}}] \leavevmode +centering +includegraphics{[}scale=0.55{]}\{morphilo\_uml.png\} +caption\{Class Diagram Morphilo\} +label\{fig:classDiag\} + +\end{description} + +end\{figure\} +\end{quote} + +end\{landscape\} + + +\section{Conceptualization} +\label{\detokenize{source/controller:conceptualization}} +The controller component is largely +specified and ready to use in some hundred or so java classes handling the +logic of the search such as indexing, but also dealing with directories and +files as saving, creating, deleting, and updating files. +Moreover, a rudimentary user management comprising different roles and +rights is offered. The basic technology behind the controller’s logic is the +servlet. As such all new code has to be registered as a servlet in the +web-fragment.xml (here the Apache Tomcat container) as listing ref\{lst:webfragment\} shows. + +begin\{lstlisting\}{[}language=XML,caption=\{Servlet Registering in the +web-fragment.xml (excerpt)\},label=lst:webfragment,escapechar=\textbar{}{]} +\textless{}servlet\textgreater{} +\begin{quote} + +\textless{}servlet-name\textgreater{}ProcessCorpusServlet\textless{}/servlet-name\textgreater{} +\textless{}servlet-class\textgreater{}custom.mycore.addons.morphilo.ProcessCorpusServlet\textless{}/servlet-class\textgreater{} +\end{quote} + +\textless{}/servlet\textgreater{} +\textless{}servlet-mapping\textgreater{} +\begin{quote} + +\textless{}servlet-name\textgreater{}ProcessCorpusServlet\textless{}/servlet-name\textgreater{} +\textless{}url-pattern\textgreater{}/servlets/object/process\textless{}/url-pattern\textgreater{}\textbar{}label\{ln:process\}\textbar{} +\end{quote} + +\textless{}/servlet-mapping\textgreater{} +\textless{}servlet\textgreater{} +\begin{quote} + +\textless{}servlet-name\textgreater{}TagCorpusServlet\textless{}/servlet-name\textgreater{} +\textless{}servlet-class\textgreater{}custom.mycore.addons.morphilo.TagCorpusServlet\textless{}/servlet-class\textgreater{} +\end{quote} + +\textless{}/servlet\textgreater{} +\textless{}servlet-mapping\textgreater{} +\begin{quote} + +\textless{}servlet-name\textgreater{}TagCorpusServlet\textless{}/servlet-name\textgreater{} +\textless{}url-pattern\textgreater{}/servlets/object/tag\textless{}/url-pattern\textgreater{}\textbar{}label\{ln:tag\}\textbar{} +\end{quote} + +\textless{}/servlet-mapping\textgreater{} +end\{lstlisting\} + +Now, the logic has to be extended by the specifications analyzed in chapter +ref\{chap:concept\} on conceptualization. More specifically, some +classes have to be added that take care of analyzing words +(emph\{AffixStripper.java, InflectionEnum.java, SuffixEnum.java, +PrefixEnum.java\}), extracting the relevant words from the text and checking the +uniqueness of the text (emph\{ProcessCorpusServlet.java\}), make reasonable +suggestions on the annotation (emph\{TagCorpusServlet.java\}), build the object +of each annotated word (emph\{JDOMorphilo.java\}), and check on the quality by applying +statistical models (emph\{QualityControl.java\}). + + +\section{Implementation} +\label{\detokenize{source/controller:implementation}} +Having taken a bird’s eye perspective in the previous chapter, it is now time to take a look at the specific implementation at the level +of methods. Starting with the main servlet, emph\{ProcessCorpusServlet.java\}, the class defines four getter method: +renewcommand\{labelenumi\}\{(theenumi)\} +begin\{enumerate\} +\begin{quote} + +itemlabel\{itm:geturl\} public String getURLParameter(MCRServletJob, String) +itemlabel\{itm:getcorp\} public String getCorpusMetadata(MCRServletJob, String) +itemlabel\{itm:getcont\} public ArrayList\textless{}String\textgreater{} getContentFromFile(MCRServletJob, String) +itemlabel\{itm:getderiv\} public Path getDerivateFilePath(MCRServletJob, String) +itemlabel\{itm:now\} public int getNumberOfWords(MCRServletJob job, String) +\end{quote} + +end\{enumerate\} +Since each servlet in MyCoRe extends the class MCRServlet, it has access to MCRServletJob, from which the http requests and responses +can be used. This is the first argument in the above methods. The second argument of method (ref\{itm:geturl\}) specifies the name of an url parameter, i.e. +the object id or the id of the derivate. The method returns the value of the given parameter. Typically MyCoRe uses the url to exchange +these ids. The second method provides us with the value of a data field in the xml document. So the string defines the name of an attribute. +emph\{getContentFromFile(MCRServletJob, String)\} returns the words as a list from a file when given the filename as a string. +The getter listed in ref\{itm:getderiv\}), returns the Path from the MyCoRe repository when the name of +the file is specified. And finally, method (ref\{itm:now\}) returns the number of words by simply returning +emph\{getContentFromFile(job, fileName).size()\}. + +There are two methods in every MyCoRe-Servlet that have to be overwritten, +emph\{protected void render(MCRServletJob, Exception)\}, which redirects the requests as emph\{POST\} or emph\{GET\} responds, and +emph\{protected void think(MCRServletJob)\}, in which the logic is implemented. Since the latter is important to understand the +core idea of the Morphilo algorithm, it is displayed in full length in source code ref\{src:think\}. + +begin\{lstlisting\}{[}language=java,caption=\{The overwritten think method\},label=src:think,escapechar=\textbar{}{]} +protected void think(MCRServletJob job) throws Exception +\{ +\begin{quote} + +this.job = job; +String dateFromCorp = getCorpusMetadata(job, “def.datefrom”); +String dateUntilCorp = getCorpusMetadata(job, “def.dateuntil”); +String corpID = getURLParameter(job, “objID”); +String derivID = getURLParameter(job, “id”); + +//if NoW is 0, fill with anzWords +MCRObject helpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(corpID));\textbar{}label\{ln:bugfixstart\}\textbar{} +Document jdomDocHelp = helpObj.createXML(); +XPathFactory xpfacty = XPathFactory.instance(); +XPathExpression\textless{}Element\textgreater{} xpExp = xpfacty.compile(“//NoW”, Filters.element()); +Element elem = xpExp.evaluateFirst(jdomDocHelp); +//fixes transferred morphilo data from previous stand alone project +int corpussize = getNumberOfWords(job, “”); +if (Integer.parseInt(elem.getText()) != corpussize) +\{ +\begin{quote} + +elem.setText(Integer.toString(corpussize)); +helpObj = new MCRObject(jdomDocHelp); +MCRMetadataManager.update(helpObj); +\end{quote} + +\}\textbar{}label\{ln:bugfixend\}\textbar{} + +//Check if the uploaded corpus was processed before +SolrClient slr = MCRSolrClientFactory.getSolrClient();\textbar{}label\{ln:solrstart\}\textbar{} +SolrQuery qry = new SolrQuery(); +qry.setFields(“korpusname”, “datefrom”, “dateuntil”, “NoW”, “id”); +qry.setQuery(“datefrom:” + dateFromCorp + ” AND dateuntil:” + dateUntilCorp + ” AND NoW:” + corpussize); +SolrDocumentList rslt = slr.query(qry).getResults();\textbar{}label\{ln:solrresult\}\textbar{} + +Boolean incrOcc = true; +// if resultset contains only one, then it must be the newly created corpus +if (slr.query(qry).getResults().getNumFound() \textgreater{} 1) +\{ +\begin{quote} + +incrOcc = false; +\end{quote} + +\}\textbar{}label\{ln:solrend\}\textbar{} + +//match all words in corpus with morphilo (creator=administrator) and save all words that are not in morphilo DB in leftovers +ArrayList\textless{}String\textgreater{} leftovers = new ArrayList\textless{}String\textgreater{}(); +ArrayList\textless{}String\textgreater{} processed = new ArrayList\textless{}String\textgreater{}(); + +leftovers = getUnknownWords(getContentFromFile(job, “”), dateFromCorp, dateUntilCorp, “”, incrOcc, incrOcc, false);\textbar{}label\{ln:callkeymeth\}\textbar{} + +//write all words of leftover in file as derivative to respective corpmeta dataset +MCRPath root = MCRPath.getPath(derivID, “/”);\textbar{}label\{ln:filesavestart\}\textbar{} +Path fn = getDerivateFilePath(job, “”).getFileName(); +Path p = root.resolve(“untagged-” + fn); +Files.write(p, leftovers);\textbar{}label\{ln:filesaveend\}\textbar{} + +//create a file for all words that were processed +Path procWds = root.resolve(“processed-” + fn); +Files.write(procWds, processed); +\end{quote} + +\} +end\{lstlisting\} +Using the above mentioned getter methods, the emph\{think\} method assigns values to the object ID, needed to get the xml document +that contain the corpus metadata, the file ID, and the beginning and starting dates from the corpus to be analyzed. Lines ref\{ln:bugfixstart\} +through ref\{ln:bugfixend\} show how to access a mycore object as an xml document, a procedure that will be used in different variants +throughout this implementation. +By means of the object ID, the respective corpus is identified and a JDOM document is constructed, which can then be accessed +by XPath. The XPath factory instances are collections of the xml nodes. In the present case, it is save to assume that only one element +of emph\{NoW\} is available (see corpus datamodel listing ref\{lst:corpusdatamodel\} with \$maxOccurs=‘1’\$). So we do not have to loop through +the collection, but use the first node named emph\{NoW\}. The if-test checks if the number of words of the uploaded file is the +same as the number written in the document. When the document is initially created by the MyCoRe logic it was configured to be zero. +If unequal, the setText(String) method is used to write the number of words of the corpus to the document. + +Lines ref\{ln:solrstart\}\textendash{}ref\{ln:solrend\} reveal the second important ingredient, i.e. controlling the search engine. First, a solr +client and a query are initialized. Then, the output of the result set is defined by giving the fields of interest of the document. +In the case at hand, it is the id, the name of the corpus, the number of words, and the beginnig and ending dates. With emph\{setQuery\} +it is possible to assign values to some or all of these fields. Finally, emph\{getResults()\} carries out the search and writes +all hits to a emph\{SolrDocumentList\} (line ref\{ln:solrresult\}). The test that follows is really only to set a Boolean +encoding if the number of occurrences of that word in the master should be updated. To avoid multiple counts, +incrementing the word frequency is only done if it is a new corpus. + +In line ref\{ln:callkeymeth\} emph\{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)\} is called and +returned as a list of words. This method is key and will be discussed in depth below. Finally, lines +ref\{ln:filesavestart\}\textendash{}ref\{ln:filesaveend\} show how to handle file objects in MyCoRe. Using the file ID, the root path and the name +of the first file in that path are identified. Then, a second file starting with {\color{red}\bfseries{}{}`{}`}untagged’’ is created and all words returned from +the emph\{getUnknownWords\} is written to that file. By the same token an empty file is created (in the last two lines of the emph\{think\}-method), +in which all words that are manually annotated will be saved. + +In a refactoring phase, the method emph\{getUnknownWords(ArrayList, String, String, String, Boolean, Boolean, Boolean)\} could be subdivided into +three methods: for each Boolean parameter one. In fact, this method handles more than one task. This is mainly due to multiple code avoidance. +\%this is just wrong because no resultset will substantially be more than 10-20 +\%In addition, for large text files this method would run into efficiency problems if the master database also reaches the intended size of about +\%\$100,000\$ entries and beyond because +In essence, an outer loop runs through all words of the corpus and an inner loop runs through all hits in the solr result set. Because the result +set is supposed to be small, approximately between \$10-20\$ items, efficiency +problems are unlikely to cause a problem, although there are some more loops running through collection of about the same sizes. +\%As the hits naturally grow larger with an increasing size of the data base, processing time will rise exponentially. +Since each word is identified on the basis of its projected word type, the word form, and the time range it falls into, it is these variables that +have to be checked for existence in the documents. If not in the xml documents, +emph\{null\} is returned and needs to be corrected. Moreover, user authentification must be considered. There are three different XPaths that are relevant. +begin\{itemize\} +\begin{quote} + +item{[}-{]} emph\{//service/servflags/servflag{[}@type=’createdby’{]}\} to test for the correct user +item{[}-{]} emph\{//morphiloContainer/morphilo\} to create the annotated document +item{[}-{]} emph\{//morphiloContainer/morphilo/w\} to set occurrences or add a link +\end{quote} + +end\{itemize\} + +As an illustration of the core functioning of this method, listing ref\{src:getUnknowWords\} is given. +begin\{lstlisting\}{[}language=java,caption=\{Mode of Operation of getUnknownWords Method\},label=src:getUnknowWords,escapechar=\textbar{}{]} +public ArrayList\textless{}String\textgreater{} getUnknownWords( +\begin{quote} + +ArrayList\textless{}String\textgreater{} corpus, +String timeCorpusBegin, +String timeCorpusEnd, +String wdtpe, +Boolean setOcc, +Boolean setXlink, +Boolean writeAllData) throws Exception +\{ +\begin{quote} + +String currentUser = MCRSessionMgr.getCurrentSession().getUserInformation().getUserID(); +ArrayList lo = new ArrayList(); + +for (int i = 0; i \textless{} corpus.size(); i++) +\{ +\begin{quote} + +SolrClient solrClient = MCRSolrClientFactory.getSolrClient(); +SolrQuery query = new SolrQuery(); +query.setFields(“w”,”occurrence”,”begin”,”end”, “id”, “wordtype”); +query.setQuery(corpus.get(i)); +query.setRows(50); //more than 50 items are extremely unlikely +SolrDocumentList results = solrClient.query(query).getResults(); +Boolean available = false; +for (int entryNum = 0; entryNum \textless{} results.size(); entryNum++) +\{ +\begin{quote} + +… +// update in MCRMetaDataManager +String mcrIDString = results.get(entryNum).getFieldValue(“id”).toString(); +//MCRObjekt auslesen und JDOM-Document erzeugen: +MCRObject mcrObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(mcrIDString)); +Document jdomDoc = mcrObj.createXML(); +… +//check and correction for word type +… +//checkand correction time: timeCorrect +… +//check if user correct: isAuthorized +\end{quote} + +… +XPathExpression\textless{}Element\textgreater{} xp = xpfac.compile(“//morphiloContainer/morphilo/w”, Filters.element()); +//Iterates w-elements and increments occurrence attribute if setOcc is true +for (Element e : xp.evaluate(jdomDoc)) +\{ +\begin{quote} +\begin{description} +\item[{//wenn Rechte da sind und Worttyp nirgends gegeben oder gleich ist}] \leavevmode\begin{quote} +\begin{description} +\item[{if (isAuthorized \&\& timeCorrect}] \leavevmode +\&\& ((e.getAttributeValue(“wordtype”) == null \&\& wdtpe.equals(“”)) +\textbar{}\textbar{} e.getAttributeValue(“wordtype”).equals(wordtype))) // nur zur Vereinheitlichung + +\end{description} +\end{quote} +\begin{description} +\item[{\{}] \leavevmode\begin{quote} + +int oc = -1; +available = true;\textbar{}label\{ln:available\}\textbar{} +\end{quote} +\begin{description} +\item[{try}] \leavevmode\begin{quote} +\begin{description} +\item[{\{}] \leavevmode +//adjust occurrence Attribut +if (setOcc) + +\end{description} +\end{quote} +\begin{description} +\item[{\{}] \leavevmode\begin{description} +\item[{oc = Integer.parseInt(e.getAttributeValue(“occurrence”));}] \leavevmode\begin{quote} + +e.setAttribute(“occurrence”, Integer.toString(oc + 1)); +\end{quote} + +\} + +\end{description} + +\item[{//write morphilo-ObjectID in xml of corpmeta}] \leavevmode\begin{quote} +\begin{quote} +\begin{quote} + +if (setXlink) +\{ +\begin{quote} + +Namespace xlinkNamespace = Namespace.getNamespace(“xlink”, “\sphinxurl{http://www.w3.org/1999/xlink}”);\textbar{}label\{ln:namespace\}\textbar{} +MCRObject corpObj = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(getURLParameter(job, “objID”))); +Document corpDoc = corpObj.createXML(); +XPathExpression\textless{}Element\textgreater{} xpathEx = xpfac.compile(“//corpuslink”, Filters.element()); +Element elm = xpathEx.evaluateFirst(corpDoc); +elm.setAttribute(“href” , mcrIDString, xlinkNamespace); +\end{quote} + +\} +mcrObj = new MCRObject(jdomDoc);\textbar{}label\{ln:updatestart\}\textbar{} +MCRMetadataManager.update(mcrObj); +QualityControl qc = new QualityControl(mcrObj);\textbar{}label\{ln:updateend\}\textbar{} +\end{quote} + +\} +catch(NumberFormatException except) +\{ +\begin{quote} + +// ignore +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} + +\end{description} + +\end{description} + +\end{description} + +\end{description} + +if (!available) // if not available in datasets under the given conditions {\color{red}\bfseries{}\textbar{}\textbackslash{}label\{ln:notavailable\}\textbar{}} +\{ +\begin{quote} + +lo.add(corpus.get(i)); +\end{quote} + +\} +\end{quote} + +\} +return lo; +\end{quote} + +\} +\end{quote} +\end{quote} + +end\{lstlisting\} +As can be seen from the functionality of listing ref\{src:getUnknowWords\}, getting the unknown words of a corpus, is rather a side effect for the equally named method. +More precisely, a Boolean (line ref\{ln:available\}) is set when the document is manipulated otherwise because it is clear that the word must exist then. +If the Boolean remains false (line ref\{ln:notavailable\}), the word is put on the list of words that have to be annotated manually. As already explained above, the +first loop runs through all words (corpus) and the following lines a solr result set is created. This set is also looped through and it is checked if the time range, +the word type and the user are authorized. In the remainder, the occurrence attribute of the morphilo document can be incremented (setOcc is true) or/and the word is linked to the +corpus meta data (setXlink is true). While all code lines are equivalent with +what was explained in listing ref\{src:think\}, it suffices to focus on an +additional name space, i.e. +{\color{red}\bfseries{}{}`{}`}xlink’’ has to be defined (line ref\{ln:namespace\}). Once the linking of word +and corpus is set, the entire MyCoRe object has to be updated. This is done by the functionality of the framework (lines ref\{ln:updatestart\}\textendash{}ref\{ln:updateend\}). +At the end, an instance of emph\{QualityControl\} is created. + +\%QualityControl +The class emph\{QualityControl\} is instantiated with a constructor +depicted in listing ref\{src:constructQC\}. +begin\{lstlisting\}{[}language=java,caption=\{Constructor of QualityControl.java\},label=src:constructQC,escapechar=\textbar{}{]} +private MCRObject mycoreObject; +/* Constructor calls method to carry out quality control, i.e. if at least 20 +\begin{quote} +\begin{itemize} +\item {} +different users agree 100\% on the segments of the word under investigation + +\end{itemize} + +{\color{red}\bfseries{}*}/ +\end{quote} + +public QualityControl(MCRObject mycoreObject) throws Exception +\{ +\begin{quote} + +this.mycoreObject = mycoreObject; +if (getEqualObjectNumber() \textgreater{} 20) +\{ +\begin{quote} + +addToMorphiloDB(); +\end{quote} + +\} +\end{quote} + +\} +end\{lstlisting\} +The constructor takes an MyCoRe object, a potential word candidate for the +master data base, which is assigned to a private class variable because the +object is used though not changed by some other java methods. +More importantly, there are two more methods: emph\{getEqualNumber()\} and +emph\{addToMorphiloDB()\}. While the former initiates a process of counting and +comparing objects, the latter is concerned with calculating the correct number +of occurrences from different, but not the same texts, and generating a MyCoRe object with the same content but with two different flags in the emph\{//service/servflags/servflag\}-node, i.e. emph\{createdby=’administrator’\} and emph\{state=’published’\}. +And of course, the emph\{occurrence\} attribute is set to the newly calculated value. The logic corresponds exactly to what was explained in +listing ref\{src:think\} and will not be repeated here. The only difference are the paths compiled by the XPathFactory. They are +begin\{itemize\} +\begin{quote} + +item{[}-{]} emph\{//service/servflags/servflag{[}@type=’createdby’{]}\} and +item{[}-{]} emph\{//service/servstates/servstate{[}@classid=’state’{]}\}. +\end{quote} + +end\{itemize\} +It is more instructive to document how the number of occurrences is calculated. There are two steps involved. First, a list with all mycore objects that are +equal to the object which the class is instantiated with ({\color{red}\bfseries{}{}`{}`}mycoreObject’’ in listing ref\{src:constructQC\}) is created. This list is looped and all occurrence +attributes are summed up. Second, all occurrences from equal texts are substracted. Equal texts are identified on the basis of its meta data and its derivate. +There are some obvious shortcomings of this approach, which will be discussed in chapter ref\{chap:results\}, section ref\{sec:improv\}. Here, suffice it to +understand the mode of operation. Listing ref\{src:equalOcc\} shows a possible solution. +begin\{lstlisting\}{[}language=java,caption=\{Occurrence Extraction from Equal Texts (1)\},label=src:equalOcc,escapechar=\textbar{}{]} +/* returns number of Occurrences if Objects are equal, zero otherwise +\begin{quote} + +{\color{red}\bfseries{}*}/ +\end{quote} + +private int getOccurrencesFromEqualTexts(MCRObject mcrobj1, MCRObject mcrobj2) throws SAXException, IOException +\{ +\begin{quote} + +int occurrences = 1; +//extract corpmeta ObjectIDs from morphilo-Objects +String crpID1 = getAttributeValue(“//corpuslink”, “href”, mcrobj1); +String crpID2 = getAttributeValue(“//corpuslink”, “href”, mcrobj2); +//get these two corpmeta Objects +MCRObject corpo1 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID1)); +MCRObject corpo2 = MCRMetadataManager.retrieveMCRObject(MCRObjectID.getInstance(crpID2)); +//are the texts equal? get list of ‘processed-words’ derivate +String corp1DerivID = getAttributeValue(“//structure/derobjects/derobject”, “href”, corpo1); +String corp2DerivID = getAttributeValue(“//structure/derobjects/derobject”, “href”, corpo2); + +ArrayList result = new ArrayList(getContentFromFile(corp1DerivID, “”));\textbar{}label\{ln:writeContent\}\textbar{} +result.remove(getContentFromFile(corp2DerivID, “”));\textbar{}label\{ln:removeContent\}\textbar{} +if (result.size() == 0) // the texts are equal +\{ +\begin{quote} + +// extract occurrences of one the objects +occurrences = Integer.parseInt(getAttributeValue(“//morphiloContainer/morphilo/w”, “occurrence”, mcrobj1)); +\end{quote} + +\} +else +\{ +\begin{quote} + +occurrences = 0; //project metadata happened to be the same, but texts are different +\end{quote} + +\} +return occurrences; +\end{quote} + +\} +end\{lstlisting\} +In this implementation, the ids from the emph\{corpmeta\} data model are accessed via the xlink attribute in the morphilo documents. +The method emph\{getAttributeValue(String, String, MCRObject)\} does exactly the same as demonstrated earlier (see from line ref\{ln:namespace\} +on in listing ref\{src:getUnknowWords\}). The underlying logic is that the texts are equal if exactly the same number of words were uploaded. +So all words from one file are written to a list (line ref\{ln:writeContent\}) and words from the other file are removed from the +very same list (line ref\{ln:removeContent\}). If this list is empty, then the exact same number of words must have been in both files and the occurrences +are adjusted accordingly. Since this method is called from another private method that only contains a loop through all equal objects, one gets +the occurrences from all equal texts. For reasons of confirmability, the looping method is also given: +begin\{lstlisting\}{[}language=java,caption=\{Occurrence Extraction from Equal Texts (2)\},label=src:equalOcc2,escapechar=\textbar{}{]} +private int getOccurrencesFromEqualTexts() throws Exception +\{ +\begin{quote} + +ArrayList\textless{}MCRObject\textgreater{} equalObjects = new ArrayList\textless{}MCRObject\textgreater{}(); +equalObjects = getAllEqualMCRObjects(); +int occurrences = 0; +for (MCRObject obj : equalObjects) +\{ +\begin{quote} + +occurrences = occurrences + getOccurrencesFromEqualTexts(mycoreObject, obj); +\end{quote} + +\} +return occurrences; +\end{quote} + +\} +end\{lstlisting\} + +Now, the constructor in listing ref\{src:constructQC\} reveals another method that rolls out an equally complex concatenation of procedures. +As implied above, emph\{getEqualObjectNumber()\} returns the number of equally annotated words. It does this by falling back to another +method from which the size of the returned list is calculated (emph\{getAllEqualMCRObjects().size()\}). Hence, we should care about +emph\{getAllEqualMCRObjects()\}. This method really has the same design as emph\{int getOccurrencesFromEqualTexts()\} in listing ref\{src:equalOcc2\}. +The difference is that another method (emph\{Boolean compareMCRObjects(MCRObject, MCRObject, String)\}) is used within the loop and +that all equal objects are put into the list of MyCoRe objects that are returned. If this list comprises more than 20 +entries,footnote\{This number is somewhat arbitrary. It is inspired by the sample size n in t-distributed data.\} the respective document +will be integrated in the master data base by the process described above. +The comparator logic is shown in listing ref\{src:compareMCR\}. +begin\{lstlisting\}{[}language=java,caption=\{Comparison of MyCoRe objects\},label=src:compareMCR,escapechar=\textbar{}{]} +private Boolean compareMCRObjects(MCRObject mcrobj1, MCRObject mcrobj2, String xpath) throws SAXException, IOException +\{ +\begin{quote} + +Boolean isEqual = false; +Boolean beginTime = false; +Boolean endTime = false; +Boolean occDiff = false; +Boolean corpusDiff = false; + +String source = getXMLFromObject(mcrobj1, xpath); +String target = getXMLFromObject(mcrobj2, xpath); + +XMLUnit.setIgnoreAttributeOrder(true); +XMLUnit.setIgnoreComments(true); +XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true); +XMLUnit.setIgnoreWhitespace(true); +XMLUnit.setNormalizeWhitespace(true); + +//differences in occurrences, end, begin should be ignored +try +\{ +\begin{quote} + +Diff xmlDiff = new Diff(source, target); +DetailedDiff dd = new DetailedDiff(xmlDiff); +//counters for differences +int i = 0; +int j = 0; +int k = 0; +int l = 0; +// list containing all differences +List differences = dd.getAllDifferences();\textbar{}label\{ln:difflist\}\textbar{} +for (Object object : differences) +\{ +\begin{quote} + +Difference difference = (Difference) object; +\sphinxhref{mailto://@begin}{//@begin},@end,… node is not in the difference list if the count is 0 +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@begin”)) i++;\textbar{}label\{ln:diffbegin\}\textbar{} +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@end”)) j++; +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@occurrence”)) k++; +if (difference.getControlNodeDetail().getXpathLocation().endsWith(“@corpus”)) l++;\textbar{}label\{ln:diffend\}\textbar{} +//@begin and @end have different values: they must be checked if they fall right in the allowed time range +if ( difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) +\begin{quote} + +\&\& difference.getControlNodeDetail().getXpathLocation().endsWith(“@begin”) +\&\& (Integer.parseInt(difference.getControlNodeDetail().getValue()) \textless{} Integer.parseInt(difference.getTestNodeDetail().getValue())) ) +\end{quote} +\begin{description} +\item[{\{}] \leavevmode +beginTime = true; + +\end{description} + +\} +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) +\begin{quote} + +\&\& difference.getControlNodeDetail().getXpathLocation().endsWith(“@end”) +\&\& (Integer.parseInt(difference.getControlNodeDetail().getValue()) \textgreater{} Integer.parseInt(difference.getTestNodeDetail().getValue())) ) +\end{quote} +\begin{description} +\item[{\{}] \leavevmode +endTime = true; + +\end{description} + +\} +//attribute values of @occurrence and @corpus are ignored if they are different +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) +\begin{quote} + +\&\& difference.getControlNodeDetail().getXpathLocation().endsWith(“@occurrence”)) +\end{quote} +\begin{description} +\item[{\{}] \leavevmode +occDiff = true; + +\end{description} + +\} +if (difference.getControlNodeDetail().getXpathLocation().equals(difference.getTestNodeDetail().getXpathLocation()) +\begin{quote} + +\&\& difference.getControlNodeDetail().getXpathLocation().endsWith(“@corpus”)) +\end{quote} +\begin{description} +\item[{\{}] \leavevmode +corpusDiff = true; + +\end{description} + +\} +\end{quote} + +\} +//if any of @begin, @end … is identical set Boolean to true +if (i == 0) beginTime = true;\textbar{}label\{ln:zerobegin\}\textbar{} +if (j == 0) endTime = true; +if (k == 0) occDiff = true; +if (l == 0) corpusDiff = true;\textbar{}label\{ln:zeroend\}\textbar{} +//if the size of differences is greater than the number of changes admitted in @begin, @end … something else must be different +if (beginTime \&\& endTime \&\& occDiff \&\& corpusDiff \&\& (i + j + k + l) == dd.getAllDifferences().size()) isEqual = true;\textbar{}label\{ln:diffsum\}\textbar{} +\} +catch (SAXException e) +\{ +\begin{quote} + +e.printStackTrace(); +\end{quote} + +\} +catch (IOException e) +\{ +\begin{quote} + +e.printStackTrace(); +\end{quote} + +\} +\end{quote} + +return isEqual; +\end{quote} + +\} +end\{lstlisting\} +In this method, XMLUnit is heavily used to make all necessary node comparisons. The matter becomes more complicated, however, if some attributes +are not only ignored, but evaluated according to a given definition as it is the case for the time range. If the evaluator and builder classes are +not to be overwritten entirely because needed for evaluating other nodes of the +xml document, the above solution appears a bit awkward. So there is potential for improvement before the production version is to be programmed. + +XMLUnit provides us with a +list of the differences of the two documents (see line ref\{ln:difflist\}). There are four differences allowed, that is, the attributes emph\{occurrence\}, +emph\{corpus\}, emph\{begin\}, and emph\{end\}. For each of them a Boolean variable is set. Because any of the attributes could also be equal to the master +document and the difference list only contains the actual differences, one has to find a way to define both, equal and different, for the attributes. +This could be done by ignoring these nodes. Yet, this would not include testing if the beginning and ending dates fall into the range of the master +document. Therefore the attributes are counted as lines ref\{ln:diffbegin\} through ref\{ln:diffend\} reveal. If any two documents +differ in some of the four attributes just specified, then the sum of the counters (line ref\{ln:diffsum\}) should not be greater than the collected differences +by XMLUnit. The rest of the if-tests assign truth values to the respective +Booleans. It is probably worth mentioning that if all counters are zero (lines +ref\{ln:zerobegin\}-ref\{ln:zeroend\}) the attributes and values are identical and hence the Boolean has to be set explicitly. Otherwise the test in line ref\{ln:diffsum\} would fail. + +\%TagCorpusServlet +Once quality control (explained in detail further down) has been passed, it is +the user’s turn to interact further. By clicking on the option emph\{Manual tagging\}, the emph\{TagCorpusServlet\} will be callled. This servlet instantiates +emph\{ProcessCorpusServlet\} to get access to the emph\{getUnknownWords\}-method, which delivers the words still to be +processed and which overwrites the content of the file starting with emph\{untagged\}. For the next word in emph\{leftovers\} a new MyCoRe object is created +using the JDOM API and added to the file beginning with emph\{processed\}. In line ref\{ln:tagmanu\} of listing ref\{src:tagservlet\}, the previously defined +entry mask is called, with which the proposed word structure could be confirmed or changed. How the word structure is determined will be shown later in +the text. +begin\{lstlisting\}{[}language=java,caption=\{Manual Tagging Procedure\},label=src:tagservlet,escapechar=\textbar{}{]} +… +if (!leftovers.isEmpty()) +\{ +\begin{quote} + +ArrayList\textless{}String\textgreater{} processed = new ArrayList\textless{}String\textgreater{}(); +//processed.add(leftovers.get(0)); +JDOMorphilo jdm = new JDOMorphilo(); +MCRObject obj = jdm.createMorphiloObject(job, leftovers.get(0));\textbar{}label\{ln:jdomobject\}\textbar{} +//write word to be annotated in process list and save it +Path filePathProc = pcs.getDerivateFilePath(job, “processed”).getFileName(); +Path proc = root.resolve(filePathProc); +processed = pcs.getContentFromFile(job, “processed”); +processed.add(leftovers.get(0)); +Files.write(proc, processed); + +//call entry mask for next word +tagUrl = prop.getBaseURL() + “content/publish/morphilo.xed?id=” + obj.getId();\textbar{}label\{ln:tagmanu\}\textbar{} +\end{quote} + +\} +else +\{ +\begin{quote} + +//initiate process to give a complete tagged file of the original corpus +//if untagged-file is empty, match original file with morphilo +//creator=administrator OR creator=username and write matches in a new file +ArrayList\textless{}String\textgreater{} complete = new ArrayList\textless{}String\textgreater{}(); +ProcessCorpusServlet pcs2 = new ProcessCorpusServlet(); +complete = pcs2.getUnknownWords( +\begin{quote} + +pcs2.getContentFromFile(job, “”), //main corpus file +pcs2.getCorpusMetadata(job, “def.datefrom”), +pcs2.getCorpusMetadata(job, “def.dateuntil”), +“”, //wordtype +false, +false, +true); +\end{quote} + +Files.delete(p); +MCRXMLFunctions mdm = new MCRXMLFunctions(); +String mainFile = mdm.getMainDocName(derivID); +Path newRoot = root.resolve(“tagged-” + mainFile); +Files.write(newRoot, complete); + +//return to Menu page +tagUrl = prop.getBaseURL() + “receive/” + corpID; +\end{quote} + +\} +end\{lstlisting\} +At the point where no more items are in emph\{leftsovers\} the emph\{getUnknownWords\}-method is called whereas the last Boolean parameter +is set true. This indicates that the array list containing all available and relevant data to the respective user is returned as seen in +the code snippet in listing ref\{src:writeAll\}. +begin\{lstlisting\}{[}language=java,caption=\{Code snippet to deliver all data to the user\},label=src:writeAll,escapechar=\textbar{}{]} +… +// all data is written to lo in TEI +if (writeAllData \&\& isAuthorized \&\& timeCorrect) +\{ +\begin{quote} + +XPathExpression\textless{}Element\textgreater{} xpath = xpfac.compile(“//morphiloContainer/morphilo”, Filters.element()); +for (Element e : xpath.evaluate(jdomDoc)) +\{ +\begin{quote} + +XMLOutputter outputter = new XMLOutputter(); +outputter.setFormat(Format.getPrettyFormat()); +lo.add(outputter.outputString(e.getContent())); +\end{quote} + +\} +\end{quote} + + +\subsection{\}} +\label{\detokenize{source/controller:id13}} +end\{lstlisting\} +The complete list (emph\{lo\}) is written to yet a third file starting with emph\{tagged\} and finally returned to the main project webpage. + +\%JDOMorphilo +The interesting question is now where does the word structure come from, which is filled in the entry mask as asserted above. +In listing ref\{src:tagservlet\} line ref\{ln:jdomobject\}, one can see that a JDOM object is created and the method +emph\{createMorphiloObject(MCRServletJob, String)\} is called. The string parameter is the word that needs to be analyzed. +Most of the method is a mere application of the JDOM API given the data model in chapter ref\{chap:concept\} section +ref\{subsec:datamodel\} and listing ref\{lst:worddatamodel\}. That means namespaces, elements and their attributes are defined in the correct +order and hierarchy. + +To fill the elements and attributes with text, i.e. prefixes, suffixes, stems, etc., a Hashmap \textendash{} containing the morpheme as +key and its position as value \textendash{} are created that are filled with the results from an AffixStripper instantiation. Depending on how many prefixes +or suffixes respectively are put in the hashmap, the same number of xml elements are created. As a final step, a valid MyCoRe id is generated using +the existing MyCoRe functionality, the object is created and returned to the TagCorpusServlet. + +\%AffixStripper explanation +Last, the analyses of the word structure will be considered. It is implemented +in the emph\{AffixStripper.java\} file. +All lexical affix morphemes and their allomorphs as well as the inflections were extracted from the +OEDfootnote\{Oxford English Dictionary \sphinxurl{http://www.oed.com/}\} and saved as enumerated lists (see the example in listing ref\{src:enumPref\}). +The allomorphic items of these lists are mapped successively to the beginning in the case of prefixes +(see listing ref\{src:analyzePref\}, line ref\{ln:prefLoop\}) or to the end of words in the case of suffixes +(see listing ref\{src:analyzeSuf\}). Since each +morphemic variant maps to its morpheme right away, it makes sense to use the morpheme and so +implicitly keep the relation to its allomorph. + +begin\{lstlisting\}{[}language=java,caption=\{Enumeration Example for the Prefix “over”\},label=src:enumPref,escapechar=\textbar{}{]} +package custom.mycore.addons.morphilo; + +public enum PrefixEnum \{ +… +\begin{quote} + +over(“over”), ufer(“over”), ufor(“over”), uferr(“over”), uvver(“over”), obaer(“over”), ober(“over)”), ofaer(“over”), +ofere(“over”), ofir(“over”), ofor(“over”), ofer(“over”), ouer(“over”),oferr(“over”), offerr(“over”), offr(“over”), aure(“over”), +war(“over”), euer(“over”), oferre(“over”), oouer(“over”), oger(“over”), ouere(“over”), ouir(“over”), ouire(“over”), +ouur(“over”), ouver(“over”), ouyr(“over”), ovar(“over”), overe(“over”), ovre(“over”),ovur(“over”), owuere(“over”), owver(“over”), +houyr(“over”), ouyre(“over”), ovir(“over”), ovyr(“over”), hover(“over”), auver(“over”), awver(“over”), ovver(“over”), +hauver(“over”), ova(“over”), ove(“over”), obuh(“over”), ovah(“over”), ovuh(“over”), ofowr(“over”), ouuer(“over”), oure(“over”), +owere(“over”), owr(“over”), owre(“over”), owur(“over”), owyr(“over”), our(“over”), ower(“over”), oher(“over”), +ooer(“over”), oor(“over”), owwer(“over”), ovr(“over”), owir(“over”), oar(“over”), aur(“over”), oer(“over”), ufara(“over”), +ufera(“over”), ufere(“over”), uferra(“over”), ufora(“over”), ufore(“over”), ufra(“over”), ufre(“over”), ufyrra(“over”), +yfera(“over”), yfere(“over”), yferra(“over”), uuera(“over”), ufe(“over”), uferre(“over”), uuer(“over”), uuere(“over”), +vfere(“over”), vuer(“over”), vuere(“over”), vver(“over”), uvvor(“over”) … +\end{quote} +\begin{description} +\item[{…chap:results}] \leavevmode +private String morpheme; +//constructor +PrefixEnum(String morpheme) +\{ +\begin{quote} + +this.morpheme = morpheme; +\end{quote} + +\} +//getter Method + +public String getMorpheme() +\{ +\begin{quote} + +return this.morpheme; +\end{quote} + +\} + +\end{description} + +\} +end\{lstlisting\} +As can be seen in line ref\{ln:prefPutMorph\} in listing ref\{src:analyzePref\}, the morpheme is saved to a hash map together with its position, i.e. the size of the +map plus one at the time being. In line ref\{ln:prefCutoff\} the emph\{analyzePrefix\} method is recursively called until no more matches can be made. + +begin\{lstlisting\}{[}language=java,caption=\{Method to recognize prefixes\},label=src:analyzePref,escapechar=\textbar{}{]} +private Map\textless{}String, Integer\textgreater{} prefixMorpheme = new HashMap\textless{}String,Integer\textgreater{}(); +… +private void analyzePrefix(String restword) +\{ +\begin{quote} + +if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion +\{ +\begin{quote} + +for (PrefixEnum prefEnum : PrefixEnum.values())\textbar{}label\{ln:prefLoop\}\textbar{} +\{ +\begin{quote} + +String s = prefEnum.toString(); +if (restword.startsWith(s)) +\{ +\begin{quote} + +prefixMorpheme.put(s, prefixMorpheme.size() + 1);\textbar{}label\{ln:prefPutMorph\}\textbar{} +//cut off the prefix that is added to the list +analyzePrefix(restword.substring(s.length()));\textbar{}label\{ln:prefCutoff\}\textbar{} +\end{quote} + +\} +else +\{ +\begin{quote} + +analyzePrefix(“”); +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} +end\{lstlisting\} + +The recognition of suffixes differs only in the cut-off direction since suffixes occur at the end of a word. +Hence, line ref\{ln:prefCutoff\} in listing ref\{src:analyzePref\} reads in the case of suffixes. + +begin\{lstlisting\}{[}language=java,caption=\{Cut-off mechanism for suffixes\},label=src:analyzeSuf,escapechar=\textbar{}{]} +analyzeSuffix(restword.substring(0, restword.length() - s.length())); +end\{lstlisting\} + +It is important to note that inflections are suffixes (in the given model case of Middle English morphology) that usually occur at the very +end of a word, i.e. after all lexical suffixes, only once. It follows that inflections +have to be recognized at first without any repetition. So the procedure for inflections can be simplified +to a substantial degree as listing ref\{src:analyzeInfl\} shows. + +begin\{lstlisting\}{[}language=java,caption=\{Method to recognize inflections\},label=src:analyzeInfl,escapechar=\textbar{}{]} +private String analyzeInflection(String wrd) +\{ +\begin{quote} + +String infl = “”; +for (InflectionEnum inflEnum : InflectionEnum.values()) +\{ +\begin{quote} + +if (wrd.endsWith(inflEnum.toString())) +\{ +\begin{quote} + +infl = inflEnum.toString(); +\end{quote} + +\} +\end{quote} + +\} +return infl; +\end{quote} + +\} +end\{lstlisting\} + +Unfortunately the embeddedness problem prevents a very simple algorithm. Embeddedness occurs when a lexical item +is a substring of another lexical item. To illustrate, the suffix emph\{ion\} is also contained in the suffix emph\{ation\}, as is +emph\{ent\} in emph\{ment\}, and so on. The embeddedness problem cannot be solved completely on the basis of linear modelling, but +for a large part of embedded items one can work around it using implicitly Zipf’s law, i.e. the correlation between frequency +and length of lexical items. The longer a word becomes, the less frequent it will occur. The simplest logic out of it is to assume +that longer suffixes (measured in letters) are preferred over shorter suffixes because it is more likely tha the longer the suffix string becomes, +the more likely it is one (as opposed to several) suffix unit(s). This is done in listing ref\{src:embedAffix\}, whereas +the inner class emph\{sortedByLengthMap\} returns a list sorted by length and the loop from line ref\{ln:deleteAffix\} onwards deletes +the respective substrings. + +begin\{lstlisting\}{[}language=java,caption=\{Method to workaround embeddedness\},label=src:embedAffix,escapechar=\textbar{}{]} +private Map\textless{}String, Integer\textgreater{} sortOutAffixes(Map\textless{}String, Integer\textgreater{} affix) +\{ +\begin{quote} +\begin{description} +\item[{Map\textless{}String,Integer\textgreater{} sortedByLengthMap = new TreeMap\textless{}String, Integer\textgreater{}(new Comparator\textless{}String\textgreater{}()}] \leavevmode\begin{description} +\item[{\{}] \leavevmode +@Override +public int compare(String s1, String s2) +\{ +\begin{quote} + +int cmp = Integer.compare(s1.length(), s2.length()); +return cmp != 0 ? cmp : s1.compareTo(s2); +\end{quote} + +\} + +\end{description} + +\} + +\end{description} + +); +sortedByLengthMap.putAll(affix); +ArrayList\textless{}String\textgreater{} al1 = new ArrayList\textless{}String\textgreater{}(sortedByLengthMap.keySet()); +ArrayList\textless{}String\textgreater{} al2 = al1; +Collections.reverse(al2); +for (String s2 : al1)\textbar{}label\{ln:deleteAffix\}\textbar{} +\{ +\begin{quote} +\begin{description} +\item[{for (String s1}] \leavevmode{[}al2){]} +if (s1.contains(s2) \&\& s1.length() \textgreater{} s2.length()) +\{ +\begin{quote} + +affix.remove(s2); +\end{quote} + +\} + +\end{description} + +\} +\end{quote} + +return affix; +\end{quote} + +\} +end\{lstlisting\} + +Finally, the position of the affix has to be calculated because the hashmap in line ref\{ln:prefPutMorph\} in +listing ref\{src:analyzePref\} does not keep the original order for changes taken place in addressing the affix embeddedness +(listing ref\{src:embedAffix\}). Listing ref\{src:affixPos\} depicts the preferred solution. +The recursive construction of the method is similar to emph\{private void analyzePrefix(String)\} (listing ref\{src:analyzePref\}) +only that the two affix types are handled in one method. For that, an additional parameter taking the form either emph\{suffix\} +or emph\{prefix\} is included. + +begin\{lstlisting\}{[}language=java,caption=\{Method to determine position of the affix\},label=src:affixPos,escapechar=\textbar{}{]} +private void getAffixPosition(Map\textless{}String, Integer\textgreater{} affix, String restword, int pos, String affixtype) +\{ +\begin{quote} + +if (!restword.isEmpty()) //Abbruchbedingung fuer Rekursion +\{ +\begin{quote} + +for (String s : affix.keySet()) +\{ +\begin{quote} + +if (restword.startsWith(s) \&\& affixtype.equals(“prefix”)) +\{ +\begin{quote} +\begin{quote} + +pos++; +prefixMorpheme.put(s, pos); +\end{quote} +\begin{description} +\item[{//prefixAllomorph.add(pos-1, restword.substring(s.length()));}] \leavevmode +getAffixPosition(affix, restword.substring(s.length()), pos, affixtype); + +\end{description} +\end{quote} + +\} +else if (restword.endsWith(s) \&\& affixtype.equals(“suffix”)) +\{ +\begin{quote} + +pos++; +suffixMorpheme.put(s, pos); +//suffixAllomorph.add(pos-1, restword.substring(s.length())); +getAffixPosition(affix, restword.substring(0, restword.length() - s.length()), pos, affixtype); +\end{quote} + +\} +else +\{ +\begin{quote} + +getAffixPosition(affix, “”, pos, affixtype); +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} +end\{lstlisting\} + +To give the complete word structure, the root of a word should also be provided. In listing ref\{src:rootAnalyze\} a simple solution is offered, however, +considering compounds as words consisting of more than one root. +begin\{lstlisting\}{[}language=java,caption=\{Method to determine roots\},label=src:rootAnalyze,escapechar=\textbar{}{]} +private ArrayList\textless{}String\textgreater{} analyzeRoot(Map\textless{}String, Integer\textgreater{} pref, Map\textless{}String, Integer\textgreater{} suf, int stemNumber) +\{ +\begin{quote} + +ArrayList\textless{}String\textgreater{} root = new ArrayList\textless{}String\textgreater{}(); +int j = 1; //one root always exists +// if word is a compound several roots exist +while (j \textless{}= stemNumber) +\{ +\begin{quote} + +j++; +String rest = lemma;\textbar{}label\{ln:lemma\}\textbar{} + +for (int i=0;i\textless{}pref.size();i++) +\{ +\begin{quote} + +for (String s : pref.keySet()) +\{ +\begin{quote} +\begin{description} +\item[{//if (i == pref.get(s))}] \leavevmode +if (rest.length() \textgreater{} s.length() \&\& s.equals(rest.substring(0, s.length()))) +\{ +\begin{quote} + +rest = rest.substring(s.length(),rest.length()); +\end{quote} + +\end{description} + +\} +\end{quote} + +\} +\end{quote} + +\} + +for (int i=0;i\textless{}suf.size();i++) +\{ +\begin{quote} + +for (String s : suf.keySet()) +\{ +\begin{quote} + +//if (i == suf.get(s)) +if (s.length() \textless{} rest.length() \&\& (s.equals(rest.substring(rest.length() - s.length(), rest.length())))) +\{ +\begin{quote} + +rest = rest.substring(0, rest.length() - s.length()); +\end{quote} + +\} +\end{quote} + +\} +\end{quote} + +\} +root.add(rest); +\end{quote} + +\} +return root; +\end{quote} + +\} +end\{lstlisting\} +The logic behind this method is that the root is the remainder of a word when all prefixes and suffixes are substracted. +So the loops run through the number of prefixes and suffixes at each position and substract the affix. Really, there is +some code doubling with the previously described methods, which could be eliminated by making it more modular in a possible +refactoring phase. Again, this is not the concern of a prototype. Line ref\{ln:lemma\} defines the initial state of a root, +which is the case for monomorphemic words. The emph\{lemma\} is defined as the wordtoken without the inflection. Thus listing +ref\{src:lemmaAnalyze\} reveals how the class variable is calculated +begin\{lstlisting\}{[}language=java,caption=\{Method to determine lemma\},label=src:lemmaAnalyze,escapechar=\textbar{}{]} +/* +\begin{quote} +\begin{itemize} +\item {} +Simplification: lemma = wordtoken - inflection + +\end{itemize} + +{\color{red}\bfseries{}*}/ +\end{quote} + +private String analyzeLemma(String wrd, String infl) +\{ +\begin{quote} + +return wrd.substring(0, wrd.length() - infl.length()); +\end{quote} + +\} +end\{lstlisting\} +The constructor of emph\{AffixStripper\} calls the method emph\{analyzeWord()\} +whose only job is to calculate each structure element in the correct order +(listing ref\{src:lemmaAnalyze\}). All structure elements are also provided by getters. +begin\{lstlisting\}{[}language=java,caption=\{Method to determine all word structure\},label=src:lemmaAnalyze,escapechar=\textbar{}{]} +private void analyzeWord() +\{ +\begin{quote} + +//analyze inflection first because it always occurs at the end of a word +inflection = analyzeInflection(wordtoken); +lemma = analyzeLemma(wordtoken, inflection); +analyzePrefix(lemma); +analyzeSuffix(lemma); +getAffixPosition(sortOutAffixes(prefixMorpheme), lemma, 0, “prefix”); +getAffixPosition(sortOutAffixes(suffixMorpheme), lemma, 0, “suffix”); +prefixNumber = prefixMorpheme.size(); +suffixNumber = suffixMorpheme.size(); +wordroot = analyzeRoot(prefixMorpheme, suffixMorpheme, getStemNumber()); +\end{quote} + +\} +end\{lstlisting\} + +To conclude, the Morphilo implementation as presented here, aims at fulfilling the task of a working prototype. It is important to note +that it neither claims to be a very efficient nor a ready software program to be used in production. However, it marks a crucial milestone +on the way to a production system. At some listings sources of improvement were made explicit; at others no suggestions were made. In the latter +case this does not imply that there is no potential for improvement. Once acceptability tests are carried out, it will be the task of a follow up project +to identify these potentials and implement them accordingly. + + +\chapter{View} +\label{\detokenize{source/view::doc}}\label{\detokenize{source/view:view}} + +\section{Conceptualization} +\label{\detokenize{source/view:conceptualization}} +Lastly, the third directory (emph\{src/main/resources\}) contains all code needed +for rendering the data to be displayed on the screen. So this corresponds to +the view in an MVC approach. It is done by xsl-files that (unfortunately) +contain some logic that really belongs to the controller. Thus, the division is +not as clear as implied in theory. I will discuss this issue more specifically in the +relevant subsection below. Among the resources are also all images, styles, and +javascripts. + + +\section{Implementation} +\label{\detokenize{source/view:implementation}} +As explained in section ref\{subsec:mvc\}, the view component handles the visual +representation in the form of an interface that allows interaction between +the user and the task to be carried out by the machine. As a +webservice in the present case, all interaction happens via a browser, i.e. webpages are +visualized and responses are recognized by registering mouse or keyboard +events. More specifically, a webpage is rendered by transforming xml documents +to html pages. The MyCoRe repository framework uses an open source XSLT +processor from Apache, Xalan.footnote\{http://xalan.apache.org\} This engine +transforms document nodes described by the XPath syntax into hypertext making +use of a special form of template matching. All templates are collected in so +called xml-encoded stylesheets. Since there are two data models with two +different structures, it is good practice to define two stylesheet files one for +each data model. + +As a demonstration, in listing ref\{lst:morphilostylesheet\} below a short +extract is given for rendering the word data. + +begin\{lstlisting\}{[}language=XML,caption=\{stylesheet +morphilo.xsl\},label=lst:morphilostylesheet{]} +\textless{}?xml version=”1.0” encoding=”UTF-8”?\textgreater{} +\textless{}xsl:stylesheet +\begin{quote} + +xmlns:xsl=”http://www.w3.org/1999/XSL/Transform” +xmlns:xalan=”http://xml.apache.org/xalan” +xmlns:i18n=”xalan://org.mycore.services.i18n.MCRTranslation” +xmlns:acl=”xalan://org.mycore.access.MCRAccessManager” +xmlns:mcr=”http://www.mycore.org/” xmlns:xlink=”http://www.w3.org/1999/xlink” +xmlns:mods=”http://www.loc.gov/mods/v3” +xmlns:encoder=”xalan://java.net.URLEncoder” +xmlns:mcrxsl=”xalan://org.mycore.common.xml.MCRXMLFunctions” +xmlns:mcrurn=”xalan://org.mycore.urn.MCRXMLFunctions” +exclude-result-prefixes=”xalan xlink mcr i18n acl mods mcrxsl mcrurn encoder” +version=”1.0”\textgreater{} +\textless{}xsl:param name=”MCR.Users.Superuser.UserName”/\textgreater{} +\begin{description} +\item[{\textless{}xsl:template match=”/mycoreobject{[}contains(@ID,’\_morphilo\_’){]}”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}head\textgreater{}}] \leavevmode +\textless{}link href=”\{\$WebApplicationBaseURL\}css/file.css” rel=”stylesheet”/\textgreater{} + +\end{description} + +\textless{}/head\textgreater{} +\textless{}div class=”row”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:call-template name=”objectAction”\textgreater{}}] \leavevmode +\textless{}xsl:with-param name=”id” select=”@ID”/\textgreater{} +\textless{}xsl:with-param name=”deriv” select=”structure/derobjects/derobject/@xlink:href”/\textgreater{} + +\end{description} + +\textless{}/xsl:call-template\textgreater{} +\textless{}xsl:variable name=”objID” select=”@ID”/\textgreater{} +\textless{}!\textendash{} Hier Ueberschrift setzen \textendash{}\textgreater{} +\textless{}h1 style=”text-indent: 4em;”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:if test=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text(){[}string-length(normalize-space(.))\textgreater{}0{]}”/\textgreater{} + +\end{description} + +\textless{}/xsl:if\textgreater{} +\end{quote} + +\textless{}/h1\textgreater{} +\textless{}dl class=”dl-horizontal”\textgreater{} +\textless{}!\textendash{} (1) Display word \textendash{}\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:if test=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}dt\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘response.page.label.word’)”/\textgreater{} + +\end{description} + +\textless{}/dt\textgreater{} +\textless{}dd\textgreater{} +\begin{quote} + +\textless{}xsl:value-of select=”metadata/def.morphiloContainer/morphiloContainer/morphilo/w/text(){[}string-length(normalize-space(.))\textgreater{}0{]}”/\textgreater{} +\end{quote} + +\textless{}/dd\textgreater{} + +\end{description} + +\textless{}/xsl:if\textgreater{} +\end{quote} +\begin{description} +\item[{\textless{}!\textendash{} (2) Display lemma \textendash{}\textgreater{}}] \leavevmode +… + +\end{description} +\end{quote} + +\end{description} + +\textless{}/xsl:template\textgreater{} +… +\textless{}xsl:template name=”objectAction”\textgreater{} +… +\textless{}/xsl:template\textgreater{} +\end{quote} + +… +\textless{}/xsl:stylesheet\textgreater{} +end\{lstlisting\} +This template matches with +the root node of each emph\{MyCoRe object\} ensuring that a valid MyCoRe model is +used and checking that the document to be processed contains a unique +identifier, here a emph\{MyCoRe-ID\}, and the name of the correct data model, +here emph\{morphilo\}. +Then, another template, emph\{objectAction\}, is called together with two parameters, the ids +of the document object and attached files. In the remainder all relevant +information from the document is accessed by XPath, such as the word and the lemma, +and enriched with hypertext annotations it is rendered as a hypertext document. +The template emph\{objectAction\} is key to understand the coupling process in the software +framework. It is therefore separately listed in ref\{lst:objActionTempl\}. + +begin\{lstlisting\}{[}language=XML,caption=\{template +objectAction\},label=lst:objActionTempl,escapechar=\textbar{}{]} +\textless{}xsl:template name=”objectAction”\textgreater{} +\begin{quote} + +\textless{}xsl:param name=”id” select=”./@ID”/\textgreater{} +\textless{}xsl:param name=”accessedit” select=”acl:checkPermission(\$id,’writedb’)”/\textgreater{} +\textless{}xsl:param name=”accessdelete” select=”acl:checkPermission(\$id,’deletedb’)”/\textgreater{} +\textless{}xsl:variable name=”derivCorp” select=”./@label”/\textgreater{} +\textless{}xsl:variable name=”corpID” select=”metadata/def.corpuslink{[}@class=’MCRMetaLinkID’{]}/corpuslink/@xlink:href”/\textgreater{} +\textless{}xsl:if test=”\$accessedit or \$accessdelete”\textgreater{}\textbar{}label\{ln:ng\}\textbar{} +\textless{}div class=”dropdown pull-right”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:if test=”string-length(\$corpID) \> 0 or \$CurrentUser=’administrator’”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}button class=”btn btn-default dropdown-toggle” style=”margin:10px” type=”button” id=”dropdownMenu1” data-toggle=”dropdown” aria-expanded=”true”\textgreater{}}] \leavevmode +\textless{}span class=”glyphicon glyphicon-cog” aria-hidden=”true”\textgreater{}\textless{}/span\textgreater{} Annotieren +\textless{}span class=”caret”\textgreater{}\textless{}/span\textgreater{} + +\end{description} + +\textless{}/button\textgreater{} + +\end{description} + +\textless{}/xsl:if\textgreater{} +\textless{}xsl:if test=”string-length(\$corpID) \> 0”\textgreater{}\textbar{}label\{ln:ru\}\textbar{} +\begin{quote} + +\textless{}xsl:variable name=”ifsDirectory” select=”document(concat(‘ifs:/’,\$derivCorp))”/\textgreater{} +\textless{}ul class=”dropdown-menu” role=”menu” aria-labelledby=”dropdownMenu1”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}li role=”presentation”\textgreater{}}] \leavevmode\begin{description} +\item[{{\color{red}\bfseries{}\textbar{}\textbackslash{}label\{ln:nw1\}\textbar{}\textless{}a href="\{\$ServletsBaseURL\}object/tag\{\$HttpSession\}?id=\{\$derivCorp\}\&objID=\{\$corpID\}" role="menuitem" tabindex="-1"\textgreater{}\textbar{}}label\{ln:nw2\}\textbar{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘object.nextObject’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} + +\end{description} + +\textless{}/li\textgreater{} +\textless{}li role=”presentation”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}a href=”\{\$WebApplicationBaseURL\}receive/\{\$corpID\}” role=”menuitem” tabindex=”-1”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘object.backToProject’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} +\end{quote} + +\textless{}/li\textgreater{} +\end{quote} + +\textless{}/ul\textgreater{} +\end{quote} + +\textless{}/xsl:if\textgreater{} +\textless{}xsl:if test=”\$CurrentUser=’administrator’”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}ul class=”dropdown-menu” role=”menu” aria-labelledby=”dropdownMenu1”\textgreater{}}] \leavevmode\begin{quote} +\begin{description} +\item[{\textless{}li role=”presentation”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}a role=”menuitem” tabindex=”-1” href=”\{\$WebApplicationBaseURL\}content/publish/morphilo.xed?id=\{\$id\}”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘object.editWord’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} + +\end{description} + +\textless{}/li\textgreater{} +\textless{}li role=”presentation”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}a href=”\{\$ServletsBaseURL\}object/delete\{\$HttpSession\}?id=\{\$id\}” role=”menuitem” tabindex=”-1” class=”confirm\_deletion option” data-text=”Wirklich loeschen”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘object.delWord’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} +\end{quote} +\end{quote} + +\textless{}/li\textgreater{} + +\end{description} + +\textless{}/ul\textgreater{} +\end{quote} + +\textless{}/xsl:if\textgreater{} +\textless{}/div\textgreater{} +\textless{}div class=”row” style=”margin-left:0px; margin-right:10px”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:apply-templates select=”structure/derobjects/derobject{[}acl:checkPermission(@xlink:href,’read’){]}”\textgreater{}}] \leavevmode +\textless{}xsl:with-param name=”objID” select=”@ID”/\textgreater{} + +\end{description} + +\textless{}/xsl:apply-templates\textgreater{} +\end{quote} + +\textless{}/div\textgreater{} +\end{quote} + +\textless{}/xsl:if\textgreater{} +\end{quote} + +\textless{}/xsl:template\textgreater{} +end\{lstlisting\} +The emph\{objectAction\} template defines the selection menu appearing \textendash{} once manual tagging has +started \textendash{} on the upper right hand side of the webpage entitled +emph\{Annotieren\} and displaying the two options emph\{next word\} or emph\{back +to project\}. +The first thing to note here is that in line ref\{ln:ng\} a simple test +excludes all guest users from accessing the procedure. After ensuring that only +the user who owns the corpus project has access (line ref\{ln:ru\}), s/he will be +able to access the drop down menu, which is really a url, e.g. line +ref\{ln:nw1\}. The attentive reader might have noticed that +the url exactly matches the definition in the web-fragment.xml as shown in +listing ref\{lst:webfragment\}, line ref\{ln:tag\}, which resolves to the +respective java class there. Really, this mechanism is the data interface within the +MVC pattern. The url also contains two variables, named emph\{derivCorp\} and +emph\{corpID\}, that are needed to identify the corpus and file object by the +java classes (see section ref\{sec:javacode\}). + +The morphilo.xsl stylesheet contains yet another modification that deserves mention. +In listing ref\{lst:derobjectTempl\}, line ref\{ln:morphMenu\}, two menu options \textendash{} +emph\{Tag automatically\} and emph\{Tag manually\} \textendash{} are defined. The former option +initiates ProcessCorpusServlet.java as can be seen again in listing ref\{lst:webfragment\}, +line ref\{ln:process\}, which determines words that are not in the master data base. +Still, it is important to note that the menu option is only displayed if two restrictions +are met. First, a file has to be uploaded (line ref\{ln:1test\}) and, second, there must be +only one file. This is necessary because in the annotation process other files will be generated +that store the words that were not yet processed or a file that includes the final result. The +generated files follow a certain pattern. The file harboring the final, entire TEI-annotated +corpus is prefixed by emph\{tagged\}, the other file is prefixed emph\{untagged\}. This circumstance +is exploited for manipulating the second option (line ref\{ln:loop\}). A loop runs through all +files in the respective directory and if a file name starts with emph\{untagged\}, +the option to manually tag is displayed. + +begin\{lstlisting\}{[}language=XML,caption=\{template +matching derobject\},label=lst:derobjectTempl,escapechar=\textbar{}{]} +\textless{}xsl:template match=”derobject” mode=”derivateActions”\textgreater{} +\begin{quote} + +\textless{}xsl:param name=”deriv” /\textgreater{} +\textless{}xsl:param name=”parentObjID” /\textgreater{} +\textless{}xsl:param name=”suffix” select=”’‘” /\textgreater{} +\textless{}xsl:param name=”id” select=”../../../@ID” /\textgreater{} +\textless{}xsl:if test=”acl:checkPermission(\$deriv,’writedb’)”\textgreater{} +\begin{quote} + +\textless{}xsl:variable name=”ifsDirectory” select=”document(concat(‘ifs:’,\$deriv,’/’))” /\textgreater{} +\textless{}xsl:variable name=”path” select=”\$ifsDirectory/mcr\_directory/path” /\textgreater{} +\end{quote} +\begin{description} +\item[{…}] \leavevmode\begin{quote} +\begin{description} +\item[{\textless{}div class=”options pull-right”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}div class=”btn-group” style=”margin:10px”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}a href=”\#” class=”btn btn-default dropdown-toggle” data-toggle=”dropdown”\textgreater{}}] \leavevmode +\textless{}i class=”fa fa-cog”\textgreater{}\textless{}/i\textgreater{} +\textless{}xsl:value-of select=”’ Korpus’”/\textgreater{} +\textless{}span class=”caret”\textgreater{}\textless{}/span\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} + +\item[{\textless{}ul class=”dropdown-menu dropdown-menu-right”\textgreater{}}] \leavevmode +\textless{}!\textendash{} Anpasssungen Morphilo \textendash{}\textgreater{}\textbar{}label\{ln:morphMenu\}\textbar{} +\textless{}xsl:if test=”string-length(\$deriv) \> 0”\textgreater{}\textbar{}label\{ln:1test\}\textbar{} +\begin{quote} +\begin{description} +\item[{\textless{}xsl:if test=”count(\$ifsDirectory/mcr\_directory/children/child) = 1”\textgreater{}\textbar{}label\{ln:2test\}\textbar{}}] \leavevmode\begin{description} +\item[{\textless{}li role=”presentation”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}a href=”\{\$ServletsBaseURL\}object/process\{\$HttpSession\}?id=\{\$deriv\}\&objID=\{\$id\}” role=”menuitem” tabindex=”-1”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘derivate.process’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} + +\end{description} + +\textless{}/li\textgreater{} + +\end{description} + +\textless{}/xsl:if\textgreater{} +\textless{}xsl:for-each select=”\$ifsDirectory/mcr\_directory/children/child”\textgreater{}\textbar{}label\{ln:loop\}\textbar{} +\begin{quote} + +\textless{}xsl:variable name=”untagged” select=”concat(\$path, ‘untagged’)”/\textgreater{} +\textless{}xsl:variable name=”filename” select=”concat(\$path,./name)”/\textgreater{} +\textless{}xsl:if test=”starts-with(\$filename, \$untagged)”\textgreater{} +\begin{quote} +\begin{description} +\item[{\textless{}li role=”presentation”\textgreater{}}] \leavevmode\begin{description} +\item[{\textless{}a href=”\{\$ServletsBaseURL\}object/tag\{\$HttpSession\}?id=\{\$deriv\}\&objID=\{\$id\}” role=”menuitem” tabindex=”-1”\textgreater{}}] \leavevmode +\textless{}xsl:value-of select=”i18n:translate(‘derivate.taggen’)”/\textgreater{} + +\end{description} + +\textless{}/a\textgreater{} + +\end{description} + +\textless{}/li\textgreater{} +\end{quote} + +\textless{}/xsl:if\textgreater{} +\end{quote} + +\textless{}/xsl:for-each\textgreater{} +\end{quote} + +\textless{}/xsl:if\textgreater{} + +\end{description} + +… +\textless{}/ul\textgreater{} + +\end{description} + +\textless{}/div\textgreater{} +\end{quote} + +\textless{}/div\textgreater{} + +\end{description} + +\textless{}/xsl:if\textgreater{} +\end{quote} + +\textless{}/xsl:template\textgreater{} +end\{lstlisting\} + +Besides the two stylesheets morphilo.xsl and corpmeta.xsl, other stylesheets have +to be adjusted. They will not be discussed in detail here for they are self-explanatory for the most part. +Essentially, they render the overall layout (emph\{common-layout.xsl\}, emph\{skeleton\_layout\_template.xsl\}) +or the presentation +of the search results (emph\{response-page.xsl\}) and definitions of the solr search fields (emph\{searchfields-solr.xsl\}). +The former and latter also inherit templates from emph\{response-general.xsl\} and emph\{response-browse.xsl\}, in which the +navigation bar of search results can be changed. For the use of multilinguality a separate configuration directory +has to be created containing as many emph\{.property\}-files as different +languages want to be displayed. In the current case these are restricted to German and English (emph\{messages\_de.properties\} and emph\{messages\_en.properties\}). +The property files include all emph\{i18n\} definitions. All these files are located in the emph\{resources\} directory. + +Furthermore, a search mask and a page for manually entering the annotations had +to be designed. +For these files a specially designed xml standard (emph\{xed\}) is recommended to be used within the +repository framework. + + +\chapter{Software Design} +\label{\detokenize{source/architecture::doc}}\label{\detokenize{source/architecture:software-design}} +!{[}Morphilo Architecture{]}(architecture.pdf) + +The architecture of a possible \textless{}em\textgreater{}take-and-share\textless{}/em\textgreater{}-approach for language +resources is visualized in figure ref\{fig:architect\}. Because the very gist +of the approach becomes clearer if describing a concrete example, the case of +annotating lexical derivatives of Middle English and a respective database is +given as an illustration. +However, any other tool that helps with manual annotations and manages metadata of a corpus could be +substituted here instead. + +After inputting an untagged corpus or plain text, it is determined whether the +input material was annotated previously by a different user. This information is +usually provided by the metadata administered by the annotation tool; in the case at +hand it is called emph\{Morphilizer\} in figure ref\{fig:architect\}. An +alternative is a simple table look-up for all occurring words in the datasets Corpus 1 through Corpus n. If contained +completely, the emph\{yes\}-branch is followed up further \textendash{} otherwise emph\{no\} +succeeds. The difference between the two branches is subtle, yet crucial. On +both branches, the annotation tool (here emph\{Morphilizer\}) is called, which, first, +sorts out all words that are not contained in the master database (here emph\{Morphilo-DB\}) +and, second, makes reasonable suggestions on an optimal annotation of +the items. In both cases the +annotations are linked to the respective items (e.g. words) in the +text, but they are also persistently saved in an extra dataset, i.e. Corpus 1 +through n, together with all available metadata. + +The difference between both information streams is that +in the emph\{yes\}-branch a comparison between the newly created dataset and +all of the previous datasets of this text is carried out. Within this +unit, all deviations and congruencies are marked and counted. The underlying +assumption is that with a growing number of comparable texts the +correct annotations approach a theoretic true value of a correct annotation +while errors level out provided that the sample size is large enough. How the +distribution of errors and correct annotations exactly looks like and if a +normal distribution can be assumed is still object of the ongoing research, but +independent of the concrete results, the component (called emph\{compare +manual annotations\} in figure ref\{fig:architect\}) allows for specifying the +exact form of the sample population. +In fact, it is necessary at that point to define the form of the distribution, +sample size, and the rejection region. The standard setting are a normal +distribution, a rejection region of \$alpha = 0.05\$ and sample size of \$30\$ so +that a simple Gauss-Test can be calculated. + +Continuing the information flow further, these statistical calculations are +delivered to the quality-control-component. Based on the statistics, the +respective items together with the metadata, frequencies, and, of course, +annotations are written to the master database. All information in the master +database is directly used for automated annotations. Thus it is directly matched +to the input texts or corpora respectively through the emph\{Morphilizer\}-tool. +The annotation tool decides on the entries looked up in the master which items +are to be manually annotated. + +The processes just described are all hidden from the user who has no possibility +to impact the set quality standards but by errors in the annotation process. The +user will only see the number of items of the input text he or she will process manually. The +annotator will also see an estimation of the workload beforehand. On this +number, a decision can be made if to start the annotation at all. It will be +possible to interrupt the annotation work and save progress on the server. And +the user will have access to the annotations made in the respective dataset, +correct them or save them and resume later. It is important to note that the user will receive +the tagged document only after all items are fully annotated. No partially +tagged text can be output. + + +\chapter{Framework} +\label{\detokenize{source/framework:framework}}\label{\detokenize{source/framework::doc}}\begin{description} +\item[{begin\{figure\}}] \leavevmode +centering +includegraphics{[}scale=0.33{]}\{mycore\_architecture-2.png\} +caption{[}MyCoRe-Architecture and Components{]}\{MyCoRe-Architecture and Componentsprotectfootnotemark\} +label\{fig:abbMyCoReStruktur\} + +\end{description} + +end\{figure\} +footnotetext\{source: \sphinxurl{https://www.mycore.de}\} +To specify the MyCoRe framework the morphilo application logic will have to be implemented, +the TEI data model specified, and the input, search and output mask programmed. + +There are three directories which are +important for adjusting the MyCoRe framework to the needs of one’s own application. These three directories +correspond essentially to the three components in the MVC model as explicated in +section ref\{subsec:mvc\}. Roughly, they are envisualized in figure ref\{fig:abbMyCoReStruktur\} in the upper +right hand corner. More precisely, the view (emph\{Layout\} in figure ref\{fig:abbMyCoReStruktur\}) and the model layer +(emph\{Datenmodell\} in figure ref\{fig:abbMyCoReStruktur\}) can be done +completely via the {\color{red}\bfseries{}{}`{}`}interface’‘, which is a directory with a predefined +structure and some standard files. For the configuration of the logic an extra directory is offered (/src/main/java/custom/mycore/addons/). Here all, java classes +extending the controller layer should be added. +Practically, all three MVC layers are placed in the +emph\{src/main/\}-directory of the application. In one of the subdirectories, +emph\{datamodel/def\}, the datamodel specifications are defined as xml files. It parallels the model +layer in the MVC pattern. How the data model was defined will be explained in +section ref\{subsec:datamodelimpl\}. + \chapter{Indices and tables} \label{\detokenize{index:indices-and-tables}}\begin{itemize} diff --git a/Morphilo_doc/_build/latex/Morphilo.toc b/Morphilo_doc/_build/latex/Morphilo.toc index 7e4a884667c825a6b4f4bbd4c940955877814136..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 100644 --- a/Morphilo_doc/_build/latex/Morphilo.toc +++ b/Morphilo_doc/_build/latex/Morphilo.toc @@ -1,5 +0,0 @@ -\babel@toc {english}{} -\contentsline {chapter}{\numberline {1}Data Model Implementation}{1}{chapter.1} -\contentsline {chapter}{\numberline {2}Controller Adjustments}{3}{chapter.2} -\contentsline {section}{\numberline {2.1}General Principle of Operation}{3}{section.2.1} -\contentsline {chapter}{\numberline {3}Indices and tables}{5}{chapter.3} diff --git a/Morphilo_doc/source/architecture.pdf b/Morphilo_doc/_static/architecture.pdf similarity index 100% rename from Morphilo_doc/source/architecture.pdf rename to Morphilo_doc/_static/architecture.pdf diff --git a/Morphilo_doc/index.rst b/Morphilo_doc/index.rst index 7af0585f25722f415b298186f3f7d887a80fb685..c7043067b4c65dfe25c731ac0e2deb7221e09da4 100644 --- a/Morphilo_doc/index.rst +++ b/Morphilo_doc/index.rst @@ -12,8 +12,9 @@ Documentation Morphilo Project source/datamodel.rst source/controller.rst - - + source/view.rst + source/architecture.rst + source/framework.rst Indices and tables ================== diff --git a/Morphilo_doc/source/architecture.rst b/Morphilo_doc/source/architecture.rst index 8bc644984b882934c250bc7c4af40bf3167a956b..5b114bda79b5dc2cca9d10df9d300507bf510317 100644 --- a/Morphilo_doc/source/architecture.rst +++ b/Morphilo_doc/source/architecture.rst @@ -1,9 +1,11 @@ Software Design =============== - -The architecture of a possible <em>take-and-share</em>-approach for language +.. image:: architecture.* + + +The architecture of a possible **take-and-share**-approach for language resources is visualized in figure \ref{fig:architect}. Because the very gist of the approach becomes clearer if describing a concrete example, the case of annotating lexical derivatives of Middle English and a respective database is