From 225264d59d69b51381fff6df9c0723e354725000 Mon Sep 17 00:00:00 2001 From: Timofey Arkhangelskiy <timarkh@gmail.com> Date: Thu, 2 Feb 2023 16:51:43 +0100 Subject: [PATCH] Added docs. --- .gitignore | 3 +- README.md | 22 ++++++++++--- docs/Makefile | 20 +++++++++++ docs/conf.py | 55 +++++++++++++++++++++++++++++++ docs/configuration.rst | 75 ++++++++++++++++++++++++++++++++++++++++++ docs/faq.rst | 30 +++++++++++++++++ docs/index.rst | 60 +++++++++++++++++++++++++++++++++ docs/make.bat | 35 ++++++++++++++++++++ docs/overview.rst | 20 +++++++++++ requirements.txt | 4 ++- 10 files changed, 318 insertions(+), 6 deletions(-) create mode 100644 docs/Makefile create mode 100644 docs/conf.py create mode 100644 docs/configuration.rst create mode 100644 docs/faq.rst create mode 100644 docs/index.rst create mode 100644 docs/make.bat create mode 100644 docs/overview.rst diff --git a/.gitignore b/.gitignore index 8e7ac9e..3f91f0a 100644 --- a/.gitignore +++ b/.gitignore @@ -37,4 +37,5 @@ docs/_build *.pdf *.min.css.map *.min.js.map -query_log.txt \ No newline at end of file +query_log.txt +docs/_build \ No newline at end of file diff --git a/README.md b/README.md index 94a1e1a..b5c9f22 100644 --- a/README.md +++ b/README.md @@ -2,22 +2,36 @@ ## Overview +This is an **endpoint for Federated Content Search** (FCS). + +There are many linguistic corpora online. They are available under different platforms and use a variety of query languages. [FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) is a mechanism that allows you to search in multiple corpora at once, using simple text queries or a CQL-like language. This way, you can discover or compare corpora that can be useful for your research, after which you can proceed to them. This is done through the [Aggregator](https://contentsearch.clarin.eu/). + +An *endpoint* is a piece of software that serves as an intermediary between FCS and individual corpora. It translates the FCS requests into corpus-specific query languages, waits for the results, and then renders them in an XML format required by the FCS. + +Different corpus platforms or online databases require different endpoints. This endpoint works with the following platforms or resources: + +* [ANNIS](https://corpus-tools.org/annis/) +* [Tsakorpus](https://tsakorpus.readthedocs.io/en/latest/) +* Database of the [Formulae-Litterae-Chartae project](https://werkstatt.formulae.uni-hamburg.de/collections/formulae_collection) ## Documentation -All documentation is available [here](https://fcs-clarin.readthedocs.io/en/latest/). +All documentation is available [here](https://fcs-clarin-endpoint-hamburg.readthedocs.io/en/latest/). CLARIN FCS specifications this endpoint implements are available [here](https://office.clarin.eu/v/CE-2017-1046-FCS-Specification-v89.pdf). - ## Requirements This software was tested on Ubuntu and Windows. Its dependencies are the following: -* python >= 3.?? -* python modules: ``fastapi``, ``uvicorn`` (you can use ``requirements.txt``) +* python >= 3.8 +* python modules: ``fastapi``, ``uvicorn``, ``lxml``, ``Jinja2`` (you can use ``requirements.txt``) * it is recommended to deploy the endpoint through apache2 with wsgi or nginx ## License The software is distributed under CC BY license (see LICENSE). + +## Funding + +The development of this software was funded by the [Akademie der Wissenschaften in Hamburg](https://www.awhamburg.de/). diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 0000000..d4bb2cb --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 0000000..3b3d02a --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,55 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + + +# -- Project information ----------------------------------------------------- + +project = 'fcs-clarin-endpoint-hamburg' +copyright = '2022-2023, Timofey Arkhangelskiy' +author = 'Timofey Arkhangelskiy' + +# The full version, including alpha/beta/rc tags +release = '1.0' + + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'alabaster' + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] \ No newline at end of file diff --git a/docs/configuration.rst b/docs/configuration.rst new file mode 100644 index 0000000..a025556 --- /dev/null +++ b/docs/configuration.rst @@ -0,0 +1,75 @@ +Endpoint configuration +====================== + +Before you deploy an endpoint, you must tell it in which corpora it is going to search and what kind of corpora they are. This is done in the configuration files located in the ``/config`` directory, one per corpus. Do not forget to remove the test configuration files before publishing an endpoint. + +Configuration files +------------------- + +You need one configuration file per corpus. All files are in JSON and must have the ``.json`` extension. Before you publish or reload the server, make sure you all files in ``/config`` are valid JSON, e.g. using JSONLint_. + +.. _JSONLint: https://jsonlint.com/ + +The name of each JSON file, excluding the extension, serves as the ID of a corpus it describes. This ID is used in the reqeust URLs. A request to a corpus whose ID is ``CORPUS_ID`` must be sent to ``http[s]://BASE_URL_OF_THE_ENDPOINT/fcs-endpoint/CORPUS_ID/``. + +A configuration file for a corpus is basically a dictionary with parameters. All possible parameters are listed below. + +List of parameters +------------------ + +Basic info about the corpus +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- ``platform`` (string; required) -- name of the corpus platform. Possible values are ``annis``, ``tsakorpus`` and ``litterae``. + +- ``resource_base_url`` (string; required) -- base URL of the online corpus the endpoint is going to communicate with. For Tsakorpus corpora, do not include the final ``/search`` part. + +Basic info about the endpoint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These values are somteimes sent back to the FCS Aggregator together with the search results. + +- ``url_path`` (string) -- URL of this endpoint. + +- ``transport_protocol`` (string) -- transport protocol (``http`` or ``https``) used by this endpoint. + +- ``host`` (string) -- host name of this endpoint (without the protocol). + +- ``port`` (string) -- port number of this endpoint. + +Capabilities and settings of the endpoint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- ``basic_search_capability`` (Boolean; defaults to ``True``) -- whether basic search is possible. + +- ``advanced_search_capability`` (Boolean; defaults to ``False``) -- whether advanced search (with a CQL-like query language) is possible. + +- ``hits_supported`` (Boolean; defaults to ``True``) -- whether the simple dataview (``hits``; only includes the text) is available. + +- ``adv_supported`` (Boolean; defaults to ``False``) -- whether the advanced dataview (``adv``; includes some annotation) is available. + +- ``max_hits`` (integer; defaults to 10) -- maximal number of hits the endpoint will send to the aggregator. + +- ``search_lang_id`` (string) -- ID of the language / layer to search in (for platforms and corpora that support multiple languages or multiple text layers, e.g. parallel translations). + +Corpus metadata +~~~~~~~~~~~~~~~ + +The following values may be sent to the Aggregator when it sends an ``explain`` request, i.e. asks the endpoint to tell it more abot the corpora it covers. + +- ``titles`` (list of dictionaries) -- title(s) of the resource (use multiple dictionaries if they are in multiple languages). Each title is described by a dictionary with three keys: ``content`` (the title itself), ``lang`` (an ISO 639 code of the language of the title) and ``primary`` (Boolean; optional; marks this version of the title as primary). + +- ``descriptions`` (list of dictionaries) -- description(s) of the resource; work the same as ``titles``. + +- ``authors`` (list of dictionaries) -- author(s) of the resource; work the same as ``titles``. + +Query translation +~~~~~~~~~~~~~~~~~ + +POS tags are required to be in the UD_ standard, per FCS specifications. If a corpus only has a non-UD morphological annotation, you can use this workaround. + +- ``pos_convert`` (list or rules, each rule is a list with exactly two items) -- rules that convert a corpus-specific morphological annotation string into an UD tag. Each rule contains two strings. The first string is a regex that is applied to the tag sequence from the corpus. The second is an UD tag that has to be sent to the Aggregator instead, if there is a match. Rules are applied in the order of their appearance. + +- ``pos_convert_reverse`` (dictionary) -- rules that convert UD tags from a query to corpus-specific tags or expressions. Keys are UD tags, values are expressions they have to be replaced with. + +.. _UD: https://universaldependencies.org/u/pos/ diff --git a/docs/faq.rst b/docs/faq.rst new file mode 100644 index 0000000..801c3b0 --- /dev/null +++ b/docs/faq.rst @@ -0,0 +1,30 @@ +Short FAQ +========= + +Here are some short answers to common questions. + +| **Q**: *Is the platform open source and free to use for any purpose?* +| **A**: Yes. + +| **Q**: *How do I set up an endpoint for a corpus?* +| **A**: In a nutshell: You fork or copy the repository, clone it on a server, :doc:`configure </configuration>` it, and run it as a web application. Then you configure a CLARIN aggregator, so that it knows where to find your endpoint. + +| **Q**: *What kind of requests does an endpoint understand?* +| **A**: GET requests with several parameters. Two operations are available: ``explain`` provides basic info about a corpus, ``searchRetrieve`` returns search results in a corpus. A CQL-like language is used for queries. You will find more details in the specifications_. + +| **Q**: *What does an endpoint return?* +| **A**: It returns XML, which contains search results, corpus info and/or error messages. You will find more details in the specifications_. + +| **Q**: *How fast is this endpoint?* +| **A**: The endpoint itself only performs the translation, which is very fast. But send requests to corpora and must wait for a reply, which can take a long time. + +| **Q**: *Can I use one endpoint for multiple corpora?* +| **A**: Yes. You create one configuration file per corpus. Each corpus gets an ID. Queries to different corpora must be sent to different URLs, which include that ID. + +| **Q**: *Does the endpoint understand all of the FCS query language?* +| **A**: It can parse any valid query. However, certain corpus platforms it works with cannot process certain subsets of the query language. If a query is too complex for a corpus, you will get an appropriate diagnostic (i.e. error message) in XML. + +| **Q**: *What about data protection?* +| **A**: The endpoint does not store any information on the server and does not place any cookies on the client's machine. + +.. _specifications: https://office.clarin.eu/v/CE-2017-1046-FCS-Specification-v89.pdf diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000..a567f4f --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,60 @@ +.. fcs-clarin-endpoint-hamburg documentation master file. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +FCS Clarin endpoint +=================== + +Introduction +------------ + +This is an **endpoint for Federated Content Search** (FCS). + +There are many linguistic corpora online. They are available under different platforms and use a variety of query languages. FCS_ is a mechanism that allows you to search in multiple corpora at once, using simple text queries or a CQL-like language. This way, you can discover or compare corpora that can be useful for your research, after which you can proceed to them. This is done through the Aggregator_. + +An *endpoint* is a piece of software that serves as an intermediary between FCS and individual corpora. It translates the FCS requests into corpus-specific query languages, waits for the results, and then renders them in an XML format required by the FCS. + +Different corpus platforms or online databases require different endpoints. This endpoint works with the following platforms or resources: + +* ANNIS_ +* Tsakorpus_ +* Database of the `Formulae-Litterae-Chartae project`_ + +.. _FCS: https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details +.. _Aggregator: https://contentsearch.clarin.eu/ +.. _ANNIS: https://corpus-tools.org/annis/ +.. _Tsakorpus: https://github.com/timarkh/tsakorpus +.. _Formulae-Litterae-Chartae project: https://werkstatt.formulae.uni-hamburg.de/collections/formulae_collection + +See :doc:`FAQ </faq>` for a short list of commonly asked questions. If you want to learn how to set up an endpoint, please go to :doc:`overview`. + +Requirements +------------ + +This software was tested on Ubuntu and Windows. Its dependencies are the following: + +* python >= 3.8 +* python modules: ``fastapi``, ``uvicorn``, ``lxml``, ``Jinja2`` (you can use ``requirements.txt``) + + +License +------- + +The software is distributed under CC BY license. + + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + faq + overview + configuration + + +Indices and tables +================== + +* :ref:`genindex` +* :ref:`modindex` +* :ref:`search` diff --git a/docs/make.bat b/docs/make.bat new file mode 100644 index 0000000..922152e --- /dev/null +++ b/docs/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/docs/overview.rst b/docs/overview.rst new file mode 100644 index 0000000..5e6c0d8 --- /dev/null +++ b/docs/overview.rst @@ -0,0 +1,20 @@ +Getting started +=============== + +Installation +------------ + +This endpoint does not require installation. All you have to do is make sure all the dependencies are installed and copy the entire contents of the repository to a directory on your computer. This concerns both Linux and Windows. After that, you will need to :doc:`configure </configuration>` the endpoint. + +Installing dependencies +~~~~~~~~~~~~~~~~~~~~~~~ + +Python modules can be installed with the help of the ``requirements.txt`` file in the root folder:: + + pip3 install -r requirements.txt + +Running the endpoint +-------------------- + +You can use the endpoint either locally (for testing pruposes) or as a web service available from outside. In the first case, it is sufficient to run ``main.py`` as a Python file. This will start a FastAPI web server. After this, the endpoint for a corpus with the ID ``CORPUS_ID`` will be accessible locally at ``http://127.0.0.1:5000/fcs-endpoint/CORPUS_ID/``. + diff --git a/requirements.txt b/requirements.txt index 3899ce7..418bc53 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,4 @@ fastapi>=0.88.0 -uvicorn>=0.20.0 \ No newline at end of file +uvicorn>=0.20.0 +lxml +Jinja2>=3.0.3 \ No newline at end of file -- GitLab