Skip to content
Snippets Groups Projects
Commit 225264d5 authored by Timofey Arkhangelskiy's avatar Timofey Arkhangelskiy
Browse files

Added docs.

parent 03669606
Branches
No related tags found
No related merge requests found
......@@ -37,4 +37,5 @@ docs/_build
*.pdf
*.min.css.map
*.min.js.map
query_log.txt
\ No newline at end of file
query_log.txt
docs/_build
\ No newline at end of file
......@@ -2,22 +2,36 @@
## Overview
This is an **endpoint for Federated Content Search** (FCS).
There are many linguistic corpora online. They are available under different platforms and use a variety of query languages. [FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) is a mechanism that allows you to search in multiple corpora at once, using simple text queries or a CQL-like language. This way, you can discover or compare corpora that can be useful for your research, after which you can proceed to them. This is done through the [Aggregator](https://contentsearch.clarin.eu/).
An *endpoint* is a piece of software that serves as an intermediary between FCS and individual corpora. It translates the FCS requests into corpus-specific query languages, waits for the results, and then renders them in an XML format required by the FCS.
Different corpus platforms or online databases require different endpoints. This endpoint works with the following platforms or resources:
* [ANNIS](https://corpus-tools.org/annis/)
* [Tsakorpus](https://tsakorpus.readthedocs.io/en/latest/)
* Database of the [Formulae-Litterae-Chartae project](https://werkstatt.formulae.uni-hamburg.de/collections/formulae_collection)
## Documentation
All documentation is available [here](https://fcs-clarin.readthedocs.io/en/latest/).
All documentation is available [here](https://fcs-clarin-endpoint-hamburg.readthedocs.io/en/latest/).
CLARIN FCS specifications this endpoint implements are available [here](https://office.clarin.eu/v/CE-2017-1046-FCS-Specification-v89.pdf).
## Requirements
This software was tested on Ubuntu and Windows. Its dependencies are the following:
* python >= 3.??
* python modules: ``fastapi``, ``uvicorn`` (you can use ``requirements.txt``)
* python >= 3.8
* python modules: ``fastapi``, ``uvicorn``, ``lxml``, ``Jinja2`` (you can use ``requirements.txt``)
* it is recommended to deploy the endpoint through apache2 with wsgi or nginx
## License
The software is distributed under CC BY license (see LICENSE).
## Funding
The development of this software was funded by the [Akademie der Wissenschaften in Hamburg](https://www.awhamburg.de/).
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
# -- Project information -----------------------------------------------------
project = 'fcs-clarin-endpoint-hamburg'
copyright = '2022-2023, Timofey Arkhangelskiy'
author = 'Timofey Arkhangelskiy'
# The full version, including alpha/beta/rc tags
release = '1.0'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
\ No newline at end of file
Endpoint configuration
======================
Before you deploy an endpoint, you must tell it in which corpora it is going to search and what kind of corpora they are. This is done in the configuration files located in the ``/config`` directory, one per corpus. Do not forget to remove the test configuration files before publishing an endpoint.
Configuration files
-------------------
You need one configuration file per corpus. All files are in JSON and must have the ``.json`` extension. Before you publish or reload the server, make sure you all files in ``/config`` are valid JSON, e.g. using JSONLint_.
.. _JSONLint: https://jsonlint.com/
The name of each JSON file, excluding the extension, serves as the ID of a corpus it describes. This ID is used in the reqeust URLs. A request to a corpus whose ID is ``CORPUS_ID`` must be sent to ``http[s]://BASE_URL_OF_THE_ENDPOINT/fcs-endpoint/CORPUS_ID/``.
A configuration file for a corpus is basically a dictionary with parameters. All possible parameters are listed below.
List of parameters
------------------
Basic info about the corpus
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``platform`` (string; required) -- name of the corpus platform. Possible values are ``annis``, ``tsakorpus`` and ``litterae``.
- ``resource_base_url`` (string; required) -- base URL of the online corpus the endpoint is going to communicate with. For Tsakorpus corpora, do not include the final ``/search`` part.
Basic info about the endpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These values are somteimes sent back to the FCS Aggregator together with the search results.
- ``url_path`` (string) -- URL of this endpoint.
- ``transport_protocol`` (string) -- transport protocol (``http`` or ``https``) used by this endpoint.
- ``host`` (string) -- host name of this endpoint (without the protocol).
- ``port`` (string) -- port number of this endpoint.
Capabilities and settings of the endpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``basic_search_capability`` (Boolean; defaults to ``True``) -- whether basic search is possible.
- ``advanced_search_capability`` (Boolean; defaults to ``False``) -- whether advanced search (with a CQL-like query language) is possible.
- ``hits_supported`` (Boolean; defaults to ``True``) -- whether the simple dataview (``hits``; only includes the text) is available.
- ``adv_supported`` (Boolean; defaults to ``False``) -- whether the advanced dataview (``adv``; includes some annotation) is available.
- ``max_hits`` (integer; defaults to 10) -- maximal number of hits the endpoint will send to the aggregator.
- ``search_lang_id`` (string) -- ID of the language / layer to search in (for platforms and corpora that support multiple languages or multiple text layers, e.g. parallel translations).
Corpus metadata
~~~~~~~~~~~~~~~
The following values may be sent to the Aggregator when it sends an ``explain`` request, i.e. asks the endpoint to tell it more abot the corpora it covers.
- ``titles`` (list of dictionaries) -- title(s) of the resource (use multiple dictionaries if they are in multiple languages). Each title is described by a dictionary with three keys: ``content`` (the title itself), ``lang`` (an ISO 639 code of the language of the title) and ``primary`` (Boolean; optional; marks this version of the title as primary).
- ``descriptions`` (list of dictionaries) -- description(s) of the resource; work the same as ``titles``.
- ``authors`` (list of dictionaries) -- author(s) of the resource; work the same as ``titles``.
Query translation
~~~~~~~~~~~~~~~~~
POS tags are required to be in the UD_ standard, per FCS specifications. If a corpus only has a non-UD morphological annotation, you can use this workaround.
- ``pos_convert`` (list or rules, each rule is a list with exactly two items) -- rules that convert a corpus-specific morphological annotation string into an UD tag. Each rule contains two strings. The first string is a regex that is applied to the tag sequence from the corpus. The second is an UD tag that has to be sent to the Aggregator instead, if there is a match. Rules are applied in the order of their appearance.
- ``pos_convert_reverse`` (dictionary) -- rules that convert UD tags from a query to corpus-specific tags or expressions. Keys are UD tags, values are expressions they have to be replaced with.
.. _UD: https://universaldependencies.org/u/pos/
Short FAQ
=========
Here are some short answers to common questions.
| **Q**: *Is the platform open source and free to use for any purpose?*
| **A**: Yes.
| **Q**: *How do I set up an endpoint for a corpus?*
| **A**: In a nutshell: You fork or copy the repository, clone it on a server, :doc:`configure </configuration>` it, and run it as a web application. Then you configure a CLARIN aggregator, so that it knows where to find your endpoint.
| **Q**: *What kind of requests does an endpoint understand?*
| **A**: GET requests with several parameters. Two operations are available: ``explain`` provides basic info about a corpus, ``searchRetrieve`` returns search results in a corpus. A CQL-like language is used for queries. You will find more details in the specifications_.
| **Q**: *What does an endpoint return?*
| **A**: It returns XML, which contains search results, corpus info and/or error messages. You will find more details in the specifications_.
| **Q**: *How fast is this endpoint?*
| **A**: The endpoint itself only performs the translation, which is very fast. But send requests to corpora and must wait for a reply, which can take a long time.
| **Q**: *Can I use one endpoint for multiple corpora?*
| **A**: Yes. You create one configuration file per corpus. Each corpus gets an ID. Queries to different corpora must be sent to different URLs, which include that ID.
| **Q**: *Does the endpoint understand all of the FCS query language?*
| **A**: It can parse any valid query. However, certain corpus platforms it works with cannot process certain subsets of the query language. If a query is too complex for a corpus, you will get an appropriate diagnostic (i.e. error message) in XML.
| **Q**: *What about data protection?*
| **A**: The endpoint does not store any information on the server and does not place any cookies on the client's machine.
.. _specifications: https://office.clarin.eu/v/CE-2017-1046-FCS-Specification-v89.pdf
.. fcs-clarin-endpoint-hamburg documentation master file.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
FCS Clarin endpoint
===================
Introduction
------------
This is an **endpoint for Federated Content Search** (FCS).
There are many linguistic corpora online. They are available under different platforms and use a variety of query languages. FCS_ is a mechanism that allows you to search in multiple corpora at once, using simple text queries or a CQL-like language. This way, you can discover or compare corpora that can be useful for your research, after which you can proceed to them. This is done through the Aggregator_.
An *endpoint* is a piece of software that serves as an intermediary between FCS and individual corpora. It translates the FCS requests into corpus-specific query languages, waits for the results, and then renders them in an XML format required by the FCS.
Different corpus platforms or online databases require different endpoints. This endpoint works with the following platforms or resources:
* ANNIS_
* Tsakorpus_
* Database of the `Formulae-Litterae-Chartae project`_
.. _FCS: https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details
.. _Aggregator: https://contentsearch.clarin.eu/
.. _ANNIS: https://corpus-tools.org/annis/
.. _Tsakorpus: https://github.com/timarkh/tsakorpus
.. _Formulae-Litterae-Chartae project: https://werkstatt.formulae.uni-hamburg.de/collections/formulae_collection
See :doc:`FAQ </faq>` for a short list of commonly asked questions. If you want to learn how to set up an endpoint, please go to :doc:`overview`.
Requirements
------------
This software was tested on Ubuntu and Windows. Its dependencies are the following:
* python >= 3.8
* python modules: ``fastapi``, ``uvicorn``, ``lxml``, ``Jinja2`` (you can use ``requirements.txt``)
License
-------
The software is distributed under CC BY license.
.. toctree::
:maxdepth: 2
:caption: Contents:
faq
overview
configuration
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
Getting started
===============
Installation
------------
This endpoint does not require installation. All you have to do is make sure all the dependencies are installed and copy the entire contents of the repository to a directory on your computer. This concerns both Linux and Windows. After that, you will need to :doc:`configure </configuration>` the endpoint.
Installing dependencies
~~~~~~~~~~~~~~~~~~~~~~~
Python modules can be installed with the help of the ``requirements.txt`` file in the root folder::
pip3 install -r requirements.txt
Running the endpoint
--------------------
You can use the endpoint either locally (for testing pruposes) or as a web service available from outside. In the first case, it is sufficient to run ``main.py`` as a Python file. This will start a FastAPI web server. After this, the endpoint for a corpus with the ID ``CORPUS_ID`` will be accessible locally at ``http://127.0.0.1:5000/fcs-endpoint/CORPUS_ID/``.
fastapi>=0.88.0
uvicorn>=0.20.0
\ No newline at end of file
uvicorn>=0.20.0
lxml
Jinja2>=3.0.3
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment