From 7a8461a8fd31abe930f961c1ac56f6930a5da9f6 Mon Sep 17 00:00:00 2001 From: "Welter, Felix" <felix.welter@studium.uni-hamburg.de> Date: Thu, 22 Oct 2020 11:35:37 +0200 Subject: [PATCH] Add folder READMEs and more info to main README --- README.md | 24 ++++++++++++++++++++++++ search_index/README.md | 20 ++++++++++++++++++++ slide_indexer/README.md | 5 +++++ 3 files changed, 49 insertions(+) create mode 100644 search_index/README.md create mode 100644 slide_indexer/README.md diff --git a/README.md b/README.md index aef2ced..37a4f92 100644 --- a/README.md +++ b/README.md @@ -22,3 +22,27 @@ fwelter/wilps_slide_index:<VERSION_NUMBER> The volume mounts are optional. If left out the container will start without content on every restart. +# Usage +After starting the slide index and browsing to the page +you will see the following four sections. + +## Upload new PDF files +Select one or more PDF files that should be added to the index. +Also specify the line number which contains the title. +This process can take some time, since each pdf page is also +converted to an image. + +## Indexed files +Gives an overview of already indexed files. +Also you can get an excerpt of the extracted titles. +If these are not the correct titles, reupload the slides +and specify the correct line number during the upload. + +## Query +This sections enables you to test the indexed files +and see which pdf pages are returned for a given query. + +## Upload benchmark data +Using an evaluation file (see `benchmark_data/bidm/terms.csv` for an example) +one can check the search performance of the index. + diff --git a/search_index/README.md b/search_index/README.md new file mode 100644 index 0000000..ded524a --- /dev/null +++ b/search_index/README.md @@ -0,0 +1,20 @@ +# Search index +This folder contains different implementations of a search index. +`BasicSearchIndex` uses the [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html) +library under the hood. All other classes inherit from `BasicSearchIndex`. +However different implementation are possible, as long as they adhere to +the interface implicitly defined by `BasicSearchIndex`. + +Currently `TitleFocusSearchIndex` is in use for the application. +This is configurable in [app.py](https://gitlab.rrz.uni-hamburg.de/bay1620/slide-index/-/blob/master/app.py#L18) around line 18. + +## Weighting +Finding the right slide for a given query is a typical informations retrieval task. +While in the web context a vast amount of ranking signals can be used, with PDF +files only text and possibly a title are available. However it also means that +we do not need to watch out for black hat SEO measures. Therefore we can +fully trust the title and use it as the primary indicator for matching slides. + +If the slide index is deployed in a larger production setup, functionality for +user feedback could be employed to find more accurate matches for reoccuring +queries. diff --git a/slide_indexer/README.md b/slide_indexer/README.md new file mode 100644 index 0000000..769f2c9 --- /dev/null +++ b/slide_indexer/README.md @@ -0,0 +1,5 @@ +# Slide indexer + +This folder contains classes responsible for processing a PDF file and adding it +to an index. The current `BasicIndexer` needs to know the index and the image +dir. Furthermore the line number which contains the title can be configured (default: 0). -- GitLab