Add folder READMEs and more info to main README

7a8461a8 · Welter, Felix · 1594b323 · 7a8461a8 · 7a8461a8 · 7a8461a8
Commit 7a8461a8 authored 4 years ago by Welter, Felix
--- a/README.md
+++ b/README.md
@@ -22,3 +22,27 @@ fwelter/wilps_slide_index:<VERSION_NUMBER>

 The volume mounts are optional. If left out the container will start without content on every restart. 

+# Usage
+After starting the slide index and browsing to the page
+you will see the following four sections.
+
+## Upload new PDF files
+Select one or more PDF files that should be added to the index.  
+Also specify the line number which contains the title.  
+This process can take some time, since each pdf page is also
+converted to an image.
+
+## Indexed files
+Gives an overview of already indexed files. 
+Also you can get an excerpt of the extracted titles. 
+If these are not the correct titles, reupload the slides
+and specify the correct line number during the upload. 
+
+## Query
+This sections enables you to test the indexed files
+and see which pdf pages are returned for a given query.
+
+## Upload benchmark data
+Using an evaluation file (see `benchmark_data/bidm/terms.csv` for an example)
+one can check the search performance of the index.
+
--- a/search_index/README.md
+++ b/search_index/README.md
+# Search index
+This folder contains different implementations of a search index. 
+`BasicSearchIndex` uses the [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html)
+library under the hood. All other classes inherit from `BasicSearchIndex`. 
+However different implementation are possible, as long as they adhere to 
+the interface implicitly defined by `BasicSearchIndex`.
+
+Currently `TitleFocusSearchIndex` is in use for the application.
+This is configurable in [app.py](https://gitlab.rrz.uni-hamburg.de/bay1620/slide-index/-/blob/master/app.py#L18) around line 18.
+
+## Weighting 
+Finding the right slide for a given query is a typical informations retrieval task.
+While in the web context a vast amount of ranking signals can be used, with PDF
+files only text and possibly a title are available. However it also means that
+we do not need to watch out for black hat SEO measures. Therefore we can
+fully trust the title and use it as the primary indicator for matching slides.
+
+If the slide index is deployed in a larger production setup, functionality for
+user feedback could be employed to find more accurate matches for reoccuring 
+queries. 
--- a/slide_indexer/README.md
+++ b/slide_indexer/README.md
+# Slide indexer
+
+This folder contains classes responsible for processing a PDF file and adding it
+to an index. The current `BasicIndexer` needs to know the index and the image
+dir. Furthermore the line number which contains the title can be configured (default: 0).