From 7a8461a8fd31abe930f961c1ac56f6930a5da9f6 Mon Sep 17 00:00:00 2001
From: "Welter, Felix" <felix.welter@studium.uni-hamburg.de>
Date: Thu, 22 Oct 2020 11:35:37 +0200
Subject: [PATCH] Add folder READMEs and more info to main README

---
 README.md               | 24 ++++++++++++++++++++++++
 search_index/README.md  | 20 ++++++++++++++++++++
 slide_indexer/README.md |  5 +++++
 3 files changed, 49 insertions(+)
 create mode 100644 search_index/README.md
 create mode 100644 slide_indexer/README.md

diff --git a/README.md b/README.md
index aef2ced..37a4f92 100644
--- a/README.md
+++ b/README.md
@@ -22,3 +22,27 @@ fwelter/wilps_slide_index:<VERSION_NUMBER>
 
 The volume mounts are optional. If left out the container will start without content on every restart. 
 
+# Usage
+After starting the slide index and browsing to the page
+you will see the following four sections.
+
+## Upload new PDF files
+Select one or more PDF files that should be added to the index.  
+Also specify the line number which contains the title.  
+This process can take some time, since each pdf page is also
+converted to an image.
+
+## Indexed files
+Gives an overview of already indexed files. 
+Also you can get an excerpt of the extracted titles. 
+If these are not the correct titles, reupload the slides
+and specify the correct line number during the upload. 
+
+## Query
+This sections enables you to test the indexed files
+and see which pdf pages are returned for a given query.
+
+## Upload benchmark data
+Using an evaluation file (see `benchmark_data/bidm/terms.csv` for an example)
+one can check the search performance of the index.
+
diff --git a/search_index/README.md b/search_index/README.md
new file mode 100644
index 0000000..ded524a
--- /dev/null
+++ b/search_index/README.md
@@ -0,0 +1,20 @@
+# Search index
+This folder contains different implementations of a search index. 
+`BasicSearchIndex` uses the [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html)
+library under the hood. All other classes inherit from `BasicSearchIndex`. 
+However different implementation are possible, as long as they adhere to 
+the interface implicitly defined by `BasicSearchIndex`.
+
+Currently `TitleFocusSearchIndex` is in use for the application.
+This is configurable in [app.py](https://gitlab.rrz.uni-hamburg.de/bay1620/slide-index/-/blob/master/app.py#L18) around line 18.
+
+## Weighting 
+Finding the right slide for a given query is a typical informations retrieval task.
+While in the web context a vast amount of ranking signals can be used, with PDF
+files only text and possibly a title are available. However it also means that
+we do not need to watch out for black hat SEO measures. Therefore we can
+fully trust the title and use it as the primary indicator for matching slides.
+
+If the slide index is deployed in a larger production setup, functionality for
+user feedback could be employed to find more accurate matches for reoccuring 
+queries. 
diff --git a/slide_indexer/README.md b/slide_indexer/README.md
new file mode 100644
index 0000000..769f2c9
--- /dev/null
+++ b/slide_indexer/README.md
@@ -0,0 +1,5 @@
+# Slide indexer
+
+This folder contains classes responsible for processing a PDF file and adding it
+to an index. The current `BasicIndexer` needs to know the index and the image
+dir. Furthermore the line number which contains the title can be configured (default: 0).
-- 
GitLab