Skip to content
Snippets Groups Projects
Commit 7a8461a8 authored by Welter, Felix's avatar Welter, Felix
Browse files

Add folder READMEs and more info to main README

parent 1594b323
Branches
No related tags found
No related merge requests found
......@@ -22,3 +22,27 @@ fwelter/wilps_slide_index:<VERSION_NUMBER>
The volume mounts are optional. If left out the container will start without content on every restart.
# Usage
After starting the slide index and browsing to the page
you will see the following four sections.
## Upload new PDF files
Select one or more PDF files that should be added to the index.
Also specify the line number which contains the title.
This process can take some time, since each pdf page is also
converted to an image.
## Indexed files
Gives an overview of already indexed files.
Also you can get an excerpt of the extracted titles.
If these are not the correct titles, reupload the slides
and specify the correct line number during the upload.
## Query
This sections enables you to test the indexed files
and see which pdf pages are returned for a given query.
## Upload benchmark data
Using an evaluation file (see `benchmark_data/bidm/terms.csv` for an example)
one can check the search performance of the index.
# Search index
This folder contains different implementations of a search index.
`BasicSearchIndex` uses the [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html)
library under the hood. All other classes inherit from `BasicSearchIndex`.
However different implementation are possible, as long as they adhere to
the interface implicitly defined by `BasicSearchIndex`.
Currently `TitleFocusSearchIndex` is in use for the application.
This is configurable in [app.py](https://gitlab.rrz.uni-hamburg.de/bay1620/slide-index/-/blob/master/app.py#L18) around line 18.
## Weighting
Finding the right slide for a given query is a typical informations retrieval task.
While in the web context a vast amount of ranking signals can be used, with PDF
files only text and possibly a title are available. However it also means that
we do not need to watch out for black hat SEO measures. Therefore we can
fully trust the title and use it as the primary indicator for matching slides.
If the slide index is deployed in a larger production setup, functionality for
user feedback could be employed to find more accurate matches for reoccuring
queries.
# Slide indexer
This folder contains classes responsible for processing a PDF file and adding it
to an index. The current `BasicIndexer` needs to know the index and the image
dir. Furthermore the line number which contains the title can be configured (default: 0).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment