Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
slide-index
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Requirements
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Welter, Felix
slide-index
Commits
7a8461a8
Commit
7a8461a8
authored
4 years ago
by
Welter, Felix
Browse files
Options
Downloads
Patches
Plain Diff
Add folder READMEs and more info to main README
parent
1594b323
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
README.md
+24
-0
24 additions, 0 deletions
README.md
search_index/README.md
+20
-0
20 additions, 0 deletions
search_index/README.md
slide_indexer/README.md
+5
-0
5 additions, 0 deletions
slide_indexer/README.md
with
49 additions
and
0 deletions
README.md
+
24
−
0
View file @
7a8461a8
...
...
@@ -22,3 +22,27 @@ fwelter/wilps_slide_index:<VERSION_NUMBER>
The volume mounts are optional. If left out the container will start without content on every restart.
# Usage
After starting the slide index and browsing to the page
you will see the following four sections.
## Upload new PDF files
Select one or more PDF files that should be added to the index.
Also specify the line number which contains the title.
This process can take some time, since each pdf page is also
converted to an image.
## Indexed files
Gives an overview of already indexed files.
Also you can get an excerpt of the extracted titles.
If these are not the correct titles, reupload the slides
and specify the correct line number during the upload.
## Query
This sections enables you to test the indexed files
and see which pdf pages are returned for a given query.
## Upload benchmark data
Using an evaluation file (see
`benchmark_data/bidm/terms.csv`
for an example)
one can check the search performance of the index.
This diff is collapsed.
Click to expand it.
search_index/README.md
0 → 100644
+
20
−
0
View file @
7a8461a8
# Search index
This folder contains different implementations of a search index.
`BasicSearchIndex`
uses the
[
Whoosh
](
https://whoosh.readthedocs.io/en/latest/intro.html
)
library under the hood. All other classes inherit from
`BasicSearchIndex`
.
However different implementation are possible, as long as they adhere to
the interface implicitly defined by
`BasicSearchIndex`
.
Currently
`TitleFocusSearchIndex`
is in use for the application.
This is configurable in
[
app.py
](
https://gitlab.rrz.uni-hamburg.de/bay1620/slide-index/-/blob/master/app.py#L18
)
around line 18.
## Weighting
Finding the right slide for a given query is a typical informations retrieval task.
While in the web context a vast amount of ranking signals can be used, with PDF
files only text and possibly a title are available. However it also means that
we do not need to watch out for black hat SEO measures. Therefore we can
fully trust the title and use it as the primary indicator for matching slides.
If the slide index is deployed in a larger production setup, functionality for
user feedback could be employed to find more accurate matches for reoccuring
queries.
This diff is collapsed.
Click to expand it.
slide_indexer/README.md
0 → 100644
+
5
−
0
View file @
7a8461a8
# Slide indexer
This folder contains classes responsible for processing a PDF file and adding it
to an index. The current
`BasicIndexer`
needs to know the index and the image
dir. Furthermore the line number which contains the title can be configured (default: 0).
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment