Skip to content
Snippets Groups Projects
user avatar
Aleksandr Riaposov authored
31d3cde2
History
user avatar 31d3cde2
Name Last commit Last update
lib
src/main/java/INEL
.gitignore
pom.xml
readme.md

INEL Corpus Services

How to run it through CLI

An example:

java -Xmx3g -jar /path/to/corpus-services.jar -i /path/to/corpus -o path/to/corpus/curation/report-output.html -c INELChecks -f -p "fsm=/path/to/corpus/corpus-utilities/segmentation.fsm"

Available options

-i, --input

Required. The path to source file(s) you want to perform an action on.

-i /path/to/corpus

If it's a path to a directory, then the action will be applied to all the eligible files within the directory and all subdirectories.

-i /path/to/corpus/selkup.coma

If it's a path to a file, then the action will be applied to that file only.

-o, --output

Optional. The path to a report file (HTML or JSON) containing warnings and errors found in the source data. If this option is not provided, no report will be made.

-o path/to/corpus/curation/report-output.html

Produces an HTML version of the report that can be viewed in a browser.

-o path/to/corpus/curation/report.json

Produces a JSON version of the report.

-o path/to/corpus/curation/report-output.html -o path/to/corpus/curation/report.json

You may provide the option twice to produce both versions of the report.

-c, --corpusfunction / -u, --utilityfunction

Optional. The name of the function you want to run AKA the case-sensitive name of the respective java class from .validation or .utilities package. If neither -c nor -u is provided, Corpus Services will do nothing. Currently -c and -u perform the same actions and are thus interchangeable, thought that may be subject to change in the future.

-i selkup.coma -c ComaApostropheChecker

Will call ComaApostropheChecker on the comafile.

-i selkup.coma -u PrettyPrintData

Will call PrettyPrintData on the comafile.

-i selkup.coma -c ComaApostropheChecker -c ComaAttachedFilepathsChecker -c ComaFileCoverageChecker

You may chain the option to run several checking classes during the same run.

-c INELChecks

A useful shortcut to run all the functions from the .validation package.

-f, --fix

Optional, boolean. If selected, Corpus Services will automatically fix some errors where possible and rewrite the source files. If not, Corpus Services will collect issues to be written in a report, and no changes to the source files will be made.

-p, --property

Optional. Some checks require properties to be provided by a user, and otherwise will not run correctly. The general syntax is -p "property_name=property_value".

-c ExbSegmentationChecker -p "fsm=/path/to/corpus/corpus-utilities/segmentation.fsm"

ExbSegmentationChecker looks for an external FSM to perform segmentation. The property_name in this example is fsm, the property_value is /path/to/corpus/corpus-utilities/segmentation.fsm.

-u ComaMassLinkFiles -p "exb=true" -p "eaf=true" -f

You may chain several properties in one call, same as with -c/-u. In the example above, ComaMassLinkFiles is being used to automatically link EXB and EAF files to their respective communications in the comafile.

-x, --xquery

Optional, specific to the class XQueryWrapper. Contains the query name.

-u XQueryWrapper -x pos

Here Corpus Services will run the query named pos that counts parts of speech across several corpora.