INEL Corpus Services
How to run it through CLI
An example:
java -Xmx3g -jar /path/to/corpus-services.jar -i /path/to/corpus -o path/to/corpus/curation/report-output.html -c INELChecks -f -p "fsm=/path/to/corpus/corpus-utilities/segmentation.fsm"
Available options
-i, --input
Required. The path to source file(s) you want to perform an action on.
-i /path/to/corpus
If it's a path to a directory, then the action will be applied to all the eligible files within the directory and all subdirectories.
-i /path/to/corpus/selkup.coma
If it's a path to a file, then the action will be applied to that file only.
-o, --output
Optional. The path to a report file (HTML or JSON) containing warnings and errors found in the source data. If this option is not provided, no report will be made.
-o path/to/corpus/curation/report-output.html
Produces an HTML version of the report that can be viewed in a browser.
-o path/to/corpus/curation/report.json
Produces a JSON version of the report.
-o path/to/corpus/curation/report-output.html -o path/to/corpus/curation/report.json
You may provide the option twice to produce both versions of the report.
-c, --corpusfunction / -u, --utilityfunction
Optional. The name of the function you want to run AKA the case-sensitive name of the respective java class from .validation or .utilities package. If neither -c nor -u is provided, Corpus Services will do nothing. Currently -c and -u perform the same actions and are thus interchangeable, thought that may be subject to change in the future.
-i selkup.coma -c ComaApostropheChecker
Will call ComaApostropheChecker on the comafile.
-i selkup.coma -u PrettyPrintData
Will call PrettyPrintData on the comafile.
-i selkup.coma -c ComaApostropheChecker -c ComaAttachedFilepathsChecker -c ComaFileCoverageChecker
You may chain the option to run several checking classes during the same run.
-c INELChecks
A useful shortcut to run all the functions from the .validation package.
-f, --fix
Optional, boolean. If selected, Corpus Services will automatically fix some errors where possible and rewrite the source files. If not, Corpus Services will collect issues to be written in a report, and no changes to the source files will be made.
-p, --property
Optional. Some checks require properties to be provided by a user, and otherwise will not run correctly. The general syntax is -p "property_name=property_value".
-c ExbSegmentationChecker -p "fsm=/path/to/corpus/corpus-utilities/segmentation.fsm"
ExbSegmentationChecker looks for an external FSM to perform segmentation. The property_name in this example is fsm, the property_value is /path/to/corpus/corpus-utilities/segmentation.fsm.
-u ComaMassLinkFiles -p "exb=true" -p "eaf=true" -f
You may chain several properties in one call, same as with -c/-u. In the example above, ComaMassLinkFiles is being used to automatically link EXB and EAF files to their respective communications in the comafile.
-x, --xquery
Optional, specific to the class XQueryWrapper. Contains the query name.
-u XQueryWrapper -x pos
Here Corpus Services will run the query named pos that counts parts of speech across several corpora.