Create an index
IndexTool
IndexTool is a simple commandline application to create a corpus and add documents to it.
Get the blacklab JAR and the required libraries (see Getting started). The libraries should be in a directory called lib
that's in the same directory as the BlackLab JAR (or elsewhere on the classpath).
Start the IndexTool
without parameters for help information:
java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool
(this assumes blacklab.jar
and the lib
subdirectory containing required libraries are located in the current directory)
(if you're on Windows, replace the classpath separator colon :
with a semicolon ;
)
To create a new index:
java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool create INDEX_DIR INPUT_FILES FORMAT
To add documents to an existing index:
java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool add INDEX_DIR INPUT_FILES FORMAT
If you specify a directory as the INPUT_FILES
, it will be scanned recursively. You can also specify a file glob (such as *.xml
; single-quote it if you're on Linux so it doesn't get expanded by the shell) or a single file. If you specify a .zip
or .tar.gz
file, BlackLab will automatically index the contents.
For example, if you have TEI data in /data/input/my-tei-files
and want to index your corpus to /data/blacklab-corpora/my-corpus
, run the following command:
java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool create /data/blacklab-corpora/my-corpus /data/input/my-tei-files tei
Your data is indexed and placed in a new BlackLab index in the /data/blacklab-corpora/my-corpus
directory.
If you don't specify a glob, IndexTool will index *.xml
by default. You can specify a glob (like *.txt
or *
for all files) to change this.
TIP: Give Java enough memory
Please note that if you're indexing very large files, you should give java
more than the default heap memory using the -Xmx
option. For really large files, and if you have the memory, you could use -Xmx 6G
, for example.
To delete documents from an index:
java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool delete INDEX_DIR FILTER_QUERY
Here, FILTER_QUERY
is a metadata filter query in Lucene query language that matches the documents to delete. Deleting documents and re-adding them can be used to update documents.
Supported formats
BlackLab supports a number of input formats that are common in corpus linguistics:
tei
(Text Encoding Initiative, a popular XML format for linguistic resources, including corpora. Will index content inside the 'body' element; assumes part of speech is found in an attribute called 'type')sketch-wpl
(the TSV/XML hybrid "word per line" input format the Sketch Engine/CWB use)chat
(Codes for the Human Analysis of Transcripts, the format used by the CHILDES project)folia
(a corpus XML format popular in the Netherlands)tsv-frog
(tab-separated file as produced by the Frog annotation tool)
BlackLab also supports these generic file formats:
csv
(Comma-Separated Values file that should have column names "word", "lemma" and "pos")tsv
(Tab-Separated Values file that should have column names "word", "lemma" and "pos")txt
(A plain text file; will tokenize on whitespace and index word forms)
To add support for your own format, you just have to write a configuration file.
If you choose the first option, specify the format name (which must match the name of the .blf.yaml or .blf.json file) as the FORMAT
parameter. IndexTool will search a number of directories, including the current directory and the (parent of the) input directory for format files.
If you choose the second option, specify the fully-qualified class name of your DocIndexer class as the FORMAT
parameter.
Add your own format
The preferred way to add support for your input format one is to write an input format configuration file in either YAML or JSON format. See the next section.
Expert: implementing a custom indexer
It is possible to implement your own DocIndexer class, which offers complete control over the indexing process, but we don't recommend this unless really necessary.
If you encounter limitations with the configuration file approach, please contact us.
Faster indexing
IndexTool will try to index two documents at the same time by default. If you have enough CPU cores and memory, you can increase this number by setting the --threads n
option, where n is the number of threads to use (i.e. documents to index at the same time).
If you find that IndexTool is running out of memory, or becoming very slow, try a lower number of threads instead.
WARNING
- multi-threaded indexing currently works per-file, so if all your documents are in a single large file, only 1 thread will be used.
- large files appear to gradually degrade indexing performance as we get further into the file.
For these reasons, it is currently better to spread your documents over multiple files, although it is not necessary to limit yourself to 1 document per file. Just make sure your files aren't larger than a few MB.