Create an index

IndexTool

IndexTool is a simple commandline application to create a corpus and add documents to it.

Get the blacklab JAR and the required libraries (see Getting started). The libraries should be in a directory called lib that's in the same directory as the BlackLab JAR (or elsewhere on the classpath).

Start the IndexTool without parameters for help information:

bash

java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool

(this assumes blacklab.jar and the lib subdirectory containing required libraries are located in the current directory)

(if you're on Windows, replace the classpath separator colon : with a semicolon ;)

To create a new index:

bash

java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool create INDEX_DIR INPUT_FILES FORMAT

To add documents to an existing index:

bash

java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool add INDEX_DIR INPUT_FILES FORMAT

If you specify a directory as the INPUT_FILES, it will be scanned recursively. You can also specify a file glob (such as *.xml; single-quote it if you're on Linux so it doesn't get expanded by the shell) or a single file. If you specify a .zip or .tar.gz file, BlackLab will automatically index the contents.

For example, if you have TEI data in /data/input/my-tei-files and want to index your corpus to /data/blacklab-corpora/my-corpus, run the following command:

bash

java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool create /data/blacklab-corpora/my-corpus /data/input/my-tei-files tei

Your data is indexed and placed in a new BlackLab index in the /data/blacklab-corpora/my-corpus directory.

If you don't specify a glob, IndexTool will index *.xml by default. You can specify a glob (like *.txt or * for all files) to change this.

TIP: Give Java enough memory

Please note that if you're indexing very large files, you should give java more than the default heap memory using the -Xmx option. For really large files, and if you have the memory, you could use -Xmx 6G, for example.

To delete documents from an index:

bash

java -cp "blacklab.jar:lib" nl.inl.blacklab.tools.IndexTool delete INDEX_DIR FILTER_QUERY

Here, FILTER_QUERY is a metadata filter query in Lucene query language that matches the documents to delete. Deleting documents and re-adding them can be used to update documents.

Supported formats

BlackLab supports a number of input formats that are common in corpus linguistics:

tei (Text Encoding Initiative, a popular XML format for linguistic resources, including corpora. Will index content inside the 'body' element; assumes part of speech is found in an attribute called 'type')
sketch-wpl (the TSV/XML hybrid "word per line" input format the Sketch Engine/CWB use)
chat (Codes for the Human Analysis of Transcripts, the format used by the CHILDES project)
folia (a corpus XML format popular in the Netherlands)
tsv-frog (tab-separated file as produced by the Frog annotation tool)

BlackLab also supports these generic file formats:

csv (Comma-Separated Values file that should have column names "word", "lemma" and "pos")
tsv (Tab-Separated Values file that should have column names "word", "lemma" and "pos")
txt (A plain text file; will tokenize on whitespace and index word forms)

To add support for your own format, you just have to write a configuration file.

If you choose the first option, specify the format name (which must match the name of the .blf.yaml or .blf.json file) as the FORMAT parameter. IndexTool will search a number of directories, including the current directory and the (parent of the) input directory for format files.

If you choose the second option, specify the fully-qualified class name of your DocIndexer class as the FORMAT parameter.

Add your own format

The preferred way to add support for your input format one is to write an input format configuration file in either YAML or JSON format. See the next section.

Expert: implementing a custom indexer

It is possible to implement your own DocIndexer class, which offers complete control over the indexing process, but we don't recommend this unless really necessary.

If you encounter limitations with the configuration file approach, please contact us.

Faster indexing

IndexTool will try to index two documents at the same time by default. If you have enough CPU cores and memory, you can increase this number by setting the --threads n option, where n is the number of threads to use (i.e. documents to index at the same time).

If you find that IndexTool is running out of memory, or becoming very slow, try a lower number of threads instead.

WARNING

multi-threaded indexing currently works per-file, so if all your documents are in a single large file, only 1 thread will be used.
large files appear to gradually degrade indexing performance as we get further into the file.

For these reasons, it is currently better to spread your documents over multiple files, although it is not necessary to limit yourself to 1 document per file. Just make sure your files aren't larger than a few MB.

Create an index ​

IndexTool ​

Supported formats ​

Add your own format ​

Faster indexing ​

Create an index

IndexTool

Supported formats

Add your own format

Faster indexing