Non-XML files
Indexing CSV/TSV files
BlackLab works best with XML files, because they can contain any kind of (sub)annotations, (embedded or linked) metadata, inline tags, and so on. It is also often easier to display them in BlackLab Frontend (using XSLT). However, if your data is in a non-XML type like CSV, TSV or plain text, and you'd rather not convert it, you can still index it.
For CSV/TSV files, indexing them directly can be done by defining a tabular input format. These are "word-per-line" (WPL) formats, meaning that each line will be interpreted as a single token. Annotations simply specify the column number (or column name, if your input files have them).
(Technical note: BlackLab uses Apache commons-csv to parse tabular files. Not all settings are exposed at the moment. If you find yourself needing access to a setting that isn't exposed via de configuration file yet, please let us know)
Here's a simple example configuration, my-tsv.blf.yaml
, that will parse tab-delimited files produced by the Frog tool:
fileType: tabular
## Options for tabular format
fileTypeOptions:
# TSV (tab-separated values) or CSV (comma-separated values, like Excel)
type: tsv
# Does the file have column names in the first line? [default: false]
columnNames: false
# The delimiter character to use between column values
# [default: comma (",") for CSV, tab ("\t") for TSV]
delimiter: "\t"
# The quote character used around column values (where necessary)
# [default: disable quoting column values]
quote: "\""
annotatedFields:
contents:
annotations:
- name: word # First annotation becomes the main annotation
valuePath: 2 # (1-based) column number or column name (if file has them)
sensitivity: sensitive_insensitive
- name: lemma
valuePath: 3
sensitivity: sensitive_insensitive
- name: pos
valuePath: 5
(Note that the BlackLab JAR includes a default tsv.blf.yaml
that is a bit different: it assumes a file containing column names. The column names are word, lemma and pos)
The Sketch Engine takes a tab-delimited WPL input format that document tags, inline tags and "glue tags" (which indicate that there should be no space between two tokens). Here's a short example:
<doc id="1" title="Test document" author="Jan Niestadt">
<s>
This PRO this
is VRB be
a ART a
test NOU test
<g/>
. SENT .
</s>
</doc>
Here's a configuration to index this format (sketch-wpl.blf.yaml
, already included in the BlackLab JAR):
fileType: tabular
fileTypeOptions:
type: tsv
# allows inline tags such as in Sketch WPL format
# all inline tags encountered will be indexed
inlineTags: true
# interprets <g/> to be a glue tag such as in Sketch WPL format
glueTags: true
# If the file includes "inline tags" like <p></p> and <s></s>,
# (like for example the Sketch Engine WPL format does)
# is it allowed to have separated characters after such a tag?
# [default: false]
allowSeparatorsAfterInlineTags: false
documentPath: doc # looks for document elements such as in Sketch WPL format
# (attributes are automatically indexed as metadata)
annotatedFields:
contents:
annotations:
- name: word # First annotation becomes the main annotation
valuePath: 1
sensitivity: sensitive_insensitive
- name: lemma
valuePath: 3
sensitivity: sensitive_insensitive
- name: pos
valuePath: 2
If one of your columns contains multiple values, for example multiple alternative lemmatizations, use a processing step (action: split
) to split the column value. See also here.
If you want to index metadata from another file along with each document, you have to use valueField
in the linkValues
section (see here). In the SketchWPL case, in addition to fromInputFile
you can also use any document element attributes, because those are added as metadata fields automatically. So if the document element has an id
attribute, you could use that as a linkValue
to locate the metadata file.
Indexing plain text files
Plain text files don't allow you to use a lot of BlackLab's features and hence don't require a lot of configuration either. If you need specific indexing features for non-tabular, non-XML file formats, please let us know and we will consider adding them. For now, here's how to configure a plain text input format (txt.blf.yaml
, included in the BlackLab JAR):
fileType: text
annotatedFields:
contents:
annotations:
- name: word
valuePath: .
sensitivity: sensitive_insensitive
Note that a plain text format may only have a single annotated field. You cannot specify containerPath or wordPath. For each annotation you define, valuePath must be "." ("the current word"), but you can specify different processing steps for different annotations if you want.
There is one way to index metadata information along with plain text files, which is to look up the metadata based on the input file. The example below uses processing steps; see the relevant section below, and see the section on linking to external files for more information on that subject.
To index metadata information based on the input file path, use a section such as this one:
linkedDocuments:
metadata:
store: true # Should we store the linked document?
# Values we need for locating the linked document
# (matching values will be substituted for $1-$9 below)
linkValues:
- valueField: fromInputFile # fetch the "fromInputFile" field from the Lucene doc
# (this is the original path to the file that was indexed)
process:
# Normalize slashes
- action: replace
find: "\\\\"
replace: "/"
# Keep only the last two path parts (which indicate location inside metadata zip file)
- action: replace
find: "^.*/([^/]+/[^/]+)/?$"
replace: "$1"
- action: replace
find: "\\.txt$"
replace: ".cmdi"
#- valueField: id # plain text has no other fields, but TSV with document elements
# could, and those fields could also be used (see documentPath
# below)
# How to fetch the linked input file containing the linked document.
# File or http(s) reference. May contain $x (x = 1-9), which will be replaced
# with (processed) linkValue
inputFile: http://server.example.com/metadata.zip
# (Optional)
# If the linked input file is an archive (zip is recommended), this is the path
# inside the archive where the file can be found. May contain $x (x = 1-9), which
# will be replaced with (processed) linkValue
pathInsideArchive: some/dir/$1
# Format of the linked input file
inputFormat: cmdi
# (Optional)
# XPath to the (single) linked document to process.
# If omitted, the entire file is processed, and must contain only one document.
# May contain $x (x = 1-9), which will be replaced with (processed) linkValue
#documentPath: /root/metadata[@docId = $2]
Indexing other files
For some types of files it is possible to automatically convert them to another file type that can be indexed.
Support for this feature works through plugins and is still experimental.
Add the following lines to your configuration file to convert your files before indexing them according to the rest of the configuration.
convertPlugin: OpenConvert
tagPlugin: DutchTagger
This setup will convert doc, docx, txt, epub, html, alto, rtf and odt
into tei
.
This will however not work until you provide the right .jar and data files to the plugins. Adding the following configuration to blacklab-server.yaml
will enable the plugins to do their work.
plugins:
OpenConvert:
jarPath: /path/to/OpenConvert-0.2.0.jar
DutchTagger:
jarPath: /path/to//DutchTagger-0.2.0.jar
vectorFile: /path/to/duthtagger/data/vectors.bin
modelFile: /path/to/dutchtagger/model
lexiconFile: /path/to/dutchtagger/lexicon.tab
Currently the files and exact version of OpenConvert are not publically available, but look at the plugins page for more information on how write your own plugin.