Miscellaneous
Forward index and multiple values
A note about forward indexes and indexing multiple values at a single corpus position: as of right now, the forward index will only store the first value indexed at any position. This is the value used for grouping and sorting on this annotation. In the future we may add the ability to store multiple values for a token position in the forward index, although it is likely that the first value will always be the one used for sorting and grouping.
Allow viewing documents
By default, BlackLab Server will not allow whole documents to be retrieved using /docs/PID/contents
. This is to prevent accidentally distributing unlicensed copyrighted material.
You can allow retrieving whole documents by enabling the corpusConfig.contentViewable
setting in the index format configuration file, or directly in the indexmetadata.yaml
file in the index directory.
This setting can also be changed for individual documents by setting a metadat field with the name contentViewable
to true
or false
.
XPath support level
BlackLab supports two different XML processors: VTD and Saxon. While currently VTD is still the default, we would recommend Saxon for most users going forward.
VTD only supports XPath 1.0 and has some slight quirks (see below). Saxon uses more memory, but is often faster and supports XPath 3.1, which can make writing indexing configurations much easier.
Certain complex indexing features can be avoided when using Saxon; many things can be done in XPath directly. See XPath examples to get an idea of the wide range of possibilities.
To use Saxon, place this in your input format config (.blf.yaml) file (at the top level):
processor: saxon
This works for the current development version and releases 4.0 and up.
Using Saxon with BlackLab 3.0.1 and older
In older versions of BlackLab (release 3.0.1 and before), there is basic Saxon support, but there are quite a few features missing.
It also didn't support the top-level processor
key shown above; if you do want to use Saxon on these older releases, use:
fileType: xml
fileTypeOptions:
processing: saxon # (instead of vtd, which is the default)
Beware of VTD quirks
If you do stick with the default processor VTD instead of switching to Saxon, be aware that in rare cases, a correct XPath may produce unexpected results. This one for example: string(.//tei:availability[1]/@status='free')
. There's often a workaround for this, in this case changing it to string(//tei:availability[1]/@status='free')
might fix it (although of course this means something slightly different, so do check thoroughly).
A future version of BlackLab will change the default from VTD to Saxon.
Unicode normalization
Unicode normalization refers to the process of converting different ways of encoding the same character to a single, canonical form. For example, the character é
can be encoded as a single character é
(U+00E9), or as a combination of e
(U+0065) and ´
(U+00B4).
BlackLab's builtin indexers should automatically normalize to NFC (Normalization Form Canonical Composition). This should prevent any issues when sorting or grouping.
More about Unicode equivalence and normal forms
Automatic XSLT generation
If you're creating your own corpora by uploading data to BlackLab Frontend, you want to be able to view your documents as well, without having to write an XSLT yourself. BlackLab Server can generate a default XSLT from your format config file. However, because BlackLab is a bit more lenient with namespaces than the XSLT processor that generates the document view, the generated XSLT will only work correctly if you take care to define your namespaces correctly in your format config file.
IMPORTANT: generating the XSLT might not work correctly if your XML namespaces change throughout the document, e.g. if you declare local namespaces on elements, instead of
Namespaces can be declared in the top-level "namespaces" block, which is simply a map of namespace prefix (e.g. "tei") to the namespace URI (e.g. http://www.tei-c.org/ns/1.0
). So for example, if your documents declare namespaces as follows:
<doc xmlns:my-ns="http://example.com/my-ns" xmlns="http://example.com/other-ns">
...
</doc>
Then your format config file should contain this namespaces section:
namespaces:
'': http://example.com/other-ns # The default namespace
my-ns: http://example.com/my-ns
If you forget to declare some or all of these namespaces, the document might index correctly, but the generated XSLT won't work and will likely show a message saying that no words have been found in the document. Updating your format config file should fix this; re-indexing shouldn't be necessary, as the XSLT is generated directly from the config file, not the index.
Configuration versions 1 and 2
There's an experimental version 2 of the .blf.yaml
format. To try it out,
add version: 2
to the top of your format file.
Version 2 of the format file introduces a few breaking changes to be aware of:
- default XML processor is now
saxon
(used to bevtd
). Saxon is faster and supports modern XPath features, making it much more flexible. baseFormat
key (to inherit from a different format config) is no longer allowed. Instead, you should copy the format and customize it to suit your needs.word
andlemma
no longer have a special defaultsensitivity
. All user-defined annotations now default toinsensitive
. To remain compatible with the old behaviour, explicitly specifysensitivity: sensitive_insensitive
forword
andlemma
.- dash
-
in field or annotation name will no longer automatically be replaced with underscore_
(this was never necessary; field and annotation names must be valid XML names, which may contain dashes) If you rely on this quirk, replace dash with underscore manually in your config. - processing step
default
was renamed toifempty
, to better describe how it's commonly used. inlineTags
keysincludeAttributes
,excludeAttributes
andextraAttributes
have been removed. Instead, use theattributes
key to specify which attributes to index. AddvaluePath
if this is an extra attribute (that doesn't actually appear on the tag, but should be added based on the XPath expression). Useexclude: true
to exclude an attribute. If the first entry contains no name, onlyexclude: true
, this means "exclude any attribute not in this list".append
processing step now has aprefix
parameter in addition to theseparator
parameter.separator
still defaults to a space, but is now only used to separate multiple metadata field values.prefix
defaults to the empty string, and is used to prefix the value to be appended. This means you won't get an extra space by default when appending a value. Addprefix: ' '
(or whatever you set asseparator
) for the old behaviour.- The
multipleValues
,allowDuplicateValues
keys on an annotations have been removed. Both work automatically now: if your config produces multiple values for an annotation, they will be indexed, and any duplicates that may arise are automatically removed. - The
mapValues
key on metadata fields has been removed. Use themap
processing step instead, which can be used anywhere where processing steps are allowed.
Extending formats (deprecated)
NOTE: THIS FUNCTIONALITY IS DEPRECATED
Don't rely on this feature as it is no longer supported in .blf.yaml format version 2. Instead, simply copy the format file and make any changes you need.
It is possible to extend an existing format. This is done by specifying the "baseFormat" setting at the top-level. You should set it to the name of the format you wish to extend.
It matters where baseFormat is placed, as it effectively copies values from the specified format when it is encountered. It's usually best to specify baseFormat somewhere at the top of the file. You can put it after 'name' and 'description' if you wish, as those settings are not copied.
To be precise, setting baseFormat does the following:
- copy type, fileType, documentPath, store, metadataDefaultAnalyzer
- copy the corpusConfig settings
- add all fileTypeOptions
- add all namespace declarations
- add all indexFieldAs entries
- add all annotatedFields entries
- add all metadata entries
- add all linkedDocument entries
In other words: setting a base format allows you to add or change file type options, namespace declarations, indexFieldAs entries, annotated fields or linked documents. You can also add (embedded) metadata sections.
Note that most blocks are not "merged": if you want to change annotated field settings, you will have to redefine the entire annoted field in the "derived" configuration file; you can't just specify the setting you wish to override for that field. It is also not possible to make changes to existing metadata sections.