Skip to content

Miscellaneous

Forward index and multiple values

A note about forward indexes and indexing multiple values at a single corpus position: as of right now, the forward index will only store the first value indexed at any position. This is the value used for grouping and sorting on this annotation. In the future we may add the ability to store multiple values for a token position in the forward index, although it is likely that the first value will always be the one used for sorting and grouping.

Allow viewing documents

By default, BlackLab Server will not allow whole documents to be retrieved using /docs/PID/contents. This is to prevent accidentally distributing unlicensed copyrighted material.

You can allow retrieving whole documents by enabling the corpusConfig.contentViewable setting in the index format configuration file, or directly in the indexmetadata.yaml file in the index directory.

This setting can also be changed for individual documents by setting a metadat field with the name contentViewable to true or false.

XPath support level

BlackLab uses the Saxon library to process XML.

Saxon on BlackLab 4.x

BlackLab v4.x used either VTD or Saxon as an XML processor. It is recommended to use Saxon there as well. Place processor: saxon in your .blf.yaml file to ensure you're using Saxon in that version.

Saxon supports XPath 3.1, which is very powerful and can help when writing complex indexing configurations.

Certain complex indexing features can be avoided when using Saxon; many things can be done in XPath directly. See XPath examples to get an idea of the wide range of possibilities.

(BlackLab up to version 4 defaulted to the VTD-XML processor and had Saxon as an option (processor: saxon in .blf.yaml). On the development branch (future BlackLab 5), Saxon is now the only supported XML processor)

Namespaces

If your XML documents use namespaces, you must declare these in the namespaces section of your format config file so your XPath expressions work correctly. Note that the xml namespace is implicit and does not need to be declared (since dev/5.x).

Example:

yaml
namespaces:
    tei: http://www.tei-c.org/ns/1.0

## What element starts a new document?
documentPath: //tei:TEI

You can also declare a default namespace (i.e. for elements without a prefix) by using an empty string as the prefix. So the following is equivalent to the above:

yaml
namespaces:
    '': http://www.tei-c.org/ns/1.0
documentPath: //TEI

Note that if you do not declare namespaces in your .blf.yaml file, namespaces will be ignored during indexing. This can help to index 'messy' datasets where some documents have schema declarations and other do not.

Unicode normalization

Unicode normalization refers to the process of converting different ways of encoding the same character to a single, canonical form. For example, the character é can be encoded as a single character é (U+00E9), or as a combination of e (U+0065) and ´ (U+00B4).

BlackLab's builtin indexers should automatically normalize to NFC (Normalization Form Canonical Composition). This should prevent any issues when sorting or grouping.

More about Unicode equivalence and normal forms

Automatic XSLT generation

If you're creating your own corpora by uploading data to BlackLab Frontend, you want to be able to view your documents as well, without having to write an XSLT yourself. BlackLab Server can generate a default XSLT from your format config file. However, because BlackLab is a bit more lenient with namespaces than the XSLT processor that generates the document view, the generated XSLT will only work correctly if you take care to define your namespaces correctly in your format config file.

IMPORTANT: generating the XSLT might not work correctly if your XML namespaces change throughout the document, e.g. if you declare local namespaces on elements, instead of

Namespaces can be declared in the top-level "namespaces" block, which is simply a map of namespace prefix (e.g. "tei") to the namespace URI (e.g. http://www.tei-c.org/ns/1.0). So for example, if your documents declare namespaces as follows:

xml
<doc xmlns:my-ns="http://example.com/my-ns" xmlns="http://example.com/other-ns">
...
</doc>

Then your format config file should contain this namespaces section:

yaml
namespaces:
  '': http://example.com/other-ns    # The default namespace
  my-ns: http://example.com/my-ns

If you forget to declare some or all of these namespaces, the document might index correctly, but the generated XSLT won't work and will likely show a message saying that no words have been found in the document. Updating your format config file should fix this; re-indexing shouldn't be necessary, as the XSLT is generated directly from the config file, not the index.

Configuration versions 1 and 2

BlackLab 4 introduced a version 2 of the .blf.yaml format. On dev/5.x, this version is the default, and the version: 2 declaration at the top of the format file is no longer necessary.

Version 2 of the format file introduced a few breaking changes to be aware of:

  • default XML processor is now saxon (used to be vtd). Saxon is faster and supports modern XPath features, making it much more flexible.
  • baseFormat key (to inherit from a different format config) is no longer allowed. Instead, you should copy the format and customize it to suit your needs.
  • word and lemma no longer have a special default sensitivity. All user-defined annotations now default to insensitive. To remain compatible with the old behaviour, explicitly specify sensitivity: sensitive_insensitive for word and lemma.
  • dash - in field or annotation name will no longer automatically be replaced with underscore _ (this was never necessary; field and annotation names must be valid XML names, which may contain dashes) If you rely on this quirk, replace dash with underscore manually in your config.
  • processing step default was renamed to ifempty, to better describe how it's commonly used.
  • inlineTags keys includeAttributes, excludeAttributes and extraAttributes have been removed. Instead, use the attributes key to specify which attributes to index. Add valuePath if this is an extra attribute (that doesn't actually appear on the tag, but should be added based on the XPath expression). Use exclude: true to exclude an attribute. If the first entry contains no name, only exclude: true, this means "exclude any attribute not in this list".
  • append processing step now has a prefix parameter in addition to the separator parameter. separator still defaults to a space, but is now only used to separate multiple metadata field values. prefix defaults to the empty string, and is used to prefix the value to be appended. This means you won't get an extra space by default when appending a value. Add prefix: ' ' (or whatever you set as separator) for the old behaviour.
  • The multipleValues, allowDuplicateValues keys on an annotations have been removed. Both work automatically now: if your config produces multiple values for an annotation, they will be indexed, and any duplicates that may arise are automatically removed.
  • The mapValues key on metadata fields has been removed. Use the map processing step instead, which can be used anywhere where processing steps are allowed.

Apache license 2.0