Metadata
The simple example included a way to index embedded metadata. Let's say this is our input file:
<?xml version="1.0" ?>
<root>
<document>
<text>
<!-- ... document contents... -->
</text>
<metadata id='1234'>
<meta name='title'>How to configure indexing</meta>
<meta name='author'>Jan Niestadt</meta>
<meta name='description'>Shedding some light on this indexing business!</meta>
</metadata>
</document>
</root>To configure how metadata should be indexed, you can either name each metadata field you want to index separately, or you can use forEachPath to index a number of similar elements as metadata:
## Embedded metadata in document
metadata:
# What element contains the metadata (relative to documentPath)
- containerPath: metadata
# What metadata fields do we have?
fields:
# <metadata/> tag has an id attribute we want to index as docId
- name: docId
valuePath: "@id"
# Each <meta/> child element of <metadata/> corresponds with a metadata field
- forEachPath: meta
namePath: "@name" # name attribute contains field name
valuePath: . # element text is the field valueIt's also possible to process metadata values before they are indexed (see Processing values), although it's often preferable to do as much processing as possible in XPath.
As you can see, metadata is a list, so you can define several metadata blocks, each with their own containerPath.
Actually, starting from dev/5.x you can even nest metadata blocks, so you can use multiple levels of containerPaths:
metadata:
- containerPath: //metadata
blocks:
- containerPath: author # relative to //metadata
fields:
- name: authorName
valuePath: name # relative to //metadata/author
- name: authorYearOfBirth
valuePath: yearOfBirth # relative to //metadata/author
- containerPath: title # relative to //metadata
fields:
- name: titleLevel1
valuePath: main # relative to //metadata/title
- name: titleLevel2
valuePath: sub # relative to //metadata/titleAs you can see, this can help reduce duplication, keeping your XPath expressions short and readable.
Tokenize or not?
By default, metadata fields are tokenized, but it can sometimes be useful to index a metadata field without tokenizing it. One example of this is a field containing the document id: if your document ids contain characters that normally would indicate a token boundary, like a period (.) , your document id would be split into several tokens, which is usually not what you want.
To prevent a metadata field from being tokenized:
metadata:
- containerPath: metadata
fields:
# This field should not be split into words
- name: docId
valuePath: @docId
type: untokenizedNumeric fields
To index a numeric field (currently supports only integer values):
metadata:
- containerPath: metadata
fields:
- name: year
valuePath: publication/year
type: numericWe may consider adding other specific field types (floating point, date, vector) in the future.
Linking to external document metadata
Old linkedDocuments feature
Previously, external metadata could be indexed using the complex linkedDocuments feature. This was removed after 4.x.
The doc() approach here is a standard XPath technique and is significantly easier to use. It is available from dev/5.x but may not work properly in older versions.
If your metadata is stored in separate files, you can use the XPath doc() function to load the metadata file and extract the relevant information.
For example, if your document looks like this:
<?xml version="1.0" ?>
<document id="12345">
<text>
<!-- ... document contents... -->
</text>
</document>And the metadata file metadata/12345.xml looks like this:
<?xml version="1.0" ?>
<metadata>
<title>How to configure indexing</title>
<author>Jan Niestadt</author>
</metadata>You can configure the metadata indexing like this:
## Embedded metadata in document
metadata:
# What element contains the metadata (relative to documentPath)
# (but here we actually point to an external file using doc())
- containerPath: doc(concat('metadata/', ./@id, '.xml'))
# What metadata fields do we have?
fields:
# Load metadata from external file using doc()
- name: title
valuePath: ./metadata/title
- name: author
valuePath: ./metadata/authorSchemes
By default, doc() looks for files on the local filesystem.
However, you can prefix the path with a scheme to load files from other sources. For example, use https://example.com/some/path.xml to load from a web server.
You can also use archive: to load a file from an archive, for example:
containerPath: doc(concat('archive:metadata.zip/', ./@id, '.xml'))You can even add your own schemes. For example, my-db:12345 might load a document from a database. Each scheme such as archive refers to a IndexSourceType plugin, and adding one isn't difficult.
(archive and custom schemes using plugin available from dev/5.x)
Using the original input file path
You can use $inputFilePath from XPath if you need. For example, if your input file is content/doc0123.xml and you
want to link to metadata/meta0123.xml, you could use this XPath:
containerPath: doc(concat('metadata/meta', replace($inputFilePath, '^.*doc(\d+)\.xml$', '$1'), '.xml'))(available from dev/5.x)
Store (part of) a linked document
If you need to store the entire metadata XML content, this should work:
metadata:
- containerPath: doc(concat('metadata/', ./@id, '.xml'))
fields:
- name: metadata-xml
valuePath: serialize(.)Custom properties
Note that custom properties may be removed in a future version.
Just like with annotations, you can specify a displayName, description and uiType for a metadata field. This
information is not used by BlackLab, but can be used by BlackLab Frontend or another application.
For example, see Metadata (Filters)
In the fields section, you can specify uiType for each field to override the default GUI widget to use for the field. By default, fields that have only a few values will use select, while others will use text. There's also a range type for a range of numbers.
Example:
metadata:
- fields:
- name: author
uiType: select
- name: year
uiType: range
- name: genre
uiType: textAgain, note that these properties may be removed from the .blf.yaml file specification in the future. It makes more sense to configure the frontend directly, for example using a custom script. See Customizing the interface.
Add a fixed metadata value to each document
You can add a field with a fixed value to every document indexed. This could be useful if you plan to add several data sets to one index and want to make sure each document is tagged with the data set name. To do this, simply specify value instead of valuePath.
metadata:
- containerPath: metadata
fields:
# Regular metadata field
- name: author
valuePath: author
# Metadata field with fixed value
- name: collection
value: blacklab-docsCorpus metadata
Each BlackLab corpus has its own metadata, recording information such as the time the index was generated and the BlackLab version used, plus information about annotations and metadata fields.
Some of this information is generated as part of the indexing process, and some of the information is copied directly from the input format configuration file if specified. This information is mostly used by applications to learn about the structure of the corpus, get human-friendly names for the various parts, and decide what UI widget to show for a metadata field.
The best way to influence the corpus metadata is by including a special section corpusConfig in your format configuration file. This section may contains certain settings to be copied directly into the index file when it is created:
# The settings in this block will be copied into indexmetadata.yaml
corpusConfig:
# Some basic information about the corpus that may be used by a user interface.
displayName: OpenSonar # Corpus name to display in user interface
description: The OpenSonar corpus. # Corpus description to display in user interface
contentViewable: false # Is the user allowed to view whole documents? [false]
textDirection: LTR # What's the text direction of this corpus? [LTR]
# Metadata fields with a special meaning
specialFields:
pidField: id # unique persistent identifier, used for document lookups, etc.
titleField: title # used to display document title in interface
authorField: author # used to display author in interface
dateField: date # used to display document date in interface
# How to group metadata fields in user interface
metadataFieldGroups:
- name: First group # Text on tab, if there's more than one group
fields: # Metadata fields to display on this tab
- author
- title
- name: Second group
fields:
- date
- keywordsIf you add addRemainingFields: true to one of the groups, any field that wasn't explicitly listed will be added to that group.
There's also a complete annotated index metadata file if you want to know more details about that.
There are also (hacky) ways to make changes to the corpus metadata after it was indexed: you can export the metadata to a file and re-import it later (older corpora had an external indexmetadata.yaml file that could be edited directly). Start the IndexTool with --help to learn more, but be careful, as it is easy to make the index unusable this way.