Skip to content

Spans

A span is a group of (contiguous) words, that you want to index and search as a single unit. Some examples are sentences or named entities.

The simple example is configured to index <s/> elements in the input document as spans:

yaml
# What tags occurring between the word tags do we wish to index? (relative to containerPath) 
inlineTags:
    # Sentence tags
    - path: .//s

This means we will be able to run searches like:

"oak" "tree" within <s/>

(more about using spans in queries here)

There's a few additional parameters you can set for inline tags.

Excluding attributes or indexing extra attributes

Make sure your .blf.yaml file starts with processor: saxon to ensure modern XPath compatibility.

yaml
# What tags occurring between the word tags do we wish to index? (relative to containerPath) 
inlineTags:
    # Sentence tags
    - path: .//s
    - attributes:
        # Don't index unique ids unless you need them; 
        # they slow down indexing and searching and increase index size
        - name: "xml:id"
          exclude: true   # all attributes except this one will be indexed
    - path: .//p
      attributes:
        # Exclude all attributes except...
        - exclude: true
        # Attribute on tag
        - name: "type"
        # extra attribute using XPath
        # if e.g. input is <p xml:id="par-12">...</p> , index  number="12"
        - name: "number"
          valuePath: "substring-after(@xml:id, 'par-')"
          
    - path: .//ne
      displayAs: named-entity    # what CSS class to use (when using autogenerated XSLT)

As you can see, attributes with exclude: true can be used to prevent the index size ballooning because of a unique id (although of course you won't be able to search sentences by their id anymore), and displayAs can be used to give the span a different CSS class in the generated XSLT (see Automatic XSLT generation).

attributes can also be used to add attributes to the tag that are not actually on the tag in the input document, by evaluating an XPath expression.

You can also apply process steps to attributes.

Standoff annotations for spans

Just like you can use standoffs to index regular token annotations, you can also use them to index spans.

This might fit better with how your input files are structured. Another advantage is that it allows you to index partially overlapping spans, which is not possible with regular (hierarchically nested) inline tags.

This is done using spanStartPath, spanEndPath and spanNamePath (instead of tokenRefPath used for token annotations).

So to index this XML:

xml
<doc>
    <w xml:id="w1">The</w>
    <w xml:id="w2">quick</w>
    <w xml:id="w3">brown</w>
    <w xml:id="w4">fox</w>
    <w xml:id="w5">jumps</w>
    <w xml:id="w6">over</w>
    ...
    <span from="w1" to="w4" type="animal" speed="fast" />
</doc>

You can use this standoffAnnotations configuration:

yaml
tokenIdPath: "@xml:id"

standoffAnnotations:
- path: .//span
  spanStartPath: "@from"
  spanEndPath: "@to"
  spanEndIsInclusive: true
  spanNamePath: "@type"
  annotations:
    - name: speed
      valuePath: "@speed"

Note the setting spanEndIsInclusive: true to indicate that the to attribute refers to the last token of the span, not the first token after the span. (true is the default value for this setting, but it is included here for completeness)

The above would allow you to search for <animal/> containing "fox" or <animal speed="fast" /> to find "The quick brown fox".

Apache license 2.0