Parallel corpus querying

Supported from v4.0

Indexing and searching parallel corpoora will be supported from BlackLab 4.0 (and current development snapshots).

A parallel corpus is a corpus that contains multiple versions of the corpus content, usually from different languages and/or time periods, and record the alignment between the versions at different levels (e.g. paragraph, sentence, word).

For example, you could have a parallel corpus of EU Parliament discussions in the various European languages, or a parallel corpus of different translations of a classic work such as Homer's Odyssey.

See indexing parallel corpora.

BlackLab's parallel corpus functionality uses cross-field relations to find alignments between the content versions available in your corpus.

The alignments operator ==> is specifically to find alignments between versions in your corpus. It essentially means "capture all relations between (part of) the left and right span". It will capture a list of relations in the response.

Basic parallel querying

For example, if your corpus contains fields contents__en (English version) and contents__nl (Dutch version), and English is the default field (the first one defined in your indexing config), you can find the Dutch translation of an English word using:

"cat" ==>nl _

The hit for this query will be cat in the English field, and the match info will contain a group named rels with all alignment relations found (just the one in this case, between the word cat and its Dutch equivalent). The hit response structure will also contain an otherFields section containing the corresponding Dutch content fragment. The location of the Dutch word aligned with the English word cat can be found from the relation in the rel capture, which includes targetField, targetStart and targetEnd.

Assuming your data has both sentence and word alignments, and you want to find all alignments for a sentence containing cat, you could use:

<s/> containing "cat" ==>nl _

This should find aligning English and Dutch sentences, including any word alignments between words in those sentences. You can also filter by alignment type, as we'll show later.

Required versus optional alignment

The ==> operator will require that an alignment exists. If you wish to see all hits on the left side of the ==>nl regardless of whether any alignments to the right side can be found, use ==>nl?.

For example, if you're searching for translations of cat to Dutch, with ==>nl you will only see instances where cat is aligned to a Dutch word; on the other hand, with ==>nl? you will see both English cat hits where the translation to Dutch was found, and cat hits where it wasn't.

Switching the main search field

If you want to search the Dutch version instead, and find alignments with the English version, you would use this query:

"kat" ==>en _

But of course, the main search field shouldn't be contents__en in this case; we want to switch it to contents__nl. You can specify a main search field other than the default with the BLS parameter field. In this case, if you specify field=nl. BlackLab will automatically recognize that you're specifying a version of the main annotated field and use the correct 'real' field, probably contents__nl in this case.

Filtering the target span

In the previous example, we used _ as the target span. This is the default, and means "the best matching span".

But you can also specify a different target span. For example, to find where fluffy was translated to pluizig:

"fluffy" ==>nl "pluizig"

This will execute the left and right queries on their respective fields and match the hits by their alignment relations.

Multiple alignment queries

You can also use multiple alignment operators in a single query to match to more than one other version:

"fluffy" ==>nl "pluizig" ;
         ==>de "flauschig"

Only matching some (alignment) relations

Just like with other relations queries, you can filter by type:

"fluffy" =word=>nl "pluizig"

This will only find relations of type word. The type filter will automatically determine the capture name as well, so any relation(s) found will be captured as word in this case instead of rels (unless an explicit name is assigned, see below).

Renaming the relations capture

You can use a override the default name rels for the alignment operator's captures:

<s/> alignments:==>nl _

Now the alignment relations will be captured in a group named alignments.

Capturing in target fields

You can capture parts of the target query like normal, e.g.:

"and" w1:[] ==>nl "en" w2:[]

There will be one match info named w1 for the primary field searched (English in this case), and one named w2 for the target field (Dutch).

rfield(): get only hits from a target field

If you only want to see hits from the target field, you can use the rfield operator:

rfield("fluffy" =word=>nl "pluizig", "nl")

This can be useful when, after running a parallel query, you want to show the highlighted contents of one of the target fields. In this case, you would like to only get the target hits (in contents__nl), not the source hits (in e.g. contents__en).

Parallel corpus querying ​

Basic parallel querying ​

Switching the main search field ​

Filtering the target span ​

Multiple alignment queries ​

Only matching some (alignment) relations ​

Renaming the relations capture ​

Capturing in target fields ​

rfield(): get only hits from a target field ​