QueryTool
What is it?
QueryTool is a simple command-driven search tool that provides a demonstration of the querying features of BlackLab. It allows you to search a corpus, for example to test that indexing was succesful.
Starting the QueryTool
Once you've indexed your data (see Getting Started), you can start QueryTool to test if indexing went well:
java -cp "blacklab-VERSION.jar:lib" nl.inl.blacklab.tools.QueryTool /data/blacklab-corpora/my-corpus
You can type a BCQL query at the prompt.
Here's a few things to try:
- Type
help
to see a list of commands. - For a simple query, just enclose each word between double quotes:
"the" "egg"
searches for "the" followed by "egg" - You can user regular expressions:
".*g"
searches for words ending with 'g' - If you want to get more of a feel for what kinds of matches were found, try grouping by matched text using the command "group match". Then, if you want to view one of the groups, use
group *n*
. - Type
struct
to see the structure of the corpus, including available annotations and metadata fields. - If you have a
lemma
annotation, you can use[lemma="work"]
to find forms of the verb "work". Similarly, if you have apos
annotation, you can use[pos="NOU-C"]
to find common nouns.
Paging, sorting and grouping
As a first query, type e.g. "the"
and press Enter. The first twenty hits for this query are shown. To page through the results, use the commands next
and previous
(or their one-letter abbreviations). You can also change the number of hits displayed per page by using the pagesize
command followed by a number.
You can sort the hits using the command sort <criterium>
. The criterium can be e.g. match
, left
or right
. match
sorts by matched text, left
sorts by left context (the text to the left of the matched text), and right
sorts by right context. You can also specify an annotation to sort on, e.g. word
, lemma
, or pos
. If you don't specify this, hits will be sorted by word
.
You can group the hits using group <criterium>
. Again, match
, left
and right
can be used. Here, left
and right
group on the single word occurring to the left or right of the matched text. Just like sort, you can optionally specify an annotation (word
, lemma
, pos
) to group on. word
is the default.
Once you group hits, you enter group mode. The groups are displayed in columns: group number, group size and group identity. If there are many groups, you can page through the groups using the same command as for hits. You can also sort the groups by identity
or size
.
To examine the hits in a group, enter group n
, where n is the group number displayed at the beginning of the line (the second number is the group size). To leave group mode and go back to showing all hits, enter hits
. To get back to group mode, enter groups
.
BlackLab Corpus Query Language (BCQL)
The demo starts out in BlackLab Corpus Query Language (BCQL) mode, which by far the most versatile of the supported languages. This query language expresses queries as sequences of token queries. It is therefore mainly useful to find specific types of phrases in a larger text.
An example of a simple query (note that the quotes are required):
"the" "tab.\*"
This searches for the word the followed by a word starting with tab, such as table. As you can see, regular expressions can be used to build token queries. Equivalent to the above query is:
[word="the"] [word="tab.\*"]
In addition to using regular expressions to express single-token restrictions, a similar notation can be used to express restrictions on sequences of tokens. For example:
"no.\*"{2,}
This query finds two or more successive words starting with no, for example no nonsense. You can also use the regular expression operators such as *
, +
and ?
to build multi-token regular expressions:
"in" "the"? "great" "la.\*"
If your corpus is tagged with lemma
(head word) and part of speech, you can search for these features as well:
[lemma="be"] [lemma="stay"]
This find forms of these verbs occurring together, e.g. is staying.
[pos="a.\*"]+ "man"
This finds the word man with one or more adjectives applied to it.
For much more about BCQL, see the Corpus Query Language documentation.
QueryTool reference
CorpusQL examples
Find words starting with "sta":
"sta.\*"
Find "man" preceded by at least 2 adjectives:
[type="a.\*"]{2,} "man"
Find "stad" and "dorp" with one word in between:
"stad" [] "dorp"
Find "stad" and "dorp" with 2-10 words in between:
"stad" []{2,10} "dorp"
Find all words:
[]
Find all bigrams:
[] []
[]{2}
"de" at the start of a named entity:
<ne\> "de"
"poorter" at the end of a named entity:
"poorter" </ne\>
Named entities containing "de":
<ne/\> containing "de"
"de" within a named entity:
"de" within <ne/\>
All named entities:
<ne/\>
All persons:
<ne type="per"/\>
Person names containing "van":
<ne type="per"/\> containing "van"
Locations starting with "de":
<ne type="loc" /\> containing <ne\> "de"
Other commands
Grouping
group match
group match lemma
group match pos
Sorting
sort left
sort right lemma
Paging through results:
n(ext)
p(revious)
Change context size (number of words around hit):
context 3
context 10
Show document title in KWIC view:
doctitle on
doctitle off
Case/diacritics-sensitivity:
sensitive on
sensitive off
Filter on metadata (uses Lucene query syntax):
filter title.level1:"courant"
filter author.level1:"jansen"
filter author.level1:"sterkenburg" author.level2:"sterkenburg"
filter (Filter weer leegmaken)
Show corpus structure:
struct
structure
Commandline editing
Commandline editing is available if the JLine JAR is found on the classpath.