Skip to content

BlackLab webservice API evolution

OLDER CONTENT

This page contains ideas that are partially obsolete. See API versions for the current state of the API.

The BLS API has quite a few quirks that can make it confusing and annoying to work with.

We intend to evolve the API over time, with new versions that gradually move away from the bad parts of the old API. This can be done using the api parameter to switch between versions, or by adding endpoints or response keys, while supporting the old ones for a allow time to transition.

For a comparison between the different API versions currently available, see API versions.

For some older ideas for example requests and responses, see here.

Fewer optional response keys

The BLS API has quite a few response keys that are only included if certain conditions apply. This can make working with the API trickier. Some of these keys should probably always be included (and be empty if not applicable). This is an (incomplete) list of such keys.

Regular API optional keys:

  • user.id alleen als user.loggedIn == true (anders ""?)
  • summary.indexStatus alleen als status != AVAILABLE (altijd?)
  • summary.matchInfos alleen als er matchInfos zijn (anders leeg object?)
  • summary.matchInfos[..].fieldName alleen als het anders is dan summary.fieldName (altijd?)
  • summary.sample* alleen als er gesampled is (ok? but should be grouped) either summary.samplePercentage/sampleSize (better to have .type: percentage / .number: 30 ?)
  • summary.window* als window != null (maar komt ws nooit voor..?)
  • summary.numberOfGroups / largestGroupSize alleen als gegroepeer (ok? but should be grouped?)
  • summary.resultsStats.stoppedBecauseTooMany: alleen als dat zo is (altijd!)
  • summary.resultsStats.countOnly: alleen als stoppedBecauseTooMany (altijd!)
  • summary.subcorpusSize.tokens: alleen als beschikaar (wanneer niet beschikbaar...?)
  • summary.subcorpusSize.annotatedFields: alleen als er meerdere annotated fields zijn (altijd?)
  • hit/snippet docPid/start/end: alleen als docPid niet leeg is (zou nooit mogen gebeuren; velden altijd includen?)
  • hit/snippet matchInfos: alleen als niet leeg (anders leeg object?)
  • matchInfo attributes: alleen als aanwezig (anders leeg object?)
  • indexProgress: alleen als status == INDEXING (ok?)
  • annotatedField/metadataField.indexName: alleen voor /fields/... request? (weg?)
  • annotation.parentAnnotation: alleen als subannotation (ok? kijken of subannotation weg kunnen)

Parallel corpora are a relatively uncommon use case, so it makes sense not to "pollute" the regular API with parallel-only fields. We still might want to always include these fields for parallel corpora, though.

Parallel corpora optional response keys:

  • summary.pattern.otherFields alleen als er otherFields zijn (anders leeg array, als parallel?)
  • summary.matchInfos[..].targetField alleen als het bestaat en anders is dan summary.fieldName (altijd als parallel?)
  • summary.subcorpusSize.docVersions alleen als er meer docVersions dan documents zijn (altijd als parallel?)
  • hit/snippet otherFields: alleen als die er zijn (altijd als parallel?)
  • matchInfo relation targetField: alleen als anders dan source field (altijd?)
  • matchInfo relation sourceStart/sourceEnd: alleen als geen root relation (ok?)

API evolution TODO

General guidelines:

  • Publish a clear and complete migration guide
  • Publish complete reference documentation
  • Use corpus/corpora in favor of index/indices.
  • Be consistent: if information is given in multiple places, e.g. on the server info page as well as on the corpus info page, use the same structure and element names (except one page may give additional details).
  • Return helpful error messages.
    (if an illegal value is passed, explain or list legal values, and/or refer to online docs)
  • JSON should probably be our primary output format
    (the XML structure should just be a dumb translation from JSON, for those who need it, e.g. to pass through XSLT). So e.g. no difference in concordance structure between JSON and XML)
  • Avoid custom encodings (e.g. strings with specific separator characters, such as used for HitProperty and related values); prefer a standard encoding such as JSON.

Already fixed in v4/5:

  • Ensure correct data types, e.g. fieldValues should have integer values, but are strings
  • Fix blacklabBuildTime vs. blackLabBuildTime
  • Added before/after in addition to left/right for parameters (response structure unchanged)
  • Don't include static info on dynamic (results) pages.
    (e.g. don't send display names for all metadata fields with each hits results; the client can request those once if needed)
  • Avoid attributes; use elements for everything.
  • Avoid dynamic XML element names
    (e.g. don't use map keys for XML element names. Not an issue if we copy JSON structure)
  • add /corpora/* endpoints. Avoid ambiguity with e.g. /blacklab-server/input-formats, and also provide a place to update the API in parallel. That is, these new endpoints will not be 100% compatible but use a newer, cleaner version.

TODO v4:

  • Make functionality more orthogonal. E.g. subcorpusSize can be included in grouped responses, but not in ungrouped ones.
  • Add a way to pass HitProperty as JSON in addition to custom encoding

DONE IN /corpora ENDPOINTS (e.g. v5):

  • Replace left/right in response with before/after
    (makes more sense for RTL languages)
  • XML: same concordance structure as in JSON
  • Handle custom information better.
    Custom information, ignored by BlackLab but useful for e.g. the frontend, like displayName, uiType, etc. is polluting the response structure. We should isolate it (e.g. in a custom section for each field, annotation, etc.), just pass it along unchecked, and include it only if requested.
    This includes the so-called "special fields" except for pidField (so author, title, date). (BlackLab uses the pidField to refer to documents)
  • Change confusing names.
    (e.g. the name stoppedRetrievingHits prompts the question "why did you stop?". limitReached might be easier to understand, especially if it's directly related to a configuration setting hitLimit)
  • Group related values.
    (e.g. numberOfHitsRetrieved / numberOfDocsRetrieved / stoppedRetrievingHits would be better as a structure "retrieved": { "hits": 100, "docs": 10, "reachedHitLimit": true } ).
  • Separate unrelated parts.
    (e.g. in DocInfo, arbitrary document metadata values such as title or author should probably be in a separate subobject, not alongside special values like lengthInTokens and mayView. Also, metadataFieldGroups shouldn't be alongside DocInfo structures.)

DONE API v5:

  • remove /blacklab-server/CORPUSNAME endpoints.
  • XML: When using usecontent=orig, don't make the content part of the XML anymore.
    (escape it using CDATA (again, same as in JSON). Also consider just returning both the FI concordances as well as the original content (if requested), so the response structure doesn't fundamentally change because of one parameter value) (optionally have a parameter to include it as part of the XML if desired, to simplify response handling?)
  • Return HitPropertyValues as JSON instead of current custom encoding?

TODO v5:

  • remove old custom encodings for HitProperty in favour of the JSON format?

Possible new endpoints/features:

  • If you're interested in stats like total number of results, subcorpus size, etc., it's kind of confusing to have to do /hits?number=0&waitfortotal=true; maybe have separate endpoints for this kind of application? (calculating stats vs. paging through hits)

This might be harder to do without breaking compatibility:

  • Try to use consistent terminology between parameters, response and configuration files.
    (e.g. use the term hitLimit everywhere for the same concept)

Maybe?

  • Support Solr's common query parameters, e.g. start,rows,fq, etc. as the preferred version.
    Support the lowerCamelCase version of query parameter names for consistency with responses and configuration options.
    Support the old query parameter names (but issue deprecation warning when first encountered?)
  • Don't send mayView for each document (until we implement such granular authorization), include it in corpus info. Although keeping it there doesn't hurt and prepares us for this feature.
  • Be stricter about parameter values.
    (if an illegal value is passed, return an error instead of silently using a default value)
  • Consider adding a JSON request option in addition to regular query parameters. There should be an easy-to-use test interface so there's no need to manually type URL-encoded JSON requests into the browser address bar.

Apache license 2.0