Query language

This page documents the search expression language which is used to query the dependency parsed corpora in the Drevesnik online interface. It is based on the query language of the dep_search tool developed by the University of Turku. In addition to querying the morphological and dependency annotations using the Universal Dependencies scheme, it also enables searching by the language-specific JOS morphosyntactic tags (XPOS column in Slovenian CONLL-U treebanks).

All expression examples below are links that search through the reference SSJ dependency treebank (randomized results, short sentences).

Token specification

Querying by word forms

Tokens with particular word form are searched by typing the token text as-is. Examples:

hodim searches for all tokens with the form hodim 'I walk', written in lowercase letters
Delo searches for all tokens with the form Delo (newspaper name, lit. 'work'), written with the first letter capitalized

Base form (lemma) is given with the L= prefix:

L=hoditi searches for all tokens with the lemma hoditi 'to walk'

Querying by morphological features

Part-of-speech categories and other morphological features can be defined in two ways, as all corpora are annotated both by the cross-linguistically standardized Universal Dependencies (UD) annotation scheme and the local language-specific JOS annotation scheme. Both schemes are well documented and comparable with respect to an adequate description of Slovenian morphology, so the choice of the annotation scheme mostly depends on the user's preferences.

JOS morphosyntactic tags

JOS morphosyntactic tags (XPOS column in Slovenian CONLLU treebanks) can be specified using the X= prefix. Given that each position in the tag represents a specific morphological feature with multiple possible values, the use of special operators is also supported, i.e. the dot operator (.) what matches any character and the asterisk operator (*) that matches 0 or more repetitions of the preceding character. Some examples:

X=Ncfsl searches for all tokens with the JOS tag for feminine common nouns in locative singular
X=Ncf.l searches for all tokens with the JOS tag for feminine common nouns in locative and any number
X=Ncf.* searches for all tokens with the JOS tag for feminine common nouns in any case and number

UD morphological features

The part-of-speech category can be specified by writing the tags as-is, while other morphological features are defined as attribute-value pairs in the form of Category=Tag.

NOUN searches for all token with the POS tag NOUN (common nouns)
VerbForm=Inf searches for all tokens with the infinitive verb form

Special operators

It is also possible to combine all above token specifications with the AND (&) and OR (|) operators:

L=delati|L=narediti searches for all tokens with the lemma delati 'to do' (imperfective) or narediti 'to do' (perfective)
NOUN&Number=Plur searches for all nouns in plural
L=prst&Gender=Masc searches for all tokens with the lemma prst 'thumb' in masculine (as opposed to prst 'soil' in feminine)
lepo&X=R.* searches for all tokens with the word form lepo 'nice', which are marked as adverbs in JOS (and not adjectives, for example)

Word forms, lemmas and tags can also be negated by typing the negation operator ! before a feature. Some examples:

L=biti&!AUX searches for all tokens with the lemma biti 'to be', which are not marked as an auxiliary
ADJ&!X=A.* searches for all tokens annotated as an adjective in UD, but a different part-of-speech category in JOS

Token can be left unspecified by typing an underscore character ('_').

Dependency specification

Dependencies are expressed using < and > operators, which mimick the "arrows" in the dependency graph.

A < B means that token A is governed by token B, e.g. rainy < morning
A > B means that token A governs token B, e.g. read > newspapers

The underscore character _ stands for any token, that is, a token on which we place no particular restrictions. Here are simple examples of basic search expressions that restrict dependency structures:

delo < _ searches for all cases of delo 'work' which are governed by some word
delo > _ searches for all cases of delo which govern a word
_ < delo searches for any token governed by delo

Note that the left-most token in the expression is always the target of the search and also identified in search results (marked as green). While queries delo > _ and _ < delo return the excact same graphs, matched tokens differ.

The dependency type can be specified typing it right after the dependency operator, e.g. _ <type _ or _ >type _. The | character denotes a logical or, so any of the given dependency relations will match.

_ <cop _ searches for all copula verbs (i.e. tokens which are governed through a cop dependency)
_ >nsubj _ searches for all words governing a nominal subject (i.e. various kinds of predicates)
_ <nsubj|<csubj _ searches for all words serving as a subject - either as a nominal or clausal subject

You can specify a number of dependency restrictions at a time by chaining the operators:

_ >obj _ >iobj _ searches for words that govern both direct and indirect objects (e.g. ditransitive predicates)
_ advmod _ searches for words that serve as adjectival modifiers and at the same time govern an adverbial modifier
_ >nmod _ >nmod _ earches for words that govern two distinct nominal modifiers

Priority is marked using parentheses:

_ >nmod _ >nmod _ searches for words that govern two distinct nominal modifiers (two nommod dependencies in parallel)
_ >nmod (_ >nmod _) searches for words that govern a nominal modifier which, in turn governs another nominal modifier (chain of two nmod dependencies)

Negation is marked using the negation operator !, which can be used to negate the < and > operators as well as specific dependency types. Some examples:

_ >nmod _ !>case _ searches for all nominal modifiers that do not govern a case marker (i.e. nominal modifiers that are not prepositional phrases)
_ >nmod _ >!case _ searches for all nominal modifiers that govern some word, but not a case marker

_ amod|>acl) _ searches for nominal subjects which do not govern adjectival or participial modifiers
Note that negating a relation (e.g. _ !>amod _) allows for the token not having any dependent, whereas negating a type (e.g. _ >!amod _) means that the token must have at least one dependent (which is not amod).

Direction of the dependency relation can be specified using operators @R and @L, where the operator means that the right-most token of the expression must be at the right side or at the left side, respectively.

VERB >nsubj@R _ searches for verbs which have nsubj dependent to the right
_ >amod@L _ >amod@R _ searches for words that have two distinct adjectival modifiers (two amod dependencies in parallel), one must be at the left side, the other at the right side
_ <case@L _ searches for case markers where the governor token is at the left side, i.e. postpositions (as compared to prepositions)

Combining queries

Several queries can be combined with the + operator. A query of the form query1 + query2 + query3 returns all trees which independently satisfy all three queries.

VERB >aux _ + Tense=Pres searches for trees with a simple and a complex verb phrase

Universal quantifcation

The operator '->' introduces a condition that all the matched tokens should fulfill (i.e. the tokens or structures preceding this operator). For example:

_ -> NOUN means "every token (_) must be a NOUN" and thus matches sentences with nouns only
NOUN -> NOUN >amod _ means "all nouns must govern an adjectival modifier" and thus matches sentences with modified nouns only