About the tool

CJVT Označevalnik is an online service for automatic grammatical annotation of Slovene text, which ascribes various morphological, syntactic, and semantic features to surface-level word forms. Such annotations are a great aid in any further text analysis, since they enable easy recall of specific language phenomena, as is often required in scientific research, data mining, or the development of complex language technologies.

The online interface is based on the CLASSLA-Stanza language processing tool, which builds its knowledge of the grammatical features of standard Slovene on a number of different language resources, such as the SUK training corpus, the Sloleks lexicon of inflected forms, CLARIN.SI word embeddings, and the Obeliks and ReLDI rule-based tokenizers. The CJVT Označevalnik service is up-to-date with the latest version of the CLASSLA-Stanza tool, producing identical results while offering more options for settings and output formats.

Using the pipeline is relatively straightforward—users type in or upload a text and select the types of annotation they are interested in. After clicking the Annotate button, the results are shown in four different formats and can also be downloaded.

Input options

The annotation tool splits the provided text into separate paragraphs, sentences, and tokens. Each token is then ascribed the selected annotations according to the following annotation schemas:

  • Lemmas: basic forms of words following the JOS schema (e.g., miza ‘table’ for the word form mize ‘tables’)
  • Morphosyntactic tags: part-of-speech tags and other morphosyntactic features following the JOS and/or UD schemas (e.g., feminine genitive singular noun)
  • Syntactic relations: the syntactic functions based on the dependency framework in accordance with the JOS and UD schemas (e.g., subject)
  • Semantic roles: the thematic relations in accordance with the SRL schema (e.g., agent)
  • Named entities: proper names of various kinds following the JANES schema (e.g., personal proper name)

The specific annotation schema and the language of the annotation tags can be adjusted in the Advanced settings tab. The selected settings are saved and automatically applied the next time the tool is used. The advanced settings also include the option to switch to the model for processing non-standard Slovene—for annotating colloquial language, such as informal social media posts.

The results

To account for the needs of different types of users, the annotation tool supports switching between four different modes of displaying results. In addition to the CONLL-U standard, which is the default output format of the CLASSLA-Stanza tool, the results can also be viewed in the form of a table, in the TEI XML format, and as graphical visualizations based on the Q-CAT tool, which proves particularly useful in the analysis of syntactically or semantically annotated sentences.

In all four modes of display, the results can be saved in the form of a .conllu, .csv, .xml or .png file that can be imported into a number of other tools for further analysis.

Accuracy of the annotations

As with all natural language processing tools, the annotations produced by the CLASSLA-Stanza tool may also contain errors. Performance evaluations of the current version of the tool shows that the F1 performance on standard written texts amounts to approximately 99% for lemmatization and part-of-speech tagging, 98% for full morphological analysis, 91% for dependency parsing, 88% for named entity recognition and 76% for semantic role labeling.

Impressum

Označevalnik

Online interface at orodja.cjvt.si
CJVT Tools

Ljubljana, 2024

This work is licensed under a Creative Commons licence:
Creative Commons Attribution-ShareAlike International 4.0.

Interface development
Kaja Dobrovoljc
Leon Noe Jovan
Mihael Šinkec

CLASSLA-Stanza tool development
Nikola Ljubešić
Marko Robnik Šikonja
Luka Krsnik
Kaja Dobrovoljc
Mihael Šinkec
Simon Krek

Interface design
Gašper Uršič
(Studio Kruh)

Editorial board
Kaja Dobrovoljc
Špela Arhar Holdt
Jaka Čibej
Tomaž Erjavec
Polona Gantar
Nikola Ljubešić
Iztok Kosem
Simon Krek
Marko Robnik Šikonja

Published by
Centre for Language Resources and Technologies, University of Ljubljana

Citation
Označevalnik CJVT, orodja.cjvt.si/oznacevalnik, accessed on 21. 11. 2024.

Versions

Version
Označevalnik CJVT 2.1

Date of last update of the tool: 8. 8. 2023
Date of last update of the interface: 11. 3. 2024


Version
Označevalnik CJVT 1.2.0

Date of last update of the tool: 29. 6. 2022
Date of last update of the interface: 12. 7. 2022