Grammatical annotation pipeline for Slovene texts
CJVT Označevalnik is an online service for automatic grammatical annotation of Slovene text, which is based on the CLASSLA-Stanza annotation tool.
The current version of the CLASSLA-Stanza tool is 2.1.
Date of last update: 8. 8. 2023
The source code for the CLASSLA-Stanza tool can be accessed via the Clarin.si repository under the following license: Creative Commons Attribution-ShareAlike International 4.0.
CJVT Označevalnik is an online service for automatic grammatical annotation of Slovene text, which ascribes various morphological, syntactic, and semantic features to surface-level word forms. Such annotations are a great aid in any further text analysis, since they enable easy recall of specific language phenomena, as is often required in scientific research, data mining, or the development of complex language technologies.
The online interface is based on the CLASSLA-Stanza language processing tool, which builds its knowledge of the grammatical features of standard Slovene on a number of different language resources, such as the SUK training corpus, the Sloleks lexicon of inflected forms, CLARIN.SI word embeddings, and the Obeliks and ReLDI rule-based tokenizers. The CJVT Označevalnik service is up-to-date with the latest version of the CLASSLA-Stanza tool, producing identical results while offering more options for settings and output formats.
Using the pipeline is relatively straightforward—users type in or upload a text and select the types of annotation they are interested in. After clicking the Annotate button, the results are shown in four different formats and can also be downloaded.
The annotation tool splits the provided text into separate paragraphs, sentences, and tokens. Each token is then ascribed the selected annotations according to the following annotation schemas:
The specific annotation schema and the language of the annotation tags can be adjusted in the Advanced settings tab. The selected settings are saved and automatically applied the next time the tool is used. The advanced settings also include the option to switch to the model for processing non-standard Slovene—for annotating colloquial language, such as informal social media posts.
To account for the needs of different types of users, the annotation tool supports switching between four different modes of displaying results. In addition to the CONLL-U standard, which is the default output format of the CLASSLA-Stanza tool, the results can also be viewed in the form of a table, in the TEI XML format, and as graphical visualizations based on the Q-CAT tool, which proves particularly useful in the analysis of syntactically or semantically annotated sentences.
In all four modes of display, the results can be saved in the form of a .conllu, .csv, .xml or .png file that can be imported into a number of other tools for further analysis.
As with all natural language processing tools, the annotations produced by the CLASSLA-Stanza tool may also contain errors. Performance evaluations of the current version of the tool shows that the F1 performance on standard written texts amounts to approximately 99% for lemmatization and part-of-speech tagging, 98% for full morphological analysis, 91% for dependency parsing, 88% for named entity recognition and 76% for semantic role labeling.
Online interface at orodja.cjvt.si
CJVT Tools
Ljubljana, 2024
This work is licensed under a Creative Commons licence:
Creative Commons Attribution-ShareAlike International 4.0.
Interface development
Kaja Dobrovoljc
Leon Noe Jovan
Mihael Šinkec
CLASSLA-Stanza tool development
Nikola Ljubešić
Marko Robnik Šikonja
Luka Krsnik
Kaja Dobrovoljc
Mihael Šinkec
Simon Krek
Interface design
Gašper Uršič
(Studio Kruh)
Editorial board
Kaja Dobrovoljc
Špela Arhar Holdt
Jaka Čibej
Tomaž Erjavec
Polona Gantar
Nikola Ljubešić
Iztok Kosem
Simon Krek
Marko Robnik Šikonja
Published by
Centre for Language Resources and Technologies, University of Ljubljana
Citation
Označevalnik CJVT, orodja.cjvt.si/oznacevalnik, accessed on 21. 11. 2024.
Version
Označevalnik CJVT 2.1
Date of last update of the tool: 8. 8. 2023
Date of last update of the interface: 11. 3. 2024
Version
Označevalnik CJVT 1.2.0
Date of last update of the tool: 29. 6. 2022
Date of last update of the interface: 12. 7. 2022