Designing and evaluating a Russian tagset

Serge Sharoff, Mikhail Kopotev, Tomaž Erjavec, Anna Feldman, Dagmar Divjak

Research output: Chapter in Book/Report/Conference proceedingConference contribution

27 Scopus citations

Abstract

This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages279-285
Number of pages7
ISBN (Electronic)2951740840, 9782951740846
StatePublished - 1 Jan 2008
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 28 May 200830 May 2008

Publication series

NameProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

Other

Other6th International Conference on Language Resources and Evaluation, LREC 2008
CountryMorocco
CityMarrakech
Period28/05/0830/05/08

Fingerprint Dive into the research topics of 'Designing and evaluating a Russian tagset'. Together they form a unique fingerprint.

Cite this