Designing and evaluating a Russian tagset

Serge Sharoff, Mikhail Kopotev, Tomaž Erjavec, Anna Feldman, Dagmar Divjak

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

25 Citations (Scopus)

Abstract

This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages279-285
Number of pages7
ISBN (Electronic)2951740840, 9782951740846
StatePublished - 1 Jan 2008
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 28 May 200830 May 2008

Other

Other6th International Conference on Language Resources and Evaluation, LREC 2008
CountryMorocco
CityMarrakech
Period28/05/0830/05/08

Fingerprint

evaluation
Evaluation
Tag
Tagging

Cite this

Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008 (pp. 279-285). European Language Resources Association (ELRA).
Sharoff, Serge ; Kopotev, Mikhail ; Erjavec, Tomaž ; Feldman, Anna ; Divjak, Dagmar. / Designing and evaluating a Russian tagset. Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), 2008. pp. 279-285
@inproceedings{93bc24ce6eed4d418d79a3c92e1c690b,
title = "Designing and evaluating a Russian tagset",
abstract = "This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95{\%} accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.",
author = "Serge Sharoff and Mikhail Kopotev and Tomaž Erjavec and Anna Feldman and Dagmar Divjak",
year = "2008",
month = "1",
day = "1",
language = "English",
pages = "279--285",
booktitle = "Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008",
publisher = "European Language Resources Association (ELRA)",

}

Sharoff, S, Kopotev, M, Erjavec, T, Feldman, A & Divjak, D 2008, Designing and evaluating a Russian tagset. in Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), pp. 279-285, 6th International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, 28/05/08.

Designing and evaluating a Russian tagset. / Sharoff, Serge; Kopotev, Mikhail; Erjavec, Tomaž; Feldman, Anna; Divjak, Dagmar.

Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA), 2008. p. 279-285.

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - Designing and evaluating a Russian tagset

AU - Sharoff, Serge

AU - Kopotev, Mikhail

AU - Erjavec, Tomaž

AU - Feldman, Anna

AU - Divjak, Dagmar

PY - 2008/1/1

Y1 - 2008/1/1

N2 - This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.

AB - This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.

UR - http://www.scopus.com/inward/record.url?scp=85021700467&partnerID=8YFLogxK

M3 - Conference contribution

SP - 279

EP - 285

BT - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

PB - European Language Resources Association (ELRA)

ER -

Sharoff S, Kopotev M, Erjavec T, Feldman A, Divjak D. Designing and evaluating a Russian tagset. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. European Language Resources Association (ELRA). 2008. p. 279-285