TY - GEN
T1 - A Low-budget Tagger for Old Czech
AU - Hana, Jirka
AU - Feldman, Anna
AU - Aharodnik, Katsiaryna
N1 - Funding Information:
This research was generously supported by the Grant Agency Czech Republic (project ID: P406/10/P328) and by the U.S. NSF grants #0916280, #1033275, and #1048406. We would like to thank Alena M. Cˇ erná and Boris Lehecˇka for annotating the testing corpus and for answering questions about Old Czech. We also thank Institute of Czech Language of the Academy of Sciences of the Czech Republic for the plain text corpus of Old Czech. Finally, we thank anonymous reviewers for their insightful comments. All mistakes are ours.
Publisher Copyright:
© 2011 Proceedings of the Annual Meeting of the Association for Computational Linguistics. All rights reserved.
PY - 2011
Y1 - 2011
N2 - The paper describes a tagger for Old Czech (1200-1500 AD), a fusional language with rich morphology. The practical restrictions (no native speakers, limited corpora and lexicons, limited funding) make Old Czech an ideal candidate for a resource-light crosslingual method that we have been developing (e.g. Hana et al., 2004; Feldman and Hana, 2010). We use a traditional supervised tagger. However, instead of spending years of effort to create a large annotated corpus of Old Czech, we approximate it by a corpus of Modern Czech. We perform a series of simple transformations to make a modern text look more like a text in Old Czech and vice versa. We also use a resource-light morphological analyzer to provide candidate tags. The results are worse than the results of traditional taggers, but the amount of language-specific work needed is minimal.
AB - The paper describes a tagger for Old Czech (1200-1500 AD), a fusional language with rich morphology. The practical restrictions (no native speakers, limited corpora and lexicons, limited funding) make Old Czech an ideal candidate for a resource-light crosslingual method that we have been developing (e.g. Hana et al., 2004; Feldman and Hana, 2010). We use a traditional supervised tagger. However, instead of spending years of effort to create a large annotated corpus of Old Czech, we approximate it by a corpus of Modern Czech. We perform a series of simple transformations to make a modern text look more like a text in Old Czech and vice versa. We also use a resource-light morphological analyzer to provide candidate tags. The results are worse than the results of traditional taggers, but the amount of language-specific work needed is minimal.
UR - http://www.scopus.com/inward/record.url?scp=84867284251&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84867284251
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 10
EP - 18
BT - Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH 2011 at the 49th Annual Meeting of the Association for Computational Linguistics
A2 - Zervanou, Kalliopi
A2 - Lendvai, Piroska
PB - Association for Computational Linguistics (ACL)
T2 - 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH 2011 at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
Y2 - 24 June 2011
ER -