A cross-language approach to rapid creation of new morpho-syntactically annotated resources

Anna Feldman, Jirka Hana, Chris Brew

Research output: Contribution to conferencePaperpeer-review

20 Scopus citations

Abstract

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

Original languageEnglish
Pages549-554
Number of pages6
StatePublished - 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: 22 May 200628 May 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
Country/TerritoryItaly
CityGenoa
Period22/05/0628/05/06

Fingerprint

Dive into the research topics of 'A cross-language approach to rapid creation of new morpho-syntactically annotated resources'. Together they form a unique fingerprint.

Cite this