A cross-language approach to rapid creation of new morpho-syntactically annotated resources

Anna Feldman, Jirka Hana, Chris Brew

Research output: Contribution to conferencePaper

11 Citations (Scopus)

Abstract

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

Original languageEnglish
Pages549-554
Number of pages6
StatePublished - 1 Jan 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: 22 May 200628 May 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period22/05/0628/05/06

Fingerprint

language
resources
Syntax
Language
Language Families
Cross-language
Resources
minority
labor
present
gender
costs
evaluation
performance
Evaluation
Adjective
Morphological Analysis
Labor
Bilingual Lexicon
Convert

Cite this

Feldman, A., Hana, J., & Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. 549-554. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.
Feldman, Anna ; Hana, Jirka ; Brew, Chris. / A cross-language approach to rapid creation of new morpho-syntactically annotated resources. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.6 p.
@conference{0fa42d0032504717bde905c6745de435,
title = "A cross-language approach to rapid creation of new morpho-syntactically annotated resources",
abstract = "We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.",
author = "Anna Feldman and Jirka Hana and Chris Brew",
year = "2006",
month = "1",
day = "1",
language = "English",
pages = "549--554",
note = "null ; Conference date: 22-05-2006 Through 28-05-2006",

}

Feldman, A, Hana, J & Brew, C 2006, 'A cross-language approach to rapid creation of new morpho-syntactically annotated resources' Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 22/05/06 - 28/05/06, pp. 549-554.

A cross-language approach to rapid creation of new morpho-syntactically annotated resources. / Feldman, Anna; Hana, Jirka; Brew, Chris.

2006. 549-554 Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.

Research output: Contribution to conferencePaper

TY - CONF

T1 - A cross-language approach to rapid creation of new morpho-syntactically annotated resources

AU - Feldman, Anna

AU - Hana, Jirka

AU - Brew, Chris

PY - 2006/1/1

Y1 - 2006/1/1

N2 - We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

AB - We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

UR - http://www.scopus.com/inward/record.url?scp=84942617858&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:84942617858

SP - 549

EP - 554

ER -

Feldman A, Hana J, Brew C. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. 2006. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.