TY - GEN
T1 - Like finding a needle in a haystack
T2 - 7th International Conference on Language Resources and Evaluation, LREC 2010
AU - Street, Laura
AU - Michalov, Nathan
AU - Silverstein, Rachel
AU - Reynolds, Michael
AU - Ruela, Lurdes
AU - Flowers, Felicia
AU - Talucci, Angela
AU - Pereira, Priscilla
AU - Morgon, Gabriella
AU - Siegel, Samantha
AU - Barousse, Marci
AU - Anderson, Antequa
AU - Carroll, Tashom
AU - Feldman, Anna
PY - 2010
Y1 - 2010
N2 - This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.
AB - This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.
UR - http://www.scopus.com/inward/record.url?scp=84949915996&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84949915996
T3 - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
SP - 647
EP - 653
BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
A2 - Tapias, Daniel
A2 - Russo, Irene
A2 - Hamon, Olivier
A2 - Piperidis, Stelios
A2 - Calzolari, Nicoletta
A2 - Choukri, Khalid
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Maegaard, Bente
A2 - Odijk, Jan
A2 - Rosner, Mike
PB - European Language Resources Association (ELRA)
Y2 - 17 May 2010 through 23 May 2010
ER -