Like finding a needle in a haystack: Annotating the American national corpus for idiomatic expressions

Laura Street, Nathan Michalov, Rachel Silverstein, Michael Reynolds, Lurdes Ruela, Felicia Flowers, Angela Talucci, Priscilla Pereira, Gabriella Morgon, Samantha Siegel, Marci Barousse, Antequa Anderson, Tashom Carroll, Anna Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.

Original languageEnglish
Title of host publicationProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
EditorsDaniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner
PublisherEuropean Language Resources Association (ELRA)
Pages647-653
Number of pages7
ISBN (Electronic)2951740867, 9782951740860
StatePublished - 1 Jan 2010
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: 17 May 201023 May 2010

Other

Other7th International Conference on Language Resources and Evaluation, LREC 2010
CountryMalta
CityValletta
Period17/05/1023/05/10

Fingerprint

Idiomatic Expressions
Idioms
Non-fiction
Prepositional Phrase
Tagging
Idiomatics
Annotation
Subordinate Clause
Nouns
Verbs
Fiction

Cite this

Street, L., Michalov, N., Silverstein, R., Reynolds, M., Ruela, L., Flowers, F., ... Feldman, A. (2010). Like finding a needle in a haystack: Annotating the American national corpus for idiomatic expressions. In D. Tapias, I. Russo, O. Hamon, S. Piperidis, N. Calzolari, K. Choukri, J. Mariani, H. Mazo, B. Maegaard, J. Odijk, ... M. Rosner (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (pp. 647-653). European Language Resources Association (ELRA).
Street, Laura ; Michalov, Nathan ; Silverstein, Rachel ; Reynolds, Michael ; Ruela, Lurdes ; Flowers, Felicia ; Talucci, Angela ; Pereira, Priscilla ; Morgon, Gabriella ; Siegel, Samantha ; Barousse, Marci ; Anderson, Antequa ; Carroll, Tashom ; Feldman, Anna. / Like finding a needle in a haystack : Annotating the American national corpus for idiomatic expressions. Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. editor / Daniel Tapias ; Irene Russo ; Olivier Hamon ; Stelios Piperidis ; Nicoletta Calzolari ; Khalid Choukri ; Joseph Mariani ; Helene Mazo ; Bente Maegaard ; Jan Odijk ; Mike Rosner. European Language Resources Association (ELRA), 2010. pp. 647-653
@inproceedings{2620429d82a64f2b94a3013a5dee869f,
title = "Like finding a needle in a haystack: Annotating the American national corpus for idiomatic expressions",
abstract = "This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.",
author = "Laura Street and Nathan Michalov and Rachel Silverstein and Michael Reynolds and Lurdes Ruela and Felicia Flowers and Angela Talucci and Priscilla Pereira and Gabriella Morgon and Samantha Siegel and Marci Barousse and Antequa Anderson and Tashom Carroll and Anna Feldman",
year = "2010",
month = "1",
day = "1",
language = "English",
pages = "647--653",
editor = "Daniel Tapias and Irene Russo and Olivier Hamon and Stelios Piperidis and Nicoletta Calzolari and Khalid Choukri and Joseph Mariani and Helene Mazo and Bente Maegaard and Jan Odijk and Mike Rosner",
booktitle = "Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010",
publisher = "European Language Resources Association (ELRA)",

}

Street, L, Michalov, N, Silverstein, R, Reynolds, M, Ruela, L, Flowers, F, Talucci, A, Pereira, P, Morgon, G, Siegel, S, Barousse, M, Anderson, A, Carroll, T & Feldman, A 2010, Like finding a needle in a haystack: Annotating the American national corpus for idiomatic expressions. in D Tapias, I Russo, O Hamon, S Piperidis, N Calzolari, K Choukri, J Mariani, H Mazo, B Maegaard, J Odijk & M Rosner (eds), Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA), pp. 647-653, 7th International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 17/05/10.

Like finding a needle in a haystack : Annotating the American national corpus for idiomatic expressions. / Street, Laura; Michalov, Nathan; Silverstein, Rachel; Reynolds, Michael; Ruela, Lurdes; Flowers, Felicia; Talucci, Angela; Pereira, Priscilla; Morgon, Gabriella; Siegel, Samantha; Barousse, Marci; Anderson, Antequa; Carroll, Tashom; Feldman, Anna.

Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. ed. / Daniel Tapias; Irene Russo; Olivier Hamon; Stelios Piperidis; Nicoletta Calzolari; Khalid Choukri; Joseph Mariani; Helene Mazo; Bente Maegaard; Jan Odijk; Mike Rosner. European Language Resources Association (ELRA), 2010. p. 647-653.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Like finding a needle in a haystack

T2 - Annotating the American national corpus for idiomatic expressions

AU - Street, Laura

AU - Michalov, Nathan

AU - Silverstein, Rachel

AU - Reynolds, Michael

AU - Ruela, Lurdes

AU - Flowers, Felicia

AU - Talucci, Angela

AU - Pereira, Priscilla

AU - Morgon, Gabriella

AU - Siegel, Samantha

AU - Barousse, Marci

AU - Anderson, Antequa

AU - Carroll, Tashom

AU - Feldman, Anna

PY - 2010/1/1

Y1 - 2010/1/1

N2 - This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.

AB - This paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1, 500-sentence samples from the spoken, the non-fiction, and the fiction portions of the ANC. This paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (<.80) on the tags but a low level of agreement (>.00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus.

UR - http://www.scopus.com/inward/record.url?scp=84949915996&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84949915996

SP - 647

EP - 653

BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010

A2 - Tapias, Daniel

A2 - Russo, Irene

A2 - Hamon, Olivier

A2 - Piperidis, Stelios

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Maegaard, Bente

A2 - Odijk, Jan

A2 - Rosner, Mike

PB - European Language Resources Association (ELRA)

ER -

Street L, Michalov N, Silverstein R, Reynolds M, Ruela L, Flowers F et al. Like finding a needle in a haystack: Annotating the American national corpus for idiomatic expressions. In Tapias D, Russo I, Hamon O, Piperidis S, Calzolari N, Choukri K, Mariani J, Mazo H, Maegaard B, Odijk J, Rosner M, editors, Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA). 2010. p. 647-653