Evaluating and automating the annotation of a learner corpus

Alexandr Rosen, Jirka Hana, Barbora Štindlová, Anna Feldman

Research output: Contribution to journalArticleResearchpeer-review

10 Citations (Scopus)

Abstract

The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.

Original languageEnglish
Pages (from-to)65-92
Number of pages28
JournalLanguage Resources and Evaluation
Volume48
Issue number1
DOIs
StatePublished - 1 Jan 2014

Fingerprint

grammar
linguistics
Learner Corpus
Annotation
present
Non-native Speakers
Grammar Checker
Spell

Keywords

  • Czech
  • Error annotation
  • Learner corpus
  • Second language acquisition

Cite this

Rosen, Alexandr ; Hana, Jirka ; Štindlová, Barbora ; Feldman, Anna. / Evaluating and automating the annotation of a learner corpus. In: Language Resources and Evaluation. 2014 ; Vol. 48, No. 1. pp. 65-92.
@article{7651cd722bd14e7a9ab10a6d90507a72,
title = "Evaluating and automating the annotation of a learner corpus",
abstract = "The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.",
keywords = "Czech, Error annotation, Learner corpus, Second language acquisition",
author = "Alexandr Rosen and Jirka Hana and Barbora Štindlov{\'a} and Anna Feldman",
year = "2014",
month = "1",
day = "1",
doi = "10.1007/s10579-013-9226-3",
language = "English",
volume = "48",
pages = "65--92",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "1",

}

Evaluating and automating the annotation of a learner corpus. / Rosen, Alexandr; Hana, Jirka; Štindlová, Barbora; Feldman, Anna.

In: Language Resources and Evaluation, Vol. 48, No. 1, 01.01.2014, p. 65-92.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Evaluating and automating the annotation of a learner corpus

AU - Rosen, Alexandr

AU - Hana, Jirka

AU - Štindlová, Barbora

AU - Feldman, Anna

PY - 2014/1/1

Y1 - 2014/1/1

N2 - The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.

AB - The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.

KW - Czech

KW - Error annotation

KW - Learner corpus

KW - Second language acquisition

UR - http://www.scopus.com/inward/record.url?scp=84897020402&partnerID=8YFLogxK

U2 - 10.1007/s10579-013-9226-3

DO - 10.1007/s10579-013-9226-3

M3 - Article

VL - 48

SP - 65

EP - 92

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 1

ER -