TY - GEN
T1 - Automatic Identification of Learners’ Language Background based on their Writing in Czech
AU - Aharodnik, Katsiaryna
AU - Chang, Marco
AU - Feldman, Anna
AU - Hana, Jirka
N1 - Funding Information:
We would like to thank the native speakers of Czech for their participation in our experiment and to Jan Štěpánek for tailoring his questionnaire system to our needs. We would also like to thank Jing Peng and the anonymous reviewers for their comments. This material is based in part upon work supported by the Grant Agency of the Czech Republic P406/10/P328 and National Science Foundation under Grant Numbers 0916280 and 1048406. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© IJCNLP 2013.All right reserved.
PY - 2013
Y1 - 2013
N2 - The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languages and speakers of non-Indo-European languages. We use an SVM classifier to perform the binary classification. We show that non-content based features perform well on highly inflectional data. In particular, features reflecting errors in orthography are the most useful, yielding about 89% precision and the same recall. A detailed discussion of the best performing features is provided.
AB - The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languages and speakers of non-Indo-European languages. We use an SVM classifier to perform the binary classification. We show that non-content based features perform well on highly inflectional data. In particular, features reflecting errors in orthography are the most useful, yielding about 89% precision and the same recall. A detailed discussion of the best performing features is provided.
UR - http://www.scopus.com/inward/record.url?scp=85014618177&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85014618177
T3 - 6th International Joint Conference on Natural Language Processing, IJCNLP 2013 - Proceedings of the Main Conference
SP - 1428
EP - 1436
BT - 6th International Joint Conference on Natural Language Processing, IJCNLP 2013 - Proceedings of the Main Conference
A2 - Mitkov, Ruslan
A2 - Park, Jong C.
PB - Asian Federation of Natural Language Processing
T2 - 6th International Joint Conference on Natural Language Processing, IJCNLP 2013
Y2 - 14 October 2013
ER -