TY - JOUR
T1 - Towards better structured and less noisy Web data
T2 - 8th Workshop on Noisy User-Generated Text, W-NUT 2022
AU - Laippala, Veronika
AU - Salmela, Anna
AU - Rönnqvist, Samuel
AU - Aji, Alham Fikri
AU - Chang, Li Hsin
AU - Dhifallah, Asma
AU - Goulart, Larissa
AU - Kortelainen, Henna
AU - Pàmies, Marc
AU - Dutra, Deise Prina
AU - Skantsi, Valtteri
AU - Sutawika, Lintang
AU - Pyysalo, Sampo
N1 - Publisher Copyright:
© 2022 COLING. All Rights Reserved.
PY - 2022
Y1 - 2022
N2 - Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification-whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.
AB - Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification-whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.
UR - http://www.scopus.com/inward/record.url?scp=105007001395&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105007001395
SN - 2951-2093
VL - 29
SP - 215
EP - 221
JO - Proceedings - International Conference on Computational Linguistics, COLING
JF - Proceedings - International Conference on Computational Linguistics, COLING
IS - 4
Y2 - 12 October 2022 through 17 October 2022
ER -