TY - JOUR
T1 - Hate and Offensive Content Identification in Indo-Aryan Languages using Transformer-based Models
AU - Ojo, Olumide Ebenezer
AU - Adebanji, Olaronke Oluwayemisi
AU - Calvo, Hiram
AU - Gelbukh, Alexander
AU - Feldman, Anna
AU - Sidorov, Grigori
N1 - Publisher Copyright:
© 2023 Copyright for this paper by its authors.
PY - 2023
Y1 - 2023
N2 - Open exchange of hate speech, insults, derogatory remarks, and obscenities on social media platforms can undermine objective discourse and facilitate radicalization by spreading propaganda and exposing people to danger. People who have been targeted by these offensive and hateful content often experience physiological effects as a result. In this work, we present our models for detecting hate speech and offensive content in two Indo-Aryan languages submitted to HASOC 2023. Although Gujarati and Sinhala are considered low-resource languages, our models demonstrated commendable accuracy in detecting hate speech after fine-tuning them with language-specific hate speech datasets. Our experiments employed and fine-tuned two transformer models, namely DistilBERT and mBERT, and we show that these transformer models were effective in detecting hate speech in Indo-Aryan texts. mBERT achieved the macro F1-score of 0.6 in the Sinhala text and excelled further with a score of 0.8 in the Gujarati text classification.
AB - Open exchange of hate speech, insults, derogatory remarks, and obscenities on social media platforms can undermine objective discourse and facilitate radicalization by spreading propaganda and exposing people to danger. People who have been targeted by these offensive and hateful content often experience physiological effects as a result. In this work, we present our models for detecting hate speech and offensive content in two Indo-Aryan languages submitted to HASOC 2023. Although Gujarati and Sinhala are considered low-resource languages, our models demonstrated commendable accuracy in detecting hate speech after fine-tuning them with language-specific hate speech datasets. Our experiments employed and fine-tuned two transformer models, namely DistilBERT and mBERT, and we show that these transformer models were effective in detecting hate speech in Indo-Aryan texts. mBERT achieved the macro F1-score of 0.6 in the Sinhala text and excelled further with a score of 0.8 in the Gujarati text classification.
KW - Gujarati
KW - Hate Speech
KW - Offensive Content
KW - Sinhala
KW - Transformers
UR - http://www.scopus.com/inward/record.url?scp=85193945531&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85193945531
SN - 1613-0073
VL - 3681
SP - 383
EP - 392
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 15th Forum for Information Retrieval Evaluation, FIRE 2023
Y2 - 15 December 2023 through 18 December 2023
ER -