Hate and Offensive Content Identification in Indo-Aryan Languages using Transformer-based Models

Olumide Ebenezer Ojo, Olaronke Oluwayemisi Adebanji, Hiram Calvo, Alexander Gelbukh, Anna Feldman, Grigori Sidorov

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Open exchange of hate speech, insults, derogatory remarks, and obscenities on social media platforms can undermine objective discourse and facilitate radicalization by spreading propaganda and exposing people to danger. People who have been targeted by these offensive and hateful content often experience physiological effects as a result. In this work, we present our models for detecting hate speech and offensive content in two Indo-Aryan languages submitted to HASOC 2023. Although Gujarati and Sinhala are considered low-resource languages, our models demonstrated commendable accuracy in detecting hate speech after fine-tuning them with language-specific hate speech datasets. Our experiments employed and fine-tuned two transformer models, namely DistilBERT and mBERT, and we show that these transformer models were effective in detecting hate speech in Indo-Aryan texts. mBERT achieved the macro F1-score of 0.6 in the Sinhala text and excelled further with a score of 0.8 in the Gujarati text classification.

Original languageEnglish
Pages (from-to)383-392
Number of pages10
JournalCEUR Workshop Proceedings
Volume3681
StatePublished - 2023
Event15th Forum for Information Retrieval Evaluation, FIRE 2023 - Goa, India
Duration: 15 Dec 202318 Dec 2023

Keywords

  • Gujarati
  • Hate Speech
  • Offensive Content
  • Sinhala
  • Transformers

Fingerprint

Dive into the research topics of 'Hate and Offensive Content Identification in Indo-Aryan Languages using Transformer-based Models'. Together they form a unique fingerprint.

Cite this