AN AUTOMATIC APPROACH FOR THE IDENTIFICATION OF OFFENSIVE LANGUAGE IN PERSO-ARABIC URDU LANGUAGE: DATASET CREATION AND EVALUATION

An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation

An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation

Blog Article

Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and bushranger awning society as well.With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society.Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content.On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu.

Urdu text poses challenges because of its unique features, complex script, and rich morphology.Applying methods directly that work in other languages is difficult.It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively.Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality borstlist självhäftande datasets and models.

This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark.Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation.Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers.The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones.

Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%.

Report this page