Advancing Sentiment Classification for Roman Urdu Text Using Novel SGD Integration and Comprehensive Analysis of Machine Learning Models

Authors

  • Muhammad Aqeel School of Software, Northwestern Polytechnical University, Xi’an, China
  • Irfan Qutab Department of Engineering, University of Modena and Reggio Emilia, Modena, Italy
  • Khawar Iqbal Riphah School of Computing & Innovation, Riphah International University, Lahore, Pakistan
  • Habiba Fiaz School of Mathematics & Statistics, Northwestern Polytechnical University, Xi’an, China
  • Hira Arooj Department of Mathematics & Statistics, The University of Lahore, Sargodha, Pakistan

DOI:

https://doi.org/10.53560/PPASA(62-4)845

Keywords:

Roman Urdu Stemmer, TF-IDF, Stochastic Gradient Descent, Topic Classification, Machine Learning

Abstract

All over social media and internet platforms, Roman Urdu content is extremely casual, inconsistent, and linguistically diversified, which makes it hard to interpret through conventional Natural Language Processing (NLP) techniques. This paper proposes a strong topic-classification framework for Roman Urdu, integrating Stochastic Gradient Descent (SGD)-optimized machine learning, dictionary-assisted stemming, and custom lexical normalization in order to overcome those challenges. The method consists of structured preprocessing, reduction of repeated letters, rule-based normalization, extraction of TF-IDF features, and the evaluation of a few classifiers including Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN), along with the proposed model of SGD. The proposed classifier outperformed all the baseline models with an accuracy of 95%, according to the experimental results on the four-class dataset comprised of Politics, Sports, Education, and Religion. The results depict the importance of stemming and normalization to improve feature quality and reduce orthographic variability in low-resource languages. All things considered, this study provides a repeatable and efficient pipeline for Roman Urdu subject classification and thus lays a concrete foundation for further Roman Urdu NLP research.

References

1. N. Pangakis and S. Wolken. Knowledge distillation in automated annotation: Supervised text classification with LLM-generated training labels. Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024), Mexico City, Mexico, pp. 113-131 (2024). https://aclanthology.org/2024.nlpcss-1.9.pdf

2. Y. Xie, Z. Li, Y. Yin, Z. Wei, G. Xu, and Y. Luo. Advancing legal citation text classification: A Conv1D-based approach for multi-class classification. Journal of Theory and Practice of Engineering Science 4(2): 15-22 (2024).

3. K. Mehmood, D. Essam, K. Shafi, and M.K. Malik. An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis. Information Processing and Management 57(6): 102368 (2020).

4. G.F. Simons and C.D. Fennig (Eds.). Ethnologue: Languages of the World, (20th Edition). SIL International, Dallas, USA (2017).

5. G.I. Akabuike and I.C. Onuh. English spelling variations in social media among select students of English language in Nnamdi Azikiwe University. Ansu Journal of Language and Literary Studies 5(1): 52-64 (2025).

6. J. Tatemura. Virtual reviewers for collaborative exploration of movie reviews.

Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, USA pp. 272-275 (2000). https://doi.org/10.1145/325737.325870

7. S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao. Deep learning-based text classification: A comprehensive review. ACM Computing Surveys 54(3): 62 (2021).

8. A. Gasparetto, M. Marcuzzo, A. Zangari, and A. Albarelli. A survey on text classification algorithms: From text to predictions. Information 13(2): 83 (2022).

9. S. Daud, M. Ullah, A. Rehman, T. Saba, R. Damaševičius, and A. Sattar. Topic classification of online news articles using optimized machine learning models. Computers 12(1): 16 (2023).

10. N. Hussain, A. Qasim, G. Mehak, O. Kolesnikova, A. Gelbukh, and G. Sidorov. Hybrid machine learning and deep learning approaches for insult detection in Roman Urdu text. AI 6(2): 33 (2025).

11. M.U. Arshad, M.F. Bashir, A. Majeed, W. Shahzad, and M.O. Beg. Corpus for emotion detection on Roman Urdu. 22nd International Multitopic Conference (INMIC 2019), Islamabad, Pakistan pp. 1-6 (2019). https://doi.org/10.1109/INMIC48123.2019.9022782

12. P. Pakray, A. Gelbukh, and S. Bandyopadhyay. Natural language processing applications for low-resource languages. Natural Language Processing 31(2): 183-197 (2025).

13. T. Adimulam, S. Chinta, and S.K. Pattanayak. Transfer learning in natural language processing: Overcoming low-resource challenges. International Journal of Enhanced Research in Science Technology and Engineering 11(2): 65-79 (2022).

14. H. Avetisyan and D. Broneske. Large language models and low-resource languages: An examination of Armenian NLP. Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings) pp. 199-210 (2023). https://aclanthology.org/2023.findings-ijcnlp.18.pdf

15. T. Ògúnrẹ̀mí, W.O. Nekoto, and S. Samuel. Decolonizing NLP for low-resource languages: Applying Abebe Birhane's relational ethics. GRACE: Global Review of AI Community Ethics 1(1): 1-13 (2023). https://ojs.stanford.edu/ojs/index.php/grace/article/view/2584/1546

16. A. Sandu, L.A. Cotfas, A. Stănescu, and C. Delcea. A bibliometric analysis of text mining: Exploring the use of natural language processing in social media research. Applied Sciences 14(8): 3144 (2024).

17. Q. Chen, W. Wang, K. Huang, and F. Coenen. Zero-shot text classification via knowledge graph embedding for social media data. IEEE Internet of Things Journal 9(12): 9205-9213 (2021).

18. A. Ghafoor, A.S. Imran, S.M. Daudpota, Z. Kastrati, R. Batra, and M.A. Wani. The impact of translating resource-rich datasets to low-resource languages through multi-lingual text processing. IEEE Access 9: 124478-124490 (2021).

19. V. Kumar, R.S. Singh, M. Rambabu, and Y. Dua. Deep learning for hyperspectral image classification: A survey. Computer Science Review 53: 100658 (2024).

20. A. Faheem, F. Ullah, U. Azam, M.S. Ayub, and A. Karim. Part of speech (POS) tagging in Roman Urdu: Datasets and models. Language Resources and Evaluation 59(4) pp. 4285-4312 (2025).

21. A. Ilyas, K. Shahzad, and M. Kamran Malik. Emotion detection in code-mixed Roman Urdu-English text. ACM Transactions on Asian and Low-Resource Language Information Processing 22(2): 48 (2023).

22. B.A. Chandio, A.S. Imran, M. Bakhtyar, S.M. Daudpota, and J. Baber. Attention-based RU-BiLSTM sentiment analysis model for Roman Urdu. Applied Sciences 12(7): 3641 (2022).

23. Z. Nabeel, M. Mehmood, A. Baqir, and A. Amjad. Classifying emotions in Roman Urdu posts using machine learning. Mohammad Ali Jinnah University International Conference on Computing (MAJICC), (15th-17th July 2021), Karachi, Pakistan pp. 1-7 (2021). https://doi.org/10.1109/MAJICC53071.2021.9526273

24. I.U. Khan, A. Khan, W. Khan, M.M. Su'ud, M.M. Alam, F. Subhan, and M.Z. Asghar. A review of Urdu sentiment analysis with multilingual perspective: A case of Urdu and Roman Urdu language. Computers 11(1): 3 (2021).

25. T.A. Rana, K. Shahzadi, T. Rana, A. Arshad, and M. Tubishat. An unsupervised approach for sentiment analysis on social media short text classification in Roman Urdu. Transactions on Asian and Low-Resource Language Information Processing 21(2): 28 (2021).

26. V. Tejaswini, K. Sathya Babu, and B. Sahoo. Depression detection from social media text analysis using natural language processing techniques and hybrid deep learning model. ACM Transactions on Asian and Low-Resource Language Information Processing 23(1): 4 (2024).

27. P.M. Lavanya and E. Sasikala. Deep learning techniques on text classification using Natural Language Processing (NLP) in social healthcare network: A comprehensive survey. 3rd International Conference on Signal Processing and Communication (ICPSC), (13th-14th may 2021), Coimbatore, India pp. 603-609 (2021). https://doi.org/10.1109/ICSPC51351.2021.9451752

28. M.P. Akhter, Z. Jiangbin, I.R. Naqvi, M. Abdelmajeed, and M.T. Sadiq. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 8: 91213-91226 (2020).

29. K. Mehmood, D. Essam, K. Shafi, and M.K. Malik. Discriminative feature spamming technique for Roman Urdu sentiment analysis. IEEE Access 7: 47991-48002 (2019).

30. F. Mehmood, M.U. Ghani, M.A. Ibrahim, R. Shahzadi, W. Mahmood, and M.N. Asim. A precisely Xtreme-multi channel hybrid approach for Roman Urdu sentiment analysis. IEEE Access 8: 192740-192759 (2020).

31. H.H. Saeed, T. Khalil, and F. Kamiran. Urdu toxic comment classification with PURUTT corpus development. IEEE Access 13: 21635-21651 (2025).

32. M. Bilal, H. Israr, M. Shahid, and A. Khan. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree, and KNN classification techniques. Journal of King Saud University-Computer and Information Sciences 28(3): 330-344 (2016).

33. H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2): 159-165 (1958).

34. S. Tariq, T.A. Rana, and F. Shahzadi. A comparative study of sentiment analysis in Urdu and Roman Urdu: The neglected realms. CSI Transactions on ICT 13(2): 193-211 (2025).

35. B. Chandio, A. Shaikh, M. Bakhtyar, M. Alrizq, J. Baber, A. Sulaiman, and W. Noor. Sentiment analysis of Roman Urdu on e-commerce reviews using machine learning. CMES-Computer Modeling in Engineering and Sciences 131(3): 1263-1287 (2022).

36. P. Willett. The Porter stemming algorithm: then and now. Program 40(3): 219-233 (2006).

37. M.F. Porter. Snowball: A language for stemming algorithms. Snowball Project, Cambridge, UK (2001). http://snowball.tartarus.org/texts/introduction.html

38. L. Wratten, A. Wilm, and J. Göke. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature Methods 18(10): 1161-1168 (2021).

39. Y. Zhou, Y. Yu, and B. Ding. Towards mlops: A case study of ml pipeline platform. International Conference on Artificial Intelligence and Computer Engineering (ICAICE), (23rd-25th October 2020), Beijing, China pp. 494-500 (2020). https://doi.org/10.1109/ICAICE51518.2020.00102

40. J. Wang and G. Joshi. Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms. Journal of Machine Learning Research 22(1): 9709-9758 (2021).

41. S.H. Haji and A.M. Abdulazeez. Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s Journal of Archaeology of Egypt/Egyptology 18(4): 2715-2743 (2021).

42. H. Ghulam, F. Zeng, W. Li, and Y. Xiao. Deep learning-based sentiment analysis for Roman Urdu text. Procedia Computer Science 147: 131-135 (2019).

43. L.C. Yu, J.L. Wu, P.C. Chang, and H.S. Chu. Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news. Knowledge-Based Systems 41: 89-97 (2013).

44. Z. Mahmood, I. Safder, R.M.A. Nawab, F. Bukhari, R. Nawaz, A.S. Alfakeeh, and S.U. Hassan. Deep sentiments in Roman Urdu text using recurrent convolutional neural network model. Information Processing and Management 57(4): 102233 (2020).

Published

2025-12-24

How to Cite

Muhammad Aqeel, Qutab, I., Khawar Iqbal, Habiba Fiaz, & Hira Arooj. (2025). Advancing Sentiment Classification for Roman Urdu Text Using Novel SGD Integration and Comprehensive Analysis of Machine Learning Models. Proceedings of the Pakistan Academy of Sciences: A. Physical and Computational Sciences, 62(4). https://doi.org/10.53560/PPASA(62-4)845

Issue

Section

Research Articles

Similar Articles

<< < 1 2 3 4 5 6 7 > >> 

You may also start an advanced similarity search for this article.