Downloads

Jiahuai Ma, Kaixian Xu, Yu Qiao, & Zhaoyan Zhang. (2022). An Integrated Model for Social Media Toxic Comments Detection: Fusion of High-Dimensional Neural Network Representations and Multiple Traditional Machine Learning Algorithms. Journal of Computational Methods in Engineering Applications, 2(1), 1–12. https://doi.org/10.62836/jcmea.v2i1.0005

An Integrated Model for Social Media Toxic Comments Detection: Fusion of High-Dimensional Neural Network Representations and Multiple Traditional Machine Learning Algorithms

Social media platforms have become pivotal for global communication and information exchange but are increasingly challenged by the proliferation of toxic comments. These comments, characterized by abusive, discriminatory, or harassing language, threaten user safety and well-being, necessitating efficient detection systems. This paper proposes a novel hybrid approach to detect social media toxic comments by combining the feature extraction capabilities of Long Short Term Memoery (LSTM)-based neural networks with multiple machine learning models, including Random Forest, Logistic Regression, and K-Nearest Neighbors. High-dimensional feature representations from the neural network are integrated with predictions from traditional classifiers, and Random Forest optimizes the output weights to maximize performance. Evaluated on a Kaggle dataset, the proposed model achieves an accuracy of 89.78% and outperforms individual models in handling the complexity of toxic comments. However, challenges such as overfitting, computational overhead, and interpretability remain. Future work aims to address these limitations through improved data augmentation, explainability methods, and more scalable architectures.

component; multi-model fusion; social media toxic comments detection; machine learning

References

  1. Weller K. Trying to Understand Social Media Users and Usage: The Forgotten Features of Social Media Platforms. Online Information Review 2016; 40(2): 256–264.
  2. Bucher T, Helmond A. The Affordances of Social Media Platforms. In The SAGE handbook of Social Media; Sage Publishing: New York, NY, USA, 2018; Volume 1, pp. 233–254.
  3. Van Dijck, J, Poell, T. Social Media Platforms and Education. In The SAGE Handbook of Social Media; Sage Publishing: New York, NY, USA, 2018; pp. 579–591.
  4. Hosseini H, Kannan S, Zhang B, et al. Deceiving Google’s Perspective Api Built for Detecting Toxic Comments. arXiv 2017, arXiv:1702.08138.
  5. Zaheri S, Leath, J, Stroud, D. Toxic Comment Classification. SMU Data Science Review 2020; 3(1): 13.
  6. Saeed HH, Ashraf MH, Kamiran F, et al. Roman Urdu Toxic Comment Classification. Language Resources and Evaluation 2021; 55: 971–996.
  7. Brassard-Gourdeau É, Khoury R. Impact of Sentiment Detection to Recognize Toxic and Subversive Online Comments. arXiv 2018, arXiv:1812.01704.
  8. Gémes K, Recski G. TUW-Inf at GermEval 2021: Rule-Based and Hybrid Methods for Detecting Toxic, Engaging, and Fact-Claiming Comments. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, Duesseldorf, Germany, September 2021; pp. 69–75.
  9. Dai W. Safety Evaluation of Traffic System with Historical Data Based on Markov Process and Deep-Reinforcement Learning. Journal of Computational Methods in Engineering Applications 2021; 1: 1–14.
  10. Yu L, Li J, Cheng S, et al. Secure Continuous Aggregation in Wireless Sensor Networks. IEEE Transactions on Parallel and Distributed Systems 2013; 25(3): 762–774.
  11. Xiong S, Yu L, Shen H, et al. Efficient Algorithms for Sensor Deployment and Routing in Sensor Networks For Network-Structured Environment Monitoring. In Proceedings of the 2012 IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; pp. 1008–1016.
  12. Zhu D, Gan Y, Chen X. Domain Adaptation-Based Machine Learning Framework for Customer Churn Prediction Across Varing Distributions. Journal of Computational Methods in Engineering Applications 2021; 1: 1–14.
  13. Feng Z, Xiong S, Cao D, et al. Hrs: A Hybrid Framework for Malware Detection. In Proceedings of the 2015 ACM International Workshop on International Workshop on Security and Privacy Analytics, San Antonio, TX, USA, 4 March 2015; pp. 19–26.
  14. Xiong S, Li J, Li M, et al. Multiple Task Scheduling for Low-Duty-Cycled Wireless Sensor Networks. In Proceedings of the 2011 IEEE INFOCOM, Shanghai, China, 10–15 April 2011; pp. 1323–1331.
  15. Yu L, Li J, Cheng S, et al. Secure Continuous Aggregation via Sampling-Based Verification in Wireless Sensor Networks. In Proceedings of the 2011 IEEE INFOCOM, Shanghai, China, 10–15 April 2011; pp. 1763–1771.
  16. Li J, Xiong S. Efficient Pr-Skyline Query Processing and Optimization in Wireless Sensor Networks. Wireless Sensor Network 2010; 2(11): 838.
  17. Xiong S, Li J. An Efficient Algorithm for Cut Vertex Detection in Wireless Sensor Networks. In Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems, Genoa, Italy, 21–25 June 2010; pp. 368–377.
  18. Xiong, S, Li, J. Optimizing Many-to-Many Data Aggregation in Wireless Sensor Networks. In Advances in Data and Web Management, Proceedings of the Asia-Pacific Web Conference 2009, Suzhou, China, 2–4 April 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 550–555.
  19. Wang H, Li J, Xiong S. Efficient Join Algorithms for Distributed Information Integration Based on XML. International Journal of Business Process Integration and Management 2008; 3(4): 271–281.
  20. Zhou Z, Wu J, Cao Z, et al. On-Demand Trajectory Prediction Based on Adaptive Interaction Car Following Model with Decreasing Tolerance. In Proceedings of the 2021 International Conference on Computers and Automation (CompAuto), Paris, France, 7–9 September 2021; pp. 67–72.
  21. Rigatti SJ. Random Forest. Journal of Insurance Medicine 2017; 47(1): 31–39.
  22. Biau G, Scornet E. A Random Forest Guided Tour. Test 2016; 25: 197–227.
  23. Breiman L. Random Forests. Machine Learning 2001; 45: 5–32.
  24. Wang H, Hu D. Comparison of SVM and LS-SVM for Regression. In Proceedings of the 2005 International Conference on Neural Networks and Brain, Beijing, China, 13–15 October 2005; Volume 1, pp. 279–283.
  25. Vishwanathan SVM, Murty MN. SSVM: A Simple SVM Algorithm. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), Honolulu, HI, USA, 12–17 May 2002; Volume 3, pp. 2393–2398.
  26. Jakkula, V. Tutorial on Support Vector Machine (SVM). School of EECS, Washington State University 2006; 37: 3.
  27. LaValley MP. Logistic Regression. Circulation 2008; 117(18): 2395–2399.
  28. Hosmer DW, Jr., Lemeshow S, Sturdivant RX. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013.
  29. Yu Y, Si X, Hu C, et al. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Computation 2019; 31(7): 1235–1270.
  30. Sherstinsky A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Physica D: Nonlinear Phenomena 2020; 404: 132306.
  31. Peterson LE. K-Nearest Neighbor. Scholarpedia 2009; 4(2): 1883.
  32. Imandoust SB, Bolandraftar M. Application of k-Nearest Neighbor (knn) Approach for Predicting Economic Events: Theoretical Background. International Journal of Engineering Research and Applications 2013; 3(5): 605–610.
  33. Tarasova Z, Khlinovskaya Rockhill E, Tuprina O, et al. Urbanisation and the Shifting Of Boundaries: Contemporary Transformations in Kinship and Child Circulation amongst the Sakha. Europe-Asia Studies 2017; 69(7): 1106–1125.
  34. Nobata C, Tetreault J, Thomas A, et al. Abusive Language Detection in Online User Content. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 145–153.
  35. Hanson R. Foul Play in Information Markets; George Mason University: Fairfax, VA, USA, 2004
  36. Waseem Z, Thorne J, Bingel J. Bridging the Gaps: Multi Task Learning for Domain Transfer of Hate Speech Detection. In Online Harassment; Springer: Cham Switzerland, 2018; pp. 29–55.
  37. Georgakopoulos SV, Tasoulis SK, Vrahatis AG, et al. Convolutional Neural Networks for toxic Comment Classification. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, Patras, Greece, 9–12 July 2018; pp. 1–6.
  38. Aggarwal CC, Zhai C. A Survey of Text Classification Algorithms. In Mining Text Data; Springer, Boston, MA, USA, 2012; pp. 163–222.
  39. Cha M, Haddadi H, Benevenuto F, et al. Measuring User Influence in Twitter: The million Follower Fallacy. In Proceedings of the International AAAI Conference on Web and Social Media, Washington, DC, USA, 23–26 May 2010; Volume 4, pp. 10–17.
  40. Aizawa A. An Information-Theoretic Perspective of tf-idf Measures. Information Processing & Management 2003; 39(1): 45–65.
  41. Ramos J. Using tf-idf to Determine Word Relevance in Document Queries. In Proceedings of the First Instructional Conference on Machine Learning, Los Angeles, CA, USA, 23–24 June 2003; Volume 242, pp. 29–48.
  42. Qaiser S, Ali R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications 2018; 181(1): 25–29.
  43. Junsomboon N, Phienthrakul T. Combining Over-Sampling and under-Sampling Techniques for Imbalance Dataset. In Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 24–26 February 2017; pp. 243–247.
  44. Karthik MG, Krishnan MM. Hybrid Random Forest and Synthetic Minority over Sampling Technique for Detecting Internet of Things Attacks. Journal of Ambient Intelligence and Humanized Computing 2021; 1–11.
  45. Zhao J, Huang F, Lv J, et al. Do RNN and LSTM Have Long Memory?. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 11365–11375.
  46. Zhao Z, Chen W, Wu X, et al. LSTM network: A Deep Learning Approach for Short-Term Traffic Forecast. IET Intelligent Transport Systems 2017; 11(2): 68–75.
  47. Cutler A, Cutler DR, Stevens JR. Random Forests. In Ensemble Machine Learning: Methods and Applications; Springer, New York, NY, USA, 2012; pp. 157–175.
  48. Speelman D. Logistic Regression. Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy 2014; 43: 487–533.
  49. Batista GE, Silva DF. How k-Nearest Neighbor Parameters Affect Its Performance. In Proceedings of the Argentine Symposium on Artificial Intelligence, Mar del Plata, Argentina, 24–25 August 2009; pp. 1–12.

Supporting Agencies

  1. Funding: Not applicable.