Multiclass Email Classification by Using Ensemble Bagging and Ensemble Voting

Ali Helmut, Danang Triantoro Murdiansyah

Abstract


Email is a common communication technology in modern life. The more emails we receive, the more difficult and time consuming it is to sort them out. One solution to overcome this problem is to create a system using machine learning to sort emails. Each method of machine learning and data sampling result in different performance. Ensemble learning is a method of combining several learning models into one model to get better performance. In this study we tried to create a multiclass email classification system by combining learning models, data sampling, and several data classes to obtain the effect of Ensemble Bagging and Ensemble Voting methods on the performance of the macro average f1 score, and compare it with non-ensemble models. The results of this study show that the sensitivity of Naïve Bayes to imbalance data is helped by the Ensemble Bagging and Ensemble Voting method with ∆P (delta performance) of range 0.0001 – 0.0018. Logistic Regression has performance with Ensemble Bagging and Ensemble Voting by ∆P of range 0.0001-0.00015. Decision Tree has lowest performance compared to others with ∆P of -0.01

Full Text:

PDF

References


X. L. Wang and I. Cloete, “Learning to classify email: A survey,” 2005 Int. Conf. Mach. Learn. Cybern. ICMLC 2005, pp. 5716–5719, 2005.

The Radicati Group.inc, “Email Statistics Report, 2017-2021”, 2018

S. Tsugawa, K. Takahashi, H. Ohsaki, and M. Imase, “Robust estimation of message importance using inferred inter-recipient trust for supporting email triage,” Proc. - 2010 10th Annu. Int. Symp. Appl. Internet, SAINT 2010, pp. 177–180, 2010.

M. Zivkovic et al., “Training Logistic Regression Model by Hybridized Multi-verse Optimizer for Spam Email Classification,” in Proceedings of International Conference on Data Science and Applications: ICDSA 2022, Volume 2, 2023, pp. 507–520.

D. M. Ablel-Rheem, A. O. Ibrahim, S. Kasim, A. A. Almazroi, M. A. Ismail, and others, “Hybrid feature selection and ensemble learning method for spam email classification,” Int. J., vol. 9, no. 1.4, pp. 217–223, 2020.

P. Kumar, “Predictive analytics for spam email classification using machine learning techniques,” Int. J. Comput. Appl. Technol., vol. 64, no. 3, pp. 282–296, 2020.

A. Sharaff and U. Srinivasarao, “Towards classification of email through selection of informative features,” in 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T), 2020, pp. 316–320.

A. Alghoul, S. Al Ajrami, G. Al Jarousha, G. Harb, and S. S. Abu-Naser, “Email Classification Using Artificial Neural Network,” 2018.

V. Babar and R. Ade, “MLP-based undersampling technique for imbalanced learning,” in 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), 2016, pp. 142–147.

N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

B. Singh, N. Kushwaha, and O. P. Vyas, “A Scalable Hybrid Ensemble model for text classification,” IEEE Reg. 10 Annu. Int. Conf. Proceedings/TENCON, pp. 3148–3152, Feb. 2017.

L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, Aug. 1996.

V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes-which naive bayes?,” in CEAS, 2006, vol. 17, pp. 28–69.

M. Dumont, R. Marée, L. Wehenkel, and P. Geurts, “Fast multi-class image annotation with random subwindows and multiple output randomized trees,” in International Conference on Computer Vision Theory and Applications (VISAPP), 2009.

H.-F. Yu, F.-L. Huang, and C.-J. Lin, “Dual coordinate descent methods for logistic regression and maximum entropy models,” Mach. Learn., vol. 85, pp. 41–75, 2011.




DOI: https://doi.org/10.33387/jiko.v6i2.6394

Refbacks

  • There are currently no refbacks.