International Journal of Data Science and Analysis
Volume 5, Issue 6, December 2019, Pages: 123-127
Received: Sep. 25, 2019;
Accepted: Nov. 8, 2019;
Published: Nov. 17, 2019
Views 368 Downloads 104
Samuel Adewale Aderoju, Department of Statistics and Mathematical Sciences, Kwara State University, Ilorin, Nigeria
Emmanuel Teju Jolayemi, Department of Statistics, University of Ilorin, Ilorin, Nigeria
Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.
Samuel Adewale Aderoju,
Emmanuel Teju Jolayemi,
Issues of Class Imbalance in Classification of Binary Data: A Review, International Journal of Data Science and Analysis.
Vol. 5, No. 6,
2019, pp. 123-127.
Wang, S., Member, and Xin Yao, (2012), “Multiclass Imbalance Problems: Analysis and Potential Solutions”, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics, Vol. 42, No. 4.
Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Ko lcz, (2004) “Editorial: Special Issue on Learning from Imbalanced Data Sets”; ACM SIGKDD Explorations Newsletter; Volume 6, Issue 1 - Page 1-6. Doi: 10.1145/1007730.1007733.
Longadge. R., Dongre. S. S., and Malik, L., (2013), Class Imbalance Problem in Data Mining: Review; International Journal of Computer Science and Network (IJCSN); Vol. 2, Issue 1.
Galar, M. and Fransico, (2012) “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions on Systems, Man, And Cybernetics—Part C: Application and Reviews, Vol. 42, No. 4.
Chawla V. N., Bowyer K. W., Hall L. O., Kegelmeyer W. P., (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.
Brown, I. and C. Mues, (2012), An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets, Expert Systems with Applications, 39 (2012), no. 3, 3446-3453. http://dx.doi.org/10.1016/j.eswa.2011.09.033.
Seiffert C., Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, (2008) “A Comparative Study of Data Sampling and Cost Sensitive Learning”, IEEE International Conference on Data Mining Workshops. 15-19.
Liu, P., Lijun Cai, Yong Wang, Longbo Zhang, (2010) “Classifying Skewed Data Streams Based on Reusing Data”; International Conference on Computer Application and System Modeling (ICCASM 2010).
Tang, Y., Zhang, Y., Chawla, N. V., and Sven Krasser; (2009), “Correspondence SVMs Modeling for Highly Imbalanced Classification”; IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 39, No. 1.
Agresti, A., (2002) Categorical Data Analysis, John Willey & Sons, Inc, New York.
Fawcett, T., (2006), An Introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874. http://dx.doi.org/10.1016/j.patrec.2005.10.010.
Hanifah, F. S, Wijayanto, H. and Kurnia, A. (2015). SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis. Applied Mathematical Sciences, Vol. 9, 2015, no. 138, 6857-6865. http://dx.doi.org/10.12988/ams.2015.58562.
Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR.
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International.
Lunardon, Giovanna Menardi, and Nicola Torelli (2014). ROSE: a Package for Binary Imbalanced Learning. R Journal, 6 (1), 82-92.
Kuhn, M., Wing, J., Weston, S., Williams, A., Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81. https://CRAN.R-project.org/package=caret.