Issues of Class Imbalance in Classification of Binary Data: A Review

Samuel Adewale Aderoju; Emmanuel Teju Jolayemi

doi:doi:10.11648/j.ijdsa.20190506.13

| Peer-Reviewed

Issues of Class Imbalance in Classification of Binary Data: A Review

Samuel Adewale Aderoju, Emmanuel Teju Jolayemi

Published in International Journal of Data Science and Analysis (Volume 5, Issue 6)

Received: 25 September 2019 Accepted: 8 November 2019 Published: 17 November 2019

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.

Published in	International Journal of Data Science and Analysis (Volume 5, Issue 6)
DOI	10.11648/j.ijdsa.20190506.13
Page(s)	123-127
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Classification, Class Imbalanced, Resampling Techniques, Logistic Model, Terminated Pregnancy

References

[1]	Wang, S., Member, and Xin Yao, (2012), “Multiclass Imbalance Problems: Analysis and Potential Solutions”, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics, Vol. 42, No. 4.
[2]	Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Ko lcz, (2004) “Editorial: Special Issue on Learning from Imbalanced Data Sets”; ACM SIGKDD Explorations Newsletter; Volume 6, Issue 1 - Page 1-6. Doi: 10.1145/1007730.1007733.
[3]	Longadge. R., Dongre. S. S., and Malik, L., (2013), Class Imbalance Problem in Data Mining: Review; International Journal of Computer Science and Network (IJCSN); Vol. 2, Issue 1.
[4]	Galar, M. and Fransico, (2012) “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions on Systems, Man, And Cybernetics—Part C: Application and Reviews, Vol. 42, No. 4.
[5]	Chawla V. N., Bowyer K. W., Hall L. O., Kegelmeyer W. P., (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.
[6]	Brown, I. and C. Mues, (2012), An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets, Expert Systems with Applications, 39 (2012), no. 3, 3446-3453. http://dx.doi.org/10.1016/j.eswa.2011.09.033.
[7]	Seiffert C., Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, (2008) “A Comparative Study of Data Sampling and Cost Sensitive Learning”, IEEE International Conference on Data Mining Workshops. 15-19.
[8]	Liu, P., Lijun Cai, Yong Wang, Longbo Zhang, (2010) “Classifying Skewed Data Streams Based on Reusing Data”; International Conference on Computer Application and System Modeling (ICCASM 2010).
[9]	Tang, Y., Zhang, Y., Chawla, N. V., and Sven Krasser; (2009), “Correspondence SVMs Modeling for Highly Imbalanced Classification”; IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 39, No. 1.
[10]	Agresti, A., (2002) Categorical Data Analysis, John Willey & Sons, Inc, New York.
[11]	Fawcett, T., (2006), An Introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874. http://dx.doi.org/10.1016/j.patrec.2005.10.010.
[12]	Hanifah, F. S, Wijayanto, H. and Kurnia, A. (2015). SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis. Applied Mathematical Sciences, Vol. 9, 2015, no. 138, 6857-6865. http://dx.doi.org/10.12988/ams.2015.58562.
[13]	Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR.
[14]	R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
[15]	National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International.
[16]	Lunardon, Giovanna Menardi, and Nicola Torelli (2014). ROSE: a Package for Binary Imbalanced Learning. R Journal, 6 (1), 82-92.
[17]	Kuhn, M., Wing, J., Weston, S., Williams, A., Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81. https://CRAN.R-project.org/package=caret.

Cite This Article

Plain Text BibTeX RIS

APA Style

Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. (2019). Issues of Class Imbalance in Classification of Binary Data: A Review. International Journal of Data Science and Analysis, 5(6), 123-127. https://doi.org/10.11648/j.ijdsa.20190506.13

Copy | Download

ACS Style

Samuel Adewale Aderoju; Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int. J. Data Sci. Anal. 2019, 5(6), 123-127. doi: 10.11648/j.ijdsa.20190506.13

Copy | Download

AMA Style

Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int J Data Sci Anal. 2019;5(6):123-127. doi: 10.11648/j.ijdsa.20190506.13

Copy | Download

@article{10.11648/j.ijdsa.20190506.13,
  author = {Samuel Adewale Aderoju and Emmanuel Teju Jolayemi},
  title = {Issues of Class Imbalance in Classification of Binary Data: A Review},
  journal = {International Journal of Data Science and Analysis},
  volume = {5},
  number = {6},
  pages = {123-127},
  doi = {10.11648/j.ijdsa.20190506.13},
  url = {https://doi.org/10.11648/j.ijdsa.20190506.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20190506.13},
  abstract = {Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.},
 year = {2019}
}

Copy | Download

TY  - JOUR
T1  - Issues of Class Imbalance in Classification of Binary Data: A Review
AU  - Samuel Adewale Aderoju
AU  - Emmanuel Teju Jolayemi
Y1  - 2019/11/17
PY  - 2019
N1  - https://doi.org/10.11648/j.ijdsa.20190506.13
DO  - 10.11648/j.ijdsa.20190506.13
T2  - International Journal of Data Science and Analysis
JF  - International Journal of Data Science and Analysis
JO  - International Journal of Data Science and Analysis
SP  - 123
EP  - 127
PB  - Science Publishing Group
SN  - 2575-1891
UR  - https://doi.org/10.11648/j.ijdsa.20190506.13
AB  - Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.
VL  - 5
IS  - 6
ER  -

Copy | Download

Author Information

Samuel Adewale Aderoju

Department of Statistics and Mathematical Sciences, Kwara State University, Ilorin, Nigeria
Emmanuel Teju Jolayemi

Department of Statistics, University of Ilorin, Ilorin, Nigeria

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. (2019). Issues of Class Imbalance in Classification of Binary Data: A Review. International Journal of Data Science and Analysis, 5(6), 123-127. https://doi.org/10.11648/j.ijdsa.20190506.13

Copy | Download

ACS Style

Samuel Adewale Aderoju; Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int. J. Data Sci. Anal. 2019, 5(6), 123-127. doi: 10.11648/j.ijdsa.20190506.13

Copy | Download

AMA Style

Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int J Data Sci Anal. 2019;5(6):123-127. doi: 10.11648/j.ijdsa.20190506.13

Copy | Download

@article{10.11648/j.ijdsa.20190506.13,
  author = {Samuel Adewale Aderoju and Emmanuel Teju Jolayemi},
  title = {Issues of Class Imbalance in Classification of Binary Data: A Review},
  journal = {International Journal of Data Science and Analysis},
  volume = {5},
  number = {6},
  pages = {123-127},
  doi = {10.11648/j.ijdsa.20190506.13},
  url = {https://doi.org/10.11648/j.ijdsa.20190506.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20190506.13},
  abstract = {Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.},
 year = {2019}
}

Copy | Download

TY  - JOUR
T1  - Issues of Class Imbalance in Classification of Binary Data: A Review
AU  - Samuel Adewale Aderoju
AU  - Emmanuel Teju Jolayemi
Y1  - 2019/11/17
PY  - 2019
N1  - https://doi.org/10.11648/j.ijdsa.20190506.13
DO  - 10.11648/j.ijdsa.20190506.13
T2  - International Journal of Data Science and Analysis
JF  - International Journal of Data Science and Analysis
JO  - International Journal of Data Science and Analysis
SP  - 123
EP  - 127
PB  - Science Publishing Group
SN  - 2575-1891
UR  - https://doi.org/10.11648/j.ijdsa.20190506.13
AB  - Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.
VL  - 5
IS  - 6
ER  -

Copy | Download