Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data

Shelmith Nyagathiri Kariuki; Anthony Waititu Gichuhi; Anthony Kibira Wanjoya

doi:doi:10.11648/j.ajtas.20150403.26

| Peer-Reviewed

Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data

Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya

Published in American Journal of Theoretical and Applied Statistics (Volume 4, Issue 3)

Received: 8 May 2015 Accepted: 18 May 2015 Published: 29 May 2015

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.

Published in	American Journal of Theoretical and Applied Statistics (Volume 4, Issue 3)
DOI	10.11648/j.ajtas.20150403.26
Page(s)	192-200
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2015. Published by Science Publishing Group

Keywords

Missingness, Missing at Random, Multiple Imputation, Expectation Maximization

References

[1]	Alan Agresti. An Introduction to Categorical Data Analysis. John Wiley & Sons, Inc.,Hoboken, New Jersey, 2007
[2]	Shu-Ching Chang and Hyung Jin Kim. Em algorithm. December 9, 2007.
[3]	Dong and Peng. Principled missing data methods for researchers. Springler Plus, 2013.
[4]	Joseph L.Shafer and John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 2002, 7, 147-177
[5]	Yulei He. Missing data analysis using multiple imputation: Getting to the heart of the matter. National Institute of Health Public Access, January 1 2010.
[6]	Nicholas J. Horton. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. National Institute of Health Public Access, February 2007.
[7]	Tamara Brian Wilfried Laubach Jochen Hardt, Max Herke. Multiple imputation of missing data: A simulation study on a binary response. Open Journal of Statistics, 3:370_378, 2013..
[8]	Ting Hsiang Lin. A comparison of multiple imputation with em algorithm and mcmc method for quality of life missing data. Springer Science + Business Media B.V., September 2008.
[9]	Joseph L.Shaferand John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7(2):147-177, January 2002.
[10]	Show-Mann Liou Chao-Ying Joanne Peng, Michael Harwell and Lee H. Ehman. Advances in missing data method and implications for educational research. page 6, June 2003.
[11]	J.W Graham. Missing Data Analysis and Design. Springer, 2012.
[12]	Gabriele B. Durrant. Imputation mmethod for handling item-nonresponse in the social sciences. June 2005.
[13]	Andrew Gelman Kobi Abayomi and Marc Levy. Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, 57:273291, November 2008.

Cite This Article

Plain Text BibTeX RIS

APA Style

Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. (2015). Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. American Journal of Theoretical and Applied Statistics, 4(3), 192-200. https://doi.org/10.11648/j.ajtas.20150403.26

Copy | Download

ACS Style

Shelmith Nyagathiri Kariuki; Anthony Waititu Gichuhi; Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am. J. Theor. Appl. Stat. 2015, 4(3), 192-200. doi: 10.11648/j.ajtas.20150403.26

Copy | Download

AMA Style

Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am J Theor Appl Stat. 2015;4(3):192-200. doi: 10.11648/j.ajtas.20150403.26

Copy | Download

@article{10.11648/j.ajtas.20150403.26,
  author = {Shelmith Nyagathiri Kariuki and Anthony Waititu Gichuhi and Anthony Kibira Wanjoya},
  title = {Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data},
  journal = {American Journal of Theoretical and Applied Statistics},
  volume = {4},
  number = {3},
  pages = {192-200},
  doi = {10.11648/j.ajtas.20150403.26},
  url = {https://doi.org/10.11648/j.ajtas.20150403.26},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajtas.20150403.26},
  abstract = {Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.},
 year = {2015}
}

Copy | Download

TY - JOUR
T1 - Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data
AU - Shelmith Nyagathiri Kariuki
AU - Anthony Waititu Gichuhi
AU - Anthony Kibira Wanjoya
Y1 - 2015/05/29
PY - 2015
N1 - https://doi.org/10.11648/j.ajtas.20150403.26
DO - 10.11648/j.ajtas.20150403.26
T2 - American Journal of Theoretical and Applied Statistics
JF - American Journal of Theoretical and Applied Statistics
JO - American Journal of Theoretical and Applied Statistics
SP - 192
EP - 200
PB - Science Publishing Group
SN - 2326-9006
UR - https://doi.org/10.11648/j.ajtas.20150403.26
AB - Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.
VL - 4
IS - 3
ER -

Copy | Download

Author Information

Shelmith Nyagathiri Kariuki

Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Anthony Waititu Gichuhi

Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Anthony Kibira Wanjoya

Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. (2015). Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. American Journal of Theoretical and Applied Statistics, 4(3), 192-200. https://doi.org/10.11648/j.ajtas.20150403.26

Copy | Download

ACS Style

Shelmith Nyagathiri Kariuki; Anthony Waititu Gichuhi; Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am. J. Theor. Appl. Stat. 2015, 4(3), 192-200. doi: 10.11648/j.ajtas.20150403.26

Copy | Download

AMA Style

Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya. Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data. Am J Theor Appl Stat. 2015;4(3):192-200. doi: 10.11648/j.ajtas.20150403.26

Copy | Download

@article{10.11648/j.ajtas.20150403.26,
  author = {Shelmith Nyagathiri Kariuki and Anthony Waititu Gichuhi and Anthony Kibira Wanjoya},
  title = {Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data},
  journal = {American Journal of Theoretical and Applied Statistics},
  volume = {4},
  number = {3},
  pages = {192-200},
  doi = {10.11648/j.ajtas.20150403.26},
  url = {https://doi.org/10.11648/j.ajtas.20150403.26},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajtas.20150403.26},
  abstract = {Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.},
 year = {2015}
}

Copy | Download