Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data
American Journal of Theoretical and Applied Statistics
Volume 4, Issue 3, May 2015, Pages: 192-200
Received: May 8, 2015; Accepted: May 18, 2015; Published: May 29, 2015
Views 4300      Downloads 175
Authors
Shelmith Nyagathiri Kariuki, Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Anthony Waititu Gichuhi, Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Anthony Kibira Wanjoya, Department of Statistics and Actuarial Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Article Tools
Follow on us
Abstract
Missing data poses a major threat to observational and experimental studies. Analysis of data having ignored missingness results to estimates that are inefficient and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missing data from complete data. Missing data are then imputed using the various methods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study aimed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of data from KDHS 2010. The overall rate of missingness was about 80%. The missing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Algorithm and Multiple Imputation. Later, logistic models were fitted to both datasets. Afterwards, goodness of fit tests were carried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and Snell’s R-Squared. The predictive ability of the two models was also examined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since logistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in data should be examined. If the amount of missing data is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The researchers suggested more research to be done to determine the maximum rate of missing data that should be imputed.
Keywords
Missingness, Missing at Random, Multiple Imputation, Expectation Maximization
To cite this article
Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, Anthony Kibira Wanjoya, Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data, American Journal of Theoretical and Applied Statistics. Vol. 4, No. 3, 2015, pp. 192-200. doi: 10.11648/j.ajtas.20150403.26
References
[1]
Alan Agresti. An Introduction to Categorical Data Analysis. John Wiley & Sons, Inc.,Hoboken, New Jersey, 2007
[2]
Shu-Ching Chang and Hyung Jin Kim. Em algorithm. December 9, 2007.
[3]
Dong and Peng. Principled missing data methods for researchers. Springler Plus, 2013.
[4]
Joseph L.Shafer and John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 2002, 7, 147-177
[5]
Yulei He. Missing data analysis using multiple imputation: Getting to the heart of the matter. National Institute of Health Public Access, January 1 2010.
[6]
Nicholas J. Horton. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. National Institute of Health Public Access, February 2007.
[7]
Tamara Brian Wilfried Laubach Jochen Hardt, Max Herke. Multiple imputation of missing data: A simulation study on a binary response. Open Journal of Statistics, 3:370_378, 2013..
[8]
Ting Hsiang Lin. A comparison of multiple imputation with em algorithm and mcmc method for quality of life missing data. Springer Science + Business Media B.V., September 2008.
[9]
Joseph L.Shaferand John W. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7(2):147-177, January 2002.
[10]
Show-Mann Liou Chao-Ying Joanne Peng, Michael Harwell and Lee H. Ehman. Advances in missing data method and implications for educational research. page 6, June 2003.
[11]
J.W Graham. Missing Data Analysis and Design. Springer, 2012.
[12]
Gabriele B. Durrant. Imputation mmethod for handling item-nonresponse in the social sciences. June 2005.
[13]
Andrew Gelman Kobi Abayomi and Marc Levy. Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, 57:273291, November 2008.
ADDRESS
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
U.S.A.
Tel: (001)347-983-5186