Preprint / Version 1

A Comparative Simulation Study of Cluster Ensemble Algorithms Integrated with Multiple Imputation for Clustering with Missing Data

##article.authors##

  • Yui Tomo Center for Surveillance, Immunization, and Epidemiologic Research, National Institute of Infectious Diseases
  • Funato Sato Department of Clinical Data Science, National Center of Neurology and Psychiatry
  • Mari Oba Department of Clinical Data Science, National Center of Neurology and Psychiatry

DOI:

https://doi.org/10.51094/jxiv.1116

Keywords:

Cluster analysis, Consensus clustering, Hierarchical clustering, k-means, Non-negative matrix factorization

Abstract

Since cluster analysis methods usually cannot be applied directly to data with missing values, various approaches have been investigated to handle the problem. Multiple imputation is one of the standard procedures for addressing the problem of missing data. In cluster analysis, instead of Rubin's rule, cluster ensemble methods have been proposed to be combined with multiple imputation. However, it remains unrevealed which of the cluster ensemble algorithms leads to better performance when integrated with the procedure. Therefore, we conducted numerical comparisons of several algorithms to integrate the results from k-means++ clustering for multiply imputed datasets. Our results suggest that the non-negative matrix factorization algorithm may be suitable for scenarios with class balance, whereas the agglomerative cluster algorithm may be suitable for scenarios with class imbalance. Before application to actual datasets, we still recommend performing simulation experiments in scenarios reflecting the characteristics of the datasets and the assumption of missing value mechanisms.

Conflicts of Interest Disclosure

The authors declare that there are no competing interests.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Al-Najdi, A., Pasquier, N., and Precioso, F. (2016). Frequent closed patterns based multiple consensus clustering. In Artificial Intelligence and Soft Computing, pages 14– 26. Springer International Publishing.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.

Aschenbruck, R., Szepannek, G., and Wilhelm, A. F. (2023). Imputation strategies for clustering mixed-type data with missing values. Journal of Classification, 40(1):2–24.

Audigier, V. and Niang, N. (2023). Clustering with missing data: Which equivalent for rubin’s rules? Advances in Data Analysis and Classification, 17:623–657.

Audigier, V., Niang, N., and Resche-Rigon, M. (2021). Clustering with missing data: which imputation model for which cluster analysis method? arXiv 2106.04424.

Basagan ̃a, X., Barrera-Go ́mez, J., Benet, M., Anto ́, J., and Garcia-Aymerich, J. (2013). A framework for multiple imputation in cluster analysis. American Journal of Epidemiology, 177(7):718–725. Epub 2013 Feb 27.

Bruckers, L., Molenberghs, G., and Dendale, P. (2017). Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biometrical Journal, 59(5):998– 1015.

Chi, J. T., Chi, E. C., and Baraniuk, R. G. (2016). k-pod: A method for k-means clustering of missing data. The American Statistician, 70(1):91–99.

Ding, C., Li, T., Peng, W., and Park, H. (2006). Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 126–135.

Faucheux, L., Resche-Rigon, M., Curis, E., Soumelis, V., and Chevret, S. (2021). Clustering with missing and left-censored data: A simulation study comparing multiple-imputation-based procedures. Biometrical Journal, 63(2):372–393.

Forgey, E. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21(3):768–769.

Ghaemi, R., Sulaiman, M. N., Ibrahim, H., and Mustapha, N. (2009). A survey: Clustering ensembles techniques. International Journal of Computer and Information Engineering, 3(2):365–374.

Gionis, A., Mannila, H., and Tsaparas, P. (2007). Clustering aggregation. Acm Transactions on Knowledge Discovery from Data (TKDD), 1(1):4–es.

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193– 218.

Ishioka, T. (2000). Extended k-means with an efficient estimation of the number of clusters. Japanese Journal of Applied Statistics, 29(3):141–149.

Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H., and Karr, A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. Journal of Business & Economic Statistics, 32(3):375–386.

Kuncheva, L., Hadjitodorov, S., and Todorova, L. (2006). Experimental comparison of cluster ensemble methods. In 2006 9th International Conference on Information Fusion, pages 1–7.

Lee, J. W. and Harel, O. (2023). Incomplete clustering analysis via multiple imputation. Journal of Applied Statistics, 50(9):1962–1979.

Li, K., Wang, L., and Hao, L. (2009). Comparison of cluster ensembles methods based on hierarchical clustering. In 2009 International Conference on Computational Intelligence and Natural Computing, volume 1, pages 499–502.

Li, T., Ding, C., and Jordan, M. I. (2007). Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 577–582. IEEE.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley Series in

Probability and Statistics. John Wiley & Sons, Inc.

Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. CRC press.

Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(3):583–617.

Topchy, A. P., Law, M. H., Jain, A. K., and Fred, A. L. (2004). Analysis of consensus partition in cluster ensemble. In Fourth IEEE International Conference on Data Mining (ICDM’04), pages 225–232. IEEE.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3):1–67.

Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25(3):337–372.

Vinh, N., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(95):2837–2854.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, I ̇., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods, 17:261–272.

Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244.

Downloads

Posted


Submitted: 2025-03-03 12:40:35 UTC

Published: 2025-03-06 05:00:12 UTC
Section
Information Sciences