When data clustering is of interest, the data partition must be regarded as the statistical parameter. However, this view to clustering is recent. In Bayesian clustering a model is assumed for data given partition, and a prior distribution is considered for partitions. The goal often is to find the maximum a posteriori grouping. When a Bayesian model is formulated for clustering, often Markov chain Monte Carlo (MCMC) method is applied. Therefore a measure of convergence defined on the partition space, a finite state space, is needed. Such a convergence measure can also be used to quantify the efficiency of a sampler. A Pearson-like goodness of fit statistic is introduced for Bayesian models with analytically tractable marginal posteriors. The asymptotic distribution of the statistic is derived providing a statistical significance test of convergence. Application of the proposed method is demonstrated on MCMC clustering of high-dimensional-low-sample-size metabolite data.
A joint work with Masoud Asgharian (McGill University) and Ioana Cosma (University of Cambridge).