Bayesian Statistical Methods in Clustering Single Nucleotide Polymorphisms

Joint research project

Project leaders: Fabrizio Ruggeri, Chuhsing Kate Hsiao
Agreement: TAIWAN - NSTC - National Science and Technology Council
Call: CNR/NSC 2012-2013
Department: Medicine
Thematic area: Biomedical sciences
Status of the project: New

Research proposal

The proposed research stems from the recent visit of Dr. Ruggeri at the Department of Public Health of National Taiwan University. During the stay, supported by CNR, Dr. Ruggeri and Prof. Hsiao had the opportunity to discuss and start a cooperation, which will involve also a Taiwanese student (expert in biostatistics) and an Italian researcher (expert in Bayesian cluster analysis and mixture models).
Genomics is a growing field which has attracted interest from researchers in different disciplines; statisticians, in particular, have found important stimuli not only for their "traditional" inference and decision analysis, but, even more, by the need of finding significant relations and parsimonious representations within a huge amount of data. The proposed research will investigate a particular case of parsimonious representation, combining the complementary, but not disjoint, knowledge of the two groups. Prof. Hsiao is an expert in biostatistics and bioinformatics, with a background in Bayesian statistics, whereas Dr. Ruggeri has a longstanding activity in the field of Bayesian analysis, system modelling and applications, with a growing interest in biostatistics.
The analysis of high throughput genomic data is usually conducted based on two data formats. One is by single nucleotide polymorphisms (SNP's), and the other is by genes. For SNP data, the difficulties encountered include multiple tests, especially when the number of SNP's is large, and the failure to incorporate the relation among SNP's. On the other hand, for data of genes which are usually denoted as sequences of bits, the sequence can be very long so that the number of possible composition of bits on this segment becomes enormous. This may induce the curse of dimensionality in statistical analyses. To reduce the difficulties discussed above, scientists consider haplotype as the unit for genetic analysis, where, loosely speaking, a haplotype can be considered a segment or cluster of SNP's showing association with each other. In recent years, association studies based on haplotypes have advanced greatly.
When the research focus comes to the interaction between haplotypes at different regions, other issues arise. For instance, the widely adopted linkage disequilibrium (LD) for haplotype construction may not be proper to depict the relation between two haplotypes. This is because the LD measure within a haplotype is usually much larger than the LD between SNP's from two different haplotypes. In addition, when two haplotypes are considered simultaneously, the number of all possible patterns can be as large as the product m´n, where m and n are numbers of possible compositions for each haplotype, respectively. In other words, the degree of freedom in statistical analysis will escalate.
This project will utilize tools in Bayesian statistics, information theory and bioinformatics to construct an association measure between two SNP clusters. The SNP cluster does not necessarily represent a haplotype. It can stand for the gene, or even the genes in a known pathway. The major contribution of the Italian researchers would be in the field of Bayesian statistics, whereas information theory and bioinformatics knowledge will be provided by the Taiwanese group. Prof. Hsiao and collaborators have been working on a distance between two clusters based on a well known error correcting code, Hamming distance. In communication theory, Hamming distance is used to measure the similarity between two sequences of the same length. However, for SNP clusters considered in this project, the cluster sizes may not be the same. Last year her research team proposed measurements which were not restricted by the assumption of equal length and flexible for wider applications.
The Italian contribution will be based on a Bayesian statistical approach, which fits well the purposes of exploiting all external sources of information, including expertise from similar experiments or genetics, and finding rigorous methods, based on the law of probability, to select among possible configurations. In particular, two models are envisaged to model association among SNP's: multinomial distributions and Bayesian cluster.
Multinomial Distributions
Every SNP assumes values 0, 1 and 2 for three genotypes per locus and we suppose this happens according to a given (multinomial) probability distribution. One assumes that the SNP's have a common distribution and the research is aimed to characterize the different distributions present in the population of SNP's and to assign every SNP to one of such distributions. Various methods will be studied. The first one, widely studied by Dr. Ruggeri to compare probability measures, is based on an extension (said concentration function) of the Lorenz curve, used in economy to describe inequalities in the wealth distribution. Such method, although well justified to identify similar distributions, is remarkably complex from the computational point of view. A promising, alternative method involves mixtures of distributions that could be used to estimate number and parameters of the possible distributions, and, finally, to assign a SNP to them. While a wide literature on mixtures of continuous distributions (e.g. Gaussian or gamma) exists (e.g. Rios Insua, Ruggeri and Wiper, 2001), we cannot say the same for mixtures of multinomial distributions.
Bayesian clusters

Given J SNP's observed for M individuals, we are interested in subdividing SNP's in K elements of a cluster and estimate K, as well. In a first model we consider the centres of the K elements as M-dimensional random variables (like the number of involved individuals) and construct a model in which the observations depend on the centres. We express a prior probability distribution on the positions of those centres, which, combined with the observations according to the Bayesian paradigm, provides a posterior distribution of such centres and a probability distribution for the allocation of the SNP's to the K elements. The first model represents, although with remarkable complexity, a typical example of Bayesian statistical analysis, whereas the second one is based on an original, iterative technique for the allocation of the SNP's to elements of the cluster, repeated until convergence, i.e. until no SNP's, or a very small number of them, move from an element to another between successive iterations. At each iteration, the centres are chosen, with opportune weights, between the SNP's assigned to the element during the previous one. The SNP's allocation is based on the largest posterior probability of belonging to a given element of the cluster. As a relevant novelty, Hamming distance (widely used by Prof. Hsiao) will be introduced in the statistical model (affecting, therefore, the likelihood function) that describes the distance between observed data and centres; here the probabilities of SNP's being allocated to a cluster are the parameters object of inference. Finally, a complex computational aspect regards the convergence of the iterative method (e.g. speed and robustness with respect to changes in initial conditions).

Research goals

We plan to develop methods to identify similarities among SNP’s, considering them as realisations of discrete (multinomial) random variables. Mixture models are typically used in Bayesian statistics when elements of a population are supposed to belong to subpopulations with similar characteristics; therefore, each observation could come from one of, say, K densities and, as a result of the inference, it could be possible to estimate the number K of densities, their parameters and determine the posterior probability of allocation of each observation to the any density. Once the densities have been identified, then observations could be allocated to the one which they have the highest probability of belonging to and SNP’s are clustered according to the density they have been assigned. Mixtures are obtained by considering a kernel (e.g. a gamma distribution like in Rios Insua, Ruggeri and Wiper, 2001) and a mixing distribution on its parameters; such distribution could be on a finite dimensional set (like in Rios Insua et al) or an infinite dimensional where methods of Bayesian nonparametrics, thoroughly studied at CNR IMATI and by Dr. Ruggeri, could be used. Such works are typically performed for continuous distributions; applications for multinomial distributions could be quite challenging. We plan to build Bayesian procedures which allow to cluster SNP’s in a number K (to be estimated) of elements, assuming models in which the parameters are either centres of the cluster or probability of allocation of each SNP’s to a cluster. We would like also to test them on real data and measure their performances, not only from a statistical viewpoint but also a computational one.

Last update: 06/09/2025