Retrieval of whole-metagenome sequencing samples | Consiglio Nazionale delle Ricerche

Joint research project

Project leaders: Daniele Santoni, Hasan Ogul
Agreement: TURCHIA - TUBITAK - Scientific and Technological Research Council of Turkey
Call: CNR-TUBITAK 2016-2017
Department: Engineering, ICT and technologies for energy and transportation
Thematic area: Engineering, ICT and technologies for energy and transportation
Status of the project: New

Research proposal

Metagenomics is the study of genomic material collected from natural environmental samples. It is a widely popular research area since high throughput sequencing technologies have been growing rapidly. Advances in these technologies are fueling a rapid increment in accumulating metagenome sequence datasets. The increasing availability of such data raises new computational challenges and the need of new algorithms. Several studies tried to address this issue from different point of views: some of them focus on sequencing specific markers while others interest in whole-metagenome sequencing (e.g. Human Microbiome Project Consortium, 2012; Qin et al., 2010). Although phylogenetic profiling can be made at a lower cost, whole metagenomes provide much more information about population genetics of the community (Schloissnig et al., 2013) and collective metabolism (Greenblum et al., 2012).
Finding similar metagenomics samples within large repositories is a significant and relevant issue for researchers. In recent studies, content-based similarity measures and retrieval of similar metagenomics datasets have been proposed (Su et al., 2012; Jiang et al., 2012). These studies base on quantifying abundances over a relatively small number of predetermined features, requiring existing annotation. They need some known taxa, genes or metabolic pathways to retrieve relevant sequence samples. On the other hand, there is only one study in which similarity measures base on raw sequencing reads, and hence, unbiased and insensitive to the quality of the existing annotation (Seth et al, 2014). In this study, instead of considering all sequences of particular length, known as k-mers, (Maillet et al., 2012), they developed a distributed string mining (DSM) algorithm to identify informative subsequences that can be of any length. There is only one study (Qin et al., 2012) in which quantification and association testing was done for >4.3 million predefined genes. Again, Seth et al (Seth et al, 2014) were the first to apply in metagenomics an unsupervised feature selection method, that is a common information retrieval practice in text mining.

In this project, we aim to develop a computational framework that takes a whole metagenome sequence as a query and provides a list of relevant experiments retrieved from a given repository.

We will adopt an approach based on the analysis of the k-mer frequencies - similar to that of Seth et al. (Seth et al, 2014) - to define a similarity measure between metagenomics samples. One of the main challenge of such an approach is to manage the very large space of k-mer frequencies; for that reason, proper filtering and mono-dimensional feature selection methods are needed. We believe that large improvements and refinements are still possible in this research area using multidimensional feature selection. Differently from the cited works, we will use recently developed Feature Selection (FS) techniques to identify proper k-mers subsets and proper algorithms to normalize and discretize frequency values.

FS can be viewed as an independent task that pre-processes the data before classication; reconstruction or representation algorithms are employed to select those features that are relevant for the specific task to be accomplished. The main difficulty in FS lies in selecting a subset, of a much larger set, that has some properties that strongly depend on the whole subset, so it is thus not always appropriate to measure them by means of simple or low order functions in the elements.

To tackle Feature Selection we will adopt a combinatorial approach paired with a robust metaheuristic solution algorithm, as described in the paper (Bertolazzi et al., 2015). Similar methods have already been used with success in other applications regarding genetic and biological sequences (Polychronopoulos et al., 2014; Weitschek et al., 2012).

REFERENCES

Human Microbiome Project Consortium. (2012) Structure, function and diversity of the healthy human microbiome. Nature, 486, 207-214.

Qin,J. et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464, 59-65.

Schloissnig,S. et al. (2013) Genomic variation landscape of the human gut microbiome. Nature, 493, 45-50.

Greenblum,S. et al. (2012) Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc. Natl Acad. Sci. USA, 109, 594-599.

Jiang,B. et al. (2012) Comparison of metagenomic samples using sequence signatures. BMC Genomics, 13, 730.

Su,X. et al. (2012) Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data. Bioinformatics, 28, 2493-2501.

Seth, Sohan, et al. (2014) Exploration and retrieval of whole-metagenome sequencing samples. Bioinformatics 30.17, 2471-2479.

Maillet,N. et al. (2012) Compareads: comparing huge metagenomic experiments. BMC Bioinformatics, 13 (Suppl. 19), S10.

Qin,J. et al. (2012) , A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 490, 55-60.

Bertolazzi, P., Felici, G., Festa, P., Fiscon, G., Weitschek, E. (2015), Integer Programming models for Feature Selection: new extensions and a randomized solution algorithm, European Journal of Operational Research.

Polychronopoulos, D., Weitschek, E., Dimitrieva, S., Bucher, P., Felici, G., Almirantis, Y. (2014), Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers, Elsevier Genomics, 104 (2), 79-86.

Weitschek E., Lo Presti A., Drovandi G., Giovanni Felici G., Massimo Ciccozzi, Ciotti M. and Bertolazzi, P. (2012), Human polyomaviruses identification by logic mining techniques, BMC Virology Journal 9:58.

Research goals

As sequencing technologies has been growing rapidly, there is a rapid increase in the number and size of metagenomics datasets. Bioinformaticians are faced with the problem of how to analyze these data with efficient and useful computational methods. Finding similar metagenomics samples within large repositories has been one of the most important research issue to handle for them. This study aims to develop a novel technique to find relevant metagenomics samples. Main focus of the study is to extract and select suitable features for representing WMS sequencing samples and to define a pairwise dissimilarity measure for a collection of such samples.

In our design, an initial unsupervised feature selection step will be run off line to reduce the feature set to a tractable size; then, a second step will perform a tuning of different distance functions based on the selected k-mers in a supervised learning framework. The third and final step will be the actual recognition step of a query sequence, based on a nearest neighbor approach between the query and the samples in the adopted repository.

We will evaluate the performance of the proposed method on data used in Seth et al (Seth et al., 2014). The dataset consists of synthetic data and metagenomics samples from human body sites (Human Microbiome Project Consortium, 2012; Qin et al., 2010, 2012). We will also use external validation based on a ground truth similarity between two samples.

Last update: 16/08/2025