Algoritmi basati su automi cellulari probabilistici per la soluzione di problemi di programmazione intera a grandi dimensioni nell'apprendimento supervisionato e nel Data Mining per la Bioinformatica

Progetto comune di ricerca

Responsabili di progetto: Giovanni Felici, Aldo Procacci
Accordo: BRASILE - CNPq - Conselho Nacional Desenvolvimento Cientifico y Tecnologico
Bando: CNR/CNPq 2012-2013
Dipartimento: ICT
Area tematica: Ingegneria, ICT e tecnologie per l'energia e i trasporti
Stato del progetto: Nuovo

Proposta di ricerca

In recent times a considerable effort has been spent in order to apply ideas coming from the physics of many particle systems to optimization problerms. With this respect a very promising technique that has been investigated recently is the sampling of discrete probability distribution via Probabilistic Cellular Automata (PCA). This idea, introduced in the '90 by Lebowitz and collaborators, found new applications by means of the overlapping weighting of the states, introduced first in (1) for the maximum clique problem.

The numerical results obtained in that work motivated a two-fold research activity: on one side the rigorous study of the convergence of the algorithm to the limit probability distribution (see (2)), on the other side a study of the possible generalizations of this idea to more general problems.

In (3) it has been proved that if certain quantities (two point correlations) decay sufficiently fast then it is possible to sample with PCA algorithms very general discrete probability distributions, and hence to find heuristically the minimum of functions of many discrete variables.

A rigorous technique that should be used in order to find new fields of application of the PCA algorithms is the so-called Cluster Expansion. With this respect the Brasilian proponents guarantee a very high scientific level, since Aldo Procacci is worldwide recognized as one of the best expert in the field (see for instance (7), (8) and (9)).

Although good computational results have been obtained with PCA approaches in previous applications, the proponents believe that it would be very important for the consolidation of this method to test the performances of the PCA optimization algorithms when they are applied to very large discrete spaces.

One such field of application is identified in Knowledge Extraction and Data Mining.

These techniques are used to identify relevant patterns and models in large amount of data. Of great interest in many application fields are methods that can extract synthetic logic formulas from very large data sets. One such method has been developed in CNR-IASI (4) and recently it has been applied with success to different Bioinformatics problems (see (5), (6)).

Typically, the problems to be solved are Discretization, Feature Selection and Formula Extraction, formulated as hard integer programming problems, that demand the development of ad hoc optimization algorithms as their size (in the range of tens or hundreds of thousands of observations and variables) cannot be dealt with general purpose solvers. In their general settings these problems are modeled using as a reference well known NP-hard optimization problems: Set Covering and Minimum Cost Satisfiability Problems.

Despite some efficient algorithmic approaches have been proposed and are currently used in the present implementation of the system (see quoted references) we still identify relevant research directions to be investigated to improve the quality of the solutions that can be obtained given the size of the instances and the computation time available.

Moreover in the Bioinformatics applications, that are of interest in this project, the attention is not simply towards the identification of a good or optimal solution of the problem, but rather on the identification of many solutions of good quality in order to explore the space of the concurrent explanations of the observed phenomena. We point out that such feature is currently not exploited in a formal way by the available Knowledge Extraction and Data Mining methods and, if available, could constitute a significant advancement in the analysis of biological and genetic experimental data.

The project will be devoted to the development and test of heuristic optimization algorithms based on PCA specifically designed for the Discretization, Feature Selection and Formula Extraction problems that arise in the CNR-IASI formulations. The proponents strongly believe that this approach is particularly well suited for these Bioinformatics problems, due to its ability to efficiently explore and sample the solution space, and to the fact that PCA approaches can be easily parallelized and scaled to large size instances.

References

1) Antonio Iovanella, Benedetto Scoppola, Elisabetta Scoppola Some Spin Glass Ideas Applied to the Clique Problem. Journal of Statistical Physics 126, n. 4-5, 895-915, (2007)

2) A. Gaudilliere, B. Scoppola, E. Scoppola, M. Viale Phase transitions for the cavity approach to the clique problem on random graphs Journal of Statistical Physics DOI 10.1007/s10955-011-0336-2 (2011)

3) Paolo Dai Pra, Benedetto Scoppola and Elisabetta Scoppola Sampling from a Gibbs measure with two body interaction by means of PCA Submitted to Journal of Statistical Physics

4) Felici G., K. Truemper, A Minsat Approach for Learning in Logic Domains, INFORMS Journal on Computing, volume 14, number, winter 2002, 20,36

5) Bertolazzi P., G. Felici, E. Weitschek: Learning to classify species with barcodes, BMC Bioinformatics, n. 10, (2009), pp. 1-12

6) I. Arisi, M. D'Onofrio, R. Brandi, A. Felsani, S. Capsoni, G. Drovandi, G. Felici, E. Weitschek, P. Bertolazzi, A. Cattaneo: Gene Expression Biomarkers in the Brain of a Mouse Model for Alzheimer's Disease: Mining of Microarray Data by Logic Classification and Feature Selection. Journal of Alzheimer's Disease 2011, 24(4)

7) Fernandez, R.; Procacci A.: Cluster expansion for abstract polymer models. New bounds from an old approach . Communications in Mathematical Physics 274, n.1, 123-140 (2007)

8) R. Bissacot, Roberto Fernandez, A. Procacci, B.Scoppola: An improvement of the Lovasz local lemma via Cluster Expansion. Combinatorics, Probability and Computing, 20, 709-719 (2011)

9) S. Ndreca, A. Procacci, B.Scoppola
Improved bounds on coloring of graphs. Subitted to European Journal of Combinatorics

Obiettivi della ricerca

L'obiettivo principale di questo progetto è lo sviluppo di metodi algoritmici euristici fortemente parallelizzabili per la soluzione di problemi difficili di ottimizzazione intera di grandi dimensioni che si presentano nella formulazione di problemi di apprendimento automatico e Data Mining, con riferimento a istanze applicative derivate da problemi biologici e genetici. Tali metodi algoritmici saranno basati sulla metodologia degli automi cellulari probabilistici e sulla tecnica della Cluster Expansion.

Tale obiettivo viene declinato in diversi sotto-obiettivi:

1) La verifica della validità dell'approccio PCA per risolvere problemi di ottimizzazione intera a grandi dimensioni di tipo diverso provenienti da contesti applicativi;

2) L'identificazione tramite Cluster Expansion delle condizioni analitiche che devono essere soddisfatte per la convergenza del metodo PCA nei problemi considerati;

3) La verifica della possibilità di produrre algoritmi basati sulla tecnica PCA in grado di campionare in modo soddisfacente uno spazio discreto delle soluzioni anche di grandi dimensioni, ed il relativo interesse applicativo di tale approccio di soluzione;

4) La comparazione degli approcci PCA per la categoria di problemi considerati con i metodi alternativi di soluzione già consolidati nella letteratura;

5) La condivisione dei risultati ottenuti tramite gli algoritmi PCA tramite la realizzazione di una versione open-source del codice di ottimizzazione sviluppato nel progetto.

Ultimo aggiornamento: 25/04/2024