Consiglio Nazionale delle Ricerche

Tipo di prodottoArticolo in rivista
TitoloDistribution-Preserving Stratified Sampling for Learning Problems
Anno di pubblicazione2018
Autore/iCervellera, Cristiano; Maccio, Danilo
Affiliazioni autoriCNR, Inst Intelligent Syst Automat, I-16149 Genoa, Italy
Autori CNR e affiliazioni
  • inglese
AbstractThe need for extracting a small sample from a large amount of real data, possibly streaming, arises routinely in learning problems, e.g., for storage, to cope with computational limitations, obtain good training/test/validation sets, and select minibatches for stochastic gradient neural network training. Unless we have reasons to select the samples in an active way dictated by the specific task and/or model at hand, it is important that the distribution of the selected points is as similar as possible to the original data. This is obvious for unsupervised learning problems, where the goal is to gain insights on the distribution of the data, but it is also relevant for supervised problems, where the theory explains how the training set distribution influences the generalization error. In this paper, we analyze the technique of stratified sampling from the point of view of distances between probabilities. This allows us to introduce an algorithm, based on recursive binary partition of the input space, aimed at obtaining samples that are distributed as much as possible as the original data. A theoretical analysis is proposed, proving the (greedy) optimality of the procedure together with explicit error bounds. An adaptive version of the algorithm is also introduced to cope with streaming data. Simulation tests on various data sets and different learning tasks are also provided.
Lingua abstractinglese
Altro abstract-
Lingua altro abstract-
Pagine da2886
Pagine a2895
Pagine totali10
RivistaIEEE Transactions on Neural Networks and Learning Systems
Attiva dal 2012
Editore: Institute of Electrical and Electronics Engineers, - New York, NY - USA
Paese di pubblicazione: Stati Uniti d'America
Lingua: inglese
ISSN: 2162-237X
Titolo chiave: IEEE Transactions on Neural Networks and Learning Systems
Numero volume della rivista29
Fascicolo della rivista7
Verificato da referee-
Stato della pubblicazionePublished version
Indicizzazione (in banche dati controllate)
  • ISI Web of Science (WOS) (Codice:000436420400018)
  • Scopus (Codice:2-s2.0-85020701015)
Parole chiaveAdaptive sampling, binary recursive partition, F-discrepancy, stratified sampling
Link (URL, URI)
Titolo parallelo-
Scadenza embargo-
Data di accettazione-
Note/Altre informazioni-
Strutture CNR
  • INM — Istituto di iNgegneria del Mare
Moduli/Attività/Sottoprogetti CNR-
Progetti Europei-
Distribution-Preserving Stratified Sampling for Learning Problems (documento privato )
Descrizione: VoR Version of Record - versione finale pubblicata
Tipo documento: application/pdf