ISST-CoNLL/TANL Corpora | Consiglio Nazionale delle Ricerche

Banca dati

Istituto

Istituto di linguistica computazionale "Antonio Zampolli" (ILC)

Referente

Simonetta Montemagni
E-mail: simonetta.montemagni@ilc.cnr.it

Descrizione

ISST-CoNLL and ISST-TANL are two different versions of a dependency annotated corpus derived from the Italian Syntactic-Semantic Treebank (ISST), a multi-layered annotated corpus of Italian which represents one of the main outcomes of an Italian national project (SI-TAL, 1999-2001), which underwent different revisions aimed at making it compliant to the de-facto CoNLL representation standard. The resulting dependency treebanks have been used in the framework of international parsing evaluation campaigns (CoNLL-2007 and Evalita-2009). The ISST-CoNLL corpus, developed through a cooperation between the Istituto di Linguistica Computazionale (ILC) of the National Council for Research (CNR) and the University of Pisa (UniPi), was specifically built for the CoNLL-X shared task on Multilingual Dependency Parsing (2007). ISST-CoNLL is a subset of the balanced ISST corpus including 79.654 word tokens (for a total 4.162 sentences), corresponding to the Corriere della Sera and periodicals partitions of ISST. It was obtained through semi-automatic conversion of the morpho-syntactic and dependency annotation levels of ISST into the CoNLL-2007 format. The ISST-TANL corpus originates as a revision of the ISST-CoNLL corpus, mainly for what concerns the annotation tagset and guidelines. This corpus was used for training and testing in the shared task "Domain Adaptation for Dependency Parsing" of EVALITA 2011.

Modalità di accesso

freely downloadable

Tipologia di dati

text enriched with morpho-syntactic and dependency annotation

Tipo database

annotated corpus in CoNLL format