TitoloAutomatic Creation of Quality Multi-Word Lexica from Noisy Text Data
Anno di pubblicazione2012
Autore/iFrancesca Frontini, Valeria Quochi, Francesco Rubino
Affiliazioni autoriCNR-ILC, Pisa
AbstractThis paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.
Titolo del volumeProceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data
  • ACM, Association for computing machinery, New York (Stati Uniti d'America)
Verificato da refereeSì: Internazionale
Parole chiaveLexical induction, multi-word extraction, web-based distributed platform, noisy data
Link (URL, URI)
Titolo convegno/congressoAND 2012
Luogo convegno/congressoMumbai, India
Data/e convegno/congressoDecember 9, 2012
Note/Altre informazioniID_PUMA: /cnr.ilc/2012-A3-008
  • ILC — Istituto di linguistica computazionale "Antonio Zampolli"
  • IC.P02.005.002 : Infrastrutture per l'interoperabilità e l'integrazione di Risorse e Tecnologie Linguistiche
