Home |  English version |  Mappa |  Commenti |  Sondaggio |  Staff |  Contattaci Cerca nel sito  
Istituto di linguistica computazionale "Antonio Zampolli"

Torna all'elenco Contributi in atti di convegno anno 2012

Contributo in atti di convegno

Tipo: Contributo in atti di convegno

Titolo: Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data

Anno di pubblicazione: 2012

Formato: Elettronico

Autori: Francesca Frontini, Valeria Quochi, Francesco Rubino

Affiliazioni autori: CNR-ILC, Pisa

Autori CNR:

  • FRANCESCA FRONTINI
  • VALERIA QUOCHI
  • FRANCESCO RUBINO

Lingua: inglese

Abstract: This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.

Lingua abstract: inglese

Titolo del volume: Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data

ISBN: 978-1-4503-1919-5

Editore: ACM, Association for computing machinery, New York (USA)

Referee: Sì: Internazionale

Parole chiave:

  • Lexical induction
  • multi-word extraction
  • web-based distributed platform
  • noisy data

URL: http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf

Congresso nome: AND 2012

Congresso luogo: Mumbai, India

Congresso data: December 9, 2012

Congresso rilevanza: Internazionale

Congresso relazione: Contributo

Altre informazioni: ID_PUMA: /cnr.ilc/2012-A3-008

Strutture CNR:

Moduli:

Allegati: Paper (application/pdf)

 
Torna indietro Richiedi modifiche Invia per email Stampa
Home Il CNR  |  I servizi News |   Eventi | Istituti |  Focus