Consiglio Nazionale delle Ricerche

Tipo di prodottoPresentazione
TitoloMultilingual Word-by-word alignment. Methodology and some preliminary outcomes towards the construction of multilingual Lexicon within the "Traduzione del Talmud Babilonese" project
Anno di pubblicazione2019
Autore/iAngelo Mario Del Grosso
Affiliazioni autoriIstituto di Linguistica Computazionale "A. Zampolli", ILC-CNR
Autori CNR e affiliazioni
  • inglese
AbstractTextual scholars have been exploiting for long time multilingual resources in their daily work to better understand the primary sources they inquire. Bitexts are parallel texts which turn out to be useful in a number of cross-linguistic and comparative processing tasks. This talk will show the workflow adopted within the research activities conducted on the Italian translation of the Babylonian Talmud. More specifically, I will illustrate the ongoing work towards the construction of a multilingual Hebrew/Aramaic/Italian terminological resource by means of stochastic generative approaches to word-by-word text alignment. The related literature discusses plenty of techniques concerning this topic. The alignment tool I developed is grounded on generative models (i.e., IBM and HMM models), which are a collection of non-supervised machine learning algorithms, to calculate the probability of linking two words in a multilingual term pair. From a technical standpoint, beside the adopted models, which are based on an alignment function and on an unsupervised training procedure devoted to estimating the unknown probability distributions, other machine learning approaches to word alignment exist that encompass discriminative techniques, which are based on a target function and on a supervised learning process exploiting labeled training data set. The implemented models were widely adopted in the literary domain, as they are able to profitably handle interpretative bitexts modeling also deletion, insertion, transposition phenomena without having an extant labeled data set. The workflow I will present encompasses four distinct phases: 1) The encoding of the parallel text, which has been carried out according to the last TEI recommendations. In particular, the linking-target approach described within the Module 16 of the guidelines was used. 2) The semi-automatic extraction of the Italian terms, which has been carried out by means of linguistic analysis technologies available at the Institute of Computational Linguistics (ILC-CNR). These tools include a stochastic component for terminology extraction. 3) The addition of Hebrew/Aramaic terms to the Italian extracted ones via word-by-word alignment to automatically process the three main ancient languages appearing in the Talmud, namely mishnaic Hebrew, biblical Hebrew and babylonian Aramaic. 4) Finally, the revision of the obtained results through an ad-hoc implemented web-based application. This final step is devoted to build a ground truth and/or a gold training set allowing us to perform a complete validation process of the alignment outcomes. For the time being, 219.000 tokens have been analyzed, extracted from four tractates of the Babylonian Talmud which were translated so far."
Lingua abstractinglese
Altro abstract-
Lingua altro abstract-
Pagine da-
Pagine a-
Pagine totali-
Numero volume della rivista-
Titolo del volume-
Numero volume della serie/collana-
Curatore/i del volume-
Verificato da refereeSì: Internazionale
Stato della pubblicazionePublished version
Indicizzazione (in banche dati controllate)-
Parole chiavebilingual word alignment, translation
Link (URL, URI)
Titolo convegno/congressoMachine learning, données textuelles et recherche en sciences humaines et sociales
Luogo convegno/congressoENS de Lyon
Data/e convegno/congresso25/11/2019 - 26/11/2019
RelazioneSu invito
Titolo parallelo-
Scadenza embargo-
Note/Altre informazioni-
Strutture CNR
  • ILC — Istituto di linguistica computazionale "Antonio Zampolli"
Moduli/Attività/Sottoprogetti CNR
  • DUS.AD006.009.001 : Progetto Talmud
Progetti Europei-
Multilingual Word-by-word alignment (documento privato )
Descrizione: Multilingual Word-by-word alignment slides
Tipo documento: application/pdf