Linguistic technologies and protection of minors | Consiglio Nazionale delle Ricerche

Linguistic technologies and protection of minors

Internet continuously poses the delicate problem of how and to what extent two equally fundamental rights of citizens can be reconciled: the protection of minors from harmful content on the one hand and the freedom of speech on the other. The fact that this is often transformed into an authentic dilemma is demonstrated from the recent decision of the United States Court of Appeals (June 29, 2004), in which in order to grant the freedom of speech the Court had to indirectly defend the pushers of sex via Internet. The decision has been motivated from the limited effectiveness of existing filtering software which is still not able to filter in a "surgical" way the materials in Internet. The design and development of Internet filters able to filter without censoring still remains an ambitious goal with a strong social and ethical impact.

The filtering system POESIA (Public Open-source Environment for to Safer Internet Access), resulting from the European project POESIA (IAP 2117/27572) carried out by a consortium of 10 academic and industrial partners from Italy, Spain; France and UK, can be seen as an important step in the direction of an intelligent filtering of Internet contents to protect minors. The POESIA system combines components based on standard filtering methods, such as positive/negative URL lists and PICS, with components incorporating more advanced content-based techniques, such as image processing and NLP-enhanced text filtering.

The Istituto di Linguistica Computazionale (ILC) of CNR (National Research Council) in Pisa has played the twofold role of coordinator of the project and of developer of the Italian filtering component. Starting from a widely experimented platform of linguistic resources, methods, and tools for Italian language processing, ILC designed and developed the component for the analysis and classification Italian web pages.

This component combines robust and widely experimented linguistic technologies for Italian with dynamic techniques for the acquisition of lexical and grammatical knowledge from textual corpora based on machine learning algorithms. The integration of linguistic technologies with machine learning represents a crucial aspect when dealing with Web-based applications which need easy-to-adapt techniques to effectively cope with the proteiform nature of Internet content.

During the project, the filtering of textual content focused on the pornographic domain, with some more limited attention being given to the filtering of gross language. The system has been extensively tested against various sets of URLs, both in controlled and live environments, and always showed a good performance, with 97% filtering effectiveness and 3% over-blocking.

Free availability of the POESIA system as an open-source software permits a wide deployment of the filtering system which can be downloaded from http://sourceforge.net/projects/poesia/. At the same time, it provides a basis for development and maintenance of the system beyond the project's lifetime: desirable improvements and extensions of the system include extending filtering to other domains, other internet channels and other languages.