Automatic classification of documents in cold-start

Document classification is key to ensuring quality of any digital library. However, classifying documents is a very time-consuming task. In addition, few or none of the documents in a newly created repository are classified. The non-classification of documents not only prevents users from finding information but also hinders the system’s aptitude to recommend relevant items. Moreover, the lack of classified documents prevents any kind of machine learning algorithm to automatically annotate these items. In this work, we propose a novel approach to automatically classifying documents that differs from previous works in the sense that it exploits the wisdom of the crowds available on the Web. Our proposed strategy adapts an automatic tagging approach combined with a straightforward matching algorithm to classify documents in a given domain classification. To validate our findings, we compared our methods against the existing and performed a user evaluation with 61 participants to estimate the quality of the classifications. Results show that, in 72% of the cases, the automatic classification is relevant and well accepted by participants. In conclusion, automatic classification can facilitate access to relevant documents.

Authors:  Ricardo Kawase, Marco Fisichella, Bernardo Pereira Nunes, Kyung-Hun Ha and Markus Bick

PDF: kawase-wims2013

Boosting Retrieval of Digital Spoken Content

Every day, the Internet expands as millions of new multimedia objects are uploaded in the form of audio, video and images. While traditional text-based content is indexed by search engines, this indexing cannot be applied to audio and video objects, resulting in a plethora of multimedia content that is inaccessible to a majority of online users. To address this issue, we introduce a technique of automatic, semantically enhanced, description generation for multimedia content. The objective is to facilitate indexing and retrieval of the objects with the help of traditional search engines. Essentially, the technique generates static Web pages automatically, which describe the content of the digital audio and video objects. These descriptions are then organized in such a way as to facilitate locating corresponding audio and video segments. The technique employs a combination of Web services and concurrently provides description translation and semantic enhancement. Thorough analysis of the click-data, comparing accesses to the digital content before and after automatic description generation, suggests a significant increase in the number of retrieval items. This outcome, however is not limited to the terms of visibility, but in supporting multilingual access, additionally decreases the number of language barriers.

Venue: KES (Selecte Papers) 2012

Authors:  Bernardo Pereira Nunes, Alexander Mera, Marco A. Casanova and Ricardo Kawase

PDF: nunes-kes(selected)2012

Automatically generating multilingual, semantically enhanced, descriptions of digital audio and video objects on the Web

Every day, millions of new images, videos and audios are uploaded to the web. However, unlike text-based content, audio and video objects cannot be indexed by search engines. Thus, much valuable multimedia content stay unreachable for a great majority of online users. To overcome this problem we introduce a technique that automatically generates semantically enhanced descriptions of audio and video objects. The goal is to facilitate indexing and retrieval of the objects with the help of traditional search engines. Basically, the technique automatically generates static Web pages that describe the content of the digital audio and video objects, organized in such a way as to facilitate locating segments of the audio or video that correspond to the descriptions. The technique is a mashup of Web services that also provides translation of the descriptions and semantic enhancement. We thoroughly analyzed the click-data comparing accesses to the digital content before and after the automatic generation of the descriptions. The outcomes suggest that the technique significantly improve the retrieval of items, not only in terms of visibility, but also brings down language barriers, by supporting multilingual access.

Venue: KES2012

Authors:  Bernardo Pereira Nunes, Alexander Mera, Marco A. Casanova and Ricardo Kawase

PDF: nunes-kes2012