Automatic classification of documents in cold-start

Document classification is key to ensuring quality of any digital library. However, classifying documents is a very time-consuming task. In addition, few or none of the documents in a newly created repository are classified. The non-classification of documents not only prevents users from finding information but also hinders the system’s aptitude to recommend relevant items. Moreover, the lack of classified documents prevents any kind of machine learning algorithm to automatically annotate these items. In this work, we propose a novel approach to automatically classifying documents that differs from previous works in the sense that it exploits the wisdom of the crowds available on the Web. Our proposed strategy adapts an automatic tagging approach combined with a straightforward matching algorithm to classify documents in a given domain classification. To validate our findings, we compared our methods against the existing and performed a user evaluation with 61 participants to estimate the quality of the classifications. Results show that, in 72% of the cases, the automatic classification is relevant and well accepted by participants. In conclusion, automatic classification can facilitate access to relevant documents.

Authors:  Ricardo Kawase, Marco Fisichella, Bernardo Pereira Nunes, Kyung-Hun Ha and Markus Bick

PDF: kawase-wims2013

Unsupervised Auto-tagging for Learning Object Enrichment

diaz-ectel2011a

Ricardo Kawase presenting @ECTEL2011

An online presence is gradually becoming an essential part of every learning institute. As such, a large portion of learning material is becoming available online. Incongruently, it is still a challenge for authors and publishers to guarantee accessibility, support effective retrieval and the consumption of learning objects. One reason for this is that non-annotated learning objects pose a major problem with respect to their accessibility. Non-annotated objects not only prevent learners from finding new information; but also hinder a system’s ability to recommend useful resources. To address this problem, commonly known as the cold-start problem, we automatically annotate specific learning resources using a state-of-the-art automatic tag annotation method: α-TaggingLDA, which is based on the Latent Dirichlet Allocation probabilistic topic model. We performed a user evaluation with 115 participants to measure the usability and effectiveness of α-TaggingLDA in a collaborative learning environment. The results show that automatically generated tags were preferred 35% more than the original authors’ annotations. Further, they were 17.7% more relevant in terms of recall for users. The implications of these results is that automatic tagging can facilitate effective information access to relevant learning objects.

Venue: ECTEL2011

Authors:  Ernesto Diaz-Aviles, Marco Fisichella, Ricardo Kawase, Wolfgang Nejdl, Avaré Stewart

Award: ECTEL2011 Best Paper

PDF:  diaz-ectel2011