on Deep Web Entity Monitoring
Gianluca Demartini, Andrea Calì, and Pierre Senellart (2016-06-02 10:00 - 11:30 in ZI-2126)
2 June, University of Twente, Building Zilverling, Room 2126
|9:30||Coffee and welcome|
|10:00||Gianluca Demartini (University of Sheffield, UK)|
|10:30||Andrea Calì (Birkbeck, University of London, UK)|
|11:00||Pierre Senellart (Télécom ParisTech, France)|
|12:30||PhD defense of Mohammad Khelghati|
Please register at: firstname.lastname@example.org to help us plan the required catering.
by Gianluca Demartini (University of Sheffield, UK)
Human Computation is a novel approach used to obtain manual data processing at scale. In this talk we will introduce the dynamics of crowdsourcing platforms and provide examples of their use to build hybrid human-machine information systems. We will then present ZenCrowd: an hybrid system for entity linking and data integration problems over linked data showing how the use of human intelligence at scale in combination with machine-based algorithms outperforms traditional systems. In this context, we will discuss efficiency and effectiveness challenges of micro-task crowdsourcing platforms.
by Andrea Calì (Birkbeck, University of London, UK)
The Deep Web is constituted by data that are accessible on the web, typically through HTML forms, but are not indexable by search engines due to their static nature. Processing queries on Deep Web data poses significant challenges as data sources cannot be normally accessed with arbitrary queries. In this talk we illustrate techniques for processing queries on the Deep Web and we survey some of the core problems underlying this task. We then propose a framework for identifying relevant sources in this context.
by Pierre Senellart (Télécom ParisTech, France)
Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content- rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.