Fabian Panse, Dan Olteanu, Birgitta König-Ries, and Maurice van Keulen (2016-06-16 12:30 - 15:30 in University of Twente, Building Ravelijn, VIP room (RA 1315))
16 June 2016, University of Twente, Building Ravelijn, VIP room (RA 1315)
|12:30||Lunch and welcome|
|13:30||Fabian Panse (Universität Hamburg, Germany)|
|14:00||Dan Olteanu (University of Oxford, UK)|
|14:30||Birgitta König-Ries (Friedrich-Schiller-Universität Jena, Germany)|
|15:00||Maurice van Keulen (University of Twente, Netherlands)|
|16:30||PhD defense of Brend Wanders|Participation is free. Please register through the registration form to help us plan the required catering.
Presentation 1: Querying Probabilistic Databases with Certain Data Applications
by Fabian Panse (Universität Hamburg, Germany)
Probabilistic databases can be processed by four types of applications: (a) certain-answer based applications which resolve data uncertainty by distinguishing between certain and uncertain query answers, (b) uncertainty analyzing applications which resolve data uncertainty by the use of aggregate functions such as min, max, or exp, (c) probabilistic data applications which do not resolve uncertainty on query answers but pass it directly to the query result, and (d) certain data application which are not able to address data uncertainty at all and therefore require a single database instance as processing input. However, whereas the first three of these types have been extensively discussed in former research, the execution of certain data applications on probabilistic databases got only less attention in the database community so far although the vast majority of existing applications are of this type. In this talk, I will present a view-based concept for processing certain data applications on probabilistic databases, propose a cost-based approach for selecting an application-specific possible world from a probabilistic input database, consider an efficient implementation of this selection approach for several probabilistic representation systems and discuss several challenges w.r.t. data manipulation and view consistency that result from using this concept.
Presentation 2: Cultural Learnings of Factorized Joins for Make Benefit Glorious Family of Regression Tasks Slides
by Dan Olteanu (University of Oxford, UK)
In this talk I will overview recent work on compilation of join queries into lossless factorized representations. The primary motivation for this compilation is to avoid redundancy in the representation of query results in relational databases. The relationship between the standard tabular representations of relations and their equivalent factorized representations is on a par with the relationship between propositional formulas in disjunctive normal form and their equivalent nested formulas obtained by algebraic factorization. For any join query, we give asymptotically tight bounds on the size of its factorized results and show that these factorized results can be exponentially more succinct than their equivalent tabular representations.
I will also discuss an application of factorized joins to learning regression models. For a range of regression models, including polynomial regression and factorization machines, their parameters can be learned in one pass over factorized joins.
Presentation 3: Lifecycle Support for Scientific Data - Examples from Biodiversity Research
by Birgitta König-Ries (Friedrich-Schiller-Universität Jena, Germany)
Data plays an increasingly important role in scientific progress. Thus, proper management of this valuable resource along its entire lifecycle is crucial: In this talk, we will look at different stages of the lifecycle from experiment planning and data acquisition, to discovery, integration, analysis and publication, and discuss some existing and needed contributions by computer science.
Presentation 4: Managing uncertainty in data: the key to effective management of data quality problems
by Maurice van Keulen (University of Twente, Netherlands)
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty in the data, i.e., as probabilistic data. Probabilistic databases allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.