Our work on using large language models to automatically create data integration queries will appear in VLDB - Databases and Large Language Models Workshop 2023.
We will present our preliminary results on learning to join large tables efficiently at SIGMOD-aiDM Workshop 2020. In our systems, the scan operators learn efficient join strategies.
We will present our work on effective and efficient learning over large and noisy data without any cleaning and preprocessing at SIGMOD 2020. Our system enables users to learn over many datasets that could not be used before due to the prohibitively time-consuming efforts to clean them.
We present our results on significantly improving the effectiveness of answering inexact queries, e.g., keyword queries, over large databases at SSDBM 2019. The larger a database is, the database system returns more non-relevant answers as the database has many non-relevant answers for a query. A larger database, however, contains answers to more queries. We find subsets of the database that are sufficiently small so the database system returns mostly relevant answers. We also develop techniques to send the query to a sufficiently large subset of the database that contain its answers.
A couple of new manuscripts:
In the first one, we show how to learn accurate models directly over heterogeneous and dirty data without cleaning them;
In the second one, we present a graph search algorithm that is robust to representational variations.
We present the fundamental ideas behind our VDBMS system, which usably manages large scale variable and heterogeneous data at VLDB-Poly 2018
We have a couple of papers in the VLDB Journal 2018: 1) Yodsawalai has the paper Cost-Effective Conceptual Design Using Taxonomies, which addresses the tradeoff between the usability and overhead of organizing data in a structured form; and 2) Jose publishes the paper Logically Scalable and Efficient Relational Learning, which extends his work on designing efficient learning algorithms that are robust against the logical representations of the data.
Jose demonstrates CastorX, a system that efficiently learns over multiple heterogeneous databases using novel sampling techniques, at VLDB 2018. He presented a summary of its fundamental ideas at SIGMOD-DEEM 2018.
Ben presents his work on helping humans and large-scale data sources to progressively and automatically develop a mutual language for effective communication via reinforcement learning at SIGMOD 2018. His paper is selected as one of the best papers of the conference.
Jose demonstrates AutoMode, a system that automatically sets the language bias for learning systems over relational data at ICDE 2018.
People usually believe that to get effective results for vague queries, e.g., ambiguous keyword queries, data systems have to spend a lot of time and explore many potential answers in the data. We present a lightening talk on how to query large databases both effectively and efficiently using caching techniques at ICDE 2018.