Home | IDEA | Oregon State University

We investigate principles and challenges of building usable and scalable data-centric systems. We develop systems for reasoning over and learning from data. We are part of the Data Science and Engineering and Artificial Intelligence groups at the School of EECS and member of Collaborative Robotics and Intelligent Systems (CoRIS) Institute and Center for Quantitative Life Sciences in Oregon State University.

Email: termehca@oregonstate.edu
Address: 3053 Kelley Engineering Center, Corvallis, OR 97330-5501

Recent News

Our benchmark study on analyzing the shifts in users' data focus in exploratory visual analysis will be presented at the 30th Annual ACM Conference on Intelligent User Interfaces (ACM IUI) 2025
We will present our work on consistent language models using controlled prompting and decoding will be presented in SIGMOD 2024 - Workshop on Data Management for End-to-End Machine Learning
We will demonstrate our system ShiftScope: Adapting Visualization Recommendations to Users’ Dynamic Data Focus in SIGMOD 2024.
Our work on User Learning In Interactive Data Exploration will be presented in ICDE 2024 Lightening talk.
Our work on learning statistical models over incomplete datasets without data cleaning will be presented in SIGMOD 2024.
Our work on modeling and analyzing user behavior during exploratory visual analysis will be presented in the AAAI workshop on Collaborative AI and Modeling of Humans in 2024 .
We will present our work on investigating the challenges and tradeoffs of creating consistent language models using controlled prompting and decoding in AAAI - Neuro-Symbolic Learning and Reasoning in the Era of Large Language Models Workshop 2024 .
Hallucinations and inaccurate information are important obstacles in using large language models reliably. Our proposal on creating large language models that are consistent with semantic constraints will appear in VLDB - Databases and Large Language Models Workshop 2023.
Our work on using large language models to automatically create data integration queries will appear in VLDB - Databases and Large Language Models Workshop 2023.
Our paper on automatic data querying and extraction from external data sources will appear in PVLDB 2023.
We will present our work on building accurate models on incomplete datasets without any manual effort in SIGMOD-DEEM Workshop 2023.
Our work on supporting human learning for model training will appear in SIGMOD 2023.
Our work on curating and analyzing heterogeneous biomedical knowledge graphs is published at BMC Bioinformatics 2023.
We will share our preliminary results on the role of human learning in model training in SIGMOD-HILDA 2022 Workshop.
We will present our work on building usable and scalable ML systems over relational data at SIGMOD 2021.
We will share our work on developing graph similarity search that is invariant to representational changes at SIGMOD 2021.
We will present our preliminary results on learning to join large tables efficiently at SIGMOD-aiDM Workshop 2020. In our systems, the scan operators learn efficient join strategies.
We will present our work on effective and efficient learning over large and noisy data without any cleaning and preprocessing at SIGMOD 2020. Our system enables users to learn over many datasets that could not be used before due to the prohibitively time-consuming efforts to clean them.
Our paper on data interaction game received an ACM SIGMOD Research Highlight Award.
We present our results on significantly improving the effectiveness of answering inexact queries, e.g., keyword queries, over large databases at SSDBM 2019. The larger a database is, the database system returns more non-relevant answers as the database has many non-relevant answers for a query. A larger database, however, contains answers to more queries. We find subsets of the database that are sufficiently small so the database system returns mostly relevant answers. We also develop techniques to send the query to a sufficiently large subset of the database that contain its answers.
A couple of new manuscripts:
- In the first one, we show how to learn accurate models directly over heterogeneous and dirty data without cleaning them;
- In the second one, we present a graph search algorithm that is robust to representational variations.
We present the fundamental ideas behind our VDBMS system, which usably manages large scale variable and heterogeneous data at VLDB-Poly 2018
Ben discusses the bases of autonomous entity integration at VLDB-Poly 2018
We have a couple of papers in the VLDB Journal 2018: 1) Yodsawalai has the paper Cost-Effective Conceptual Design Using Taxonomies, which addresses the tradeoff between the usability and overhead of organizing data in a structured form; and 2) Jose publishes the paper Logically Scalable and Efficient Relational Learning, which extends his work on designing efficient learning algorithms that are robust against the logical representations of the data.
Jose demonstrates CastorX, a system that efficiently learns over multiple heterogeneous databases using novel sampling techniques, at VLDB 2018. He presented a summary of its fundamental ideas at SIGMOD-DEEM 2018.
Ben presents his work on helping humans and large-scale data sources to progressively and automatically develop a mutual language for effective communication via reinforcement learning at SIGMOD 2018. His paper is selected as one of the best papers of the conference.
Jose demonstrates AutoMode, a system that automatically sets the language bias for learning systems over relational data at ICDE 2018.
People usually believe that to get effective results for vague queries, e.g., ambiguous keyword queries, data systems have to spend a lot of time and explore many potential answers in the data. We present a lightening talk on how to query large databases both effectively and efficiently using caching techniques at ICDE 2018.

IDEA

Recent News

Contact Info