HILDA '25: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

Full Citation in the ACM Digital Library

Challenges in Using Conversational AI for Data Science

Bhavya Chopra
Ananya Singha
Anna Fariha
Sumit Gulwani
Chris Parnin
Ashish Tiwari
Austin Z. Henley

Large Language Models (LLMs) are transforming data science, offering assistance in coding, preprocessing, analysis, and decisionmaking. However, data scientists face significant challenges when interacting with LLM-powered agents and implementing their suggestions effectively. To explore these challenges, we conducted a mixed-methods study comprising contextual observations, semi-structured interviews (n=14), and a survey (n=114). Our findings reveal key obstacles, including difficulties in retrieving contextual data, crafting prompts for complex tasks, adapting generated code to local environments, and refining prompts iteratively. Based on these insights, we propose actionable design recommendations, such as data brushing for improved context selection and inquisitive feedback loops to enhance communication with conversational AI assistants in data science workflows.

From Precision to Perception: User Surveys in the Evaluation of Keyword Extraction Algorithms

Jingwen Cai
Sara Leckner
Johanna Björklund

Stricter regulations on personal data are causing a shift towards contextual advertising, where keywords are used to predict the topical congruence between ads and their surrounding media contexts — an alignment shown to enhance advertising effectiveness. Recent advances in AI, particularly large language models, have improved keyword extraction capabilities but also introduced concerns about computational cost. This study conducts a comparative, survey-based evaluation experiment of three prominent keyword extraction approaches, emphasising user-perceived accuracy and efficiency. Based on responses from 552 participants, the embedding-based approach emerges as the preferred method. The findings underscore the importance of human-in-the-loop evaluation in real-world settings.

ONION: A Multi-Layered Framework for Participatory ER Design

Viktoriia Makovska
George Fletcher
Julia Stoyanovich

We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe, Nurture, Integrate, Optimize, Normalize. It supports progressive abstraction from unstructured stakeholder input to structured ER diagrams.

Our approach aims to reduce designer bias, promote inclusive participation, and increase transparency through the modeling process. We evaluate ONION through real-world workshops focused on sociotechnical systems in Ukraine, highlighting how diverse stakeholder engagement leads to richer data models and deeper mutual understanding. Early results demonstrate ONION's potential to host diversity in early-stage data modeling. We conclude with lessons learned, limitations and challenges involved in scaling and refining the framework for broader adoption.

CREDAL: Close Reading of Data Models

George Fletcher
Olha Nahurna
Matvii Prytula
Julia Stoyanovich

Data models are foundational to the creation of data and any data-driven system. Every algorithm, ML model, statistical model, and database depends on a data model to function. As such, data models are rich sites for examining the material, social, and political conditions shaping technical systems. Inspired by literary criticism, we propose close readings of data models—treating them as artifacts to be analyzed like texts. This practice highlights the materiality, genealogy, techne, closure, and design of data systems.

While literary theory teaches that no single reading is "correct," systematic guidance is vital—especially for those in computing and data science, where sociopolitical dimensions are often overlooked. To address this gap, we introduce the CREDAL methodology for close readings of data models. We describe its iterative development and share results from a qualitative evaluation, demonstrating its usability and value for critical data studies.

Hierarchical Table Semantics for Exploratory Table Discovery

Grace Fan
Juliana Freire

Exploratory table discovery in open data portals presents significant challenges due to unreliable metadata and ambiguous table semantics. Users typically lack prior knowledge of available datasets, making it difficult to identify relevant tables through traditional keyword search or value matching approaches, which often fail to capture semantic relevance across heterogeneous table representations.

We propose a new approach that automatically constructs hierarchical semantic representations of tables, encompassing specific column semantic types, shared concepts across column groups, and general table-level semantics. By leveraging these semantically-rich representations, our method retrieves relevant tables through semantic alignment rather than traditional value or metadata matching, leading to improved accuracy and recall for table discovery queries.

To enhance interpretability and support human-in-the-loop exploration, our system presents users with semantically relevant tables alongside explanations of their relevance. Evaluation on real-world open data and a question-answering benchmark demonstrates the effectiveness of our approach, achieving up to 36% improvement in Recall@10 compared to embedding-based baselines confirming the utility of hierarchical semantic representations for table discovery.

Humans, Machine Learning, and Language Models in Union: A Cognitive Study on Table Unionability

Sreeram Marimuthu
Nina Klimenkova
Roee Shraga

Data discovery and table unionability in particular became key tasks in modern Data Science. However, the human perspective for these tasks is still under-explored. Thus, this research investigates the human behavior in determining table unionability within data discovery. We have designed an experimental survey and conducted a comprehensive analysis, in which we assess human decision-making for table unionability. We use the observations from the analysis to develop a machine learning framework to boost the (raw) performance of humans. Furthermore, we perform a preliminary study on how LLM performance is compared to humans indicating that it is typically better to consider a combination of both. We believe that this work lays the foundations for developing future Human-in-the-Loop systems for efficient data discovery.

Interactive Coreset Selection for Tabular Data: Fairness-Aware, Explainable, and User-Guided

Aviv Hadar
Tova Milo
Kathy Razmadze

We present a human-centric extension of CoreTab, a novel coreset algorithm for tabular data recently accepted to VLDB 2025. A core-set is a compact, representative subset of a dataset that approximates full-data training performance while greatly reducing computational cost. While CoreTab already achieves state-of-the-art accuracy and efficiency, this demonstration focuses on its interactive components that let users audit, adjust, and guide data selection—addressing real-world concerns like fairness and representativeness. CoreTab is the first coreset method with built-in explainability, offering a decision-tree-based view of which data regions were included or excluded. It also introduces the first human-in-the-loop interface for coreset refinement. We highlight design insights and open challenges for building transparent and responsible data sampling workflows.

Bootstrapping Compositional Video Query Synthesis with Natural Language and Previous Queries from Users

Manasi Ganti
Enhao Zhang
Magdalena Balazinska

With the emerging ubiquity of video data across diverse applications, the accessibility of video analytics is essential. To address this goal, some state-of-the-art systems synthesize declarative queries over video databases using example video fragments provided by the user. However, finding examples of what a user is looking for can still be tedious. This work presents POLY-VOCAL, a new system that eases this burden. POLY-VOCAL uses multiple forms of user input to bootstrap the synthesis of a new query, including textual descriptions of the user's search and previously synthesized queries. Our empirical evaluation demonstrates that POLY-VOCAL significantly improves accuracy and accelerates query convergence compared with query synthesis from only user-labeled examples, while lowering the effort required from users.

Emojis in Autocompletion: Enhancing Video Search with Visual Cues

Hojin Yoo
Arnab Nandi

Effective video search is increasingly challenging due to the inherent complexity and richness of video content, which traditional full-text query systems and text-based autocompletion methods struggle to capture. In this work, we propose an innovative autocompletion system that integrates visual cues, specifically, representative emojis, into the query formulation process to enhance video search efficiency. Our approach leverages cutting-edge Vision-Language Models (VLMs) to generate detailed scene descriptions from videos and employs Large Language Models (LLMs) to distill these descriptions into succinct, segmented search phrases augmented with context-specific emojis. A controlled user study, conducted with 11 university students using the MSVD dataset, demonstrates that the emoji-enhanced autocompletion reduces the average query completion time by 2.27 seconds (14.6% decrease) compared to traditional text-based methods, while qualitative feedback indicates mixed but generally positive user perceptions. These results highlight the potential of combining linguistic and visual modalities to redefine interactive video search experiences.

Utilizing Past User Feedback for More Accurate Text-to-SQL

Matthias Urban
Jialin Ding
David Kernert
Kapil Vaidya
Tim Kraska

In the classical problem formulation of Text-to-SQL in academia, each question is translated independently of the others into SQL. This differs from the setting in practice, where questions enter the Text-to-SQL system in sequence. Thus, for all but the first few questions, a translation history is available that contains past questions and how they were translated by the system. So far, it has not been sufficiently explored how Text-to-SQL systems can make use of the translation history to improve future translations. Another crucial difference from the academic setting is that in practice, users generally have a conversation with the Text-to-SQL system. More concretely, if the initial translation contains a mistake, the user can follow up with feedback messages to allow the system to fix the mistake. In this case, it might be helpful to remember these user feedback messages to avoid repeating past mistakes. Thus, in this paper, we explore how a history of such past conversations between users and the Text-to-SQL system can be used to make future Text-to-SQL translations more accurate. We explore several approaches for extracting relevant experiences and insights from this conversation history and show in an evaluation that utilizing them can improve translation accuracy by up to 14.9%.

Explanations for Machine Learning Pipelines under Data Drift

Jahid Hasan
Romila Pradhan

Ensuring the robustness of data preprocessing pipelines is essential for maintaining the reliability of machine learning model performance in the face of real-world data shifts. Traditional methods optimize preprocessing sequences for specific datasets but often overlook their vulnerability to future data variations. This research introduces a vulnerability score to quantify the susceptibility of preprocessing components to data shift. We propose a Linear Regression approach to establish a predictive relationship between the vulnerability of the pipeline components and changes in the model's performance. The generated relationships act as explanations for practitioners of the system and help them quantify the robustness of the pipeline to data shift. For a given pipeline, we generate an explanation that highlights a tolerable threshold beyond which a component is considered shift-vulnerable and is likely to contribute to performance degradation. For the shift-vulnerable scenarios, we further suggest a new pipeline for system maintainers that preserves the model performance without retraining. The proposed framework delivers a risk-aware assessment, empowering practitioners to anticipate potential performance changes and adapt their pipeline strategies accordingly. Experimental results on several real-world datasets generate valid explanations for pipeline robustness and demonstrate the opportunities in this field of research.

From Similarities to Insights: Approaching Time Series Integration from a User Perspective

Lucas Weber
Richard Lenz

Cyber-physical systems such as buildings and power plants are increasingly monitored using large numbers of sensors, resulting in massive and heterogeneous time-series datasets. High-quality metadata - particularly measurement type and functional location - is essential to extract value from this data. However, such metadata is often incomplete or missing. While recent research addresses the issue of recovering functional location from raw time-series data, it focuses on discovering pairwise relationships and provides little guidance for end-users on how to apply these methods.

From the user's perspective, we identify three open challenges in the current research on functional location inference: selecting the appropriate relationship discovery algorithm, minimizing computational effort, and interpreting the results to assign locations. We examine each challenge in detail and explore potential solutions. As a first step towards interpretability, we demonstrate how to visualize pairwise similarities using matrix and scatter plots to keep the user in the loop. Using seven datasets and five pairwise relationship measures, we find that simulated annealing is effective for matrix reordering, while t-SNE and UMAP provide the best two-dimensional embeddings for preserving local structure.

Toward Scalable Human-in-the-Loop Annotation Error Detection with Label Noise-Aware Training

Shubham Malaviya
Manish Shukla
Sachin Lodha

Annotation errors in datasets have a substantial impact on the performance of AI systems, both during training and evaluation. Although accurate labeling is crucial, human-in-the-loop methods for error detection face limitations due to scalability challenges and susceptibility to fatigue. In this work, we investigate Annotation Error Detection (AED) techniques with an emphasis on the effects of noise-robust training strategies. We explore how label noise affects AED performance and examine the effectiveness of robust regularization and robust loss based training in mitigating the negative effects of noisy labels. Our results show that standard models trained with noisy data experience significant performance drop; however, simple techniques from robust regularization improve AED performance by 20% to 45%. Our study underscores the importance of integrating noise-robust methods into existing AED systems for improving overall dataset quality.

Responsive Retrieval of Consistent States in Pipelined Executions of Dataflows

Shengquan Ni
Chen Li

Many modern analytics data workflows (i.e., dataflows) run on distributed data-processing systems to support real-time tasks such as fraud detection and product recommendation. During the execution of a workflow, users often want to inspect two types of states: the internal state of operators and in-flight data tuples between operators (i.e., on the edges) to understand their runtime behaviors for analysis and debugging purposes. While existing methods designed for fault-tolerance can support state retrieval, their response can be slow since they rely on the propagation of a special message (a.k.a. a "barrier" or "marker") on the edges. In this paper, we study how to retrieve a consistent state during a pipelined execution of a workflow with a low latency. We focus on two scenarios: retrieving only operator states, and retrieving both operator and edge states. For each case, we leverage its unique properties to develop a novel retrieval method that does not require barrier propagation on the edges. We also compare these two methods with three existing checkpointing-based methods. We conducted experiments using real datasets and workflows to compare these methods and show that the two novel methods achieve lower latency for state retrieval.