Proceedings of the ACM on Management of Data: SIGMOD: Vol. 3, No. 3. 2025

Full Citation in the ACM Digital Library

PACMMOD V3, N3 (SIGMOD), June 2025: Editorial

The Proceedings of the ACM on Management of Data (PACMMOD) is concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data management technology, and science and engineering of data. It includes articles reporting cutting-edge data management, data engineering, and data science research.

We are pleased to present the third issue of Volume 3 of PACMMOD. This issue contains papers that were submitted to the SIGMOD research track in October 2024.

Papers accepted in this issue form the final pool of presentation candidates in the research track of the ACM SIGMOD Conference on Management of Data 2025, to be held in Berlin, Germany.

A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces

Wildlife trafficking remains a critical global issue, significantly impacting biodiversity, ecological stability, and public health. Despite efforts to combat this illicit trade, the rise of e-commerce platforms has made it easier to sell wildlife products, putting new pressure on wild populations of endangered and threatened species. The use of these platforms also opens a new opportunity: as criminals sell wildlife products online, they leave digital traces of their activity that can provide insights into trafficking activities as well as how they can be disrupted. The challenge lies in finding these traces. Online marketplaces publish ads for a plethora of products, and identifying ads for wildlife-related products is like finding a needle in a haystack. Learning classifiers can automate ad identification, but creating them requires costly, time-consuming data labeling that hinders support for diverse ads and research questions. This paper addresses a critical challenge in the data science pipeline for wildlife trafficking analytics: generating quality labeled data for classifiers that select relevant data. While large language models (LLMs) can directly label advertisements, doing so at scale is prohibitively expensive. We propose a cost-effective strategy that leverages LLMs to generate pseudo labels for a small sample of the data and uses these labels to create specialized classification models. Our novel method automatically gathers diverse and representative samples to be labeled while minimizing the labeling costs. Our experimental evaluation shows that our classifiers achieve up to 95% F1 score, outperforming LLMs at a lower cost. We present real use cases that demonstrate the effectiveness of our approach in enabling analyses of different aspects of wildlife trafficking.

A New Paradigm in Tuning Learned Indexes: A Reinforcement Learning Enhanced Approach

Learned Index Structures (LIS) have significantly advanced data management by leveraging machine learning models to optimize data indexing. However, designing these structures often involves critical trade-offs, making it challenging for both designers and end-users to find an optimal balance tailored to specific workloads and scenarios. While some indexes offer adjustable parameters that demand intensive manual tuning, others rely on fixed configurations based on heuristic auto-tuners or expert knowledge, which may not consistently deliver optimal performance.

This paper introduces LITune, a novel framework for end-to-end automatic tuning of Learned Index Structures. LITune employs an adaptive training pipeline equipped with a tailor-made Deep Reinforcement Learning (DRL) approach to ensure stable and efficient tuning. To accommodate long-term dynamics arising from online tuning, we further enhance LITune with an on-the-fly updating mechanism termed the O2 system. These innovations allow LITune to effectively capture state transitions in online tuning scenarios and dynamically adjust to changing data distributions and workloads, marking a significant improvement over other tuning methods. Our experimental results demonstrate that LITune achieves up to a 98% reduction in runtime and a 17-fold increase in throughput compared to default parameter settings given a selected Learned Index instance. These findings highlight LITune's effectiveness and its potential to facilitate broader adoption of LIS in real-world applications.

A Structured Study of Multivariate Time-Series Distance Measures

Distance measures are fundamental to time series analysis and have been extensively studied for decades. Until now, research efforts mainly focused on univariate time series, leaving multivariate cases largely under-explored. Furthermore, the existing experimental studies on multivariate distances have critical limitations: (a) focusing only on lock-step and elastic measures while ignoring categories such as sliding and kernel measures; (b) considering only one normalization technique; and (c) placing limited focus on statistical analysis of findings. Motivated by these shortcomings, we present the most complete evaluation of multivariate distance measures to date. Our study examines 30 standalone measures across 8 categories, 2 channel-dependency models, and considers 13 normalizations. We perform a comprehensive evaluation across 30 datasets and 3 downstream tasks, accompanied by rigorous statistical analysis. To ensure fairness, we conduct a thorough investigation of parameters for methods in both a supervised and an unsupervised manner. Our work verifies and extends earlier findings, showing that insights from univariate distance measures also apply to the multivariate case: (a) alternative normalization methods outperform Z-score, and for the first time, we demonstrate statistical differences in certain categories for the multivariate case; (b) multiple lock-step measures are better suited than Euclidean distance, when it comes to multivariate time series; and (c) newer elastic measures outperform the widely adopted Dynamic Time Warping distance, especially with proper parameter tuning in the supervised setting. Moreover, our results reveal that (a) sliding measures offer the best trade-off between accuracy and runtime; (b) current normalization techniques fail to significantly enhance accuracy on multivariate time series and, surprisingly, do not outperform the no normalization case, indicating a lack of appropriate solutions for normalizing multivariate time series; and (c) independent consideration of time series channels is beneficial only for elastic measures. In summary, we offer guidelines to aid in designing and selecting preprocessing strategies and multivariate distance measures for our community.

Accelerate Distributed Joins with Predicate Transfer

Join is one of the most critical operators in query processing. One effective way to optimize multi-join performance is to pre-filter rows that do not contribute to the query output. Techniques that reflect this principle include predicate pushdown, semi-join, Yannakakis algorithm, Bloom join, predicate transfer, etc. Among these, predicate transfer is the state-of-the-art pre-filtering technique that removes most non-contributing rows through a series of Bloom filters thereby delivering significant performance improvement. However, the existing predicate transfer technique has several limitations. First, the current algorithm works only on a single-threaded system while real analytics databases for large workloads are typically distributed across multiple nodes. Second, some predicate transfer steps may not filter out any rows in the destination table thus introduces performance overhead with no speedup. This issue is exacerbated in a distributed environment, where unnecessary predicate transfers lead to extra network latency and traffic. In this paper, we aim to address both limitations. First, we explore the design space of distributed predicate transfer and propose cost-based adaptive execution to maximize the performance for each individual transfer step. Second, we develop a pruning algorithm to effectively remove unnecessary transfers that do not have positive contribution to performance. We implement both techniques and evaluate on a distributed analytics query engine. Results on standard OLAP benchmarks including TPC-H and DSB with a scale factor up to 400 show that distributed predicate transfer can improve the query performance by over 3×, and reduce the amount of data exchange by over 2.7×.

Accelerating Graph Indexing for ANNS on Modern CPUs

In high-dimensional vector spaces, Approximate Nearest Neighbor Search (ANNS) is a key component in database and artificial intelligence infrastructures. Graph-based ANNS methods, particularly HNSW, have emerged as leading solutions, offering an impressive trade-off between search efficiency and accuracy. Many vector databases utilize graph indexes as their core algorithms, benefiting from various optimizations to enhance search performance. However, the high indexing time associated with graph algorithms poses a significant challenge, especially given the increasing volume of data, query processing complexity, and dynamic index maintenance demand. This has rendered indexing time a critical performance metric for users.

In this paper, we comprehensively analyze the underlying causes of the low graph indexing efficiency on modern CPUs, identifying that distance computation dominates indexing time, primarily due to high memory access latency and suboptimal arithmetic operation efficiency. We demonstrate that distance comparisons during index construction can be effectively performed using compact vector codes at an appropriate compression error. Drawing from insights gained through integrating existing compact coding methods in the graph indexing process, we propose a novel compact coding strategy, named Flash, designed explicitly for graph indexing and optimized for modern CPU architectures. By minimizing random memory accesses and maximizing the utilization of SIMD (Single Instruction, Multiple Data) instructions, Flash significantly enhances cache hit rates and arithmetic operations. Extensive experiments conducted on eight real-world datasets, ranging from ten million to one billion vectors, exhibit that Flash achieves a speedup of 10.4× to 22.9× in index construction efficiency, while maintaining or improving search performance.

Accelerating Skyline Path Enumeration with a Core Attribute Index on Multi-attribute Graphs

As a building block of many graph-based areas, the s-t path enumeration problem aims to find all paths between s and t by satisfying a given constraint, e.g., hop numbers. In many real-world scenarios, graphs are multi-attribute, where vertices and edges are associated with numerical attributes, such as expense or distance in road networks. However, existing methods have not fully leveraged all attributes in s-t path analysis. Hence, in this paper, we study the problem of skyline path enumeration, which aims to identify paths that balance multiple attributes, ensuring that no skyline result is dominated by another, thus meeting diverse user needs. To efficiently tackle this problem, we design a task-oriented core attribute index, called CAI, to rule out all redundant vertices and edges not located in any skyline path. Additionally, we introduce a hop-dependency label propagation strategy to construct the CAI index in parallel, improving the indexing process. Based on this index, we further design a CAI-based querying strategy that reduces fruitless explorations between candidate vertices not in the same skyline path, significantly optimizing query processing time. Experimental evaluations on fifteen real-world graphs show that CAI outperforms existing methods by up to four orders of magnitude in speed while demonstrating enhanced scalability and well-bound memory costs.

Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents

Integrating machine learning (ML) analytics into existing database management systems (DBMSs) not only eliminates the need for costly data transfers to external ML platforms but also ensures compliance with regulatory standards. While some DBMSs have integrated functionalities for training and applying ML models for analytics, these tasks still present challenges, particularly due to limited support for automatic feature engineering (AutoFE), which is crucial for optimizing ML model performance. In this paper, we introduce Adda, an agent-driven in-database feature generation tool designed to automatically create high-quality features for ML analytics directly within the database. Adda interprets ML analytics tasks described in natural language and generates code for feature construction by leveraging the power of large language models (LLMs) integrated with specialized agents. This code is then translated into SQL statements using a predefined set of operators and compiled just-in-time (JIT) into user-defined functions (UDFs). The result is a seamless, fully in-database solution for feature generation, specifically tailored for ML analytics tasks. Extensive experiments across 14 public datasets, with five ML tasks per dataset, show that Adda improves the AUC by up to 33.2% and reduces end-to-end latency by up to 100x compared to Madlib.

AJOSC: Adaptive Join Order Selection for Continuous Queries

Multi-way join, which refers to the join operation among multiple tables, is widely used in database systems. With the development of the Internet and social networks, a new variant of the multi-way join query has emerged, requiring continuous monitoring of the query results as the database is updated. This variant is called continuous multi-way join. The join order of continuous multi-way join significantly impacts the operation's cost. However, existing methods for continuous multi-way join order selection are heuristic, which may fail to select the most efficient orders. On the other hand, the high-cost order computation will become a system bottleneck if we directly transfer join order selection algorithms for static multi-way join to the dynamic setting. In this paper, we propose a new A daptive J oin O rder S election algorithm for the C ontinuous multi-way join queries named AJOSC. It uses dynamic programming to find the optimal join order with a new cost model specifically designed for continuous multi-way join. We further propose a lower-bound-based incremental re-optimization algorithm to restrict the search space and recompute the join order with low cost when data distribution changes. Experimental results show that AJOSC is up to two orders of magnitude faster than the state-of-the-art methods.

Alsatian: Optimizing Model Search for Deep Transfer Learning

Transfer learning is an effective technique for tuning a deep learning model when training data or computational resources are limited. Instead of training a new model from scratch, the parameters of an existing base model are adjusted for the new task. The accuracy of such a fine-tuned model depends on the suitability of the base model chosen. Model search automates the selection of such a base model by evaluating the suitability of candidate models for a specific task. This entails inference with each candidate model on task-specific data. With thousands of models available through model stores, the computational cost of model search is a major bottleneck for efficient transfer learning.

In this work, we present Alsatian, a novel model search system. Based on the observation that many candidate models overlap to a significant extent and following a careful bottleneck analysis, we propose optimization techniques that are applicable to many model search frameworks. These optimizations include: (i) splitting models into individual blocks that can be shared across models, (ii) caching of intermediate inference results and model blocks, and (iii) selecting a beneficial search order for models to maximize sharing of cached results. In our evaluation on state-of-the-art deep learning models from computer vision and natural language processing, we show that Alsatian outperforms baselines by up to 14x.

Approximate DBSCAN under Differential Privacy

This paper revisits the DBSCAN problem under differential privacy (DP). Existing DP-DBSCAN algorithms aim at publishing the cluster labels of the input points. However, we show that both empirically and theoretically, this approach cannot offer any utility in the published results. We therefore propose an alternative definition of DP-DBSCAN based on the notion of spans. We argue that publishing the spans actually better serves the purposes of visualization and classification of DBSCAN. Then we present a linear-time DP-DBSCAN algorithm achieving the sandwich quality guarantee in any constant dimensions, as well as matching lower bounds on the approximation ratio. A key building block in our algorithm is a linear-time algorithm for constructing a histogram under pure-DP, which is of independent interest. Finally, we conducted experiments on both synthetic and real-world datasets to verify the practical performance of our DP-DBSCAN algorithm.

Approximating Opaque Top-k Queries

Combining query answering and data science workloads has become prevalent. An important class of such workloads is top-k queries with a scoring function implemented as an opaque UDF - a black box whose internal structure and scores on the search domain are unavailable. Some typical examples include costly calls to fuzzy classification and regression models. The models may also be changed in an ad-hoc manner. Since the algorithm does not know the scoring function's behavior on the input data, opaque top-k queries become expensive to evaluate exactly or speed up by indexing. Hence, we propose an approximation algorithm for opaque top-k query answering. Our proposed solution is a task-independent hierarchical index and a novel bandit algorithm. The index clusters elements by some cheap vector representation then builds a tree of the clusters. Our bandit is a diminishing returns submodular epsilon-greedy bandit algorithm that maximizes the sum of the solution set's scores. Our bandit models the distribution of scores in each arm using a histogram, then targets arms with fat tails. We prove that our bandit algorithm approaches a constant factor of the optimal algorithm. We evaluate our standalone library on large synthetic, image, and tabular datasets over a variety of scoring functions. Our method accelerates the time required to achieve nearly optimal scores by up to an order of magnitude compared to exhaustive scan while consistently outperforming baseline sampling algorithms.

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving

Large language model (LLM) inference serving systems are essential to various LLM-based applications. As demand for LLM services continues to grow, scaling these systems to handle high request rates while meeting latency Service-Level Objectives (SLOs), referred to as effective throughput, becomes critical. However, existing systems often struggle to improve effective throughput, primarily due to a significant decline in Time To First Token (TTFT) SLO attainment. We identify two major causes of this bottleneck: (1) memory-intensive KV cache that limits batch size expansion under GPU memory constraints, and (2) rigid batch composition enforced by the default First-Come-First-Serve scheduling policy. In this paper, we introduce Apt-Serve, a scalable framework designed to enhance effective throughput in LLM inference serving. Apt-Serve features a new hybrid cache scheme that combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request concurrency. Based on the hybrid cache, Apt-Serve employs an adaptive runtime scheduling mechanism that dynamically optimizes batch composition. We formally define the adaptive scheduling optimization problem and propose an efficient algorithm with theoretical guarantees. Extensive evaluations on three real-world datasets and LLMs ranging from 13B to 66B parameters demonstrate that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.

Are Database System Researchers Making Correct Assumptions about Transaction Workloads?

Many recent papers have contributed novel concurrency control and transaction processing algorithms that start by making an assumption about the transaction workload submitted by an application, and yield high performance (sometimes by an order of magnitude) under these assumptions. Two of the most common assumptions are (1) Is the read and write set of a transaction known (or easily derivable) directly by analyzing the application code in advance of transaction execution, or is the access set of a transaction dependent on the current state of the database? (2) Does the application send the entire transaction in a single request to the database system or is the transaction sent via several requests ''interactively'', with application code run in between these requests. The database community has made tremendous progress in improving throughput and latency of transaction processing when read/write sets are known in advance, and for non-interactive transactions. However, the impact of this progress is directly dependent on the accuracy of these assumptions both for current and future applications. In this paper, we conduct an extensive study of 111 open-source applications, analyzing over 30,000 transactions to evaluate the accuracy of these assumptions both as they exist in the current codebase, and how extensive are the changes required to the code for these assumptions to hold moving forward. Our study reveals that the second of these assumptions is stronger than the first. More specifically, for 90% of applications, at least 58% of transactions per application have read/write sets that can be inferred in advance. Furthermore, although only 39% of applications contain zero interactive transactions, nonetheless, the majority of the remaining 61% of applications can be converted to being completely non-interactive with minimal changes.These insights underscore the potential for further optimization and research in designing OLTP systems that balance transaction expressivity and performance.

Athena: An Effective Learning-based Framework for Query Optimizer Performance Improvement

Recent studies have made it possible to integrate learning techniques into database systems for practical utilization. In particular, the state-of-the-art studies hook the conventional query optimizer to explore multiple execution plan candidates, then choose the optimal one with a learned model. This framework simplifies the integration of learning techniques into the database system. However, these methods still have room for improvement due to their limited plan exploration space and ineffective learning from execution plans. In this work, we propose Athena, an effective learning-based framework of query optimizer enhancer. It consists of three key components: (i) an order-centric plan explorer, (ii) a Tree-Mamba plan comparator and (iii) a time-weighted loss function. We implement Athena on top of the open-source database PostgreSQL and demonstrate its superiority via extensive experiments. Specifically, We achieve 1.75x, 1.95x, 5.69x, and 2.74x speedups over the vanilla PostgreSQL on the JOB, STATS-CEB, TPC-DS, and DSB benchmarks, respectively. Athena is 1.74x, 1.87x, 1.66x, and 2.28x faster than the state-of-the-art competitor Lero on these benchmarks. Additionally, Athena is open-sourced and it can be easily adapted to other relational database systems as all these proposed techniques in Athena are generic.

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied.

In this work, we observe that there is an important class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints.

Our code and data are available at https://github.com/qixuchen/AutoTest for future research.

Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency

State-of-the-art Text-to-SQL models rely on fine-tuning or few-shot prompting to help LLMs learn from training datasets containing mappings from natural language (NL) queries to SQL statements. Consequently, the quality of the dataset can greatly affect the accuracy of these Text-to-SQL models. Unlike other NL tasks, Text-to-SQL datasets are prone to errors despite extensive manual efforts due to the subtle semantics of SQL. Our study has found a non-negligible (>30%) portion of incorrect NL to SQL mapping cases exists in popular datasets Spider and BIRD.

This paper aims to improve the quality of Text-to-SQL training datasets and thereby increase the accuracy of the resulting models. To do so, we propose a necessary correctness condition called execution consistency. For a given database instance, an NL to SQL mapping satisfies execution consistency if the execution result of an NL query matches that of the corresponding SQL. We develop SQLDriller to detect incorrect NL to SQL mappings based on execution consistency in a best-effort manner by crafting database instances that likely result in violations of execution consistency. It generates multiple candidate SQL predictions that differ in their syntax structures. Using a SQL equivalence checker, SQLDriller obtains counterexample database instances that can distinguish non-equivalent candidate SQLs. It then checks the execution consistency of an NL to SQL mapping under this set of counterexamples. The evaluation shows SQLDriller effectively detects and fixes incorrect mappings in the Text-to-SQL dataset, and it improves the model accuracy by up to 13.6%.

BPF-DB: A Kernel-Embedded Transactional Database Management System For eBPF Applications

Developers rely on the eBPF framework to augment operating system (OS) behavior for the betterment of database management system (DBMS) without having to modify kernel code. But eBPF's verifier limits program complexity and data management functionality. As a result eBPF's storage options are limited to kernel-resident, non-durable data structures that lack transactional guarantees.

Inspired by embedded DBMSs for user-space applications, this paper present BPF-DB, an OS-embedded DBMS that offers transactional data management for eBPF applications. We explore the storage management and concurrency control challenges associated with DBMS design in eBPF's restrictive execution environment. We demonstrate BPF-DB's capabilities with two applications based on real-world systems. The first is a Redis-compatible in-memory DBMS that uses BPF-DB as its transactional storage engine. This system matches the performance of state-of-the-art implementations while offering stronger transactional guarantees. The second application implements a stored procedure-based DBMS that provides serializable multi-statement transactions. We compare this application against VoltDB, with BPF-DB achieving 43% higher throughput. BPF-DB's robust and high-performance transactional semantics enable emerging kernel-space applications.

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention layers in LLMs fully compute the Keys and Values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations or with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (which we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache and maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6× speed up in throughput for both the LLama-3-8B and 70B models and a 2.1× and 2× reduction in end-to-end response latency respectively, compared to prefix-caching, while maintaining generation quality.

CARINA: An Efficient CXL-Oriented Embedding Serving System for Recommendation Models

Embedding-based recommendation models (ERMs) require large memory to host huge embedding tables and involve massive data traffic to read the embeddings. As a new interconnect, CXL suits ERMs since it can scale up single-machine memory with performant remote memory devices. However, directly running DRAM-based ERM serving systems on CXL yields poor performance because the bandwidth of CXL is notably lower than DRAM and can be easily saturated, making CXL memory the bottleneck. The non-uniform memory access (NUMA) architecture in modern CXL servers further decreased the system performance. In this paper, we design Carina for ERM serving on heterogeneous memory with CXL by considering such bandwidth asymmetry. In particular, Carina balances the memory access from different memory devices by storing hot embeddings with high access frequencies on DRAM and specifying the placement of embedding tables on the NUMA nodes. Moreover, Carina adopts bandwidth-aware task execution, which decomposes each batch of ERM requests into fine-grained tasks and schedules the tasks to control the real-time utilization of CXL bandwidth to avoid instantaneous saturation. We evaluate Carina under real CXL devices and find that it outperforms a CXL-oblivious baseline by an average of 5.38x and 4.04x in system throughput and request latency, respectively.

Clementi: Efficient Load Balancing and Communication Overlap for Multi-FPGA Graph Processing

Efficient graph processing is critical in various modern applications, such as social network analysis, recommendation systems, and large-scale data mining. Traditional single-FPGA systems struggle to handle the increasing size and complexity of real-world graphs due to limitations in memory and computational resources. Existing multi-FPGA solutions face significant challenges, including high communication overhead caused by irregular data transfer patterns and workload imbalances stemming from skewed graph distributions. These inefficiencies hinder scalability and performance, highlighting a critical research gap. To address these issues, we introduce Clementi, an efficient multi-FPGA graph processing framework that features customized fine-grained pipelines for computation and cross-FPGA communication. Clementi uniquely integrates an accurate performance model for execution time prediction, enabling a novel scheduling method that balances workload distribution and minimizes communication overhead by overlapping communication and computation stages. Experimental results demonstrate that Clementi achieves speedups of up to 8.75× compared to state-of-the-art multi-FPGA designs, indicating significant improvements in processing efficiency as the number of FPGAs increases. This near-linear scalability underscores the framework' s potential to enhance graph processing capabilities in practical applications. Clementi is open-sourced at https://github.com/Xtra-Computing/Clementi.

Community Detection in Heterogeneous Information Networks Without Materialization

Community detection in heterogeneous information networks (HINs) poses significant challenges due to the diversity of entity types and the complexity of their interrelations. While traditional algorithms may perform adequately in some scenarios, many struggle with the high memory usage and computational demands of large-scale HINs. To address these challenges, we introduce a novel framework, SCAR, which efficiently uncovers community structures in HINs without requiring network materialization. SCAR leverages insights from meta-paths to interpret multi-relational data through compact vertex-based sketches, significantly reducing computational overhead and materialization overhead. We propose a sketch-based technique for estimating changes in modularity, improving both the precision and speed in community detection. Our extensive evaluations on diverse real-world datasets provide detailed comparative metrics, demonstrating that SCAR outperforms several state-of-the-art methods, including Gdy, Louvain, Leiden, Infomap, Walktrap, and Networkit, in execution time and memory consumption while maintaining competitive accuracy. Overall, SCAR offers a robust and scalable solution for revealing community structures in large HINs, with applications across various domains, including social networks, academic collaboration networks, and e-commerce platforms.

Computing Inconsistency Measures Under Differential Privacy

Assessing data quality is crucial to knowing whether and how to use the data for different purposes. Specifically, given a collection of integrity constraints, various ways have been proposed to quantify the inconsistency of a database. Inconsistency measures are particularly important when we wish to assess the quality of private data without revealing sensitive information. We study the estimation of inconsistency measures for a database protected under Differential Privacy (DP). Such estimation is nontrivial since some measures intrinsically query sensitive information, and the computation of others involves functions on underlying sensitive data. Among five inconsistency measures that have been proposed in recent work, we identify that two are intractable in the DP setting. The major challenge for the other three is high sensitivity: adding or removing one tuple from the dataset may significantly affect the outcome. To mitigate that, we model the dataset using a conflict graph and investigate private graph statistics to estimate these measures. The proposed machinery includes adapting graph-projection techniques with parameter selection optimizations on the conflict graph and a DP variant of approximate vertex cover size. We experimentally show that we can effectively compute DP estimates of the three measures on five real-world datasets with denial constraints, where the density of the conflict graphs highly varies.

Cracking SQL Barriers: An LLM-based Dialect Translation System

Automatic dialect translation reduces the complexity of database migration, which is crucial for applications interacting with multiple database systems. However, rule-based translation tools (e.g., SQLGlot, jOOQ, SQLines) are labor-intensive to develop and often (1) fail to translate certain operations, (2) produce incorrect translations due to rule deficiencies, and (3) generate translations compatible with some database versions but not the others.

In this paper, we investigate the problem of automating dialect translation with large language models (LLMs). There are three main challenges. First, queries often involve lengthy content (e.g., excessive column values) and multiple syntax elements that require translation, increasing the risk of LLM hallucination. Second, database dialects have diverse syntax trees and specifications, making it difficult for cross-dialect syntax matching. Third, dialect translation often involves complex many-to-one relationships between source and target operations, making it impractical to translate each operation in isolation. To address these challenges, we propose an automatic dialect translation system CrackSQL. First, we propose Functionality-based Query Processing that segments the query by functionality syntax trees and simplifies the query via (i) customized function normalization and (ii) translation-irrelevant query abstraction. Second, we design a Cross-Dialect Syntax Embedding Model to generate embeddings by the syntax trees and specifications (of certain version), enabling accurate query syntax matching. Third, we propose a Local-to-Global Dialect Translation strategy, which restricts LLM-based translation and validation on operations that cause local failures, iteratively extending these operations until translation succeeds. Experiments show CrackSQL significantly outperforms existing methods (e.g., by up to 77.42%). The code is available at https://github.com/weAIDB/CrackSQL.

Credible Intervals for Knowledge Graph Accuracy Estimation

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge and outcomes. Therefore, assessing the accuracy of a KG is essential for ensuring the quality of facts used in these tasks. However, the large size of real-world KGs makes manual triple-by-triple annotation impractical, thereby requiring sampling strategies to provide accuracy estimates with statistical guarantees. The current state-of-the-art approaches rely on Confidence Intervals (CIs), derived from frequentist statistics. While efficient, CIs have notable limitations and can lead to interpretation fallacies. In this paper, we propose to overcome the limitations of CIs by using Credible Intervals (CrIs), which are grounded in Bayesian statistics. These intervals are more suitable for reliable post-data inference, particularly in KG accuracy evaluation. We prove that CrIs offer greater reliability and stronger guarantees than frequentist approaches in this context. Additionally, we introduce aHPD, an adaptive algorithm that is more efficient for real-world KGs and statistically robust, addressing the interpretive challenges of CIs.

cuMatch: A GPU-based Memory-Efficient Worst-case Optimal Join Processing Method for Subgraph Queries with Complex Patterns

Subgraph queries are widely used but face significant challenges due to complex patterns such as negative and optional edges. While worst-case optimal joins have proven effective for subgraph queries with regular patterns, no method has been proposed that can process queries involving complex patterns in a single multi-way join. Existing CPU-based and GPU-based methods experience intermediate data explosion when processing complex patterns following regular patterns. In addition, GPU-based methods struggle with issues of wasted GPU memory and redundant computation. In this paper, we propose cuMatch, a GPU-based unified worst-case optimal join processing method for subgraph queries. It avoids intermediate data explosions by processing even complex pattern queries in a single multi-way join. It is also memory-efficient and fast, due to a new partitioning format, scheduling method, and task fusion technique. Extensive experiments demonstrate that cuMatch outperforms state-of-the-art methods by orders of magnitude, without out-of-memory errors.

Dangers of List Processing in Querying Property Graphs

The workhorse of property graph query languages such as Cypher and GQL is pattern matching. The result of pattern matching is a collection of paths and mappings of variables to graph elements. To increase expressiveness of post-processing of pattern matching results, languages such as Cypher introduce the capability of creating lists of nodes and edges from matched paths, and provide users with standard list processing tools such as reduce. We show that on the one hand, this makes it possible to capture useful classes of queries that pattern matching alone cannot do. On the other hand, we show that this opens backdoor to very high and unexpected expressiveness. In particular one can very easily express several classical NP-hard problems by simple queries that use reduce. This level of expressiveness appears to be beyond what query optimizers can handle, and indeed this is confirmed by an experimental evaluation, showing that such queries time out already on very small graphs. We conclude our analysis with a suggestion on the use of list processing in queries that while retaining its usefulness, avoids the above pitfalls and prevents highly intractable queries.

Data Enhancement for Binary Classification of Relational Data

This paper studies enhancement of training data D to improve the robustness of machine learning (ML) classifiers M against adversarial attacks on relational data. Data enhancing aims to (a) defuse poisoned imperceptible features embedded in D, and (b) defend against attacks at prediction time that are unseen in D. We show that while there exists an inherent tradeoff between the accuracy and robustness of M in case (b), data enhancing can improve both the accuracy and robustness at the same time in case (a). We formulate two data enhancing problems accordingly, and show that both problems are intractable.Despite the hardness, we propose a framework that integrates model training and data enhancing. Moreover, we develop algorithms for (a) detecting and debugging corrupted imperceptible features in training data, and (b) selecting and adding adversarial examples to training data to defend against unseen attacks at prediction time. Using real-life datasets, we empirically verify that the method is at least 20.4% more robust and 2.02X faster than SOTA methods for classifiers M, without degrading the accuracy of M.

Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

Join order optimization is critical in achieving good query performance. Despite decades of research and practice, modern query optimizers could still generate inferior join plans that are orders of magnitude slower than optimal. Existing research on robust query processing often lacks theoretical guarantees on join-order robustness while sacrificing query performance. In this paper, we rediscover the recent Predicate Transfer technique from a robustness point of view. We introduce two new algorithms, LargestRoot and SafeSubjoin, and then propose Robust Predicate Transfer (RPT) that is provably robust against arbitrary join orders of an acyclic query. We integrated Robust Predicate Transfer with DuckDB, a state-of-the-art analytical database, and evaluated against all the queries in TPC-H, JOB, TPC-DS, and DSB benchmarks. Our experimental results show that RPT improves join-order robustness by orders of magnitude compared to the baseline. With RPT, the largest ratio between the maximum and minimum execution time out of random join orders for a single acyclic query is only 1.6x (the ratio is close to 1 for most evaluated queries). Meanwhile, applying RPT also improves the end-to-end query performance by ≈1.5x (per-query geometric mean). We hope that this work sheds light on solving the practical join ordering problem.

DFlush: DPU-Offloaded Flush for Disaggregated LSM-based Key-Value Stores

Rapid increase of storage and network bandwidth incurs higher CPU consumption in modern data systems. This phenomenon is particularly evident for log-structured merged key-value stores (LSM-KVS), which rely on resource-intensive background operations to flush and compact disk data. While extensive research has been conducted to reduce the CPU overhead of background compaction, less attention has been paid to background flushing, which can also consume a significant amount of valuable CPU cycles and disrupt CPU caches, ultimately impacting overall performance.

In this paper, we propose DFlush, a novel solution that uses DPUs to offload background flush operations to reduce its CPU cost. DPUs are an appealing choice for this goal due to their cost-effectiveness, ease of programming, and widespread deployment. However, their complex hardware architecture requires careful design of both the data and control planes. To fully harness the DPU's capabilities, DFlush decomposes a flush job into fine-grained steps, mapped them to DPU hardware units, and accelerates them through pipeline, data, and channel parallelism, ensuring data-plane efficiency. It also introduces an adaptive control plane that dynamically schedules flush jobs from different LSM-KVS instances based on their priority, reducing write stall and tail latency. Our experiments on a real DPU platform with an industrial-grade LSM-KVS show that DFlush delivers higher throughput, significantly lower tail latency, and saves up to dozens of CPU cores per LSM-KVS server while reducing energy consumption.

DIGRA: A Dynamic Graph Indexing for Approximate Nearest Neighbor Search with Range Filter

Recent advancements in AI have enabled models to map real-world entities, such as product images, into high-dimensional vectors, making approximate nearest neighbor search (ANNS) crucial for various applications. Often, these vectors are associated with additional attributes like price, prompting the need for range-filtered ANNS where users seek similar items within specific attribute ranges. Naive solutions like pre-filtering and post-filtering are straightforward but inefficient. Specialized indexes, such as SeRF, SuperPostFiltering, and iRangeGraph, have been developed to address these queries effectively. However, these solutions do not support dynamic updates, limiting their practicality in real-world scenarios where datasets frequently change.

To address these challenges, we propose DIGRA, a novel dynamic graph index for range-filtered ANNS. DIGRA supports efficient dynamic updates while maintaining a balance among query efficiency, update efficiency, indexing cost, and result quality. Our approach introduces a dynamic multi-way tree structure combined with carefully integrated ANNS indices to handle range filtered ANNS efficiently. We employ a lazy weight-based update mechanism to significantly reduce update costs and adopt optimized choice of ANNS index to lower construction and update overhead. Experimental results demonstrate that DIGRA achieves superior trade-offs, making it suitable for large-scale dynamic datasets in real-world applications.

Divide-and-Conquer: Scalable Shortest Path Counting on Large Road Networks

The shortest path counting problem is crucial for various applications in road networks, such as network robustness analysis, traffic flow distribution, and navigation optimization. Unlike traditional shortest path problems, it requires enumerating all possible shortest paths, making it computationally challenging, especially in dense urban networks with numerous equal-length paths. Existing methods, such as 2-hop labeling schemes, precompute shortest-path distances and counts for efficient queries but struggle to scale in large networks. In this work, we propose a novel divide-and-conquer approach based on recursive vertex bipartitioning to address this limitation. At its core, we establish a count reconstruction theorem that efficiently combines shortest subpath counts from smaller subgraphs to accurately reconstruct shortest path counts for the entire graph. This approach significantly reduces computational overhead and storage requirements. We also introduce a 2-hop count labeling scheme that integrates effectively with this divide-and-conquer framework. Experimental results show that our approach significantly outperforms state-of-the-art solutions, doubling query processing speed, reducing label construction time to one-fourth, and requiring only around 20% of labeling space.

Dupin: A Parallel Framework for Densest Subgraph Discovery in Fraud Detection on Massive Graphs

Detecting fraudulent activities in financial and e-commerce transaction networks is crucial. One effective method for this is Densest Subgraph Discovery (DSD). However, deploying DSD methods in production systems faces substantial scalability challenges due to the predominantly sequential nature of existing methods, which impedes their ability to handle large-scale transaction networks and results in significant detection delays. To address these challenges, we introduce Dupin, a novel parallel processing framework designed for efficient DSD processing in billion-scale graphs. Dupin is powered by a processing engine that exploits the unique properties of the peeling process, with theoretical guarantees on detection quality and efficiency. Dupin provides user-friendly APIs for flexible customization of DSD objectives and ensures robust adaptability to diverse fraud detection scenarios. Empirical evaluations indicate that Dupin consistently outperforms several existing DSD methods, achieving performance improvements of up to two orders of magnitude compared to traditional approaches. On billion-scale graphs, Dupin demonstrates the potential to enhance the prevention of fraudulent transactions by approximately 49.5 basis points and reduces density error from 30.3% to below 5.0%, as supported by our experimental results.

Efficient and Accurate Differentially Private Cardinality Continual Releases

Accurately estimating the number of unique elements that appear in data streams in real time is a fundamental problem with applications including network traffic monitoring and real-time social media analytics. Traditional sketch-based algorithms such as FM Sketch and HyperLogLog offer memory-friendly solutions for cardinality estimation but fall short in scenarios where the stream elements are privacy-sensitive and require differential privacy. Although recent approaches have incorporated differential privacy into the above cardinality estimators, they are limited to single-query settings, restricting their applicability. Previous methods for private cardinality continual release settings-i.e., releasing the cardinality after each new element in the stream-demand large memory resources and are thus difficult to apply in practice. In this paper, we present a novel cardinality estimation framework, FC, which ensures differential privacy under continual releases while simultaneously achieving low memory usage, high accuracy, and efficient computation. Our approach innovatively leverages an efficient cardinality estimator and privacy-preserving mechanisms to overcome the limitations of existing methods. Comprehensive experiments demonstrate that our method reduces memory usage by up to 504 times compared to the best previous method while maintaining nearly the same accuracy. Additionally, under identical memory constraints, our method improves the estimation accuracy by orders of magnitude.

Efficient Dynamic Indexing for Range Filtered Approximate Nearest Neighbor Search

Given a set O of objects consisting of n high-dimensional vectors, the problem of approximate nearest neighbor (ANN) search for a query vector q is crucial in many applications where objects are represented as feature vectors in high-dimensional spaces. Each object in O often has attributes like popularity or price, which influence the search. Practically, searching for the nearest neighbor to q might include a range filter specifying the desired attribute values, e.g., within a specific price range. Existing solutions for range filtered ANN search often face trade-offs among excessive storage, poor query performance, and limited support for updates. To address this challenge, we propose RangePQ, a novel indexing scheme that supports efficient range filtered ANN searches and updates, requiring only linear space. Our scheme integrates seamlessly with existing PQ-based index---a widely recognized, scalable index type for ANN searches---to enhance range-filtered ANN queries and update capabilities. Our indexing method, supporting arbitrary range filters, has a space complexity of (O(n log K)), where K is a parameter of the PQ-based index and log K scales with O(log n). To reduce the space cost, we further present a hybrid two-layer structure to reduce space usage to O(n), preserving query efficiency without additional update costs. Experimental results demonstrate that our indexing scheme significantly improves query performance while maintaining competitive update performance and space efficiency.

Efficient Indexing for Flexible Label-Constrained Shortest Path Queries in Road Networks

The point-to-point shortest path query is widely used in many spatial applications, e.g., navigation systems. However, the returned shortest path minimizing only one objective fails to satisfy users' various routing requirements in practice. For example, the user may specify the order of using several transportation modes in the planned route. The Label-Constrained Shortest Path (LCSP) query under regular languages is powerful enough to express diversified routing demands in a labeled road network where each edge is associated with a label to denote its road type. The complex routing demand can be formulated by a regular language, and the edge labels along each path should be a word under the given regular language. Previous LCSP solutions were either inefficient in query processing or inflexible in their use of the languages since they made some assumptions about the given language. In this paper, we propose an efficient index-based solution called Border-based State Move (BSM), which can answer LCSP queries quickly with flexible use of the language constraint. Specifically, our BSM builds indexes to skip the exploration between a vertex and its border vertices during query processing. Our experiments conducted on real road networks demonstrated the superiority of our proposed BSM. It can reduce the query time over state-of-the-art solutions by two orders of magnitude.

Extending SQL to Return a Subdatabase

Every SQL statement is limited to return a single, possibly denormalized table. This approximately 50-year-old design decision has far-reaching consequences. The most apparent problem is the redundancy introduced through denormalization, which can result in long transfer times of query results and high memory usage for materializing intermediate results. Additionally, regardless of their goals, users are forced to fit query computations into one single result, mixing the data retrieval and transformation aspect of SQL. Moreover, both problems violate the principles and core ideas of normal forms. In this paper, we argue for eliminating the single-table limitation of SQL. We extend SQL's SELECT clause by the keyword 'RESULTDB' to support returning a result subdatabase. Our extension has clear semantics, i.e., by annotating any existing SQL statement with the RESULTDB keyword, the DBMS returns the tables participating in the query, each restricted to the relevant tuples that occur in the traditional single-table query result. Thus, we do not denormalize the query result in any way. Our approach has significant, far-reaching consequences, impacting the querying of hierarchical data, materialized views, and distributed databases, while maintaining backward compatibility. In addition, our proposal paves the way for a long list of exciting future research opportunities. We propose multiple algorithms to integrate our feature into both closed-source and open-source database systems. For closed-source systems, we provide several SQL-based rewrite methods. In addition, we present an efficient algorithm for cyclic and acyclic join graphs that we integrated into an open-source database system. We conduct a comprehensive experimental study. Our results show that returning multiple individual result sets can significantly decrease the result set size. Furthermore, our rewrite methods and algorithm introduce minimal overhead and can even outperform single-table execution in certain cases.

FAAQP: Fast and Accurate Approximate Query Processing based on Bitmap-augmented Sum-Product Network

For interactive data exploration, approximate query processing (AQP) is a useful approach that provides a timely response by trading query accuracy. To reduce query latency, existing AQP methods use samples or models rather than the underlying data to answer queries. However, it is difficult to achieve satisfactory results in terms of query accuracy and query latency simultaneously with these methods. For the sample-based methods, this is because the more accurate the query results are, the more samples are needed and the more time cost is required for processing. The model-based methods have lower query latency, but they cannot return the approximate results with high accuracy because the existing models cannot capture the complex data distribution accurately. In this paper, we propose a fast and accurate AQP method FAAQP. In FAAQP, we propose a novel unsupervised model bitmap-augmented sum-product network (BSPN) that combines the advantages of the sum-product network with bitmaps to capture the characteristics of data distribution more accurately. Then, we propose a budget-aware BSPN construction method that builds BSPN models with the maximum query accuracy for the given storage budget. Furthermore, to reduce the query latency of FAAQP, we propose a bitmap merging strategy that makes a trade-off between query accuracy and query latency. Experimental results on real-world and synthetic datasets show that FAAQP outperforms the state-of-the-art AQP methods and achieves 1.3×-9.0× improvements in query accuracy with a low query latency.

Fair and Actionable Causal Prescription Ruleset

Prescriptions, or actionable recommendations, are commonly generated across various fields to influence key outcomes such as improving public health, enhancing economic policies, or increasing business efficiency. While traditional association-based methods may identify correlations, they often fail to reveal the underlying causal factors needed for informed decision-making. On the other hand, in decision making for tasks with significant societal or economic impact, it is crucial to provide recommendations that are interpretable and justifiable, and equitable in terms of the outcome for both the protected and non-protected groups. Motivated by these two goals, this paper introduces a fairness-aware framework leveraging causal reasoning for generating a set of interpretable and actionable prescription rules (ruleset) toward betterment of an outcome while preventing exacerbating inequalities for protected groups. By considering group and individual fairness metrics from the literature, we ensure that both protected and non-protected groups benefit from these recommendations, providing a balanced and equitable approach to decision-making. We employ efficient optimizations to explore the vast and complex search space considering both fairness and coverage of the prescription ruleset. Empirical evaluation and case study on real-world datasets demonstrates the utility of our framework for different use cases.

Fast and Scalable Data Transfer Across Data Systems

Fast and scalable data transfer is crucial in today's decentralized data ecosystems and data-driven applications. Example use cases include transferring data from operational systems to consolidated data warehouse environments, or from relational database systems to data lakes for exploratory data analysis or ML model training. Traditional data transfer approaches rely on efficient point-to-point connectors or general middleware with generic intermediate data representations. Physical environments (e.g., on-premise, cloud, or consumer nodes) also have become increasingly heterogeneous. Existing work still struggles to achieve both, fast and scalable data transfer as well as generality in terms of heterogeneous systems and environments. Hence, in this paper, we introduce a holistic data transfer framework. Our XDBC framework splits the data transfer pipeline into logical components and provides a wide variety of physical implementations for these components. This design allows a seamless integration of different systems as well as the automatic optimizations of data transfer configurations according to workload and environment characteristics. Our evaluation shows that XDBC outperforms state-of-the-art generic data transfer tools by up to 5x, while being on par with specialized approaches.

Fast Approximate Similarity Join in Vector Databases

Recent advancements in deep learning, particularly in embedding models, have enabled the effective representation of various data types such as text, images, and audio as vectors, thereby facilitating semantic analysis. A large number of massive vector datasets are maintained in vector databases. Approximate similarity join is a core operation in vector database systems that joins two datasets, and outputs all pairs of vectors from the two datasets, if the distance between such a pair of two vectors is no more than a specified value. Existing approaches for similarity join are selection-based such that they treat each data point in a dataset as an individual query point to search data points by an approximate range query in another dataset. Such methods do not fully capitalize on the inherent properties of the join operation itself. In this paper, we propose a new join algorithm, SimJoin. Our join algorithm aims at boosting join processing efficiency by leveraging relationships between partial join results (e.g., join windows). In brief, our join algorithm accelerates the join processing to process a join window by utilizing the join windows from the processed data points. Then, we discuss optimizing join window order to minimize join costs. In addition, we discuss how to support k-similarity join, and how to maintain proximity graph index based on k-similarity join. Extensive experiments on real-world and synthetic datasets demonstrate the significant performance superiority of our proposed algorithms over existing state-of-the-art methods.

Fast Hypertree Decompositions via Linear Programming: Fractional and Generalized

Efficient query evaluation in databases and solving constraint satisfaction problems (CSPs) are crucial for improving performance in many real-world applications, from large-scale data management to decision-making systems. These problems naturally admit hypergraph representations, and are efficiently solvable using hypertree decomposition techniques, when the decomposition width is small. However, these techniques require finding small-width decompositions efficiently. This remains a significant challenge despite research from both the database and theory communities. In this work we present Ralph (Randomized Approximation using Linear Programming for Hypertree-Decompositions), a fast algorithm to compute low width fractional and generalized hypertree decompositions for input hypergraphs, as well as lower bounds for these widths. We build on the recent breakthrough by Korchemna et al. [FOCS 2024], which introduced the first polynomial time approximation algorithm for fractional (generalized) hypertree width. Our approach combines this theoretical advancement with practical heuristic improvements utilizing (mixed-integer) linear programs. Along the way, we present new algorithms with strong theoretical guarantees. Through empirical evaluation on the nearly 3700 instances of HyperBench, a well established benchmark suite for hypertree decompositions, we find near optimal decompositions for all previously solved instances and low width decompositions for all 500 previously unsolved instances, effectively pushing state-of-the-art.

Fast Maximum Common Subgraph Search: A Redundancy-Reduced Backtracking Approach

Given two input graphs, finding the largest subgraph that occurs in both, i.e., finding the maximum common subgraph, is a fundamental operator for evaluating the similarity between two graphs in graph data analysis. Existing works for solving the problem are of either theoretical or practical interest, but not both. Specifically, the algorithms with a theoretical guarantee on the running time are known to be not practically efficient; algorithms following the recently proposed backtracking framework called McSplit, run fast in practice but do not have any theoretical guarantees. In this paper, we propose a new backtracking algorithm called RRSplit, which at once achieves better practical efficiency and provides a non-trivial theoretical guarantee on the worst-case running time. To achieve the former, we develop a series of reductions and upper bounds for reducing redundant computations, i.e., the time for exploring some unpromising branches of exploration that hold no maximum common subgraph. To achieve the latter, we formally prove that RRSplit incurs a worst-case time complexity which matches the best-known complexity for the problem. Finally, we conduct extensive experiments on four benchmark graph collections, and the results demonstrate that our algorithm outperforms the practical state-of-the-art by several orders of magnitude.

Faster and Efficient Density Decomposition via Proportional Response with Exponential Momentum

Graphs are crucial for modeling complex relationships in fields like social networks and biology. A key aspect in graph theory is identifying dense subgraphs, with applications in various domains. Density decomposition refines this by analyzing a graph's global density structure. This concept has been independently rediscovered in research areas like graph mining, algorithm design, and economics.

Recent advancements in maximum-flow algorithms allow for nearly-linear time computation of the density vector, but they struggle with large real-world graphs. To address this, iterative first-order methods based on convex optimization, such as the Frank-Wolfe algorithm and momentum-based methods like accelerated FISTA, have been developed to approximate density vectors.

This work explores density decomposition through market dynamics, where edges represent buyers and nodes represent sellers in a Fisher market model. The iterative proportional response process, which converges to the Fisher market equilibrium, provides an alternative interpretation of density decomposition. In each step, agents allocate resources based on the proportion of benefit received, moving the system toward equilibrium.

Since each proportional response update can be seen as a gradient descent step, we investigate whether momentum methods can speed up convergence. Traditional momentum uses a linear combination of current and previous solutions, whereas our novel exponential momentum variant uses geometric interpolation, aligning better with proportional adjustments. Empirical evaluations on large-scale real-world and synthetic graphs confirm the effectiveness of our methods. Notably, the proportional response algorithm with exponential momentum outperforms existing methods, delivering improvements by several orders of magnitude in some cases. This advancement is significant, as the resulting density vector is crucial for many downstream tasks in graph mining and algorithm design.

Femur: A Flexible Framework for Fast and Secure Querying from Public Key-Value Store

With increasing demands for privacy, it becomes necessary to protect sensitive user query data when accessing public key-value databases. Existing Private Information Retrieval (PIR) schemes provide full security but suffer from poor scalability, limiting their applicability in large-scale deployment. We argue that in many real-world scenarios, a more practical solution should allow users to flexibly determine the privacy levels of their queries in a theoretically guided way, balancing security and performance based on specific needs. To formally provide provable guarantees, we introduce a novel concept of distance-based indistinguishability, which can facilitate users to comfortably relax their security requirements. We then design Femur, an efficient framework to securely query public key-value stores with flexible security and performance trade-offs. It uses a space-efficient learned index to convert query keys into storage locations, obfuscates these locations with extra noise provably derived by the distance-based indistinguishability theory, and sends the expanded range to the server. The server then adaptively utilizes the best scheme to retrieve data. We also propose a novel variable-range PIR scheme optimized for bandwidth-constrained environments. Experiments show that Femur outperforms the state-of-the-art designs even when ensuring the same full security level. When users are willing to relax their privacy requirements, Femur can further improve the performance gains to up to 163.9X, demonstrating an effective trade-off between security and performance.

Finding Logic Bugs in Graph-processing Systems via Graph-cutting

Graph-processing systems, including Graph Database Management Systems (GDBMSes) and graph libraries, are designed to analyze and manage graph data efficiently. They are widely used in applications such as social networks, recommendation systems, and fraud detection. However, logic bugs in these systems can lead to incorrect results, compromising the reliability of applications. While recent research has explored testing techniques specialized for GDBMSes, it is unclear how to adapt them to graph-processing systems in general. This paper proposes Graph-cutting, a universal approach for detecting logic bugs in both GDBMSes and various algorithms in graph libraries. Our key idea is inspired by the observation that certain graph patterns are critical for various graph-processing tasks. Dividing graph data into subgraphs that preserve those patterns establishes a natural relationship between query results on the original graph and its subgraphs, allowing for the detection of logic bugs when this relationship is violated. We implemented Graph-cutting as a tool, GSlicer, and evaluated it on 3 popular graph-processing systems, NetworkX, Neo4j, and Kùzu. GSlicer detected 39 unique and previously unknown bugs, out of which 34 have been fixed and confirmed by developers. At least 8 logic bugs detected by GSlicer cannot be detected by baseline strategies. Additionally, by leveraging just a few concrete relationships, Graph-cutting can cover over 100 APIs in NetworkX. We expect this technique to be widely applicable and that it can be used to improve the quality of graph-processing systems broadly.

Galley: Modern Query Optimization for Sparse Tensor Programs

The tensor programming abstraction is a foundational paradigm which allows users to write high performance programs via a high-level imperative interface. Recent work on sparse tensor compilers has extended this paradigm to sparse tensors (i.e., tensors where most entries are not explicitly represented). With these systems, users define the semantics of the program and the algorithmic decisions in a concise language that can be compiled to efficient low-level code. However, these systems still require users to make complex decisions about program structure and memory layouts to write efficient programs.

This work presents .Galley, a system for declarative tensor programming that allows users to write efficient tensor programs without making complex algorithmic decisions. Galley is the first system to perform cost based lowering of sparse tensor algebra to the imperative language of sparse tensor compilers, and the first to optimize arbitrary operators beyond Σ and *. First, it decomposes the input program into a sequence of aggregation steps through a novel extension of the FAQ framework. Second, Galley optimizes and converts each aggregation step to a concrete program, which is compiled and executed with a sparse tensor compiler. We show that Galley produces programs that are 1-300x faster than competing methods for machine learning over joins and 5-20x faster than a state-of-the-art relational database for subgraph counting workloads with a minimal optimization overhead.

GPH: An Efficient and Effective Perfect Hashing Scheme for GPU Architectures

Hash tables are widely used to support fast lookup operations for various applications on key-value stores and relational databases. In recent years, hash tables have been significantly improved by utilizing the high memory bandwidth and large parallelism degree offered by Graphics Processing Units (GPUs). However, there is still a lack of comprehensive analysis of the lookup performance on existing GPU-based hash tables. In this work, we develop a micro-benchmark and devise an effective and general performance analysis model, which enables uniform and accurate lookup performance evaluation of GPU-based hash tables. Moreover, we propose GPH, a novel GPU-based hash table, to improve lookup performance with the guidance of the benchmark results from the analysis model devised above. In particular, GPH employs the perfect hashing scheme that ensures exactly 1 bucket probe for every lookup operation. Besides, we optimize the bucket requests to global memory in GPH by devising vectorization and instruction-level parallelism techniques. We also introduce the insert kernel in GPH to support dynamic updates (e.g., processing insert operations) on GPU. Experimentally, GPH achieves over 8500 million operations per second (MOPS) for lookup operation processing in both synthetic and real-world workloads, which outperforms all evaluated GPU-based hash tables.

Rule-Based Graph Cleaning with GPUs on a Single Machine

This paper studies cost-effective graph cleaning with a single machine. We adopt a rule-based method that may embed machine learning models as predicates in the rules. Graph cleaning with the rules involves rule discovery, error detection and correction. These tasks are both computation-heavy and I/O-intensive as they repeatedly invoke costly graph pattern matching, and produce a large amount of a large volume of intermediate results, among other things. In light of these, no existing single-machine system is able to carry out these tasks even on not-too-large graphs, even using GPUs. Thus we develop MiniClean, a single-machine system for cleaning large graphs. It proposes (1) a workflow that better fits a single machine by pipelining CPU, GPU and I/O operations; (2) memory footprint reduction with bundled processing and data compression; and (3) a multi-mode parallel model for SIMD, pipelined and independent parallelism, and their scheduling to maximize CPU--GPU synergy. Using real-life graphs, we empirically verify that MiniClean outperforms the SOTA single-machine systems by at least 65.34× and multi-machine systems with 32 nodes by at least 8.09×.

Graph Edit Distance Estimation: A New Heuristic and A Holistic Evaluation of Learning-based Methods

Graph edit distance (GED) is an important metric for measuring the distance or similarity between two graphs. It is defined as the minimum number of edit operations required to transform one graph into another. Computing the exact GED between two graphs is an NP-hard problem. With the success of deep learning across various application domains, graph neural networks have also been recently utilized to predict the GED between graphs. However, the existing studies on learning-based methods have two significant limitations. (1)~The development of deep learning models for GED prediction has been explored in various research fields (e.g., databases, machine learning, information retrieval, and computer vision), yet cross-field evaluations have been quite limited. (2)~More importantly, all these advancements have been evaluated against a simple combinatorial heuristic baseline, with their models shown to outperform it. In this paper, we aim to bridge this knowledge gap. We first conduct a holistic review of the existing learning-based methods, categorizing them into non-interpretable and interpretable GED prediction approaches, while highlighting their overarching design principles and relationships among these models. Secondly, we present a simple yet effective combinatorial heuristic algorithm App-BMao for GED estimation, adapted from an existing exact GED computation algorithm. App-BMao provides interpretable GED estimation with controlled time and space complexity. Extensive empirical evaluations on three widely used datasets show that the new heuristic algorithm App-BMao outperforms all existing learning-based approaches for both interpretable and non-interpretable GED prediction.

GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support

This paper introduces GTX, a standalone main-memory write-optimized graph data system that specializes in structural and graph property updates while enabling concurrent reads and graph analytics through ACID transactions. Recent graph systems target concurrent read and write support while guaranteeing transaction semantics. However, their performance suffers from updates with real-world temporal locality over the same vertices and edges due to vertex-centric lock contentions. GTX has an adaptive delta-chain locking protocol on top of a carefully designed latch-free graph storage. It eliminates vertex-level locking contention, and adapts to real-life workloads while maintaining sequential access to the graph's adjacency lists storage. GTX's transactions further support cache-friendly block-level concurrency control, and cooperative group commit and garbage collection. This combination of features ensures high update throughput and provides low latency graph analytics. Based on experimental evaluation, in addition to not sacrificing the performance of read-heavy analytical workloads, and having competitive performance similar to state-of-the-art systems, GTX has high read-write transaction throughput. For write-heavy transactional workloads, GTX achieves up to 11X better transaction throughput than the best-performing state-of-the-art system.

High-Throughput Ingestion for Video Warehouse: Comprehensive Configuration and Effective Exploration

The innovative concept of Video Extract-Transform-Load (V-ETL), recently proposed in Skyscraper, reinterprets large-scale video analytics as a data warehousing problem. In this study, we aim at enabling real-time and high-throughput ingestion of hundreds of video streams and maximizing the overall accuracy, by constructing a proper ingestion plan for each video stream. To achieve the goal, we construct a comprehensive configuration space that takes into account the configurable components in the entire ingestion pipeline, including numeric parameters and categorical options such as visual inference model selection. The new space is 107 times larger than existing approaches, rendering them as sub-optimal points in our space. To effectively explore the huge and heterogeneous configuration space, we devise an accuracy-aware search strategy based on graph embedding and reinforcement learning to establish the runtime-quality Pareto frontier. To reduce the configuration exploration cost for all video streams, we cluster video streams with similar contexts and adopt mixed integer programming to maximize the overall ingestion accuracy while ensuring the real-time ingestion requirement. In the experimental evaluation with one NVIDIA GeForce RTX 4090 GPU card, our Hippo can support real-time ingestion with 300 video streams and secures an ingestion accuracy that exceeds its competitors by more than 30%.

HoneyComb: A Parallel Worst-Case Optimal Join on Multicores

To achieve true scalability on massive datasets, a modern query engine needs to be able to take advantage of large, shared-memory, multicore systems. Binary joins are conceptually easy to parallelize on a multicore system; however, several applications require a different approach to query evaluation, using a Worst-Case Optimal Join (WCOJ) algorithm. WCOJ is known to outperform traditional query plans for cyclic queries. However, there is no obvious adaptation of WCOJ to parallel architectures. The few existing systems that parallelizeWCOJ do this by partitioning only the top variable of theWCOJ algorithm. This leads to work skew (since some relations end up being read entirely by every thread), possible contention between threads (when the hierarchical trie index is built lazily, which is the case on most recent WCOJ systems), and exacerbates the redundant computations already existing in WCOJ.

HotStuff-1: Linear Consensus with One-Phase Speculation

This paper introduces HotStuff-1, a BFT consensus protocol that improves the latency of HotStuff-1 by two network hops while maintaining linear communication complexity against faults. Furthermore, HotStuff-1 incorporates an incentive-compatible leader rotation design that motivates leaders to propose transactions promptly. HotStuff-1 achieves a reduction of two network hops by speculatively sending clients early finality confirmations, after one phase of the protocol. Introducing speculation into streamlined protocols is challenging because, unlike stable-leader protocols, these protocols cannot stop the consensus and recover from failures. Thus, we identify prefix speculation dilemma in the context of streamlined protocols; HotStuff-1 is the first streamlined protocol to resolve it. HotStuff-1 embodies an additional mechanism, slotting, that thwarts delays caused by (1) rationally-incentivized leaders and (2) malicious leaders inclined to sabotage others' progress. The slotting mechanism allows leaders to dynamically drive as many decisions as allowed by network transmission delays before view timers expire, thus mitigating both threats.

How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks

Traditionally, query optimizers rely on cost models to choose the best execution plan from several candidates, making precise cost estimates critical for efficient query execution. In recent years, cost models based on machine learning have been proposed to overcome the weaknesses of traditional cost models. While these models have been shown to provide better prediction accuracy, only limited efforts have been made to investigate how well Learned Cost Models (LCMs) actually perform in query optimization and how they affect overall query performance. In this paper, we address this by a systematic study evaluating LCMs on three of the core query optimization tasks: join ordering, access path selection, and physical operator selection. In our study, we compare seven state-of-the-art LCMs to a traditional cost model and, surprisingly, find that the traditional model often still outperforms LCMs in these tasks. We conclude by highlighting major takeaways and recommendations to guide future research toward making LCMs more effective for query optimization.

How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and Practice

LSM-tree based key-value stores are widely adopted as the data storage backend in modern big data applications. The LSM-tree grows with data ingestion, by either adding levels with fixed level capacities (dubbed as vertical scheme) or increasing level capacities with fixed number of levels (dubbed as horizontal scheme). The vertical scheme leads the trend in recent system designs in RocksDB, LevelDB, and WiredTiger, whereas the horizontal scheme shows a decline in being adopted in the industry. The growth scheme profoundly impacts the LSM system performance in various aspects such as read, write and space costs. This paper attempts to give a new insight into a fundamental design question -- how to grow an LSM-tree to attain more desirable performance?

Our analysis highlights the limitations of the vertical scheme in achieving an optimal read-write trade-off and the horizontal scheme in managing space cost effectively. Building on the analysis, we present a novel approach, Vertiorizon, which combines the strengths of both the vertical and horizontal schemes to achieve a superior balance between lookup, update, and space costs. Its adaptive design makes it highly compatible with a wide spectrum of workloads. Compared to the vertical scheme, Vertiorizon significantly improves the read-write performance trade-off. In contrast to the horizontal scheme, Vertiorizon greatly extends the trade-off range by a non-trivial generalization of Bentley and Saxe's theory, while substantially reducing space costs. When integrated with RocksDB, Vertiorizon demonstrates better write performance than the vertical scheme, while incurring about six times less additional space cost compared to the horizontal scheme.

Aero: Adaptive Query Processing of ML Queries

Query optimization is critical in relational database management systems (DBMSs) for ensuring efficient query processing. The query optimizer relies on precise selectivity and cost estimates to generate optimal query plans for execution. However, this static query optimization approach falls short for DBMSs handling machine learning (ML) queries. ML-centric DBMSs face distinct challenges in query optimization. First, performance bottlenecks shift to user-defined functions (UDFs), often encapsulating deep learning models, making it difficult to estimate UDF statistics without profiling the query. Second, optimal query plans for ML queries are data-dependent, requiring dynamic plan adjustments during execution.

To address these challenges, we introduce Aero, an ML-centric DBMS that utilizes adaptive query processing (AQP) for efficiently processing ML queries. Aero optimizes the evaluation of UDF-based query predicates by dynamically adjusting predicate evaluation order and enhancing UDF execution scalability. By integrating AQP, Aero continuously monitors UDF statistics, routes data to predicates in an optimal order, and dynamically allocates resources for evaluating predicates. Aero achieves up to 6.4x speedup compared to a state-of-the-art ML-centric DBMS across four diverse use cases, with no impact on accuracy.

Incremental Rule Discovery in Response to Parameter Updates

This paper studies incremental rule discovery. Given a dataset D, rule discovery is to mine the set of the rules on D such that their supports and confidences are above thresholds 𝜎 and 𝛅 , respectively. We formulate incremental problems in response to updates Δ𝜎 and/or Δ𝛅, to compute rules added and/or removed with respect to 𝜎 + Δ𝜎 and 𝛅 + Δ𝛅. The need for studying the problems is evident since practitioners often want to adjust their support and confidence thresholds during discovery. The objective is to minimize unnecessary recomputation during the adjustments, not to restart the costly discovery process from scratch. As a testbed, we consider entity enhancing rules, which subsume popular data quality rules as special cases. We develop three incremental algorithms, in response to Δ𝜎 , Δ𝜎 and both. We show that relative to a batch discovery algorithm, these algorithms are bounded, i.e., they incur the minimum cost among all incrementalizations of the batch one, and parallelly scalable, i.e., they guarantee to reduce runtime when given more processors. Using real-life data, we empirically verify that the incremental algorithms outperform the batch counterpart by up to 658× when Δ𝜎 and Δ𝜎 are either positive or negative.

Integral Densest Subgraph Search on Directed Graphs

The densest subgraph (DS) search over a directed graph focuses on finding the subgraph with the highest density among all subgraphs. This problem has raised numerous applications, such as fraud detection and community detection. The state-of-the-art DS algorithms have prohibitively high costs or poor approximation ratios, making them unsuitable for practical applications. To address these dilemmas, in this paper, we propose a novel model called integral densest subgraph (IDS). We show that IDS can serve as a near-DS model that has a tight floor relationship with the density of the DS. To compute IDS, we first propose a novel flow network named (α,β)-dense network, based on which we design an exact network-flow algorithm GetIDS with O(p • log |V| • |E|1.5) time complexity, where p is typically a small constant in real-world graphs. Additionally, we propose several non-trivial pruning techniques to further improve the efficiency. Subsequently, we propose a novel (2 + ε)-approximation algorithm MultiCore with near-linear time complexity, providing a good approximation guarantee with high efficiency. Finally, our extensive experiments on 10 real-world graphs demonstrate the effectiveness of the proposed IDS model, and the high efficiency and scalability of the proposed solutions.

Interactive Graph Search Made Simple

Interactive graph search (IGS) has proven to be a useful information retrieval paradigm in a diverse set of applications. Robust IGS algorithms are notoriously difficult to design because they are deeply rooted in graph theory. The current state-of-the-art algorithms either fail to achieve an optimal number of interaction rounds or rely on interfaces demanding tedious user inputs. Furthermore, previous research has paid little attention to the underlying computation bottleneck, which is currently dealt with using primitive implementations. This work remedies the above issues altogether. Utilizing novel findings on the problem characteristics, we develop an algorithmic framework for IGS that requires a designer to fill in the details for only two ''black-box'' operations. Our framework, when instantiated with surprisingly simple black-box implementations, yields optimal algorithms not only in all the scenarios explored before but also in new scenarios never studied. We accompany our framework, designed to minimize interaction rounds, with a new algorithm designed to reduce the CPU time complexity significantly. Extensive experiments on both real and synthetic data confirm both the efficacy and efficiency of the proposed techniques.

Intra-Query Runtime Elasticity for Cloud-Native Data Analysis

We propose the concept of Intra-Query Runtime Elasticity (IQRE) for cloud-native data analysis. IQRE enables a cloud-native OLAP engine to dynamically adjust a query's Degree of Parallelism (DOP) during execution. This capability allows users to utilize cloud computing resources more cost-effectively. We present Accordion, the first IQRE query engine. Accordion can adjust the parallelism of a query at any point during query execution without pausing data processing. It features a user-friendly interface and an auto-tuner backed by a "what-if" service to allow users to adjust the DOP according to their query latency constraints. The design of Accordion follows the execution model in Presto, an open-source distributed SQL query engine developed at Meta. We present the implementation of Accordion and demonstrate its ease of use, showcasing how it enables users to minimize compute resource consumption while meeting their query time constraints.

Learned Offline Query Planning via Bayesian Optimization

Analytics database workloads often contain queries that are executed repeatedly. Existing optimization techniques generally prioritize keeping optimization cost low, normally well below the time it takes to execute a single instance of a query. If a given query is going to be executed thousands of times, could it be worth investing significantly more optimization time? In contrast to traditional online query optimizers, we propose an offline query optimizer that searches a wide variety of plans and incorporates query execution as a primitive. Our offline query optimizer combines variational auto-encoders with Bayesian optimization to find optimized plans for a given query. We compare our technique to the optimal plans possible with PostgreSQL and recent RL-based systems over several datasets, and show that our technique finds faster query plans.

LICS: Towards Theory-Informed Effective Visual Abstraction of Property Graph Schemas

Property graph schemas are essential for organizing property graph data, serving both prescriptive and descriptive roles. This has led to the recent development of property graph schema languages such as PGSchema. While understanding of these languages requires familiarity with complex syntax, this poses usability challenges, particularly for domain experts who are not programmers. Current visual abstractions, such as the labeled schema graph (łsg), simplify representation but suffers from visual clutter and limited feature support. To address these challenges, we propose a novel, generic, and extensible visual abstraction, labeled iconized composite schema (łics), whose design is informed by theories and principles from HCI, cognitive psychology, and visualization. A novel łics-based visual interface coined PASCAL is also proposed to facilitate visualization of property graph schemas. Under the hood, it leverages the Map-Paint algorithm for creating the visual components of łics. A user study demonstrates that łics is superior to the traditional łsg abstraction w.r.t. usability, effectiveness, query formulation efficiency, and schema comprehension.

Logical and Physical Optimizations for SQL Query Execution over Large Language Models

Interacting with Large Language Models (LLMs) via declarative queries is increasingly popular for tasks like question answering and data extraction, thanks to their ability to process vast unstructured data. However, LLMs often struggle with answering complex factual questions, exhibiting low precision and recall in the returned data.

This challenge highlights that executing queries on LLMs remains a largely unexplored domain, where traditional data processing assumptions often fall short. Conventional query optimization, typically cost-driven, overlooks LLM-specific quality challenges such as contextual understanding. Just as new physical operators are designed to address the unique characteristics of LLMs, optimization must consider these quality challenges. Our results highlight that adhering strictly to conventional query optimization principles fails to generate the best plans in terms of result quality.

To tackle this challenge, we present a novel approach to enhance SQL results by applying query optimization techniques specifically adapted for LLMs. We introduce a database system, GALOIS, that sits between the query and the LLM, effectively using the latter as a storage layer. We design alternative physical operators tailored for LLM-based query execution and adapt traditional optimization strategies to this novel context. For example, while pushing down operators in the query plan reduces execution cost (fewer calls to the model), it might complicate the call to the LLM and deteriorate result quality. Additionally, these models lack a traditional catalog for optimization, leading us to develop methods to dynamically gather such metadata during query execution.

Our solution is compatible with any LLM and balances the trade-off between query result quality and execution cost. Experiments show up to 144% quality improvement over questions in Natural Language and 29% over direct SQL execution, highlighting the advantages of integrating database solutions with LLMs.

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?

Traditional non-preemptive scheduling can lead to long latency under workloads that mix long-running and short transactions with varying priorities. This occurs because worker threads tend to monopolize CPU cores until they finish processing long-running transactions. Thus, short transactions must wait for the CPU, leading to long latency. As an alternative, cooperative scheduling allows for transaction yielding, but it is difficult to tune for diverse workloads. Although preemption could potentially alleviate this issue, it has seen limited adoption in DBMSs due to the high delivery latency of software interrupts and concerns on wasting useful work induced by read-write lock conflicts in traditional lock-based DBMSs.

In this paper, we propose PreemptDB, a new database engine that leverages recent userspace interrupts available in modern CPUs to enable efficient preemptive scheduling. We present an efficient transaction context switching mechanism purely in userspace and scheduling policies that prioritize short, high-priority transactions without significantly affecting long-running queries. Our evaluation demonstrates that PreemptDB significantly reduces end-to-end latency for high-priority transactions compared to non-preemptive FIFO and cooperative scheduling methods.

Low Rank Learning for Offline Query Optimization

Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce LimeQO, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ''hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS.

LpBound: Pessimistic Cardinality Estimation Using ℓp-Norms of Degree Sequences

Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of a query optimizer, and is often the main culprit when the optimizer chooses a poor plan.

This paper introduces LpBound, a pessimistic cardinality estimator for multi-join queries (acyclic or cyclic) with selection predicates and group-by clauses. LpBound computes a guaranteed upper bound on the size of the query output using simple statistics on the input relations, consisting of ℓp-norms of degree sequences. The bound is the optimal solution of a linear program whose constraints encode data statistics and Shannon inequalities. We introduce two optimizations that exploit the structure of the query in order to speed up the estimation time and make LpBound practical.

We experimentally evaluate LpBound against a range of traditional, pessimistic, and machine learning-based estimators on the JOB, STATS, and subgraph matching benchmarks. Our main finding is that LpBound can be orders of magnitude more accurate than traditional estimators used in mainstream open-source and commercial database systems. Yet it has comparable low estimation time and space requirements. When injected the estimates of LpBound, Postgres derives query plans at least as good as those derived using the true cardinalities.

Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers.

This paper presents Malleus, a straggler-resilient hybrid parallel training framework for large-scale models. Malleus quantifies the stragglers at the nuanced, per-GPU granularity during training, and develops a novel planning algorithm to deduce the optimal parallelization of GPU devices, pipeline stages, model layers, and training data, maximizing training efficiency when stragglers exist. In addition, once a shift in the straggler situation is detected, Malleus adaptively adjusts the parallelization via a re-planning process, and seamlessly and efficiently migrates the model states on the fly, without sacrificing the stability of the training tasks. Empirical results on large language models with up to 110B parameters show that Malleus consistently outperforms existing parallel training frameworks under various straggler situations, delivering on average 2.63-5.28x of efficiency improvement.

MatCo: Computing Match Cover of Subgraph Query over Graph Data

Subgraph query can be applied in various scenarios, such as fraud detection and cyberattack pattern analysis. However, computing subgraph queries usually traverses a huge search space. Many efforts have been made to reduce this search space. The size of the answer set can be exponential, providing a substantial lower bound for the search space. Additionally, different answers may overlap, and a single vertex can occur multiple times in different matches. In this paper, we propose a new problem to compute the match cover of a subgraph query. We define the match cover as a subset of answers such that the vertices included are exactly the same as those in the entire set. There can be more than one match covers, however, we only return one, as long as we can avoid the huge overhead of searching the entire set. It is inefficient to apply traditional subgraph query methods for computing match cover. Specifically, existing methods do not prune partial matches that could grow into full matches. For match cover computation, if the vertices in those full matches are already included in previously found matches, continuing the computation over such partial matches is a waste of time. We propose a new framework, called MatCo, to compute the match cover. In MatCo, we design a new data structure, called local candidate space, to determine whether the future search scopes of partial matches have been covered. We can easily maintain local candidate space and efficiently conduct the determination. We also reduce some Cartesian products, which are inevitable in existing methods, into linear enumerations, which significantly improves performance. Extensive experiments over various datasets confirm that our method outperforms comparative ones by 1~3 orders of magnitude. Efficiently computing the minimum match cover could be an interesting future work.

Maximus: A Modular Accelerated Query Engine for Data Analytics on Heterogeneous Systems

Several trends are changing the underlying fabric for data processing in fundamental ways. On the hardware side, machines are becoming heterogeneous with smart NICs, TPUs, DPUs, etc., but specially with GPUs taking a more dominant role. On the software side, the diversity in workloads, data sources, and data formats has given rise to the notion of composable data processing where the data is processed across a variety of engines and platforms. Finally, on the infrastructure side, different storage types, disaggregated storage, disaggregated memory, networking, and interconnects are all rapidly evolving, which demands a degree of customization to optimize data movement well beyond established techniques. To tackle these challenges, in this paper, we present Maximus, a modular data processing engine that embraces heterogeneity from the ground up. Maximus can run queries on CPUs and GPUs, can split execution between CPUs and GPUs, import and export data in a variety of formats, interact with a wide range of query engines through Substrait, and efficiently manage the execution of complex data processing pipelines. Through the concept of operator-level integration, Maximus can use operators from third-party engines and achieve even better performance with these operators than when they are used with their native engines. The current version of Maximus supports all TPC-H queries on both the GPU and the CPU and optimizes the data movement and kernel execution between them, enabling the overlap of communication and computation to achieve performance comparable to that of the best systems available, but with a far higher degree of completeness and flexibility.

MIRAGE-ANNS: Mixed Approach Graph-based Indexing for Approximate Nearest Neighbor Search

Approximate nearest neighbor search (ANNS) on high dimensional vectors is important for numerous applications, such as search engines, recommendation systems, and more recently, large language models (LLMs), where Retrieval Augmented Generation (RAG) is used to add context to an LLM query. Graph-based indexes built on these vectors have been shown to perform best but have challenges. These indexes can either employ refinement-based construction strategies such as K-Graph and NSG, or increment-based strategies such as HNSW. Refinement-based approaches have fast construction times, but worse search performance and do not allow for incremental inserts, requiring a full reconstruction each time new vectors are added to the index. Increment-based approaches have good search performance and allow for incremental inserts, but suffer from slow construction. This work presents MIRAGE-ANNS (Mixed Incremental Refinement Approach Graph-based Exploration for Approximate Nearest Neighbor Search) that constructs the index as fast as refinement-based approaches while retaining search performance comparable or better than increment-based ones. It also allows incremental inserts. We show that MIRAGE achieves state of the art construction and query performance, outperforming existing methods by up to 2x query throughput on real-world datasets.

Mitigating the Impedance Mismatch between Prediction Query Execution and Database Engine

Prediction queries that apply machine learning (ML) models to perform analysis on data stored in the database are prevalent with the advance of research. Current database systems introduce Python UDFs to express prediction queries and call ML frameworks for inference. However, the impedance mismatch between database engines and prediction query execution imposes a challenge for query performance. First, the database engine is oblivious to the internal semantics of prediction functions and evaluates the UDF holistically, which incurs the repetitive inference context setup. Second, the invocation of prediction functions in the database does not consider that batching inference with a desirable inference batch size achieves a high performance in ML frameworks. To mitigate the mismatch, we propose to employ a prediction-aware operator in database engines, which leverages inference context reuse cache to achieve an automatic one-off inference context setup and batch-aware function invocation to ensure desirable batching inference. We implement a prototype system, called IMBridge, based on an open-source database OceanBase. Our experiments show that IMBridge achieves a 71.4x speedup on average over OceanBase for prediction query execution and significantly outperforms other solutions.

Mnemosyne: Dynamic Workload-Aware BF Tuning via Accurate Statistics in LSM trees

Log-structured merge (LSM) trees typically employ Bloom Filters (BFs) to prevent unnecessary disk accesses for point queries. The size of BFs can be tuned to navigate a memory vs. performance tradeoff. State-of-the-art memory allocation strategies use a worst-case model for point lookup cost to derive a closed-form solution. However, existing approaches have three limitations: (1) the number of key-value pairs to be ingested must be known a priori, (2) the closed-form solution only works for a perfectly shaped LSM tree, and (3) the model assumes a uniform query distribution. Due to these limitations, the available memory budget for BFs is sub-optimally utilized, especially when the system is under memory pressure (i.e., less than 7 bits per key).

In this paper, we design Mnemosyne, a BF reallocation framework for evolving LSM trees that does not require prior workload knowledge. We use a more general query cost model that considers the access pattern per file, and we find that no system accurately maintains access statistics per file, and that simply maintaining a counter per file significantly deviates from the ground truth for evolving LSM trees. To address this, we propose Merlin, a dynamic sliding-window-based tracking mechanism that accurately captures these statistics. The upgraded Mnemosyne^+ combines Merlin with our new cost model. In our evaluation, Mnemosyne reduces query latency by up to 20% compared to RocksDB under memory pressure, and Mnemosyne^+ further improves throughput by another 10% when workloads exhibit higher skew.

Moving on From Group Commit: Autonomous Commit Enables High Throughput and Low Latency on NVMe SSDs

Achieving both high throughput and low commit latency has long been a difficult challenge for Database Management Systems (DBMSs). As we show in this paper, existing commit processing protocols fail to fully leverage modern NVMe SSDs to deliver both high throughput and low-latency durable commits. We therefore propose autonomous commit, the first commit protocol that fully utilizes modern NVMe SSDs to achieve both objectives. Our approach exploits the high parallelism and low write latency of SSDs, enabling workers to explicitly write logs in smaller batches, thereby minimizing the impact of logging I/O on commit latency. Additionally, by parallelizing the acknowledgment procedure, where the DBMS iterates through a set of transactions to inspect their commit state, we mitigate excessive delays resulting from single-threaded commit operations in high-throughput workloads. Our experimental results show that autonomous commit achieves exceptional scalability and low-latency durable commits across a wide range of workloads.

Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key Generation and Joins

Parquet is the most commonly used file format to store data in a columnar, binary structure. The format also supports storing nested data in this flattened columnar layout. However, many query engines either do not support nested data or process it with substantially worse performance than relational data.

In this work, we close this gap and present a new way to leverage relational query engines for nested data that is stored in this flat columnar file format. Specifically, we demonstrate how to process nested Parquet files much more efficiently. Our approach does not store a copy of the data in an internal format but reads directly from the Parquet file. During query computation, the required flat columns are scanned independently and the nesting is reconstructed using joins with on-the-fly generated join keys.

Our approach can be easily integrated into existing query engines to support querying nested Parquet files. Furthermore, we achieve orders of magnitude faster analytical query performance than existing solutions, which makes it a valuable addition.

NEXT: A New Secondary Index Framework for LSM-based Data Storage

Key-value databases with Log Structured Merge tree are increasingly favored by modern applications. Apart from supporting fast lookup on primary key, efficient queries on non-key attributes are also highly demanded by many of these applications. To enhance query performance, many auxiliary structures like secondary indexing and filters have been developed. However, existing auxiliary structures suffer from three limitations. First, creating filter for every disk component has low lookup efficiency as all components need to be searched during query processing. Second, current secondary index design requires primary table access to fetch the data entries for each output primary key from the index. This indirect entries fetching process involves significant point lookup overhead in the primary table and hence hinders the query performance. Last, maintaining the consistency between the secondary index and the primary table is challenging due to the out-of-place update mechanism of the LSM-tree. To overcome the limitations in existing auxiliary structures for non-key attributes queries, this paper proposes a novel secondary index framework, NEXT, for LSM-based key-value storage system. NEXT utilizes a two-level structure which is integrated with the primary table. In particular, NEXT proposes to create secondary index blocks on each LSM disk component to map the secondary attributes to their corresponding data blocks. In addition, NEXT introduces a global index component which is created on top of all secondary index blocks to direct the secondary index operation to the target secondary index blocks. Finally, NEXT adopts two optimization strategies to further improve the query performance. We implement NEXT on RocksDB and experimentally evaluate its performance against existing methods. Experiments on both static and mixed workloads demonstrate that NEXT outperforms existing methods for different types of non-key attributes.

OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

Although multi-agent collaborative Large Language Models (LLMs) have achieved significant breakthroughs in the Text-to-SQL task, their performance is still constrained by various factors. These factors include the incompleteness of the framework, failure to follow instructions, and model hallucinations. To address these problems, we propose OpenSearch-SQL, which divides the Text-to-SQL task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on a consistency alignment mechanism. This architecture aligns the inputs and outputs of agents through the Alignment module, reducing failures in instruction following and hallucination. Furthermore, we introduce SQL-Like (an intermediate language), optimize the structured Chain-of-Thought (CoT) based on SQL-Like, and develop a dynamic few-shot strategy via self-taught Query-CoT-SQL.

In terms of model selection, we directly applied the base LLMs without any post-training, thereby simplifying the task chain and enhancing the framework's portability. Experimental results show that OpenSearch-SQL achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based validity efficiency score (R-VES) of 69.36%, with all three metrics ranking first at the time of submission. These results demonstrate the comprehensive advantages of the proposed method in both effectiveness and efficiency.

Parallel k-Core Decomposition: Theory and Practice

This paper proposes efficient solutions for k-core decomposition with high parallelism. The problem of k-core decomposition is fundamental in graph analysis and has applications across various domains. However, existing algorithms face significant challenges in achieving work-efficiency in theory and/or high parallelism in practice, and suffer from various performance bottlenecks. We present a simple, work-efficient parallel framework for k-core decomposition that is easy to implement and adaptable to various strategies for improving work-efficiency. We introduce two techniques to enhance parallelism: a sampling scheme to reduce contention on high-degree vertices, and vertical granularity control (VGC) to mitigate scheduling overhead for low-degree vertices. Furthermore, we design a hierarchical bucket structure to optimize performance for graphs with high coreness values. We evaluate our algorithm on a diverse set of real-world and synthetic graphs. Compared to state-of-the-art parallel algorithms, including ParK, PKC, and Julienne, our approach demonstrates superior performance on 23 out of 25 graphs when tested on a 96-core machine. Our algorithm shows speedups of up to 315× over ParK, 33.4× over PKC, and 52.5× over Julienne.

PDX: A Data Layout for Vector Similarity Search

We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data ''as-is'' without preprocessing; making it attractive for vector databases with frequent updates.

Physical Visualization Design: Decoupling Interface and System Design

Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer's desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures.

This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations.

To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees

After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid modifications to database management systems. To address these challenges, we introduce two novel techniques, TAQA and BSAP. TAQA is a two-stage online AQP algorithm that achieves all three properties for arbitrary queries. However, it can be slower than exact queries if we use standard row-level sampling. BSAP resolves this by enabling block-level sampling with statistical guarantees in TAQA. We implement TAQA and BSAP in a prototype middleware system, PilotDB, that is compatible with all DBMSs supporting efficient block-level sampling. We evaluate PilotDB on PostgreSQL, SQL Server, and DuckDB over real-world benchmarks, demonstrating up to 126X speedups when running with a 5% guaranteed error.

PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models

Number of Distinct Values (NDV) estimation of a multiset/column is a basis for many data management tasks, especially within databases. Despite decades of research, most existing methods require either a significant amount of samples through uniform random sampling or access to the entire column to produce estimates, leading to substantial data access costs and potentially ineffective estimations in scenarios with limited data access. In this paper, we propose leveraging semantic information, i.e., schema, to address these challenges. The schema contains rich semantic information that can benefit the NDV estimation. To this end, we propose PLM4NDV, a learned method incorporating Pre-trained Language Models (PLMs) to extract semantic schema information for NDV estimation. Specifically, PLM4NDV leverages the semantics of the target column and the corresponding table to gain a comprehensive understanding of the column's meaning. By using the semantics, PLM4NDV reduces data access costs, provides accurate NDV estimation, and can even operate effectively without any data access. Extensive experiments on a large-scale real-world dataset demonstrate the superiority of PLM4NDV over baseline methods. Our code is available at https://github.com/bytedance/plm4ndv.

Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System

Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its content but also on the context, which is usually tribal knowledge known to the individual or team. While tools like data catalogs and academic data discovery systems target this problem, they rely on keyword search or more complex interfaces, limiting non-technical users' ability to find relevant data. The advent of large language models (LLMs) offers a unique opportunity for users to ask questions directly in natural language, making dataset discovery more intuitive, accessible, and efficient.

In this paper, we introduce Pneuma, a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. Pneuma leverages large language models (LLMs) for both table representation and table retrieval. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search, harnessing the strengths of both to improve retrieval performance. To evaluate Pneuma, we generate comprehensive benchmarks that simulate table discovery workload on six real-world datasets including enterprise data, scientific databases, warehousing data, and open data. Our results demonstrate that Pneuma outperforms widely used table search systems (such as full-text search and state-of-the-art RAG systems) in accuracy and resource efficiency.

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, we use PQ codes and centroids to approximately identify important preceding tokens, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments demonstrate that PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench and low system latency in both prefilling and decoding.

Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search

Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space is a key operator in database systems. For this query, quantization is a popular family of methods developed for compressing vectors and reducing memory consumption. Among these methods, a recent algorithm called RaBitQ achieves the state-of-the-art performance and provides an asymptotically optimal theoretical error bound. RaBitQ uses 1 bit per dimension for quantization and compresses vectors with a large compression rate. In this paper, we extend RaBitQ to compress vectors with flexible compression rates - it achieves this by using B bits per dimension for quantization with B = 1, 2, ... It inherits the theoretical guarantees of RaBitQ and achieves the asymptotic optimality in terms of the trade-off between space and error bounds as to be proven in this study. Additionally, we present efficient implementations of the extended RaBitQ, enabling its application to ANN queries to reduce both space and time consumption. Extensive experiments on real-world datasets confirm that our method consistently outperforms the state-of-the-art baselines in both accuracy and efficiency when using the same amount of memory.

Privacy and Accuracy-Aware AI/ML Model Deduplication

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. Managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models.

We formalize the problem of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problem. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines, our approach improved the compression ratio by up to 35× for individual models (including large language models and vision transformers). We also observed up to 43× inference speedup due to the reduction of I/O operations.

PrivPetal: Relational Data Synthesis via Permutation Relations

Releasing relational databases while preserving privacy is an important research problem with numerous applications. A canonical approach is to generate synthetic data under differential privacy (DP), which provides a strong, rigorous privacy guarantee. The problem is particularly challenging when the data involve not only entities (e.g., represented by records in tables) but also relationships (represented by foreign-key references), since if we generate random records for each entity independently, the resulting synthetic data usually fail to exhibit realistic relationships. The current state of the art, PrivLava, addresses this issue by generating random join key attributes through a sophisticated expectation-maximization (EM) algorithm. This method, however, is rather costly in terms of privacy budget consumption, due to the numerous EM iterations needed to retain high data utility. Consequently, the privacy cost of PrivLava can be prohibitive for some real-world scenarios.

We observe that the utility of the synthetic data is inherently sensitive to the join keys: changing the primary key of a record t, for example, causes t to join with a completely different set of partner records, which may lead to a significant distribution shift of the join result. Consequently, join keys need to be kept highly accurate, meaning that enforcing DP on them inevitably incurs a high privacy cost. In this paper, we explore a different direction: synthesizing a flattened relation and subsequently decomposing it down to base relations, which eliminates the need to generate join keys. Realizing this idea is challenging, since naively flattening a relational schema leads to a rather high-dimensional table, which is hard to synthesize accurately with differential privacy.

We present a sophisticated PrivPetal approach that addresses the above issues via a novel concept: permutation relation, which is constructed as a surrogate to synthesize the flattened relation, avoiding the generation of a high-dimensional relation directly. The synthesis is done using a refined Markov random field mechanism, backed by fine-grained privacy analysis. Extensive experiments using multiple real datasets and the TPC-H benchmark demonstrate that PrivPetal significantly outperforms existing methods in terms of aggregate query accuracy on the synthetic data.

PrivRM: A Framework for Range Mean Estimation under Local Differential Privacy

The increasing collection and analysis of personal data driven by digital technologies has raised concerns about individual privacy. Local Differential Privacy (LDP) has emerged as a promising solution to provide rigorous privacy guarantee for users, without relying on a trusted data collector. In the context of LDP, range mean estimation over numerical values is an important yet challenging problem. Simply applying existing work may introduce overly large noise sensitivity, since all of them focus on statistical tasks (e.g., mean or distribution) across the entire domain. In this paper, we propose a novel framework for <u>Priv</u>ate <u>R</u>ange <u>M</u>ean (PrivRM) estimation under LDP. Two implementations of the framework, namely PrivRMI and PrivRM*, are developed, which are adaptable to all existing numerical value perturbation mechanisms. As an optimization of the framework, we also propose a distribution-aware Adaptive Adjustment (AA) strategy to dynamically confine the perturbation space for skewed data distributions. Extensive experimental results show that under the same privacy guarantee and query range, our framework PrivRM significantly improve over existing solutions.

Relevance Queries for Interval Data

A wide range of applications manage large collections of interval data. For instance, temporal databases manage validity intervals of objects or versions thereof, while in probabilistic databases attribute values of records are associated with confidence or uncertainty intervals. The main search operation on interval data is the retrieval of data intervals that intersect (i.e., overlap with) a query interval (e.g., find records which were valid in September 2020, find temperature readings with non-zero probability to be within [24, 26] degrees). As query results could be many, we need mechanisms that filter or order them based on how relevant they are to the query interval. We define alternative relevance scores between a data and a query interval based on their (relative) overlap. We define relevance queries, which compute only a subset of the most relevant intervals that intersect a query. Then, we propose a framework for evaluating relevance queries that can be applied on popular domain-partitioning interval indices (interval tree and HINT). We present experiments on real datasets that demonstrate the efficiency of our framework over baseline approaches.

Rethinking The Compaction Policies in LSM-trees

Log-structured merge-trees (LSM-trees) are widely used to construct key-value stores. They periodically compact overlapping sorted runs to reduce the read amplification. Prior research on compaction policies has focused on the trade-off between write amplification (WA) and read amplification (RA). In this paper, we propose to treat the compaction operation in LSM-trees as a computational and I/O-bandwidth investment for improving the system's future query throughput, and thus rethink the compaction policy designs. A typical LSM-tree application handles a steady but moderate write stream and prioritizes resources for top-level flushes of small sorted runs to avoid data loss due to write stalls. The goal of the compaction policy, therefore, is to maintain an optimal number of sorted runs to maximize average query throughput. Because compaction and read operations compete for the CPU and I/O resources from the same pool, we must perform a joint optimization to determine the appropriate timing and aggressiveness of the compaction. We introduce a three-level model of an LSM-tree and propose EcoTune, an algorithm based on dynamic programming to find the optimal compaction policy according to workload characterizations. Our evaluation on RocksDB shows that EcoTune improves the average query throughput by 1.5x to 3x over the leveling policy and by up to 2.5x over the lazy-leveling policy on workloads with range/point query ratios.

Revisiting Graph Analytics Benchmark

The rise of graph analytics platforms has led to the development of various benchmarks for evaluating and comparing platform performance. However, existing benchmarks often fall short of fully assessing performance due to limitations in core algorithm selection, data generation processes (and the corresponding synthetic datasets), as well as the neglect of API usability evaluation. To address these shortcomings, we propose a novel graph analytics benchmark. First, we select eight core algorithms by extensively reviewing both academic and industrial settings. Second, we design an efficient and flexible data generator and produce eight new synthetic datasets as the default datasets for our benchmark. Lastly, we introduce a multi-level large language model (LLM)-based framework for API usability evaluation-the first of its kind in graph analytics benchmarks. We conduct comprehensive experimental evaluations on existing platforms (GraphX, PowerGraph, Flash, Grape, Pregel+, Ligra, and G-thinker). The experimental results demonstrate the superiority of our proposed benchmark.

RLOMM: An Efficient and Robust Online Map Matching Framework with Reinforcement Learning

Online map matching is a fundamental problem in location-based services, aiming to incrementally match trajectory data step-by-step onto a road network. However, existing methods fail to meet the needs for efficiency, robustness, and accuracy required by large-scale online applications, making this task still challenging. This paper introduces a novel framework that achieves high accuracy and efficient matching while ensuring robustness in handling diverse scenarios. To improve efficiency, we begin by modeling the online map matching problem as an Online Markov Decision Process (OMDP) based on its inherent characteristics. This approach helps efficiently merge historical and real-time data, reducing unnecessary calculations. Next, to enhance robustness, we design a reinforcement learning method, enabling robust handling of real-time data from dynamically changing environments. In particular, we propose a novel model learning process and a comprehensive reward function, allowing the model to make reasonable current matches from a future-oriented perspective, and to continuously update and optimize during the decision-making process based on feedback. Lastly, to address the heterogeneity between trajectories and roads, we design distinct graph structures, facilitating efficient representation learning through graph and recurrent neural networks. To further align trajectory and road data, we introduce contrastive learning to decrease their distance in the latent space, thereby promoting effective integration of the two. Extensive evaluations on three real-world datasets confirm that our method significantly outperforms existing state-of-the-art solutions in terms of accuracy, efficiency and robustness.

RM2: Answer Counting Queries Efficiently under Shuffle Differential Privacy

Differential privacy (DP) is a leading standard for protecting individual privacy in data collection and analysis. This paper explores the shuffle model of DP, which balances privacy and utility by allowing users to send messages to a trusted shuffler before reaching an untrusted analyzer anonymously. We focus on efficiently implementing the matrix mechanism in shuffle-DP, where efficiency is defined by the number of messages each user sends. Our contributions include a baseline shuffle-DP mechanism that naively adapts the matrix mechanism, followed by an improved mechanism that reduces message complexity while maintaining error levels comparable to central-DP. We demonstrate the versatility of our approach across common query workloads, such as range queries and data cubes, achieving significant improvements in message efficiency. Experimental results confirm that our method outperforms the baseline solution while closely matching the accuracy of central-DP mechanisms.

Robust Privacy-Preserving Triangle Counting under Edge Local Differential Privacy

Counting the number of triangles in a graph is a fundamental task and has been extensively studied recently. In real-world applications, continuously releasing the triangle count of a graph poses a significant privacy risk for users. To protect sensitive edge information from a central server, we study the problem of estimating the number of triangles under edge local differential privacy (edge LDP). Existing approaches adopt a multi-round computing scheme, allowing the vertices to perform local triangle counting using the noisy graph constructed in the previous round. However, these algorithms not only restrict the noisy graph that can be downloaded to each vertex, but also have coarse upper bounds for the scale of noise added to the estimates. In this paper, we propose a vertex-centric triangle counting algorithm under edge LDP, which improves data utility by leveraging a larger part of the noisy adjacency matrix. Our approach fully exploits the local graph structure to obtain refined estimates of per-vertex triangle counts. We also devise tight bounds for global sensitivities to not only comply with privacy requirements but also control the scale of added noise. Furthermore, we perform a rigorous analysis of the L2 loss of our unbiased estimators and design optimizations for allocating the privacy budget to minimize L2 loss based on the input graph. Extensive experiments on 12 datasets validate the effectiveness and efficiency of our proposed algorithms.

RWalks: Random Walks as Attribute Diffusers for Filtered Vector Search

Analytical tasks in various domains increasingly encode complex information as dense vector data (e.g., embeddings), often requiring filtered vector search (i.e., vector search with attribute filtering). This search is challenging due to the volume and dimensionality of the data, the number and variety of filters, and the difference in distribution and/or update frequency between vectors and filters. Besides, many real applications require answers in a few milliseconds with high recall on large collections. Graph-based methods are considered the best choice for such applications, despite a lack of theoretical guarantees on query accuracy. Existing solutions for filtered vector search are either: 1) ad-hoc, using existing techniques with no or minor modifications; or 2) hybrid, providing specialized indexing and/or search algorithms. We show that neither is satisfactory and propose RWalks, an index-agnostic graph-based filtered vector search method that efficiently supports both filtered and unfiltered vector search. We demonstrate its scalability and robustness against the state-of-the-art with an exhaustive experimental evaluation on four real datasets (up to 100 million vectors), using query workloads with filters of different types (unique/composite), and varied specificity (proportion of points that satisfy a filter). The results show that RWalks can perform filtered search up to 2x faster than the second-best competitor (ACORN), while building the index 76x faster and answering unfiltered search 13x faster.

SBSC: A fast Self-tuned Bipartite proximity graph-based Spectral Clustering

Spectral clustering (SC) is well-known for discovering natural groups present in the data by projecting them into Eigen-space based on the proximity graph but incurs cubic time in terms of size (N) of the data as all pair proximity is used. To enhance the efficiency of the SC techniques, the proximity between the data instances and their representatives (R) is captured through a bipartite similarity graph. However, extrinsic parameters such as the number of representatives and nearby representatives of data instances, influence the clustering performance, time, and memory usage. Therefore, in this work, we construct a parameter-free bipartite graph to further improve the clustering quality and computational cost of SC by introducing a locality-based sparsification technique. First, the proposed method (SBSC) determines O(√N) numbers of well-distributed representatives in O(N lg N) time by applying Bi-means and K-means partitioning techniques. Next, SBSC utilizes the local neighbors of R to search the nearby representatives, which fastens the search time to O(N). To the best of our knowledge, the proposed bipartite graph is the least (O(N)) sized and therefore, by exploiting the high sparsity of the graph accelerates the Eigen-decomposition step of SBSC. The proposed algorithm takes overall O(N(K2+lg N)) time only to detect K clusters, and experimental results on eighteen large-sized diversified datasets suggest that SBSC discovers complex clusters much faster than the competing methods with enhanced clustering quality.

Scalable Complex Event Processing on Video Streams

The rapid expansion of video streaming content in our daily lives has rendered the real-time processing and analysis of these video streams a critical capability. However, existing deep video analytics systems primarily support only simple queries, such as selection and aggregation. Considering the inherent temporal nature of video streams, queries capable of matching patterns of events could enable a wider range of applications. In this paper, we present Bobsled, a novel video stream processing system designed to efficiently support complex event queries. Experimental results demonstrate that Bobsled can achieve a throughput improvement over state-of-the-art ranging from 2.4× to 11.6×, without any noticeable loss in accuracy.

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models

Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.

Serf: Streaming Error-Bounded Floating-Point Compression

In IoT (Internet of Things) scenarios, massive floating-point time series data are generated in a streaming manner and transmitted within limited bandwidth for real-time analysis. To enhance the efficiency, it is acknowledged to compress the data before transmission. Existing floating-point compression methods are either for batched compression that may cause long delays, or for streaming lossless compression that has an unsatisfactory compression ratio when certain errors are allowed. In this paper, we propose the first <u>S</u>treaming <u>ER</u>ror-bounded <u>F</u>loating-point compression Serf, which has two implementations: Serf-Qt and Serf-XOR. Serf-Qt first quantizes each floating-point value into an integer, and then encodes the integer with Elias gamma coding. Serf-XOR is the first lossy floating-point compression based on the XORing operation. To enhance the compression ratio of Serf-XOR, we propose a novel data offset technique to increase the leading zeros of the XORed values, and design a novel approximation technique to search for an error-qualified value that produces an XORed value with many trailing zeros. To improve the compression efficiency, we propose a pruning strategy to accelerate the process of approximated values search. We further build a streaming transmission prototype system based on a real development board, and deploy the proposed methods to it. Extensive experiments using 13 datasets show that, compared with 17 competitors, both Serf-Qt and Serf-XOR enjoy remarkable compression ratios with high efficiency in streaming scenarios. The transmission experiments based on the proposed system also showcase that Serf-XOR always takes the least overall time when the bandwidth is limited.

SHIELD: Encrypting Persistent Data of LSM-KVS from Monolithic to Disaggregated Storage

Log-Structured Merge-tree-based Key-Value Stores (LSM-KVS) are widely used to support modern, high-performance, data-intensive applications. In recent years, with the trend of deploying and optimizing LSM-KVS from monolith to Disaggregated Storage (DS) setups, the confidentiality of LSM-KVS persistent data (e.g., WAL and SST files) is vulnerable to unauthorized access from insiders and external attackers and must be protected using encryption. Existing solutions lack a high-performance design for encryption in LSM-KVS, often focus on in-memory data protection with overheads of 3.4-32.5x, and lack the scalability and flexibility considerations required in DS deployments.

This paper proposes two novel designs to address the challenges of providing robust security for persistent components of LSM-KVS while maintaining high performance in both monolith and DS deployments - a simple and effective instance-level design suitable for monolithic LSM-KVS deployments, and SHIELD, a design that embeds encryption into LSM-KVS components for minimal overhead in both monolithic and DS deployment. We achieve our objective through three contributions: (1) A fine-grained integration of encryption into LSM-KVS write path to minimize performance overhead from exposure-limiting practices like using unique encryption keys per file and regularly re-encrypting using new encryption keys during compaction, (2) Mitigating performance degradation caused by recurring encryption of Write-Ahead Log (WAL) writes by using a buffering solution and (3) Extending confidentiality guarantees to DS by designing a metadata-enabled encryption-key-sharing mechanism and a secure local cache for high scalability and flexibility. We implement both designs on RocksDB, evaluating them in monolithic and DS setups while showcasing an overhead of 0-32% for the instance-level design and 0-36% for SHIELD.

SPACE: Cardinality Estimation for Path Queries Using Cardinality-Aware Sequence-based Learning

Cardinality estimation is a central task of cost-based database query optimization. Accurate estimates enable optimizers to identify and avoid expensive plans requiring large intermediate results. While cardinality estimation has been studied extensively in relational databases, research in the setting of graph databases has been more scarce. Furthermore, recent studies have shown that machine-learning-based methods can be utilized for cardinality estimation in both relational and graph databases. In this paper, we focus on the problem of estimating the cardinality of path patterns in graph databases, and we propose the Sequence-based Path Pattern Cardinality Estimator (SPACE). Our approach treats path patterns as sequences of node labels and edge types and assign similar cardinalities to path patterns with similar node and edge order. SPACE uses a dual approach: it encodes the sequence of nodes and edges to capture structural characteristics of the path pattern, while also incorporating a cardinality-based encoding to integrate cardinality information throughout learning. In a comprehensive experimental evaluation, we show that our method outperforms the state of the art in terms of both accuracy (Q-error) and training time.

SpareLLM: Automatically Selecting Task-Specific Minimum-Cost Large Language Models under Equivalence Constraint

We introduce SpareLLM, Selecting Passable And Resource-Efficient LLMs, a novel LLM framework designed to minimize the inference costs (i.e., resource-efficient) of large-scale NLP tasks while ensuring sufficient result quality (i.e., passable). It enables users to specify an equivalence constraint in terms of the equivalence of outputs to those of the most powerful LLM. SpareLLM then generates results that deviate from the outputs of this LLM only with a probability below a user-defined threshold. SpareLLM employs a profiling phase that evaluates the performance of multiple LLMs to identify those that meet the user-defined equivalence level. It optimizes the tradeoff between profiling overheads and the anticipated cost savings resulting from profiling. Moreover, SpareLLM further reduces inference costs by strategically leveraging a mix of LLMs. Our experiments on five real-world datasets show that SpareLLM achieves significant cost savings, up to 8.6x, while generating equivalent outputs in 90% of cases compared to GPT-4-Turbo. Compared to recent LLM cascading baselines, SpareLLM demonstrates a superior tradeoff between cost and accuracy, accounting for 91.1% and 83.8% of the points on the Pareto curve for OpenAI and Llama models.

SPARTAN: Data-Adaptive Symbolic Time-Series Approximation

Symbolic approximations are dimensionality reduction techniques that convert time series into sequences of discrete symbols, enhancing interpretability while reducing computational and storage costs. To construct symbolic representations, first numeric representations approximate and capture properties of raw time series, followed by a discretization step that converts these numeric dimensions into symbols. Despite decades of development, existing approaches have several key limitations that often result in unsatisfactory performance: they (i) rely on data-agnostic numeric approximations, disregarding intrinsic properties of the time series; (ii) decompose dimensions into equal-sized subspaces, assuming independence among dimensions; and (iii) allocate a uniform encoding budget for discretizing each dimension or subspace, assuming balanced importance. To address these shortcomings, we propose SPARTAN, a novel data-adaptive symbolic approximation method that intelligently allocates the encoding budget according to the importance of the constructed uncorrelated dimensions. Specifically, SPARTAN (i) leverages intrinsic dimensionality reduction properties to derive non-overlapping, uncorrelated latent dimensions; (ii) adaptively distributes the budget based on the importance of each dimension by solving a constrained optimization problem; and (iii) prevents false dismissals in similarity search by ensuring a lower bound on the true distance in the original space. To demonstrate SPARTAN's robustness, we conduct the most comprehensive study to date, comparing SPARTAN with seven state-of-the-art symbolic methods across four tasks: classification, clustering, indexing, and anomaly detection. Rigorous statistical analysis across hundreds of datasets shows that SPARTAN outperforms competing methods significantly on all tasks in terms of downstream accuracy, given the same budget. Notably, SPARTAN achieves up to a 2x speedup compared to the most accurate rival. Overall, SPARTAN effectively improves the symbolic representation quality without storage or runtime overheads, paving the way for future advancements.

Subgroup Discovery with Small and Alternative Feature Sets

Subgroup-discovery methods find interesting regions in a dataset. In this article, we analyze two constraint types to enhance the interpretability of subgroups: First, we make subgroup descriptions small by limiting the number of features used. Second, we propose the novel problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features. We describe how to integrate both constraint types into heuristic subgroup-discovery methods as well as a novel Satisfiability Modulo Theories (SMT) formulation, which enables a solver-based search for subgroups. Further, we prove NP-hardness of optimization with either constraint type. Finally, we evaluate unconstrained and constrained subgroup discovery with 27 binary-classification datasets. We observe that heuristic search methods often yield high-quality subgroups fast, even with constraints.

SuSe: Summary Selection for Regular Expression Subsequence Aggregation over Streams

Regular expressions (RegEx) are an essential tool for pattern matching over streaming data, e.g., in network and security applications. The evaluation of RegEx queries becomes challenging, though, once subsequences are incorporated, i.e., characters in a sequence may be skipped during matching. Since the number of subsequence matches may grow exponentially in the input length, existing RegEx engines fall short in finding all subsequence matches, especially for queries including Kleene closure.

In this paper, we argue that common applications for RegEx queries over streams do not require the enumeration of all distinct matches at any point in time. Rather, only an aggregate over the matches is typically fetched at specific, yet unknown time points. To cater for these scenarios, we present SuSe, a novel architecture for RegEx evaluation that is based on a query-specific summary of the stream. It employs a novel data structure, coined StateSummary, to capture aggregated information about subsequence matches. This structure is maintained by a summary selector, which aims at choosing the stream projections that minimize the loss in the aggregation result over time. Experiments on real-world and synthetic data demonstrate that SuSe is both effective and efficient, with the aggregates being based on several orders of magnitude more matches compared to baseline techniques.

SWASH: A Flexible Communication Framework with Sliding Window-Based Cache Sharing for Scalable DGNN Training

Dynamic Graph Neural Networks (DGNNs) are effective at capturing multidimensional data and enable many important applications. As model training is computationally intensive, distributed DGNN training is employed to accommodate large data. Also, when training DGNNs, so-called sliding window training is used predominantly, as it enhances both accuracy and efficiency. However, current distributed frameworks-such as snapshot partitioning, chunk-based partitioning, and L-hop cache-based communication-free vertex partitioning-are inherently incompatible with sliding window training. While communication-based vertex partitioning supports sliding window training, its design for static graphs limits the effectiveness in distributed DGNN training. Specifically, existing partitioning strategies fail to optimize communication across snapshots, while existing cache reuse and communication scheduling strategies ignore opportunities for optimization between sliding windows. To support distributed sliding window training, we present SWASH, a scalable and flexible communication framework that utilizes a <u>S</u>liding <u>W</u>indow-based c<u>A</u>che <u>SH</u>aring technique. Specifically, we propose a flexible communication framework that supports ratio adjustment and timing selection, as well as hyperparameter settings and adaptive scheduling. We also propose a lightweight partitioning strategy tailored to sliding window-based DGNN training to reduce both partitioning and communication overheads. Finally, to alleviate decreases in accuracy due to reduced communication, we propose a cache-sharing technique based on sliding windows for sharing boundary vertex embeddings. Comprehensive experiments show that SWASH is capable of training speedups of an average of 9.44× over state-of-the-art frameworks while maintaining the accuracy of fully communicating, non-caching training frameworks.

SwiftSpatial: Spatial Joins on Modern Hardware

Spatial joins are among the most time-consuming spatial queries, remaining costly even in parallel and distributed systems. In this paper, we explore hardware acceleration for spatial joins by proposing SwiftSpatial, an FPGA-based accelerator that can be deployed in data centers and at the edge. SwiftSpatial contains multiple high-performance join units with innovative hybrid parallelism, several efficient memory management units, and an extensible on-chip join scheduler that supports the popular R-tree synchronous traversal and partition-based spatial-merge (PBSM) algorithms. Benchmarked against various CPU and GPU-based spatial data processing systems, SwiftSpatial demonstrates a latency reduction of up to 41.03x relative to the best-performing baseline, while requiring 6.16x less power. The performance and energy efficiency of SwiftSpatial demonstrate its potential to be used in a variety of configurations (e.g., as an accelerator, near storage, in-network) as well as on different devices (e.g., data centers where FPGAs are widely available or mobile devices, which also contain FPGAs for specialized processing).

Synthesizing Third Normal Form Schemata that Minimize Integrity Maintenance and Update Overheads: Parameterizing 3NF by the Numbers of Minimal Keys and Functional Dependencies

State-of-the-art relational schema design generates a lossless, dependency-preserving decomposition into Third Normal Form (3NF), that is in Boyce-Codd Normal Form (BCNF) whenever possible. In particular, dependency-preservation ensures that data integrity can be maintained on individual relation schemata without having to join them, but may need to tolerate a priori unbounded levels of data redundancy and integrity faults. As our main contribution we parameterize 3NF schemata by the numbers of minimal keys and functional dependencies they exhibit. Conceptually, these parameters quantify, already at schema design time, the effort necessary to maintain data integrity, and allow us to break ties between 3NF schemata. Computationally, the parameters enable us to optimize normalization into 3NF according to different strategies. Operationally, we show through experiments that our optimizations translate from the logical level into significantly smaller update overheads during integrity maintenance. Hence, our framework provides access to parameters that guide the computation of logical schema designs which reduce operational overheads.

Styx: Transactional Stateful Functions on Streaming Dataflows

Developing stateful cloud applications, such as low-latency workflows and microservices with strict consistency requirements, remains arduous for programmers. The Stateful Functions-as-a-Service (SFaaS) paradigm aims to serve these use cases. However, existing approaches provide weak transactional guarantees or perform expensive external state accesses requiring inefficient transactional protocols that increase execution latency. In this paper, we present Styx, a novel dataflow-based SFaaS runtime that executes serializable transactions consisting of stateful functions that form arbitrary call-graphs with exactly-once guarantees. Styx extends a deterministic transactional protocol by contributing: i) a function acknowledgment scheme to determine transaction boundaries required in SFaaS workloads, ii) a function-execution caching mechanism, and iii) an early commit-reply mechanism that substantially reduces transaction execution latency. Experiments with the YCSB, TPC-C, and Deathstar benchmarks show that Styx outperforms state-of-the-art approaches by achieving at least one order of magnitude higher throughput while exhibiting near-linear scalability and low latency.

T3: Accurate and Fast Performance Prediction for Relational Database Systems With Compiled Decision Trees

Query performance prediction is used for scheduling, resource scaling, tenant placement, and various other use-cases. Here, the main goal is to estimate the execution time of a query without running it. To be effective, predictors need to be both accurate and fast. In contrast, neural networks that were used in recent work deliver very accurate predictions but suffer from high latency. In this work, we propose the Tuple Time Tree (T3), a new model that is both accurate and fast. It is orders of magnitude faster than comparable methods and has competitive accuracy to state-of-the-art approaches. Additionally, T3 works for new database instances without re-training because it generalizes across database instances. We achieve T3's speed by relying on a low-latency decision tree model that is compiled to native machine code. We maintain high accuracy with two novel techniques: pipeline-based query plan representation and tuple-centric prediction targets. In our pipeline-based query plan representation, T3 decomposes query plans into pipelines. Then, T3 predicts the execution time of each pipeline individually, instead of the whole query in one step. With tuple-centric prediction targets, T3 predicts the expected time it takes to push a single tuple through a pipeline. It then multiplies this predicted value by the input cardinality of the pipeline to estimate its execution time. As a result, T3 achieves state-of-the-art accuracy with a low-latency decision tree model.

Table Overlap Estimation through Graph Embeddings

Discovering duplicate or high-overlapping tables in table collections is a crucial task for eliminating redundant information, detecting inconsistencies in the evolution of a table across its multiple versions produced over time, and identifying related tables. Candidate duplicate or related tables to support this task can be identified via the estimation of the largest table overlap. Unfortunately, current solutions for finding it present serious scalability issues for heavy workloads: Sloth, the state of-the-art framework for its estimation, requires more than three days of machine time for computing 100k table overlaps.

In this paper, we introduce ARMADILLO, an approach based on graph neural networks that learns table embeddings whose cosine similarity approximates the overlap ratio between tables, i.e., the ratio between the area of their largest table overlap and the area of the smaller table in the pair. We also introduce two new annotated datasets based on GitTables and a Wikipedia table corpus containing 1.32 million table pairs overall labeled with their overlap. Evaluating the performance of ARMADILLO on these datasets, we observed that it is able to calculate overlaps between pairs of tables several times faster than the state-of-the-art method while maintaining a good quality in approximating the exact result.

TableDC: Deep Clustering for Tabular Data

Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data integration). In data management tasks, where high-density embeddings and overlapping clusters dominate, a data management-specific DC algorithm should be able to interact better with the data properties to support data integration tasks. This paper presents a deep clustering algorithm for tabular data (TableDC) that reflects the properties of data management applications that cluster tables (schema inference), rows (entity resolution) and columns (domain discovery). To address overlapping clusters, TableDC integrates Mahalanobis distance, which considers variance and correlation within the data, offering a similarity method suitable for tabular data in high-dimensional latent spaces. TableDC also shows higher tolerance to outliers through its heavy-tailed Cauchy distribution as the similarity kernel. The proposed similarity measure is particularly beneficial where the embeddings of raw data are densely packed and exhibit high degrees of overlap. Data integration tasks may also involve large numbers of clusters, which challenges the scalability of existing DC methods. TableDC learns data embeddings with a large number of clusters more efficiently than baseline DC methods, which scale in quadratic time. We evaluated TableDC with several existing DC, Standard Clustering (SC), and state-of-the-art bespoke methods over benchmark datasets. TableDC consistently outperforms existing DC, SC and bespoke methods.

The Best of Both Worlds: On Repairing Timestamps and Attribute Values for Multivariate Time Series

Dirty data are often observed in the multivariate time series, which not only degrades data quality but also adversely affects various downstream applications. Existing studies typically focus on repairing such errors appearing in either timestamps or attribute values alone, relying on the assumption that the other part is clean. However, in real scenarios, owing to various reasons, both timestamps and attribute values can be erroneous. It is intuitive to repair timestamps and attribute values respectively by calling different methods in turn. However, such a strategy may lead to over-repairing and introduce additional errors, by ignoring the mutual reference between timestamps and attribute values. Therefore, in this study, rather than repairing timestamps and attribute values respectively by calling different methods in turn, we consider the repairing for both attribute values and timestamps simultaneously. Our major contributions include (1) defining the multivariate speed constraints and formalizing the optimal repair problem with the NP-hardness analysis, (2) computing the exact solutions with pruning strategies and correctness ensurance, (3) designing the quadratic time approximation algorithm with the performance guarantee, (4) devising the linear time algorithm and ensuring its approximation performance bound. Empirical results over real-world dirty datasets demonstrate the superiority and practicality of our algorithms, against eleven competing methods, where our algorithm not only achieves the best accuracy but also spends the lowest time cost.

Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection

Using a large training dataset to train a big and powerful model -- a typical practice in modern deep learning, often suffers from two major problems: the expensive and slow training process and the error-prone labels. The existing approaches, targeting either speeding up the training by selecting a subset of representative training instances (subset selection) or eliminating the negative effect of mislabels during training (mislabel detection), do not perform well in this scenario due to overlooking one of these two problems. To fill this gap, we propose Deem, a novel data-efficient framework that selects a subset of representative training instances under label uncertainty. The key idea is to leverage the metadata produced during deep learning training, e.g., training losses and gradients, to estimate the label uncertainty and select the representative instances. In particular, we model the problem of subset selection under uncertainty as a problem of finding a subset that closely approximates the gradient of the whole training data set derived on soft labels. We show that it is an NP-hard problem with submodular property and propose a low complexity algorithm to solve this problem with an approximate ratio. Training on this small subset thus improves the training efficiency while guaranteeing the model's accuracy. Moreover, we propose an efficient strategy to dynamically refine this subset during the iterative training process. Extensive experiments on 6 datasets and 10 baselines demonstrate that Deem accelerates the training process up to 10X without sacrificing the model accuracy.

Understanding the Black Box: A Deep Empirical Dive into Shapley Value Approximations for Tabular Data

Understanding the decisions made by machine learning models is significant for building trust and enabling the adoption of these models in real-world applications. Shapley values have emerged as a leading method for model interpretability, offering precise insights by quantifying each feature's contribution to predictions. However, computing Shapley values requires exploring all possible combinations of features, which can be computationally expensive, especially for high-dimensional data. This challenge has led to the development of various approximation techniques, often composed of estimation and replacement strategies, to compute the Shapley values efficiently. Our study focuses on the interpretability of machine learning models for tabular datasets, one of the most common and widely used data type. However, the abundance of options has created a substantial gap in determining the most appropriate technique for practical applications. Through this study, we seek to bridge this gap by comprehensively evaluating Shapley value approximations, covering 8 replacement and 17 estimation strategies across diverse regression and classification tasks. The evaluation is conducted exclusively on tabular data, leveraging 200 synthetic and real-world datasets, covering a wide range of model types, from conventional tree-based and linear models to modern neural networks. We focus on computational efficiency and the consistency of Shapley value estimates in handling high-dimensional feature spaces. Our findings reveal that traditional sampling-based approaches significantly reduce computational costs but fail to capture complex feature interactions. On the contrary, model-specific approaches that exploit the structure of the underlying model consistently outperform model-agnostic techniques, delivering higher accuracy and faster computations. Through the study, we aim to encourage further research on Shapley value approximations, advancing data-centric explainable AI.

Using Process Calculus for Optimizing Data and Computation Sharing in Complex Stateful Parallel Computations

We propose novel techniques that exploit data and computation sharing to improve the performance of complex stateful parallel computations, like agent-based simulations. Parallel computations are translated into behavioral equations, a novel formalism layered on top of the foundational process calculus π-calculus. Behavioral equations blend code and data, allowing a system to easily compose and transform parallel programs into specialized programs. We show how optimizations like merging programs, synthesizing efficient message data structures, eliminating local messaging, rewriting communication instructions into local computations, and aggregation pushdown can be expressed as transformations of behavioral equations. We have also built a system called OptiFusion that implements behavioral equations and the aforementioned optimizations. Our experiments showed that OptiFusion is over 10× faster than state-of-the-art stateful systems benchmarked via complex stateful workloads. Generating specialized instructions that are impractical to write by hand allows OptiFusion to outperform even the hand-optimized implementations by up to 2×.

Wait and See: A Delayed Transactions Partitioning Approach in Deterministic Database Systems for Better Performance

Deterministic databases are revolutionizing batch transaction processing in shared-nothing architectures, with efficiency largely hinging on minimizing cross-partition operations. However, achieving a universal data partition that eliminates cross-partition operations is often impractical. Thus, developing effective transaction partitioning strategies becomes crucial. Existing methods tend to partition and optimize transactions individually, neglecting the overarching commonalities between transactions within a batch. This oversight results in suboptimal partitioning of transactions that share similar read-write sets, ultimately missing opportunities for global batch execution optimization. In this paper, we present DelayPart, a deterministic database transaction engine that employs a ''wait and see'' strategy to address contextual conflicts between transactions within each batch. DelayPart models transaction batch partitioning as a k-cut problem based on transaction similarity and employs a LSH forest-based approach to approximate solutions efficiently in linear time, factoring in the global overhead of remote operations for each batch. By postponing the allocation and execution of individual transactions, DelayPart systematically analyzes inter-transaction relationships, enhancing overall performance without compromising execution efficiency. We evaluated DelayPart's performance against various benchmarks on a large-scale cluster, demonstrating that it significantly outperforms state-of-the-art transaction partitioning methods.

Yannakakis+: Practical Acyclic Query Evaluation with Theoretical Guarantees

Acyclic conjunctive queries form the backbone of most analytical workloads, and have been extensively studied in the literature from both theoretical and practical angles. However, there is still a large divide between theory and practice. While the 40-year-old Yannakakis algorithm has strong theoretical running time guarantees, it has not been adopted in real systems due to its high hidden constant factor. In this paper, we strive to close this gap by proposing Yannakakis+, an improved version of the Yannakakis algorithm, which is more practically efficient while preserving its theoretical guarantees. Our experiments demonstrate that Yannakakis+ consistently outperforms the original Yannakakis algorithm by 2x to 5x across a wide range of queries and datasets.

Another nice feature of our new algorithm is that it generates a traditional DAG query plan consisting of standard relational operators, allowing Yannakakis+ to be easily plugged into any standard SQL engine. Our system prototype currently supports four different SQL engines (DuckDB, PostgreSQL, SparkSQL, and AnalyticDB from Alibaba Cloud), and our experiments show that Yannakakis+ is able to deliver better performance than their native query plans on 160 out of the 162 queries tested, with an average speedup of 2.41x and a maximum speedup of 47,059x.

Zombie Hashing: Reanimating Tombstones in Graveyard

Linear probing-based hash tables offer high data locality and are considered among the fastest in real-world applications. However, they come with an inherent tradeoff between space efficiency and speed, i.e. when the hash table approaches full capacity, its performance tends to decline considerably due to an effect known as primary clustering. As a result they are only used at low load factors.

Tombstones (markers for deleted elements) can help mitigate the effect of primary clustering in linear probing hash tables. However, tombstones require periodic redistribution, which, in turn, requires a complete halt of regular operations. This makes linear probing not suitable in practical applications where periodic halts are unacceptable.

In this paper, we present a solution to forestall primary clustering in linear probing hash tables, ensuring high data locality and consistent performance even at high load factors. Our approach redistributes tombstones within small windows, deamortizing the cost of mitigating primary clustering and eliminating the need for periodic halts. We provide theoretical guarantees that our deamortization method is asymptotically optimal in efficiency and cost. We also design an efficient implementation within dominant linear-probing hash tables and show performance improvements.

We introduce Zombie hashing in two variants: ordered (compact) and unordered (vectorized) linear probing hash tables. Both variants achieve consistent, high throughput and lowest variance in operation latency compared to other state-of-the-art hash tables across numerous churn cycles, while maintaining 95% space efficiency without downtime. Our results show that Zombie hashing overcomes the limitations of linear probing while preserving high data locality.