Proceedings of the ACM on Management of Data: SIGMOD: Vol. 3, No. 1. 2025

Full Citation in the ACM Digital Library

PACMMOD V3, N1 (SIGMOD), February 2025: Editorial

The Proceedings of the ACM on Management of Data (PACMMOD) is concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data management technology, and science and engineering of data. It includes articles reporting cutting-edge data management, data engineering, and data science research.

We are pleased to present the 1st issue of Volume 3 of PACMMOD. This issue contains papers that were submitted to the SIGMOD research track in July 2024.

Papers accepted in this issue are part of the pool of presentation candidates in the research track of the ACM SIGMOD Conference on Management of Data 2025, to be held in Berlin, Germany.

λ-Tune: Harnessing Large Language Models for Automated Database System Tuning

We introduce λ-Tune, a framework that leverages Large Language Models (LLMs) for automated database system tuning. The design of λ-Tune is motivated by the capabilities of the latest generation of LLMs. Different from prior work, leveraging LLMs to extract tuning hints for single parameters, λ-Tune generates entire configuration scripts, based on a large input document, describing the tuning context. λ-Tune generates alternative configurations, using a principled approach to identify the best configuration, out of a small set of candidates. In doing so, it minimizes reconfiguration overheads and ensures that evaluation costs are bounded as a function of the optimal run time. By treating prompt generation as a cost-based optimization problem, λ-Tune conveys the most relevant context to the LLM while bounding the number of input tokens and, therefore, monetary fees for LLM invocations. We compare λ-Tune to various baselines, using multiple benchmarks and PostgreSQL and MySQL as target systems for tuning, showing that λ-Tune is significantly more robust than prior approaches.

Online Marketplace: A Benchmark for Data Management in Microservices

Microservice architectures have become a popular approach for designing scalable distributed applications. Despite their extensive use in industrial settings for over a decade, there is limited understanding of the data management challenges that arise in these applications. Consequently, it has been difficult to advance data system technologies that effectively support microservice applications. To fill this gap, we present Online Marketplace, a microservice benchmark that highlights core data management challenges that existing benchmarks fail to address. These challenges include transaction processing, query processing, event processing, constraint enforcement, and data replication. We have defined criteria for various data management issues to enable proper comparison across data systems and platforms. Through case studies with state-of-the-art data platforms, we discuss the issues encountered while implementing and meeting Online Marketplace's criteria. By capturing the overhead of meeting the key data management requirements that are overlooked by existing benchmarks, we gain actionable insights into the experimental platforms. This highlights the significance of Online Marketplace in advancing future data systems to meet the needs of microservice practitioners.

A Local Search Approach to Efficient (k,p)-Core Maintenance

The ((k,p))-core model was recently proposed to capture engagement dynamics by considering both intra-community interactions (i.e., the k-core structure) and inter-community interactions (i.e., the p-fraction property). It is a refinement of the classic k-core, by introducing an extra parameter p to customize the engagement within a community at a finer granularity. In this paper, we study the problem of maintaining all (k,p)-cores (essentially, maintaining the p-numbers for all vertices) for dynamic graphs. The existing Global approach conducts a global peeling, almost from scratch, for all vertices whose old p-numbers are within a computed range [p-,p+], and thus is inefficient. We propose a new Local approach which conducts local searches starting from the two end-points of the newly inserted or deleted edge, and then iteratively expands the search frontier by including their neighbors. Our algorithm is designed based on several fundamental properties that we prove in this paper to characterize the necessary condition for a vertex's p-number to change. Compared to Global, our Local approach implicitly obtains the optimal affected p-number range [p-*,p+*] ⊆ [p-,p+], and further skips many vertices whose p-numbers are within this range. Experimental results show that Local is on average two orders of magnitude faster than Global.

A Rank-Based Approach to Recommender System's Top-K Queries with Uncertain Scores

Top-K queries provide a ranked answer using a score that can either be given explicitly or computed from tuple values. Recommender systems use scores, based on user feedback on items with which they interact, to answer top-K queries. Such scores pose the challenge of correctly ranking elements using scores that are more often than not, uncertain. In this work, we address top-K queries based on uncertain scores. We propose to explicitly model the inherent uncertainty in the provided data and to consider a distribution of scores instead of a single score. Rooted in works of database probabilistic ranking, we offer the use of probabilistic ranking as a tool of choice for generating recommendation in the presence of uncertainty. We argue that the ranking approach should be chosen in a manner that maximizes user satisfaction, extending state-of-the-art on quality aspect of top-K answers over uncertain data, their relationship to top-K semantics, and improve ranking with uncertain scores in recommender systems. Towards this end, we introduce RankDist, an algorithm for efficiently computing probability of item position in a ranked recommendation. We show that rank-based (rather than score-based) methods that are computed using RankDist, which were not applied in recommender systems before, offer a guaranteed optimality by expectation and empirical superiority when tested on common benchmarks.

Accelerating Core Decomposition in Billion-Scale Hypergraphs

Hypergraphs provide a versatile framework for modeling complex relationships beyond pairwise interactions, finding applications in various domains. k-core decomposition is a fundamental task in hypergraph analysis that decomposes hypergraphs into cohesive substructures. Existing studies capture the cohesion in hypergraphs based on the vertex neighborhood size. However, such decomposition poses unique challenges, including the efficiency of core value updates, redundant computation, and high memory consumption. We observe that the state-of-the-art algorithms do not fully address the above challenges and are unable to scale to large hypergraphs. In this paper, we propose an efficient approach for hypergraph k-core decomposition. Novel concepts and strategies are developed to compute the core value of each vertex and reduce redundant computation of vertices. Experimental results on real-world and synthetic hypergraphs demonstrate that our approach significantly outperforms the state-of-the-art algorithm by 7 times on average while reducing the average memory usage by 36 times. Moreover, while existing algorithms fail on tens of millions hyperedges, our approach efficiently handles billion-scale hypergraphs in only a single thread.

Agree to Disagree: Robust Anomaly Detection with Noisy Labels

Due to the scarcity of reliable anomaly labels, recent anomaly detection methods leveraging noisy auto-generated labels either select clean samples or refurbish noisy labels. However, both approaches struggle due to the unique properties of anomalies. Sample selection often fails to separate sufficiently many clean anomaly samples from noisy ones, while label refurbishment erroneously refurbishes marginal clean samples. To overcome these limitations, we design Unity, the first learning from noisy labels (LNL) approach for anomaly detection that elegantly leverages the merits of both sample selection and label refurbishment to iteratively prepare a diverse clean sample set for network training. Unity uses a pair of deep anomaly networks to collaboratively select samples with clean labels based on prediction agreement, followed by a disagreement resolution mechanism to capture marginal samples with clean labels. Thereafter, Unity utilizes unique properties of anomalies to design an anomaly-centric contrastive learning strategy that accurately refurbishes the remaining noisy labels. The resulting set, composed of selected and refurbished clean samples, will be used to train the anomaly networks in the next training round. Our experimental study on 10 real-world benchmark datasets demonstrates that Unity consistently outperforms state-of-the-art LNL techniques by up to 0.31 in F-1 Score (0.52 \rightarrow 0.83).

An Adaptive Benchmark for Modeling User Exploration of Large Datasets

In this paper, we present a new DBMS performance benchmark that can simulate user exploration with any specified dashboard design made of standard visualization and interaction components. The distinguishing feature of our SImulation-BAsed (or SIMBA) benchmark is its ability to model user analysis goals as a set of SQL queries to be generated through a valid sequence of user interactions, as well as measure the completion of analysis goals by testing for equivalence between the user's previous queries and their goal queries. In this way, the SIMBA benchmark can simulate how an analyst opportunistically searches for interesting insights at the beginning of an exploration session and eventually hones in on specific goals towards the end. To demonstrate the versatility of the SIMBA benchmark, we use it to test the performance of four DBMSs with six different dashboard specifications and compare our results with IDEBench. Our results show how goal-driven simulation can reveal gaps in DBMS performance missed by existing benchmarking methods and across a range of data exploration scenarios.

An Elephant Under the Microscope: Analyzing the Interaction of Optimizer Components in PostgreSQL

Despite an ever-growing corpus of novel query optimization strategies, the interaction of the core components of query optimizers is still not well understood. This situation can be problematic for two main reasons: On the one hand, this may cause surprising results when two components influence each other in an unexpected way. On the other hand, this can lead to wasted effort in regard to both engineering and research, e.g., when an improvement for one component is dwarfed or entirely canceled out by problems of another component. Therefore, we argue that making improvements to a single optimization component requires a thorough understanding of how these changes might affect the other components. To achieve this understanding, we present results of a comprehensive experimental analysis of the interplay in the traditional optimizer architecture using the widely-used PostgreSQL system as prime representative. Our evaluation and analysis revisit the core building blocks of such an optimizer, i.e. per-column statistics, cardinality estimation, cost model, and plan generation. In particular, we analyze how these building blocks influence each other and how they react when faced with faulty input, such as imprecise cardinality estimates. Based on our results, we draw novel conclusions and make recommendations on how these should be taken into account.

An Experimental Comparison of Tree-data Structures for Connectivity Queries on Fully-dynamic Undirected Graphs

During the past decades significant efforts have been made to propose data structures for answering connectivity queries on fully dynamic graphs, i.e., graphs with frequent insertions and deletions of edges. However, a comprehensive understanding of how these data structures perform in practice is missing, since not all of them have been implemented, let alone evaluated experimentally. We provide reference implementations for the proposed data structures and experimentally evaluate them on a wide range of graphs. Our findings show that the current solutions are not ready to be deployed in systems as is, as every data structure has critical weaknesses when used in practice. Key limitations that must be overcome are the space and time overhead incurred by balanced data structures, the degeneration of the runtime of space-efficient data structures in worst case scenarios, and the maintenance costs for balanced data structures. We detail our findings in the experimental evaluation and provide recommendations for implementing robust solutions for answering connectivity queries on dynamic graphs.

AquaPipe: A Quality-Aware Pipeline for Knowledge Retrieval and Large Language Models

The knowledge retrieval methods such as Approximate Nearest Neighbor Search (ANNS) significantly enhance the generation quality of Large Language Models (LLMs) by introducing external knowledge, and this method is called Retrieval-augmented generation (RAG). However, due to the rapid growth of data size, ANNS tends to store large-scale data on disk, which greatly increases the response time of RAG systems. This paper presents AquaPipe, which pipelines the execution of disk-based ANNS and the LLM prefill phase in an RAG system, effectively overlapping the latency of knowledge retrieval and model inference to enhance the overall performance, while guaranteeing data quality. First, ANNS's recall-aware prefetching strategy enables the early return of partial text with acceptable accuracy so the prefill phase can launch before getting the full results. Then, we adaptively choose the remove-after-prefill or re-prefill strategies based on the LLM cost model to effectively correct disturbed pipelines caused by wrong early returns. Finally, the pipelined prefill dynamically changes the granularity of chunk size to balance the overlap efficiency and GPU efficiency, adjusting to ANNS tasks that converge at different speeds. Our experiments have demonstrated the effectiveness of AquaPipe. It successfully masks the latency of disk-based ANNS by 56% to 99%, resulting in a 1.3× to 2.6× reduction of the response time of the RAG, while the extra recall loss caused by prefetching is limited to approximately 1%.

Aster: Enhancing LSM-structures for Scalable Graph Database

There is a proliferation of applications requiring the management of large-scale, evolving graphs under workloads with intensive graph updates and lookups. Driven by this challenge, we introduce Poly-LSM, a high-performance key-value storage engine for graphs with the following novel techniques: (1) Poly-LSM is embedded with a new design of graph-oriented LSM-tree structure that features a hybrid storage model for concisely and effectively storing graph data. (2) Poly-LSM utilizes an adaptive mechanism to handle edge insertions and deletions on graphs with optimized I/O efficiency. (3) Poly-LSM exploits the skewness of graph data to encode the key-value entries. Building upon this foundation, we further implement Aster, a robust and versatile graph database that supports Gremlin query language facilitating various graph applications. In our experiments, we compared Aster against several mainstream real-world graph databases. The results demonstrate that Aster outperforms all baseline graph databases, especially on large-scale graphs. Notably, on the billion-scale Twitter graph dataset, Aster achieves up to 17x throughput improvement compared to the best-performing baseline graph system.

Automatic Database Configuration Debugging using Retrieval-Augmented Language Models

Database management system (DBMS) configuration debugging, e.g., diagnosing poorly configured DBMS knobs and generating troubleshooting recommendations, is crucial in optimizing DBMS performance. However, the configuration debugging process is tedious and, sometimes challenging, even for seasoned database administrators (DBAs) with sufficient experience in DBMS configurations and good understandings of the DBMS internals (e.g., MySQL or Oracle). To address this difficulty, we propose Andromeda, a framework that utilizes large language models (LLMs) to enable automatic DBMS configuration debugging. Andromeda serves as a natural surrogate of DBAs to answer a wide range of natural language (NL) questions on DBMS configuration issues, and to generate diagnostic suggestions to fix these issues. Nevertheless, directly prompting LLMs with these professional questions may result in overly generic and often unsatisfying answers. To this end, we propose a retrieval-augmented generation (RAG) strategy that effectively provides matched domain-specific contexts for the question from multiple sources. They come from related historical questions, troubleshooting manuals and DBMS telemetries, which significantly improve the performance of configuration debugging. To support the RAG strategy, we develop a document retrieval mechanism addressing heterogeneous documents and design an effective method for telemetry analysis. Extensive experiments on real-world DBMS configuration debugging datasets show that Andromeda significantly outperforms existing solutions.

B-Trees Are Back: Engineering Fast and Pageable Node Layouts

Large main memory capacity and even larger data sets have motivated hybrid storage systems, which serve most transactions from memory, but can seamlessly transition to flash storage. In such systems, the data structure of choice is usually a B-Tree with pageable nodes. Most academic B-Tree work considers only fixed size records, making them unsuitable for most practical applications. Given the prevalence of B-Trees, surprisingly few available implementations and benchmarks of optimized B-Trees cover variable-sized records. In this paper, we describe an efficient B-Tree implementation supporting variable-sized records containing six known node layout optimizations. We evaluate each optimization to guide future implementations, and propose an optimized adaptive layout that can even compete with pure in-memory structures for many workloads. Our results show that well-engineered B-Trees can efficiently handle both in-memory and out-of-memory workloads.

BⓈX: Subgraph Matching with Batch Backtracking Search

Subgraph matching is a fundamental problem in graph analysis. Recently, many algorithms have been developed, often using classic backtracking search. This traditional <u>b</u>acktracking <u>s</u>earch matches <u>one</u> vertex at a time, denoted as BⓈ1, which can lead to redundant computations due to overlapping search spaces. To address this problem, we propose a novel batch-<u>b</u>acktracking <u>s</u>earch framework that enables matching a set of data vertices <u>X</u>, denoted as BⓈX, in each backtracking step. BⓈX models the search space as a "search box", allowing for flexible search space exploration and significantly minimizing the overlap between search spaces. It effectively selects batches to cluster data vertices with similar search spaces. For each search box, we introduce a refinement method to filter out unpromising candidate mappings. Furthermore, we propose a homomorphism termination to break the backtracking process as early as possible and an efficient embedding enumeration method to list all embeddings within the search box simultaneously. Extensive experiments on real-world graphs demonstrate that BⓈX significantly outperforms existing state-of-the-art algorithms, achieving a speedup of one to two orders of magnitude on most graphs under the EPS metric.

BCviz: A Linear-Space Index for Mining and Visualizing Cohesive Bipartite Subgraphs

Finding the maximum biclique in a bipartite graph is a fundamental graph analysis problem. Existing methods for maximum biclique search are not very efficient because they cannot effectively reduce the size of a bipartite graph composed of large bicliques that are loosely linked together because the graph reduction strategies adopted by these methods only consider local densities of vertices.

This paper proposes a novel approach to maximum biclique search. The unique feature of this approach is building a linear-space data-driven index called BCviz that helps accurately identify subgraphs containing all bicliques with sizes no less than a certain threshold. The core technique of BCviz is determining a total order of vertices that can reveal both the local density and the connectivity of the vertices. Notably, our work is the first one to take connectivity into account in graph reduction. Interestingly, the total order of vertices entails BCviz an illustrative visualization of the distribution of cohesive subgraphs in the input graph.

To deeply understand BCviz, we carry out a theoretical study on its properties and reveal how it enables more effective graph reduction. Based on BCviz, we propose an exact maximum biclique search algorithm that searches for results on much smaller subgraphs than any existing method does.

In addition, we improve the efficiency of index construction by two techniques. One is approximating an edge's local density with an upper bound that can be derived in linear time. The other is a lightweight vertex ordering method called one-spot ordering which reduces unnecessary cohesion computations.

Extensive experiments indicate that the proposed maximum biclique search methods based on BCviz and its variants outperform the state-of-the-art search-based methods by 2--3 orders of magnitude. Compared with the state-of-the-art index for maximum biclique search, the improved BCviz index can reduce the index size by 1--2 orders of magnitude and the index construction time by up to 2 orders of magnitude.

Boosting OLTP Performance with Per-Page Logging on NVDIMM

When running OLTP workloads on flash SSDs, relational DBMSs still face the write durability overhead, severely limiting their performance. To address this challenge, we propose NV-PPL, a novel database architecture that leverages NVDIMM as a durable log cache. NV-PPL captures per-page redo logs and retains them on NVDIMM to absorb writes from DRAM to SSD. Our NV-PPL prototype, deployed on an actual NVDIMM device, demonstrates superior transaction throughput, surpassing the same-priced Vanilla MySQL by at least 6.9× and NV-SQL, a page-grained NVDIMM caching scheme, by up to 1.5×. Beyond write reduction, the page-wise logs in NVDIMM enable novel approaches such as redo-less recovery and redo-based multi-versioning. Compared to Vanilla MySQL, redo-less recovery reduces recovery time by one-third, while redo-based multi-versioning enhances the latency of long-lived transactions in HTAP workloads by 3× to 18×.

Bursting Flow Query on Large Temporal Flow Networks

Recently, queries that find bursting patterns in temporal graph data have received increasing research attention. In particular, finding the flow in temporal networks whose flow values are bursting in a time interval has numerous applications, such as detecting the money laundering by the maximum average transfer flow in a transaction graph, and the congestion by the maximum average traffic flow in a road network. Despite its usefulness, there is limited research on querying such a flow pattern. In this paper, we study a novel query of finding a flow pattern of burstiness in a temporal flow network. In a nutshell, this query aims to find the bursting flow f from a source node to a sink node such that the ratio of f's flow value to the time interval length of f is maximized. To solve this query, we propose the first solution called BFQ that enumerates all the necessary time intervals and then computes the maximum flow value for each interval. Based on BFQ, we propose an efficient solution called BFQ*, which consists of optimization techniques that incrementally compute the maximum flows without computing the common parts of flows from scratch. The experimental results demonstrate the efficiency of our solutions. A case study on a real world transaction network demonstrates the application of this bursting flow query on detecting abnormal transactions.

Capsule: An Out-of-Core Training Mechanism for Colossal GNNs

Cutting-edge platforms of graph neural networks (GNNs), such as DGL and PyG, harness the parallel processing power of GPUs to extract structural information from graph data, achieving state-of-the-art (SOTA) performance in fields such as recommendation systems, knowledge graphs, and bioinformatics. Despite the computational advantages provided by GPUs, these GNN platforms struggle with scalability challenges due to the colossal graphical structures processed and the limited memory capacities of GPUs. In response, this work introduces Capsule, a new out-of-core mechanism for large-scale GNN training. Unlike existing out-of-core GNN systems, which use main or secondary memory as operative memory and use CPU kernels during non-backpropagation computation, Capsule uses GPU memory and GPU kernels. By substantially leveraging the parallelization capabilities of GPUs, Capsule significantly enhances GNN training efficiency. In addition, Capsule can be smoothly integrated to mainstream open-source GNN frameworks, DGL and PyG, in a play-and-plug manner. Through a prototype implementation and comprehensive experiments on real datasets, we demonstrate that Capsule can achieve up to a 12.02× improvement in runtime efficiency, while using only 22.24% of the main memory, compared to SOTA out-of-core GNN systems.

Cardinality Estimation of LIKE Predicate Queries using Deep Learning

Cardinality estimation of LIKE predicate queries has an important role in the query optimization of database systems. Traditional approaches generally use a summary of text data with some statistical assumptions. Recently, the deep learning model for cardinality estimation of LIKE predicate queries has been investigated. To provide more accurate cardinality estimates and reduce the maximum estimation errors, we propose a deep learning model that utilizes the extended N-gram table and the conditional regression header. We next investigate how to efficiently generate training data. Our LEADER (LikE predicate trAining Data gEneRation) algorithms utilize the shareable results across the relational queries corresponding to the LIKE predicates. By analyzing the queries corresponding to LIKE predicates, we develop an efficient join method and utilize the join order for fast query execution and maximal sharing of shareable results. Extensive experiments with real-life datasets confirm the efficiency of the proposed training data generation algorithms and the effectiveness of the proposed model.

Centrum: Model-based Database Auto-tuning with Minimal Distributional Assumptions

Gaussian Process (GP)-based Bayesian optimization (BO), i.e., GP-BO, emerges as a prevailing model-based framework for DBMS (Database Management System) auto-tuning. However, recent work shows GP-BO-based DBMS auto-tuners are significantly outperformed by auto-tuners based on SMAC, which features random forest surrogate models; such results motivate us to rethink and investigate the limitations of GP-BO in auto-tuner design. We find that the fundamental assumptions of GP-BO are widely violated when modeling and optimizing DBMS performance, while tree-ensemble-BOs (e.g., SMAC) can avoid the assumption pitfalls and deliver improved tuning efficiency and effectiveness. Moreover, we argue that existing tree-ensemble-BOs restrict further advancement in DBMS auto-tuning. First, existing tree-ensemble-BOs can only achieve distribution-free point estimates, but still impose unrealistic distributional assumptions on uncertainty (interval) estimates, which can compromise surrogate modeling and distort the acquisition function. Second, recent advances in (ensemble) gradient boosting, which can further enhance surrogate modeling against vanilla GP and random forest counterparts, have rarely been applied in optimizing DBMS auto-tuners.

To address these issues, we propose a novel model-based DBMS auto-tuner, Centrum. Centrum achieves and improves distribution-free point and interval estimation in surrogate modeling with a two-phase learning procedure of stochastic gradient boosting ensembles (SGBE). Moreover, Centrum adopts a generalized SGBE-estimated locally-adaptive conformal prediction to facilitate a distribution-free interval (uncertainty) estimation and acquisition function. To our knowledge, Centrum is the first auto-tuner that realizes distribution-freeness to stress and enhance BO's practicality in DBMS auto-tuning, and the first to seamlessly fuse gradient boosting ensembles and conformal inference in BO. Extensive physical and simulation experiments on two DBMSs and three workloads show that Centrum outperforms 21 state-of-the-art (SOTA) DBMS auto-tuners based on BO with GP, random forest, gradient boosting, OOB (Out-Of-Bag) conformal ensemble and other surrogates, as well as that based on reinforcement learning and genetic algorithms.

Cohesiveness-aware Hierarchical Compressed Index for Community Search on Attributed Graphs

Community search on attributed graphs (CSAG) is a fundamental topic in graph data mining. Given an attributed graph G and a query node q, CSAG seeks a structural- and attribute-cohesive subgraph from G that contains q. Exact methods based on graph traversal are time-consuming, especially for large graphs. Approximate methods improve efficiency by pruning the search space with heuristics but still take hundreds of milliseconds to tens of seconds to respond, hindering their use in time-sensitive applications. Moreover, pruning strategies are typically tailored to specific algorithms and their cohesiveness metrics, making them difficult to generalize. To address this, we study a general approach to accelerate various CSAG methods. We first present a proximity graph-based, cohesiveness-aware hierarchical index that accommodates different cohesiveness metrics. Then, we present two optimizations to enhance the index's navigability and reliability. Finally, we design a compressed storage structure for space-efficient indexing. Experiments on real-world datasets show that integrating our index with existing mainstream CSAG methods results in an average 30.7× speedup while maintaining a comparable or even better attribute cohesiveness.

Computing Approximate Graph Edit Distance via Optimal Transport

Given a graph pair (G1, G2), graph edit distance (GED) is defined as the minimum number of edit operations converting G1 to G2. GED is a fundamental operation widely used in many applications, but its exact computation is NP-hard, so the approximation of GED has gained a lot of attention. Data-driven learning-based methods have been found to provide superior results compared to classical approximate algorithms, but they directly fit the coupling relationship between a pair of vertices from their vertex features. We argue that while pairwise vertex features can capture the coupling cost (discrepancy) of a pair of vertices, the vertex coupling matrix should be derived from the vertex-pair cost matrix through a more well-established method that is aware of the global context of the graph pair, such as optimal transport. In this paper, we propose an ensemble approach that integrates a supervised learning-based method and an unsupervised method, both based on optimal transport. Our learning method, GEDIOT, is based on inverse optimal transport that leverages a learnable Sinkhorn algorithm to generate the coupling matrix. Our unsupervised method, GEDGW, models GED computation as a linear combination of optimal transport and its variant, Gromov-Wasserstein discrepancy, for node and edge operations, respectively, which can be solved efficiently without needing the ground truth. Our ensemble method, GEDHOT, combines GEDIOT and GEDGW to further boost the performance. Extensive experiments demonstrate that our methods significantly outperform the existing methods in terms of the performance of GED computation, edit path generation, and model generalizability.

Constant Optimization Driven Database System Testing

Logic bugs are bugs that can cause database management systems (DBMSs) to silently produce incorrect results for given queries. Such bugs are severe, because they can easily be overlooked by both developers and users, and can cause applications that rely on the DBMSs to malfunction. In this work, we propose Constant-Optimization-Driven Database Testing (CODDTest) as a novel approach for detecting logic bugs in DBMSs. This method draws inspiration from two well-known optimizations in compilers: constant folding and constant propagation. Our key insight is that for a certain database state and query containing a predicate, we can apply constant folding on the predicate by replacing an expression in the predicate with a constant, anticipating that the results of this predicate remain unchanged; any discrepancy indicates a bug in the DBMS. We evaluated CODDTest on five mature and extensively-tested DBMSs--SQLite, MySQL, CockroachDB, DuckDB, and TiDB--and found 45 unique, previously unknown bugs in them. Out of these, 24 are unique logic bugs. Our manual analysis of the state-of-the-art approaches indicates that 11 logic bugs are detectable only by CODDTest. We believe that CODDTest is easy to implement, and can be widely adopted in practice.

CRDV: Conflict-free Replicated Data Views

There are now multiple proposals for Conflict-free Replicated Data Types (CRDTs) in SQL databases aimed at distributed systems. Some, such as ElectricSQL, provide only relational tables as convergent replicated maps, but this omits semantics that would be useful for merging updates. Others, such as Pg\_crdt, provide access to a rich library of encapsulated column types. However, this puts merge and query processing outside the scope of the query optimizer and restricts the ability of an administrator to influence access paths with materialization and indexes.

Our proposal, CRDV, overcomes this challenge by using two layers implemented as SQL views: The first provides a replicated relational table from an update history, while the second implements varied and rich types on top of the replicated table. This allows the definition of merge semantics, or even entire new data types, in SQL itself, and enables global optimization of user queries together with merge operations. Therefore, it naturally extends the scope of query optimization and local transactions to operations on replicated data, can be used to reproduce the functionality of common CRDTs with simple SQL idioms, and results in better performance than alternatives.

Data Chunk Compaction in Vectorized Execution

Modern analytical database management systems often adopt vectorized query execution engines that process columnar data in batches (i.e., data chunks) to minimize the interpretation overhead and improve CPU parallelism. However, certain database operators, especially hash joins, can drastically reduce the number of valid entries in a data chunk, resulting in numerous small chunks in an execution pipeline. These small chunks cannot fully enjoy the benefits of vectorized query execution, causing significant performance degradation. The key research question is when and how to compact these small data chunks during query execution. In this paper, we first model the chunk compaction problem and analyze the trade-offs between different compaction strategies. We then propose a learning-based algorithm that can adjust the compaction threshold dynamically at run time. To answer the ''how'' question, we propose a compaction method for the hash join operator, called logical compaction, that minimizes data movements when compacting data chunks. We implemented the proposed techniques in the state-of-the-art DuckDB and observed up to 63% speedup when evaluated using the Join Order Benchmark, TPC-H, and TPC-DS.

DataVinci: Learning Syntactic and Semantic String Repairs

String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Automatically cleaning such string data can have a significant impact on users. Previous approaches are limited to error detection, require that the user provides annotations, examples, or constraints to fix the errors, and focus independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such majority patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and using row tuples associated with majority values as examples. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic. Because not all data columns can result in majority patterns, when available, DataVinci can leverage execution information from an existing data program (which uses the target data as input) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms eleven baseline systems on both data error detection and repair as demonstrated on four existing and new benchmarks.

Deep Overlapping Community Search via Subspace Embedding

Overlapping Community Search (OCS) identifies nodes that interact with multiple communities based on a specified query. Existing community search approaches fall into two categories: algorithm-based models and Machine Learning-based (ML) models. Despite the long-standing focus on this topic within the database domain, current solutions face two major limitations: 1) Both approaches fail to address personalized user requirements in OCS, consistently returning the same set of nodes for a given query regardless of user differences. 2) Existing ML-based CS models suffer from severe training efficiency issues. In this paper, we formally redefine the problem of OCS. By analyzing the gaps in both types of approaches, we then propose a general solution for OCS named <u>S</u>parse <u>S</u>ubspace <u>F</u>ilter (SSF), which can extend any ML-based CS model to enable personalized search in overlapping structures. To overcome the efficiency issue in the current models, we introduce <u>S</u>implified <u>M</u>ulti-hop Attention <u>N</u>etworks (SMN), a lightweight yet effective community search model with larger receptive fields. To the best of our knowledge, this is the first ML-based study of overlapping community search. Extensive experiments validate the superior performance of SMN within the SSF pipeline, achieving a 13.73% improvement in F1-Score and up to 3 orders of magnitude acceleration in model efficiency compared to state-of-the-art approaches.

DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation Graph

Bimodal data, such as image-text pairs, has become increasingly prevalent in the digital era. The Hybrid Vector Query (HVQ) is an effective approach for querying such data and has recently garnered considerable attention from researchers. It calculates similarity scores for objects represented by two vectors using a weighted sum of each individual vector's similarity, with a query-specific parameter α to determine the weight. Existing methods for HVQ typically construct Approximate Nearest Neighbors Search (ANNS) indexes with a fixed α value. This leads to significant performance degradation when the query's α dynamically changes based on the different scenarios and needs.

In this study, we introduce the Dynamic Edge Navigation Graph ( DEG ), a graph-based ANNS index that maintains efficiency and accuracy with changing α values. It includes three novel components: (1) a greedy Pareto frontier search algorithm to compute a candidate neighbor set for each node, which comprises the node's approximate nearest neighbors for all possible α values; (2) a dynamic edge pruning strategy to determine the final edges from the candidate set and assign each edge an active range. This active range enables the dynamic use of the Relative Neighborhood Graph's pruning strategy based on the query's α values, skipping redundant edges at query time and achieving a better accuracy-efficiency trade-off; and (3) an edge seed method that accelerates the querying process. Extensive experiments on real-world datasets show that DEG demonstrates superior performance compared to existing methods under varying α values.

Density Decomposition of Bipartite Graphs

Mining dense subgraphs in a bipartite graph is a fundamental task in bipartite graph analysis, with numerous applications in community detection, fraud detection, and e-commerce recommendation. Existing dense subgraph models, such as biclique, k-biplex, k-bitruss, and (α,β)-core, often face challenges due to their high computational complexity or limitations in effectively capturing the density of the graph. To overcome these issues, in this paper, we propose a new dense subgraph model for bipartite graphs, namely (α,β)-dense subgraph, designed to capture the density structure inherent in bipartite graphs. We show that all (α,β)-dense subgraphs are nested within each other, forming a hierarchical density decomposition of the bipartite graph. To efficiently compute the (α,β)-dense subgraph, we develop a novel network flow algorithm with a carefully-designed core pruning technique. The time complexity of our algorithm is O(|E|+|E(R)|1.5), where |E| denotes the number of edges and |E(R)| is the number of edges of the pruned graph, often significantly smaller than |E|. Armed with this algorithm, we also propose a novel and efficient divide-and-conquer algorithm to compute the entire density decomposition of the bipartite graph within O(p ⋅ log dmax ⋅ |E|1.5 ) time, where p is typically a small constant in real-world bipartite graphs and dmax is the maximum degree. Extensive experiments and case studies on 11 real-world datasets demonstrate the effectiveness of our (α,β)-dense subgraph model and the high efficiency and scalability of our proposed algorithms.

Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs

Dialogue benchmarks are crucial in training and evaluating chatbots engaging in domain-specific conversations. Knowledge graphs (KGs) represent semantically rich and well-organized data spanning various domains, such as DBLP, DBpedia, and YAGO. Traditionally, dialogue benchmarks have been manually created from documents, neglecting the potential of KGs in automating this process. Some question-answering benchmarks are automatically generated using extensive preprocessing from KGs, but they do not support dialogue generation. This paper introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation platform for automatically generating high-quality dialogue benchmarks tailored to a specific domain using a KG. Chatty-Gen decomposes the generation process into manageable stages and uses assertion rules for automatic validation between stages. Our approach enables control over intermediate results to prevent time-consuming restarts due to hallucinations. It also reduces reliance on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront processing of the entire KG using efficient query-based retrieval to find representative subgraphs based on the dialogue context. Our experiments with several real and large KGs demonstrate that Chatty-Gen significantly outperforms state-of-the-art systems and ensures consistent model and system performance across multiple LLMs of diverse capabilities, such as GPT-4o, Gemini 1.5, Llama 3, and Mistral.

DISCES: Systematic Discovery of Event Stream Queries

The continuous evaluation of queries over an event stream provides the foundation for reactive applications in various domains. Yet, knowledge of queries that detect distinguished event patterns that are potential causes of the situation of interest is often not directly available. However, given a database of finite, historic (sub-)streams that have been gathered whenever a situation of interest was observed, one may aim at automatic discovery of the respective queries. Existing algorithms for event query discovery incorporate ad-hoc design choices, though, and it is unclear how their suitability for a database shall be assessed.

In this paper, we address this gap with DISCES, an algorithmic framework for event query discovery. DISCES outlines a design space for discovery algorithms, thereby making the design choices explicit. We instantiate the framework to derive four specific algorithms, which all yield correct and complete results, but differ in their runtime sensitivity. We therefore also provide guidance on how to select one of the algorithms for a given database based on a few of its essential properties. Our experiments using simulated and real-world data illustrate that our algorithms are indeed tailored to databases showing certain properties and solve the query discovery problem several orders of magnitude faster than existing approaches.

Disco: A Compact Index for LSM-trees

Many key-value stores and database systems use log-structured merge-trees (LSM-trees) as their storage engines because of their excellent write performance. However, the read performance of LSM-trees is suboptimal due to the overlapping sorted runs. Most existing efforts rely on filters to reduce unnecessary I/Os, but filters fundamentally do not help locate items and often become the bottleneck of the system. We identify that the lack of efficient index is the root cause of subpar read performance in LSM-trees. In this paper, we propose Disco: a compact index for LSM-trees. Disco indexes all the keys in an LSM-tree, so a query does not have to search every run of the LSM-tree. It records compact key representations to minimize the number of key comparisons so as to minimize cache misses and I/Os for both point and range queries. Disco guarantees that both point queries and seeks issue at most one I/O to the underlying runs, achieving an I/O efficiency close to a B+-tree. Disco improves upon REMIX's pioneering multi-run index design with additional compact key representations to help improve read performance. The representations are compact so the cost of persisting Disco to disk is small. Moreover, while a traditional LSM-tree has to choose a more aggressive compaction policy that slows down write performance to have better read performance, a Disco-indexed LSM-tree can employ a write-efficient policy and still have good read performance. Experimental results show that Disco can save I/Os and improve point and range query performance by up to 220% over RocksDB while maintaining efficient writes.

DiskGNN: Bridging I/O Efficiency and Model Accuracy for Out-of-Core GNN Training

Graph neural networks (GNNs) are models specialized for graph data and widely used in applications. To train GNNs on large graphs that exceed CPU memory, several systems have been designed to store data on disk and conduct out-of-core processing. However, these systems suffer from either read amplification when conducting random reads for node features that are smaller than a disk page, or degraded model accuracy by treating the graph as disconnected partitions. To close this gap, we build DiskGNN for high I/O efficiency and fast training without model accuracy degradation. The key technique is offline sampling, which decouples graph sampling from model computation. In particular, by conducting graph sampling beforehand for multiple mini-batches, DiskGNN acquires the node features that will be accessed during model computation and conducts pre-processing to pack the node features of each mini-batch contiguously on disk to avoid read amplification for computation. Given the feature access information acquired by offline sampling, DiskGNN also adopts designs including four-level feature store to fully utilize the memory hierarchy of GPU and CPU to cache hot node features and reduce disk access, batched packing to accelerate feature packing during pre-processing, and pipelined training to overlap disk access with other operations. We compare DiskGNN with state-of-the-art out-of-core GNN training systems. The results show that DiskGNN has more than 8x speedup over existing systems while matching their best model accuracy. DiskGNN is open-source at https://github.com/Liu-rj/DiskGNN.

Dual-Hierarchy Labelling: Scaling Up Distance Queries on Dynamic Road Networks

Computing the shortest-path distance between any two given vertices in road networks is an important problem. A tremendous amount of research has been conducted to address this problem, most of which are limited to static road networks. Since road networks undergo various real-time traffic conditions, there is a pressing need to address this problem for dynamic road networks. Existing state-of-the-art methods incrementally maintain an indexing structure to reflect dynamic changes on road networks. However, these methods suffer from either slow query response time or poor maintenance performance, particularly when road networks are large. In this work, we propose an efficient solution Dual-Hierarchy Labelling (DHL) for distance querying on dynamic road networks from a novel perspective, which incorporates two hierarchies with different but complementary data structures to support efficient query and update processing. Specifically, our proposed solution is comprised of three main components: query hierarchy, update hierarchy, and hierarchical labelling, where query hierarchy enables efficient query answering by exploring only a small subset of vertices in the labels of two query vertices and update hierarchy supports efficient maintenance of distance labelling under edge weight increase or decrease. We further develop dynamic algorithms to reflect dynamic changes by efficiently maintaining the update hierarchy and hierarchical labelling. We also propose a parallel variant of our dynamic algorithms by exploiting labelling structure which aligns well with parallel processing. We evaluate our methods on 10 large road networks and it shows that our methods significantly outperform the state-of-the-art methods, i.e., achieving considerably faster construction and update time, while being consistently 2-4 times faster in terms of query processing and consuming only 10%-20% labelling space.

Efficient Index Maintenance for Effective Resistance Computation on Evolving Graphs

In this paper, we study a problem of index maintenance on evolving graphs for effective resistance computation. Unlike an existing matrices-based index, we show that the index can be efficiently maintained by directly preserving samples of random walks and loop-erased walks. This approach not only enables efficient storage and rapid query response but also supports effective maintenance. We propose a novel approach to convert edge updates into landmark node updates. Building upon this, we present two new update algorithms for random walk and loop-erased walk samples respectively. Both algorithms update samples without requiring complete resampling, ensuring accuracy and high efficiency. A particularly challenging and innovative technique involves updating loop-erased walks. Here we develop a novel and powerful cycle decomposition technique for loop-erased walks, enabling us to update samples at the cycle level rather than the node level, significantly enhancing efficiency. Furthermore, we show that both of our methods achieve an Õ (1) time complexity per edge update in real-world graphs under a mild assumption. We conduct extensive experiments using 10 large real-world datasets to evaluate the performance of our approaches. The results show that our best algorithm can be up to two orders of magnitude faster than the baseline methods.

Efficient Maximum s-Bundle Search via Local Vertex Connectivity

The s-bundle, as a cohesive subgraph model which relaxes the clique, remains connected whenever fewer than n-s vertices are removed, where n is the number of vertices inside. Finding the largest s-bundle is a fundamental problem and has diverse applications in various fields such as social network analysis, graph visualization, and bioinformatics. Existing studies for solving the problem follow the same branch-and-bound framework and improve the efficiency by developing pruning techniques. As a result, all share the same worst-case time complexity of O*(2n), where O* suppresses the polynomial factors. In this paper, we propose a new branch-and-bound algorithm, called SymBD, which achieves improved theoretical guarantees and practical performance. It adopts the existing Symmetric-BK branching strategy whose performance highly depends on the ordering of vertices. We explore various vertex orderings for improving the performance. In particular, we propose two novel vertex orderings based on the local vertex connectivity. With the proposed vertex orderings, SymBD improves the worst-case time complexity to O*ns) where λs is strictly less than 2. To further boost the practical efficiency, we introduce a heuristic algorithm for computing a large initial solution and a divide-and-conquer strategy. Extensive experiments on 664 graphs demonstrate that our algorithm is up to five orders of magnitude faster than existing solutions.

Efficiently Counting Triangles in Large Temporal Graphs

In many real-world applications (e.g., email networks, social networks, and phone call networks), the relationships between entities can be modeled as a temporal graph, in which each edge is associated with a timestamp representing the interaction time. As a fundamental task in temporal graph analysis, triangle counting has received much attention, and several triangle models have been developed, including δ-temporal triangle, sliding-window triangle, and (δ1,3, δ1,2, δ2,3)-temporal triangle. In particular, the δ-temporal triangle, requiring the gap of timestamps of any two edges within it to be bounded by a threshold δ, has been demonstrated effective in many real applications, such as cohesiveness analysis, transitivity, clustering coefficient, and graph classification. In this paper, we study fast algorithms for counting δ-temporal triangles in a given query time window. We first propose an online algorithm, which enumerates all edges in the graph and for each edge, calculates how many δ-temporal triangles end with the edge. We further develop an efficient index-based solution, which maps δ-temporal triangles into points of the 2-dimensional space and further compactly organizes these points using hierarchical structures. Besides, we study the problem of binary δ-temporal triangle counting by considering the existence of δ-temporal triangle among three vertices. Experiments on large temporal graphs show that our online algorithm is up to 70× faster than the state-of-the-art algorithm, and our index-based algorithm is up to 108× faster than the online algorithm.

Efficiently Processing Joins and Grouped Aggregations on GPUs

There is a growing interest in leveraging GPUs for tasks beyond ML, especially in database systems. Despite the existing extensive work on GPU-based database operators, several questions are still open. For instance, the performance of almost all operators suffers from random accesses, which can account for up to 75% of the runtime. In addition, the group-by operator which is widely used in combination with joins, has not been fully explored for GPU acceleration. Furthermore, existing work often uses limited and unrepresentative workloads for evaluation and does not explore the query optimization aspect, i.e., how to choose the most efficient implementation based on the workload. In this paper, we revisit the state-of-the-art GPU-based join and group-by implementations. We identify their inefficiencies and propose several optimizations. We introduce GFTR, a novel technique to reduce random accesses, leading to speedups of up to 2.3x. We further optimize existing hash-based and sort-based group-by implementations, achieving significant speedups (19.4x and 1.7x, respectively). We also present a new partition-based group-by algorithm ideal for high group cardinalities. We analyze the optimizations with cost models, allowing us to predict the speedup. Finally, we conduct a performance evaluation to analyze each implementation. We conclude by providing practical heuristics to guide query optimizers in selecting the most efficient implementation for a given workload.

Entity/Relationship Graphs: Principled Design, Modeling, and Data Integrity Management of Graph Databases

Chen's Entity/Relationship (E/R) framework is a lingua franca for well-designed databases. We define E/R graphs as property graphs that are instances of E/R diagrams. As the latter are a subclass of PG-Schema, E/R modeling constitutes a methodology for designing graph databases that guarantee data integrity, the absence of data redundancy and update anomalies. In addition, E/R graphs provide the first graph semantics for E/R diagrams. Further to the unification of conceptual and graph data modeling, referential integrity for E/R graphs can be managed by directed edges, called E/R links, between nodes. As a consequence, redundancy and sources of potential inconsistency can be eliminated, minimizing update maintenance. This is achieved by E/R keys that use properties and E/R links to enforce entity integrity, in contrast to property keys that rely exclusively on properties to enforce entity integrity. We use the TPC-H benchmark as running example and for extensive experiments that quantify the effort for i) managing entity integrity using property keys or E/R keys, ii) managing referential integrity using property redundancy or E/R links, iii) query evaluation. In summary, E/R diagrams form a principled core of PG-Schema for well-designed property graphs, while E/R keys constitute an efficient core of PG-Key for data integrity management.

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FastPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.

Federated Heavy Hitter Analytics with Local Differential Privacy

Federated heavy hitter analytics enables service providers to better understand the preferences of cross-party users by analyzing the most frequent items. As with federated learning, it faces challenges of privacy concerns, statistical heterogeneity, and expensive communication. Local differential privacy (LDP), as the de facto standard for privacy-preserving data collection, solves the privacy challenge by letting each user perturb her data locally and report the sanitized version. However, in federated settings, applying LDP complicates the other two challenges, due to the deteriorated utility by the injected LDP noise or increasing communication/computation costs by perturbation mechanism. To tackle these problems, we propose a novel target-aligning prefix tree mechanism satisfying ε-LDP, for federated heavy hitter analytics. In particular, we propose an adaptive extension strategy to address the inconsistencies between covering necessary prefixes and estimating heavy hitters within a party to enhance the utility. We also present a consensus-based pruning strategy that utilizes noisy prior knowledge from other parties to further align the inconsistency between finding heavy hitters in each party and providing reasonable frequency information to identify the global ones. To the best of our knowledge, our study is the first solution to the federated heavy hitter analytics in a cross-party setting while satisfying the stringent ε-LDP. Comprehensive experiments on both real-world and synthetic datasets confirm the effectiveness of our proposed mechanism.

Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art

Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. We briefly survey in-memory graph-based vector search, outline the chronology of the different methods and classify them according to five main design paradigms: seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data-adaptive seed selection and diversification strategies.

H-Rocks: CPU-GPU accelerated Heterogeneous RocksDB on Persistent Memory

Persistent key-value stores (pKVS) such as RocksDB are critical to many internet-scale services. Recent works leveraged persistent memory (PM) to improve pKVS throughput. However, they are typically limited to CPUs.

We develop H-Rocks to judiciously leverage both the CPU and the Graphics Processing Unit (GPU) for accelerating a wide range of RocksDB operations. H-Rocks selectively accelerates performance-critical parts of RocksDB on the GPU. It uses operation sub-batching and key-value versioning to leverage GPU's parallelism while maintaining compatibility with RocksDB. It harnesses GPU's high-bandwidth memory while limiting data movement between the CPU and GPU. In YCSB workloads, H-Rocks outperforms CPU-based pKVSs like Viper, Plush, and pmem-RocksDB by 3-18×.

HyperMR: Efficient Hypergraph-enhanced Matrix Storage on Compute-in-Memory Architecture

Matrix-vector multiplication (MVM) operations, essential for modern hardware architectures, suffer from heavy I/O overheads and costly serial multiply-add operations. The emerging Compute-in-Memory (CIM) architecture alleviates these issues by enabling in situ MVM operations with O(1) time complexity, eliminating the need to move matrices. However, current storage schemes are still inefficient on CIM due to limited optimization objectives and inflexible support for various access patterns and matrix structures. To address this, we propose HyperMR, a hypergraph-enhanced matrix storage scheme for CIM architectures. First, we identify two performance optimization objectives that are tailored to CIM and prove their NP-hardness. We then introduce a hypergraph modeling approach with a novel access-aware hypergraph generation algorithm to handle diverse matrix structures and access patterns. Moreover, we present a two-phase hypergraph partitioning method to efficiently tackle the NP-hard optimization objectives. Experimental results show that HyperMR outperforms multiple state-of-the-art storage schemes, offering valid optimization for all evaluated matrices, compared to the best-performing baseline which optimizes only 75%. HyperMR also achieves the best average optimization performance for matrix storage layouts, significantly improving efficiency in varied workload scenarios, with a 29.65% improvement on synthetic queries and up to 34.9% on scientific image filtering.

In-Database Time Series Clustering

Time series data are often clustered repeatedly across various time ranges to mine frequent subsequence patterns from different periods, which could further support downstream applications. Existing state-of-the-art (SOTA) time series clustering method, such as K-Shape, can proficiently cluster time series data referring to their shapes. However, in-database time series clustering problem has been neglected, especially in IoT scenarios with large-volume data and high efficiency demands. Most time series databases employ LSM-Tree based storage to support intensive writings, yet causing underlying data points out-of-order in timestamps. Therefore, to apply existing out-of-database methods, all data points must be fully loaded into memory and chronologically sorted. Additionally, out-of-database methods must cluster from scratch each time, making them inefficient when handling queries across different time ranges. In this work, we propose an in-database adaptation of SOTA time series clustering method K-Shape. Moreover, to solve the problem that K-Shape cannot efficiently handle long time series, we propose Medoid-Shape, as well as its in-database adaptation for further acceleration. Extensive experiments are conducted to demonstrate the higher efficiency of our proposals, with comparable effectiveness. Remarkably, all proposals have already been implemented in an open-source commodity time series database, Apache IoTDB.

InTime: Towards Performance Predictability In Byzantine Fault Tolerant Proof-of-Stake Consensus

Performance predictability, ensuring low latency variability, is crucial for the reliability and efficiency of blockchain consensus. Byzantine Fault Tolerant Proof-of-Stake (BFT-PoS) consensus aims to achieve stable transaction processing latency by scheduling block generation at consistent intervals. However, BFT-PoS's incentive mechanisms grant all transaction tips to the block proposer, which can be exploited by delaying proposals to gain extra Maximal Extractable Value (MEV) rewards, thus undermining performance predictability. Existing solutions impose penalties for delays but lack a standard for measuring the extra rewards from delays or fail in malicious environments.

This paper introduces InTime, a novel approach to safeguard performance predictability in BFT-PoS by economically motivating timely block proposals. We first introduce the untimely MEV ratio, a reliable metric to measure the extra rewards gained from proposal delays, facilitating our countermeasures against deliberate delays. Furthermore, we propose the arrival rate incentive (ARI), aligning rewards with transaction arrival timing among nodes to reduce potential MEV manipulation. To make ARI robust against malicious behaviors, we establish a committee time witness (CTW) workflow to accurately gather and verify transaction arrival times. Extensive experiments demonstrate that InTime can effectively reduce latency variability by up to 95.9%.

ISSD: Indicator Selection for Time Series State Detection

Time series data from monitoring applications captures the behaviours of objects, which can often be split into distinguishable segments that reflect the underlying state changes. Despite the recent advances in time series state detection, the indicator selection for state detection is rarely studied, most of state detection work assumes the input indicators have been properly or manually selected. However, this assumption is disconnected from practice, on one hand, manual selection is not scalable, there can be up to thousands of indicators for the runtime monitoring of certain objects, e.g., supercomputer systems. On the other hand, performing state detection on a large amount of raw indicators is both inefficient and redundant. We argue that indicator selection should be made an upstream task for selecting a subset of indicators to facilitate state analysis. To this end, we propose ISSD (Indicator Selection for State Detection), an indicator selection method for time series state detection. At its core, ISSD attempts to find an indicator subset that has as much high-quality states, which is measured by the channel set completeness and quality we invent based on segment-level sampling statistics. Such an indicator selection process is transformed into a multi-objective optimization problem and an approximation algorithm is designed to solve the NP-hard searching for specific end point in the Pareto front. Experiments on 5 datasets and 4 downstream methods show that ISSD has significant selection superiority compared with 6 baselines. We also elaborate on two observations of selection resilience and channel sensitivity of existing state detection methods and appeal to further research on them.

Largest Triangle Sampling for Visualizing Time Series in Database

In time series visualization, sampling is used to reduce the number of points while retaining the visual features of the raw time series. Area-based Largest Triangle Sampling (LTS) excels at preserving perceptually critical points. However, the heuristic solution to LTS by sequentially sampling points with the locally largest triangle area (a.k.a. Largest-Triangle-Three-Buckets, LTTB) suffers from suboptimal solution and query inefficiency. We address the shortcomings by contributing a novel Iterative Largest Triangle Sampling (ILTS) algorithm with convex hull acceleration. It refines the sampling results iteratively, capturing a broader perspective by integrating more points in each iteration. Remarkably, we prove that the largest triangle can always be found in the precomputed convex hulls, making the iterative sampling still efficient. Experiments demonstrate increased visual quality over state-of-the-art baselines and significant speedups over the brute force approach.

LCP: Enhancing Scientific Data Management with <u>L</u>ossy <u>C</u>ompression for <u>P</u>articles

Many scientific applications opt for particles instead of meshes as their basic primitives to model complex systems composed of billions of discrete entities. Such applications span a diverse array of scientific domains, including molecular dynamics, cosmology, computational fluid dynamics, and geology. The scale of the particles in those scientific applications increases substantially thanks to the ever-increasing computational power in high-performance computing (HPC) platforms. However, the actual gains from such increases are often undercut by obstacles in data management systems related to data storage, transfer, and processing. Lossy compression has been widely recognized as a promising solution to enhance scientific data management systems regarding such challenges, although most existing compression solutions are tailored for Cartesian grids and thus have sub-optimal results on discrete particle data. In this paper, we introduce LCP, an innovative lossy compressor designed for particle datasets, offering superior compression quality and higher speed than existing compression solutions. Specifically, our contribution is threefold. (1) We propose LCP-S, an error-bound aware block-wise spatial compressor to efficiently reduce particle data size while satisfying the pre-defined error criteria. This approach is universally applicable to particle data across various domains, eliminating the need for reliance on specific application domain characteristics. (2) We develop LCP, a hybrid compression solution for multi-frame particle data, featuring dynamic method selection and parameter optimization. It aims to maximize compression effectiveness while preserving data quality as much as possible by utilizing both spatial and temporal domains. (3) We evaluate our solution alongside eight state-of-the-art alternatives on eight real-world particle datasets from seven distinct domains. The results demonstrate that our solution achieves up to 104% improvement in compression ratios and up to 593% increase in speed compared to the second-best option, under the same error criteria.

LeaFi: Data Series Indexes on Steroids with Learned Filters

The ever-growing collections of data series create a pressing need for efficient similarity search, which serves as the backbone for various analytics pipelines. Recent studies have shown that tree-based series indexes excel in many scenarios. However, we observe a significant waste of effort during search, due to suboptimal pruning. To address this issue, we introduce LeaFi, a novel framework that uses machine learning models to boost pruning effectiveness of tree-based data series indexes. These models act as learned filters, which predict tight node-wise distance lower bounds that are used to make pruning decisions, thus, improving pruning effectiveness. We describe the LeaFi-enhanced index building algorithm, which selects leaf nodes and generates training data to insert and train machine learning models, as well as the LeaFi-enhanced search algorithm, which calibrates learned filters at query time to support the user-defined quality target of each query. Our experimental evaluation, using two different tree-based series indexes and five diverse datasets, demonstrates the advantages of the proposed approach. LeaFi-enhanced data-series indexes improve pruning ratio by up to 20x and search time by up to 32x, while maintaining a target recall of 99%.

MAST: Towards Efficient Analytical Query Processing on Point Cloud Data

The proliferation of 3D scanning technology, particularly within autonomous driving, has led to an exponential increase in the volume of Point Cloud (PC) data. Given the rich semantic information contained in PC data, deep learning models are commonly employed for tasks such as object queries. However, current query systems that support PC data types do not process queries on semantic information. Consequently, there is a notable gap in research regarding the efficiency of invoking deep models for each PC data query, especially when dealing with large-scale models and datasets. To address this issue, this work aims to design an efficient approximate approach for supporting PC analysis queries, including PC retrieval and aggregate queries. In particular, we propose a novel framework that delivers approximate query results efficiently by sampling core PC frames within a constrained budget, thereby minimizing the reliance on deep learning models. This framework is underpinned by rigorous theoretical analysis, providing error-bound guarantees for the approximate results if the sampling policy is preferred. To achieve this, we incorporate a multi-agent reinforcement learning-based approach to optimize the sampling procedure, along with an innovative reward design leveraging spatio-temporal PC analysis. Furthermore, we exploit the spatio-temporal characteristics inherent in PC data to construct an index that accelerates the query process. Extensive experimental evaluations demonstrate that our proposed method, MAST, not only achieves accurate approximate query results but also maintains low query latency, ensuring high efficiency.

MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing memory reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.

Minimum Spanning Tree Maintenance in Dynamic Graphs

Minimum Spanning Tree (MST) is a fundamental structure in graph analytics and can be applied in various applications. The problem of maintaining MSTs in dynamic graphs is significant, as many real-world graphs are frequently updated. Existing studies on MST maintenance primarily focus on theoretical analysis and lack practical efficiency. In this paper, we propose a novel algorithm to maintain MST in dynamic graphs, which achieves high practical efficiency. In addition to the tree structure, our main idea is to maintain a replacement edge for each tree edge. In this way, the tree structure can be immediately updated when a tree edge is deleted. We propose algorithms to maintain the replacement edge for each tree edge by sharing the computation cost in the updating process. Our performance studies on large datasets demonstrate considerable improvements over state-of-the-art solutions.

Modyn: Data-Centric Machine Learning Pipeline Orchestration

In real-world machine learning (ML) pipelines, datasets are continuously growing. Models must incorporate this new training data to improve generalization and adapt to potential distribution shifts. The cost of model retraining is proportional to how frequently the model is retrained and how much data it is trained on, which makes the naive approach of retraining from scratch each time impractical. We present Modyn, a data-centric end-to-end machine learning platform. Modyn's ML pipeline abstraction enables users to declaratively describe policies for continuously training a model on a growing dataset. Modyn pipelines allow users to apply data selection policies (to reduce the number of data points) and triggering policies (to reduce the number of trainings). Modyn executes and orchestrates these continuous ML training pipelines. The system is open-source and comes with an ecosystem of benchmark datasets, models, and tooling. We formally discuss how to measure the performance of ML pipelines by introducing the concept of composite models, enabling fair comparison of pipelines with different data selection and triggering policies. We empirically analyze how various data selection and triggering policies impact model accuracy, and also show that Modyn enables high throughput training with sample-level data selection.

Multi-Level Graph Representation Learning Through Predictive Community-based Partitioning

Graph representation learning (GRL) aims to map a graph into a low-dimensional vector space while preserving graph topology and node properties. This study proposes a novel GRL model, Multi-Level GRL (simply, ML-GRL), that recursively partitions input graphs by selecting the most appropriate community detection algorithm at each graph or partitioned subgraph. To preserve the relationship between subgraphs, ML-GRL incorporates global graphs that effectively maintain the overall topology. ML-GRL employs a prediction model, which is pre-trained using graph-based features and covers a wide range of graph distributions, to estimate GRL accuracy of each community detection algorithm without partitioning graphs or subgraphs and evaluating them. ML-GRL improves learning accuracy by selecting the most effective community detection algorithm while enhancing learning efficiency from parallel processing of partitioned subgraphs. Through extensive experiments with two different tasks, we demonstrate ML-GRL's superiority over the six representative GRL models in terms of both learning accuracy and efficiency. Specifically, ML-GRL not only improves the accuracy of existing GRL models by 3.68 ~ 47.59% for link prediction and 1.75 ~ 40.90% for node classification but also significantly reduces their running time by 9.63 ~ 62.71% and 7.14 ~ 82.14%, respectively. Our source code is available at https://github.com/pnpy6elp/Multi_Level_GRL.

Nezha: An Efficient Distributed Graph Processing System on Heterogeneous Hardware

The growing scale of graph data across various applications demands efficient distributed graph processing systems. Despite the widespread use of the Scatter-Gather model for large-scale graph processing across distributed machines, the performance still can be significantly improved as the computation ability of each machine is not fully utilized and the communication costs during graph processing are expensive in the distributed environment. In this work, we propose a novel and efficient distributed graph processing system Nezha on heterogeneous hardware, where each machine is equipped with both CPU and GPU processors and all these machines in the distributed cluster are interconnected via Remote Direct Memory Access (RDMA).To reduce the communication costs, we devise an effective communication mode with a graph-friendly communication protocol in the graph-based RDMA communication adapter of Nezha. To improve the computation efficiency, we propose a multi-device cooperative execution mechanism in Nezha, which fully utilizes the CPU and GPU processors of each machine in the distributed cluster. We also alleviate the workload imbalance issue at inter-machine and intra-machine levels via the proposed workload balancer in Nezha. We conduct extensive experiments by running 4 widely-used graph algorithms on 5 graph datasets to demonstrate the superiority of Nezha over existing systems.

OBIR-tree: An Efficient Oblivious Index for Spatial Keyword Queries on Secure Enclaves

In recent years, the widely collected spatial-textual data has given rise to numerous applications centered on spatial keyword queries. However, securely providing spatial keyword query services in an outsourcing environment has been challenging. Existing schemes struggle to enable top-k spatial keyword queries on encrypted data while hiding search, access, and volume patterns, which raises concerns about availability and security. To address the above issue, this paper proposes OBIR-tree, a novel index structure for oblivious (provably hides search, access, and volume patterns) top-k spatial keyword queries on encrypted data. As a tight spatial-textual index tailored from the IR-tree and PathORAM, OBIR-tree can support sublinear search without revealing any useful information. Furthermore, we present extension designs to optimize the query latency of the OBIR-tree: (1) combine the OBIR-tree with hardware secure enclaves ( e.g., Intel SGX) to minimize client-server interactions; (2) build a Real/Dummy block Tree (RDT) to reduce the computational cost of oblivious operations within enclaves. Extensive experimental evaluations on real-world datasets demonstrate that the search efficiency of OBIR-tree outperforms state-of-the-art baselines by 25x ~ 723× and is practical for real-world applications.

On Graph Representation for Attributed Hypergraph Clustering

Attributed Hypergraph Clustering (AHC) aims at partitioning a hypergraph into clusters such that nodes in the same cluster are close to each other with both high connectedness and homogeneous attributes. Existing AHC methods are all based on matrix factorization which may incur a substantial computation cost; more importantly, they inherently require a prior knowledge of the number of clusters as an input which, if inaccurately estimated, shall lead to a significant deterioration in the clustering quality. In this paper, we propose <u>A</u>ttributed <u>H</u>ypergraph <u>R</u>epresentation for <u>C</u>lustering (AHRC), a cluster-number-free hypergraph clustering consisting of an effective integration of the hypergraph topology and node attributes for hypergraph representation, a multi-hop modularity function for optimization, and a hypergraph sparsification for scalable computation. AHRC achieves cutting-edge clustering quality and efficiency: compared to the state-of-the-art (SOTA) AHC method on 10 real hypergraphs, AHRC obtains an average of 20% higher F-measure, 24% higher ARI, 26% higher Jaccard Similarity, 10% higher Purity, and runs 5.5× faster. As a byproduct, the intermediate result of graph representation dramatically boosts the clustering quality of SOTA contrastive-learning-based hypergraph clustering methods, showing the generality of our graph representation.

Optimizing Block Skipping for High-Dimensional Data with Learned Adaptive Curve

In the realm of big data and cloud analytics, efficiently managing and retrieving high-dimensional data presents a critical challenge. Traditional indexes often struggle with the storage overhead inherent in large datasets. There is a growing interest in the adoption of Small Materialize Aggregation (SMA) among cloud database vendors due to its ability to maintain lightweight block-level metadata, facilitating efficient block skipping. However, SMA performance relies heavily on data layout. This is especially critical in scenarios with wide tables containing hundreds of dimensions, where the curse of dimensionality exacerbates the issue. In this paper, we propose AdaCurve, a novel approach aimed at enhancing block skipping in high-dimensional datasets through adaptive optimization of data layout. Unlike conventional static and non-adaptive space-filling curves (SFCs), AdaCurve leverages machine learning to develop an adaptive curve---a dynamically adjusting optimal projection function tailored to high-dimensional workloads and data characteristics. We introduce an attention-based network to handle high-dimensional data and a learnable objective for training adaptive curves in an end-to-end manner. Extensive experiments conducted on the Spark with real-world datasets demonstrate the effectiveness of AdaCurve. We have shown that AdaCurve effectively scales to datasets with dimensions of up to 1,000 columns, achieving a 2.8× improvement in block skipping compared to SFCs.

Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data Streams

In data streams, persistence characterizes items that appear repeatedly across multiple non-overlapping time windows. Addressing persistence-based tasks, such as detecting highly persistent items and estimating persistence, is crucial for applications like recommendation systems and anomaly detection in high-velocity data streams. However, these tasks are challenging due to stringent requirements for rapid processing and limited memory resources. Existing methods often struggle with accuracy, especially given highly skewed data distributions and tight fastest memory budgets, where hash collisions are severe. In this paper, we introduce Pandora, a novel approximate data structure designed to tackle these challenges efficiently. Our approach incorporates the insight that items absent for extended periods are likely non-persistent, increasing their probability of eviction to accommodate potential persistent items more effectively. We validate this insight empirically and integrate it into our update strategy, providing better protection for persistent items. We formally analyze Pandora's error bounds to validate its theoretical soundness. Through extensive trace-driven tests, we demonstrate that Pandora achieves superior accuracy and processing speed compared to state-of-the-art methods across various persistence-based tasks. Additionally, we further accelerate Pandora's update speed using Single Instruction Multiple Data (SIMD) instructions, enhancing its efficiency in high-speed data stream environments. The code for our method is open-sourced.

Parallel kd-tree with Batch Updates

The kd-tree is one of the most widely used data structures to manage multi-dimensional data. Due to the ever-growing data volume, it is imperative to consider parallelism in kd-trees. However, we observed challenges in existing parallel kd-tree implementations, for both constructions and updates.

The goal of this paper is to develop efficient in-memory kd-trees by supporting high parallelism and cache-efficiency. We propose the Pkd-tree (Parallel kd-tree), a parallel kd-tree that is efficient both in theory and in practice. The Pkd-tree supports parallel tree construction, batch update (insertion and deletion), and various queries including k-nearest neighbor search, range query, and range count. We proved that our algorithms have strong theoretical bounds in work (sequential time complexity), span (parallelism), and cache complexity. Our key techniques include 1) an efficient construction algorithm that optimizes work, span, and cache complexity simultaneously, and 2) reconstruction-based update algorithms that guarantee the tree to be weight-balanced. With the new algorithmic insights and careful engineering effort, we achieved a highly optimized implementation of the Pkd-tree.

We tested Pkd-tree with various synthetic and real-world datasets, including both uniform and highly skewed data. We compare the Pkd-tree with state-of-the-art parallel kd-tree implementations. In all tests, with better or competitive query performance, Pkd-tree is much faster in construction and updates consistently than all baselines. We released our code.

PoneglyphDB: Efficient Non-interactive Zero-Knowledge Proofs for Arbitrary SQL-Query Verification

In database applications involving sensitive data, the dual imperatives of data confidentiality and provable (verifiable) query processing are important. This paper introduces PoneglyphDB, a database system that leverages non-interactive zero-knowledge proofs (ZKP) to support both confidentiality and provability. Unlike traditional databases, PoneglyphDB enhances confidentiality by ensuring that raw data remains exclusively with the host, while also enabling verifying the correctness of query responses by providing proofs to clients.

The main innovation in this paper is proposing efficient ZKP designs (called circuits) for basic operations in SQL query processing. These basic operation circuits are then combined to form ZKP circuits for larger, more complex queries. PoneglyphDB's circuits are carefully designed to be efficient by utilizing advances in cryptography such as PLONKish-based circuits, recursive proof composition techniques, and designing with low-order polynomial constraints. We demonstrate the performance of PoneglyphDB with the standard TPC-H benchmark. Our experimental results show that PoneglyphDB can efficiently achieve both confidentiality and provability, outperforming existing state-of-the-art ZKP methods.

Practical DB-OS Co-Design with Privileged Kernel Bypass

This paper revisits the longstanding challenge of coordinating database systems with general-purpose OS interfaces, such as POSIX, which often lack tailored support for DB requirements. Existing approaches to this DB-OS co-design struggle with limited design space, security risks, and compatibility issues. To overcome these hurdles, we propose a new co-design approach leveraging virtualization to elevate the privilege level of DB processes. Our method enables database systems to fully exploit hardware capabilities via virtualization, while minimizing the need for extensive modifications to the host OS kernel, thereby maintaining compatibility. We demonstrate the effectiveness of our approach through two novel virtual memory mechanisms tailored for database workloads: (1) an efficient snapshotting mechanism that captures memory snapshots at millisecond intervals for in-memory databases and HTAP workloads, and (2) a streamlined in-kernel buffer pool design. We introduce Libdbos, a lightweight guest kernel implementing these mechanisms. Our evaluations highlight significant improvements in latency and efficiency compared to existing snapshotting and buffer pool designs, underscoring the potential of the approach.

Progressive Entity Matching: A Design Space Exploration

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Matching that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.

QURE: AI-Assisted and Automatically Verified UDF Inlining

User-defined functions (UDFs) extend the capabilities of SQL by improving code reusability and encapsulating complex logic, but can hinder the performance due to optimization and execution inefficiencies. Prior approaches attempt to address this by rewriting UDFs into native SQL, which is then inlined into the SQL queries that invoke them. However, these approaches are either limited to simple pattern matching or require the synthesis of complex verification conditions from procedural code, a process that is brittle and difficult to automate. This limits coverage and makes the translation approaches less extensible to previously unseen procedural constructs. In this work, we present QURE, a framework that (1) leverages large language models (LLMs) to translate UDFs to native SQL, and (2) introduces a novel formal verification method to establish equivalence between the UDF and its translation. QURE uses the semantics of SQL operators to automate the derivation of verification conditions, in turn resulting in broad coverage and high extensibility. We model a large set of imperative constructs, particularly those common in Python and Pandas UDFs, in an intermediate verification language, allowing for the verification of their SQL translation. In our empirical evaluation of Python and Pandas UDFs, equivalence is successfully verified for 88% of UDF-SQL pairs (the rest lack semantically-equivalent SQLs) and LLMs correctly translate 84% of the UDFs. Executing the translated UDFs achieves median performance improvements of 23x on single-node clusters and 12x on 12-node clusters compared to the original UDFs, while also significantly reducing out-of-memory errors.

Randomized Sketches for Quantile in LSM-tree based Store

Quantiles are costly to compute exactly but can be efficiently estimated by quantile sketches. Extensive works on summarizing streaming data, such as KLL sketch, focus on minimizing the cost in memory to provide certain error guarantees. For the problem of quantile estimation of values in LSM-tree based stores, streaming methods have an expensive I/O cost linear to data size N. Since disk components (chunks and SSTables) in the LSM-tree are immutable once flushed, quantile sketches can be pre-computed as a type of statistics to reduce I/O cost and accelerate queries. Unfortunately, to provide deterministic additive εN error guarantees on queried data, all pre-computed deterministic sketches of queried chunks each with size N_c should provide εN_c error guarantee, resulting in no improvement in the linear I/O cost. In this study, we propose pre-computing randomized sketches which provide randomized additive error guarantees. Our major technical contributions include (1) randomized sketches for data chunks constructed in flush events, which are proved to be optimal and achieve an I/O cost proportional to √(N), (2) hierarchical randomized sketches for SSTables constructed in compaction events, that further improve the asymptotic I/O cost, (3) the KLL sketch summarizing proposed pre-computed sketches is proved to be more accurate than that summarizing streaming data, and proved to achieve sublinear I/O cost while achieving the same memory complexity as in the streaming settings. Extensive experiments on synthetic and real datasets demonstrate the superiority of the proposed techniques. The approach is deployed in an LSM-tree based time-series database Apache IoTDB.

Rapid Data Ingestion through DB-OS Co-design

Sequential data access for the rapid ingestion of large fact tables from storage is a pivotal yet resource-intensive operation in data warehouse systems, consuming substantial CPU cycles across various components of DBMSs and operating systems. Although bypassing these layers can eliminate access latency, concurrent access to the same table often results in redundant data fetching due to cache-bypassing data transfers. Thus, a new design for data access control is necessary to enhance rapid data ingestion in databases. To address this concern, we propose a novel DB-OS co-design that efficiently supports sequential data access at full device speed. Our approach, zicIO, liberates DBMSs from data access control by preparing required data just before DBMSs access it, while alleviating all known I/O latencies. The core of zicIO lies in its DB-OS co-design, which aims to (1) automate data access control and (2) relieve redundant data fetching through seamless collaboration between the DB and the OS. We implemented zicIO and integrated it with four databases to demonstrate its general applicability. The evaluation showed performance enhancements of up to 9.95x under TPC-H loads.

Reliable Text-to-SQL with Adaptive Abstention

Large language models (LLMs) have revolutionized natural language interfaces for databases, particularly in text-to-SQL conversion. However, current approaches often generate unreliable outputs when faced with ambiguity or insufficient context.

We present Reliable Text-to-SQL (RTS), a novel framework that enhances query generation reliability by incorporating abstention and human-in-the-loop mechanisms. RTS focuses on the critical schema linking phase, which aims to identify the key database elements needed for generating SQL queries. It autonomously detects potential errors during the answer generation process and responds by either abstaining or engaging in user interaction. A vital component of RTS is the Branching Point Prediction (BPP) which utilizes statistical conformal techniques on the hidden layers of the LLM model for schema linking, providing probabilistic guarantees on schema linking accuracy.

We validate our approach through comprehensive experiments on the BIRD benchmark, demonstrating significant improvements in robustness and reliability. Our findings highlight the potential of combining transparent-box LLMs with human-in-the-loop processes to create more robust natural language interfaces for databases. For the BIRD benchmark, our approach achieves near-perfect schema linking accuracy, autonomously involving a human when needed. Combined with query generation, we demonstrate that near-perfect schema linking and a small query generation model can almost match SOTA accuracy achieved with a model orders of magnitude larger than the one we use.

Revisiting the Design of In-Memory Dynamic Graph Storage

The effectiveness of in-memory dynamic graph storage (DGS) for supporting concurrent graph read and write queries is crucial for real-time graph analytics and updates. Various methods have been proposed, for example, LLAMA, Aspen, LiveGraph, Teseo, and Sortledton. These approaches differ significantly in their support for read and write operations, space overhead, and concurrency control. However, there has been no systematic study to explore the trade-offs among these dimensions. In this paper, we evaluate the effectiveness of individual techniques and identify the performance factors affecting these storage methods by proposing a common abstraction for DGS design and implementing a generic test framework based on this abstraction. Our findings highlight several key insights: 1) Existing DGS methods exhibit substantial space overhead. For example, Aspen consumes 3.3-10.8x more memory than CSR, while the optimal fine-grained methods consume 4.1-8.9x more memory than CSR, indicating a significant memory overhead. 2) Existing methods often overlook memory access impact of modern architectures, leading to performance degradation compared to continuous storage methods. 3) Fine-grained concurrency control methods, in particular, suffer from severe efficiency and space issues due to maintaining versions and performing checks for each neighbor. These methods also experience significant contention on high-degree vertices. Our systematic study reveals these performance bottlenecks and outlines future directions to improve DGS for real-time graph analytics.

RLER-TTE: An Efficient and Effective Framework for En Route Travel Time Estimation with Reinforcement Learning

En Route Travel Time Estimation (ER-TTE) aims to learn driving patterns from traveled routes to achieve rapid and accurate real-time predictions. However, existing methods ignore the complexity and dynamism of real-world traffic systems, resulting in significant gaps in efficiency and accuracy in real-time scenarios. Addressing this issue is a critical yet challenging task. This paper proposes a novel framework that redefines the implementation path of ER-TTE to achieve highly efficient and effective predictions. Firstly, we introduce a novel pipeline consisting of a Decision Maker and a Predictor to rectify the inefficient prediction strategies of current methods. The Decision Maker performs efficient real-time decisions to determine whether the high-complexity prediction model in the Predictor needs to be invoked, and the Predictor recalculates the travel time or infers from historical prediction results based on these decisions. Next, to tackle the dynamic and uncertain real-time scenarios, we model the online decision-making problem as a Markov decision process and design an intelligent agent based on reinforcement learning for autonomous decision-making. Moreover, to fully exploit the spatio-temporal correlation between online data and offline data, we meticulously design feature representation and encoding techniques based on the attention mechanism. Finally, to improve the flawed training and evaluation strategies of existing methods, we propose an end-to-end training and evaluation approach, incorporating curriculum learning strategies to manage spatio-temporal data for more advanced training algorithms. Extensive evaluations on three real-world datasets confirm that our method significantly outperforms state-of-the-art solutions in both accuracy and efficiency.

Schema-Based Query Optimisation for Graph Databases

Recursive graph queries are increasingly popular for extracting information from interconnected data found in various domains such as social networks, life sciences, and business analytics. Graph data often come with schema information that describe how nodes and edges are organized. We propose a type inference mechanism that enriches recursive graph queries with relevant structural information contained in a graph schema. We show that this schema information can be useful in order to improve the performance when evaluating recursive graph queries. Furthermore, we prove that the proposed method is sound and complete, ensuring that the semantics of the query is preserved during the schema-enrichment process.

SecureXGB: A Secure and Efficient Multi-party Protocol for Vertical Federated XGBoost

Extreme Gradient Boosting (XGBoost) demonstrates excellent performance in practice and is widely used in both industry and academic research. This extensive application has led to a growing interest in employing multi-party data to develop more robust XGBoost models. In response to increasing concerns about privacy leakage, secure vertical federated XGBoost is proposed. It employs secure multi-party computation techniques, such as secret sharing (SS), to allow multiple parties holding vertically partitioned data, i.e., disjoint features on the same samples, to collaborate in constructing an XGBoost model. However, the running efficiency is the primary obstacle to the practical application of existing protocols, especially in multi-party settings. The reason is that these protocols not only require the execution of data-oblivious computations to protect intermediate results, leading to high computational complexity, but also involve a large number of SS-based non-linear operations with high overheads, e.g., division operations in gain score calculation and comparison operations in best split selection. To this end, we present a secure and efficient multi-party protocol for vertical federated XGBoost, called SecureXGB, which can perform the collaborative training of an XGBoost model in an SS-friendly manner. In SecureXGB, we first propose a parallelizable multi-party permutation method, which can secretly and efficiently permute all samples before model training to reduce the reliance on data-oblivious computations. Then, we design a linear gain score that can be evaluated without involving division operations and has equivalent utility to the original gain score. Finally, we develop a synchronous best split selection method to secretly identify the best split with the maximum gain score using a minimal number of comparison operations. Experimental results demonstrate that SecureXGB can achieve better training efficiency than state-of-the-art protocols without the loss of model accuracy.

Sequoia: An Accessible and Extensible Framework for Privacy-Preserving Machine Learning over Distributed Data

Privacy-preserving machine learning (PPML) algorithms use secure computation protocols to allow multiple data parties to collaboratively train machine learning (ML) models while maintaining their data confidentiality. However, current PPML frameworks couple secure protocols with ML models in PPML algorithm implementations, making it challenging for non-experts to develop and optimize PPML applications, limiting their accessibility and performance.

We propose Sequoia, a novel PPML framework that decouples ML models and secure protocols to optimize the development and execution of PPML applications across data parties. Sequoia offers JAX-compatible APIs for users to program their ML models, while using a compiler-executor architecture to automatically apply PPML algorithms and system optimizations for model execution over distributed data. The compiler in Sequoia incorporates cross-party PPML processes into user-defined ML models by transparently adding computation, encryption, and communication steps with extensible policies, and the executor efficiently schedules code execution across multiple data parties, considering data dependencies and device heterogeneity.

Compared to existing PPML frameworks, Sequoia requires 64%-92% fewer lines of code for users to implement the same PPML algorithms, and achieves 88% speedup of training throughput in horizontal PPML.

Shapley Value Estimation based on Differential Matrix

The Shapley value has been extensively used in many fields as the unique metric to fairly evaluate player contributions in cooperative settings. Since the exact computation of Shapley values is \#P-hard in the task-agnostic setting, many studies have been developed to utilize the Monte Carlo method for Shapley value estimation. The existing methods estimate the Shapley values directly. In this paper, we explore a novel idea-inferring the Shapley values by estimating the differences between them. Technically, we estimate a differential matrix consisting of pairwise Shapley value differences to reduce the variance of the estimated Shapley values. We develop a least-squares optimization solution to derive the Shapley values from the differential matrix, minimizing the estimator variances. Additionally, we devise a Monte Carlo method for efficient estimation of the differential matrix and introduce two stratified Monte Carlo methods for further variance reduction. Our experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of the differential-matrix-based sampling approaches.

SHARQ: Explainability Framework for Association Rules on Relational Data

Association rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear. This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification). As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements. To that end, we present an efficient framework for computing the exact SHARQ value of a single element whose running time is practically linear in the number of rules. Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements. Based on the definition of SHARQ for elements we describe two additional use-cases for association rules explainability: rule importance and attribute importance. Extensive experiments over a novel benchmark dataset containing 67 instances of mined rule sets show the effectiveness of our approach.

SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference

Large Language Models (LLMs) have revolutionized Natural Language to SQL (NL-to-SQL), dominating most NL-to-SQL benchmarks. But LLMs still face limitations due to hallucinations, semantic ambiguity, and lexical mismatches between an NL query and the database schema. Naturally, a lot of work in the ML+DB intersection aims to mitigate such LLM limitations. In this work, we shine the light on a complementary data-centric question: How should DB schemas evolve in this era of LLMs to boost NL-to-SQL? The intuition is that more NL-friendly schema identifiers can help LLMs work better with DBs. We dive deeper into this seemingly obvious, but hitherto underexplored and important, connection between schema identifier ''naturalness'' and the behavior of LLM-based NL-to-SQL by creating a new integrated benchmark suite we call SNAILS. SNAILS has 4 novel artifacts: (1) A collection of real-world DB schemas not present in prior NL-to-SQL benchmarks; (2) A set of labeled NL-SQL query pairs on our collection not seen before by public LLMs; (3) A notion of naturalness level for schema identifiers and a novel labeled dataset of modified identifiers; and (4) AI artifacts to automatically modify identifier naturalness. Using SNAILS, we perform a comprehensive empirical evaluation of the impact of schema naturalness on LLM-based NL-to-SQL accuracy, and present a method for improving LLM-based NL-to-SQL with natural views. Our results reveal statistically significant correlations across multiple public LLMs from OpenAI, Meta, and Google on multiple databases using both zero-shot prompting as well as more complex NL-to-SQL workflows: DIN SQL, and CodeS. We present several fine-grained insights and discuss pathways for DB practitioners to better exploit LLMs for NL-to-SQL.

SPAS: Continuous Release of Data Streams under w-Event Differential Privacy

Continuous release of data streams is frequently used in numerous applications. However, when data is sensitive, this poses privacy risks. To mitigate this risk, efforts have been devoted to devising techniques that satisfy a formal privacy notion called w-event differential privacy. Nevertheless, a recent benchmark reveals that none of the existing works offer a universally effective solution across all types of data streams, making it challenging to select an appropriate scheme for unknown data streams in practical scenarios.

We identify that all existing methods are heuristic-based and make data-independent decisions. In this paper, we change this landscape by introducing SPAS which is built on data-dependent strategies. Specifically, SPAS continuously predicts an optimal publishing strategy within each sliding window that minimizes the error of the released results based on the characteristics of the data stream. Additionally, we develop a weighted sparse vector technique to control data sampling and manage privacy budget consumption following that optimal publishing strategy. Comprehensive experimental evaluations demonstrate the efficacy of SPAS in adapting to diverse one-dimensional and multi-dimensional data streams for both data release and range query tasks. Our code is open-sourced.

Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search

Approximate Nearest Neighbor (ANN) search in high-dimensional Euclidean spaces is a fundamental problem with a wide range of applications. However, there is currently no ANN method that performs well in both indexing and query answering performance, while providing rigorous theoretical guarantees for the quality of the answers. In this paper, we first design SC-score, a metric that we show follows the Pareto principle and can act as a proxy for the Euclidean distance between data points. Inspired by this, we propose a novel ANN search framework called Subspace Collision (SC), which can provide theoretical guarantees on the quality of its results. We further propose SuCo, which achieves efficient and accurate ANN search by designing a clustering-based lightweight index and query strategies for our proposed subspace collision framework. Extensive experiments on real-world datasets demonstrate that both the indexing and query answering performance of SuCo outperform state-of-the-art ANN methods that can provide theoretical guarantees, performing 1-2 orders of magnitude faster query answering with only up to one-tenth of the index memory footprint. Moreover, SuCo achieves top performance (best for hard datasets) even when compared to methods that do not provide theoretical guarantees.

SymphonyQG: Towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search

Approximate nearest neighbor (ANN) search in high-dimensional Euclidean space has a broad range of applications. Among existing ANN algorithms, graph-based methods have shown superior performance in terms of the time-accuracy trade-off. However, they face performance bottlenecks due to the random memory accesses caused by the searching process on the graph indices and the costs of computing exact distances to guide the searching process. To relieve the bottlenecks, a recent method named NGT-QG makes an attempt by integrating quantization and graph. It (1) replicates and stores the quantization codes of a vertex's neighbors compactly so that they can be accessed sequentially, and (2) uses a SIMD-based implementation named FastScan to efficiently estimate distances based on the quantization codes in batch for guiding the searching process. While NGT-QG achieves promising improvements over the vanilla graph-based methods, it has not fully unleashed the potential of integrating quantization and graph. For instance, it entails a re-ranking step to compute exact distances at the end, which introduces extra random memory accesses; its graph structure is not jointly designed considering the in-batch nature of FastScan, which causes wastes of computation in searching. In this work, following NGT-QG, we present a new method named SymphonyQG, which achieves more symphonious integration of quantization and graph (e.g., it avoids the explicit re-ranking step and refines the graph structure to be more aligned with FastScan). Based on extensive experiments on real-world datasets, SymphonyQG establishes the new state-of-the-art in terms of the time-accuracy trade-off: at 95% recall, SymphonyQG achieves 1.5x-4.5x QPS compared with the most competitive baselines and achieves 3.5x-17x QPS compared with the classical library HNSWlib across all tested datasets. At the same time, its indexing is at least 8x faster than NGT-QG.

TGraph: A Tensor-centric Graph Processing Framework

Graph is ubiquitous in various real-world applications, and many graph processing systems have been developed. Recently, hardware accelerators have been exploited to speed up graph systems. However, such hardware-specific systems are hard to migrate across different hardware backends. In this paper, we propose the first tensor-based graph processing framework, Tgraph, which can be smoothly deployed and run on any powerful hardware accelerators (uniformly called XPU) that support Tensor Computation Runtimes (TCRs). TCRs, which are deep learning frameworks along with their runtimes and compilers, provide tensor-based interfaces to users to easily utilize specialized hardware accelerators without delving into the complex low-level programming details. However, building an efficient tensor-based graph processing framework is non-trivial. Thus, we make the following efforts: (1) propose a tensor-centric computation model for users to implement graph algorithms with easy-to-use programming interfaces; (2) provide a set of graph operators implemented by tensor to shield the computation model from the detailed tensor operators so that Tgraph can be easily migrated and deployed across different TCRs; (3) design a tensor-based graph compression and computation strategy and an out-of-XPU-memory computation strategy to handle large graphs. We conduct extensive experiments on multiple graph algorithms (BFS, WCC, SSSP, etc.), which validate that Tgraph not only outperforms seven state-of-the-art graph systems, but also can be smoothly deployed and run on multiple DL frameworks (PyTorch and TensorFlow) and hardware backends (Nvidia GPU, AMD GPU, and Apple MPS).

Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities

Approximate Nearest Neighbor Search (ANNS) is a critical problem in vector databases. Cluster-based index is utilized to narrow the search scope of ANNS, thereby accelerating the search process. Due to its scalability, it is widely employed in real-world vector search systems. However, existing cluster-based indexes often suffer from coarse granularity, requiring query vectors to compute distances with vectors of varying quality, thus increasing query complexity. Existing work aim to represent vectors with minimal cost, such as using product quantization (PQ) or linear transformations, to speed up ANNS. However, these approaches do not address the coarse granularity inherent in cluster-based index. In this paper, we present an efficient vector data query engine to enhance the granularity of cluster-based index by carefully subdividing clusters using diverse distance metrics. Building on this refined index, we introduce techniques that leverage triangle inequalities to develop highly optimized and distinct search strategies for clusters and vectors of varying qualities, thereby reducing the overhead of ANNS. Extensive experiments demonstrate that our method significantly outperforms existing in-memory cluster-based indexing algorithms, achieving up to an impressive 10× speedup and a pruning ratio exceeding 99.4%.

U-DPAP: Utility-aware Efficient Range Counting on Privacy-preserving Spatial Data Federation

Range counting is a fundamental operation in spatial data applications. There is a growing demand to facilitate this operation over a data federation, where spatial data are separately held by multiple data providers (a.k.a., data silos). Most existing data federation schemes employ Secure Multiparty Computation (SMC) to protect privacy, but this approach is computationally expensive and leads to high latency. Consequently, private data federations are often impractical for typical database workloads.This challenge highlights the need for a private data federation scheme capable of providing fast and accurate query responses while maintaining strong privacy.

To address this issue, we propose U-DPAP, a utility-aware efficient privacy-preserving method. It is the first scheme to exclusively use differential privacy for privacy protection in spatial data federation, without employing SMC. Moreover, it combines approximate query processing to further enhance efficiency. Our experimental results indicate that a straightforward combination of the two techniques results in unacceptable impacts on data utility. Thus, we design two novel algorithms: one to make differential privacy practical by optimizing the privacy-utility trade-off, and another to address the efficiency-utility trade-off in approximate query processing. The grouping-based perturbation algorithm reduces noise by grouping similar data and applying noise to the groups. The representative data silos selection algorithm minimizes approximate error by selecting representative silos using the similarity between data silos. We rigorously prove the privacy guarantees of U-DPAP. Moreover, experimental results demonstrate that U-DPAP enhances data utility by an order of magnitude while maintaining high communication efficiency.

Ultraverse: An Efficient What-if Analysis Framework for Software Applications Interacting with Database Systems

Existing what-if analysis systems are predominantly tailored to operate on either only the application layer or only the database layer of software. This isolated approach limits their effectiveness in scenarios where intensive interaction between applications and database systems occurs. To address this gap, we introduce Ultraverse, a what-if analysis framework that seamlessly integrates both application and database layers. Ultraverse employs dynamic symbolic execution to effectively translate application code into compact SQL procedure representations, thereby synchronizing application semantics at both SQL and application levels during what-if replays. A novel aspect of Ultraverse is its use of advanced query dependency analysis, which serves two key purposes: (1) it eliminates the need to replay irrelevant transactions that do not influence the outcome, and (2) it facilitates parallel replay of mutually independent transactions, significantly enhancing the analysis efficiency. Ultraverse is applicable to existing unmodified database systems and legacy application codes. Our extensive evaluations of the framework have demonstrated remarkable improvements in what-if analysis speed, achieving performance gains ranging from 7.7x to 291x across diverse benchmarks.

User-Centric Property Graph Repairs

Property graphs serve as unifying abstractions for encoding, inspecting, and updating interconnected data with greater expressive power. They are increasingly popular across various application domains involving real users. However, graph data often contains inconsistencies that need proper transformations to address underlying constraint violations and often require specific domain knowledge. In this paper, we propose an interactive and user-centric approach to repair property graphs under denial constraints. Our approach includes a novel theoretical framework comprising a query-based inconsistency detection mechanism, a dependency graph for tracking violations, and an assignment algorithm facilitating multi-user property graph repairs by leveraging independent sets. We evaluate our approach through several experiments on real-world and synthetic datasets, considering different levels of user expertise and comparing against various baselines. Even with multiple non-oracle users, our approach outperforms existing interactive and non-interactive baselines by 30% on average in terms of repair quality. Additionally, we conduct a user study to assess real user performance in property graph repairs.

VEGA: An Active-tuning Learned Index with Group-Wise Learning Granularity

Learned indexes, which model key-value data structures by machine learning models, have been extensively studied. However, the fastest immutable learned indexes (e.g., RMI) do not provide the same tight lookup bounds as classical indexes such as B-trees. There are learned indexes that provide tight bounds (e.g., PGM) but those fall short in query performance. This gives rise to an interesting open question: whether there exists a learned index that simultaneously achieves state-of-the-art empirical performance and matching complexity?

In this paper, we give a positive answer to this standing problem.We propose two new online model-building policies: (1) simplifying distribution by the adoption of a proper granularity (i.e., grouping multiple keys together for model-building) and (2) actively tuning distribution through key repositioning. Additionally, we introduce a general framework that combines these two policies for performance optimization under a given memory budget. We put everything together to design VEGA, a learned index that simultaneously achieves competitive theoretical and empirical performance compared to state-of-the-art learned indexes. We conducted extensive evaluations, demonstrating VEGA achieves both better lookup and building performance.