We are excited to announce the third issue dedicated to the Principles of Database Systems (PODS) research track of the Proceedings of the ACM on Management of Data (PACMMOD) journal. In its current form, the journal hosts both a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific foundation for methods, techniques, and solutions to the data management challenges that continually arise in our data-driven society. More specifically, articles in the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles may be based, among others, on establishing theoretical results, developing new concepts and frameworks that merit further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or conducting a rigorous analysis of widely used and/or recently developed industry artifacts.
Motivated by recent connections to factorised databases, we analyse the efficiency of representations by context free grammars (CFGs). Concretely, we prove a recent conjecture by Kimelfeld, Martens, and Niewerth (ICDT 2025), that for finite languages representations by general CFGs can be doubly-exponentially smaller than those by unambiguous CFGs. To do so, we show the first exponential lower bounds for representation by unambiguous CFGs of a finite language that can efficiently be represented by ambiguous CFGs. Our proof first reduces the problem to proving a lower bound in a non-standard model of communication complexity. Then, we argue similarly in spirit to a recent discrepancy argument to show the required communication complexity lower bound. Our result also implies that a finite language may admit an exponentially smaller representation as a nondeterministic finite automaton than as an unambiguous CFG.
Schema matching refers to the task of identifying corresponding attributes of different database relation schemas to enable the efficient integration of the associated datasets. We model the task of finding suitable 1:N/N:1 global matchings in relational schemas as an optimization problem. We show that this optimization problem is NP-hard. We then translate the optimization problem into the problem of minimizing a particular rational-valued function on binary variables. The latter enables us to utilize modern quantum algorithms for solving the global matching problem, a crucial stage in schema matching. We also report on preliminary experimental results that serve as a proof-of-concept for our approach.
Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristic methods, which often result in biased outcomes. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings.
More specifically, let P={P1,..., PN} be a repository of N datasets, such that each Pi ⊂ ℝd, where d is a constant. We study the percentile-aware indexing (Ptile) problem and the preference-aware indexing (Pref) problem under the centralized and the federated setting. In the centralized setting, we assume direct access to the datasets in P. In the federated setting we are given a synopsis SPi which is a compressed representation of Pi that captures the structure of Pi, for every i ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query rectangle R and an interval θ) report all indexes J such that j ∈ J if and only if |Pj ∩ R|/|Pj| ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query vector v→ and an interval θ) report all indexes J such that j ∈ J if and only if ωk(Pj,v→)∈ θ, where ωk(pj,v→) is the score (inner-product) of the k-th largest projection of Pj on v→. We first show lower bounds for the Ptile and Pref problems in the centralized setting, showing that we cannot hope for near-linear data structures with polylogarithmic query time. Then we focus on approximate data structures for both problems in both settings. We show Ø(N) space data structures with Ø(N) preprocessing time, that can answer Ptile and Pref queries in Ø(1+OUT) time, where OUT is the output size. The data structures return a set of indexes J such that: i) for every Pi that satisfies the predicate, i ∈ J and ii) if j ∈ J then Pj satisfies the predicate up to an additive error of ε+2δ, where ε is an arbitrarily small constant and δ is the error of the synopses.
We study subgraph counting over fully dynamic graphs, which undergo edge insertions and deletions. Counting subgraphs is a fundamental problem in graph theory with numerous applications across various fields, including database theory, social network analysis, and computational biology. In database theory, we can use dynamic subgraph counting algorithms on layered graphs to maintain the sizes of joins of databases that undergo updates. Specifically, the problem of finding the number of elements in a cyclic join of size k is equivalent to counting the number of k-cycles in k-layered graphs. For example, let R, S, and T be relations that have schemas (A, B), (B, C), and (C, A) respectively. Then the size of the join of R with S with T is given by the number of triangles in the corresponding layered graph where there is a layer for each attribute, the vertices are the attribute values and the edges represent the tuples of attribute values in the relations. Maintaining the number of triangles in fully dynamic graphs is very well studied and has an upper bound of O(√m) for the update time [KNN+20]. There is also a conditional lower bound of Ω(m1/2-γ) for any constant γ>0, for the update time [HKNS15] under the Online Matrix-Vector (OMv) conjecture implying that O(√m) is the ''right answer' for the update time of counting triangles. More recently, [HHH22] studied the problem of maintaining the number of 4-cycles in fully dynamic graphs and designed an algorithm with O(m2/3) update time which is a natural generalization of the approach for counting triangles. They also studied the problem of counting 4-cliques showing that the folklore upper bound of O(m) for the update time is tight under the static combinatorial 4-clique conjecture by giving a lower bound of Ω(m1-γ) for any γ>0. Thus, it seems natural that O(m2/3 ) might be the correct answer for the complexity of the update time for counting 4-cycles. In this work, we present an improved algorithm for maintaining the number of 4-cycles in fully dynamic graphs. Our algorithm achieves a worst-case update time of O(m2/3-ε) for some constant ε>0. We also show that the problem of counting 4-cycles is equivalent in layered graphs and general graphs. Our approach crucially uses fast matrix multiplication and leverages recent developments therein to get an improved runtime. Using the current best value of the matrix multiplication exponent ω=2.371339 we get ε=0.009811 and if we assume the best possible exponent i.e. ω=2 then we get ε=1/24. There is also a lower bound of Ω(m1/2-γ) for any constant γ>0, for the update time [HKNS15,HHH22], so there is still a big gap between the best-known upper and lower bounds. The key message of our paper is demonstrating that O(m2/3 ) is not the correct answer for the complexity of the update time.
In this paper, we study circuit size bounds for Conjunctive Queries (CQs) under different semiring semantics. Recent work established tight bounds for self-join-free CQs over the tropical semiring, among other results [16]. Here, we extend these results in two main directions. First, we prove a lower bound for any self-join-free CQ over the Boolean semiring by extending Razborov and Alon-Boppana's classic lower bound result for the k-clique problem [1, 38]. Second, we characterize the circuit complexity of CQs with self-joins by relating them to appropriate self-join free CQs. Interestingly, such correspondence crucially depends on the underlying semiring. To achieve this result, we present a novel technique of investigating the circuit complexity of a CQ with self-joins through the lens of endomorphisms.
In this paper, we study circuits and formulas for provenance polynomials of Datalog programs. We ask the following question: given an absorptive semiring and a fact of a Datalog program, what is the optimal depth and size of a circuit/formula that computes its provenance polynomial? We focus on absorptive semirings as these guarantee the existence of a polynomial-size circuit. Our main result is a dichotomy for several classes of Datalog programs on whether they admit a formula of polynomial size or not. We achieve this result by showing that for these Datalog programs the optimal circuit depth is either Θ(log m) or Θ(log2 m), where m is the input size. We also show that for Datalog programs with the polynomial fringe property, we can always construct low-depth circuits of size O(log2 m). Finally, we give characterizations of when Datalog programs are bounded over more general semirings.
Complex Event Recognition (CER) establishes a relevant solution for processing streams of events, giving users timely information. CER systems detect patterns in real-time, producing complex events and generating responses to them. In these tasks, time is a first-class citizen. Indeed, the time-sequence model distinguishes CER from other solutions, like data stream management systems. Surprisingly, until now, time constraints are usually included in CER query languages and models in a restricted way, and we still lack an understanding of the expressiveness and of efficient algorithms concerning this crucial feature: time.
This work studies CER under time constraints regarding its query language, computational models, and streaming evaluation algorithms. We start by introducing an extension of Complex Event Logic (CEL), called timed CEL, with simple time operators. We show that timed CEL aids in modeling CER query languages in practice, serving as a proxy to study the expressive power of such languages under time constraints. For this purpose, we introduce an automata model for studying timed CEL, called timed Complex Event Automata (timed CEA). This model extends the existing CEA model with clocks, combining CEA and timed automata in a single model. We show that timed CEL and timed CEA are equally expressive, giving the first characterization of CER query languages under time constraints. Then, we move towards understanding the efficient evaluation of timed CEA over streams concerning its determinization and efficient algorithms. We present a class of timed CEA that are closed under determinization; furthermore, we show that this class contains swg-queries, an expressive class of CER queries recently introduced by Kleest-Meißner et al. Finally, we present a streaming evaluation algorithm with constant update time and output-linear delay for evaluating deterministic monotonic timed CEA with a single clock, which have only less equal or greater equal comparisons.
Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting ) and (ii) how many documents in the collection contain the pattern as a substring (document counting ). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an ε-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of O(𝓁 • polylog(n𝓁|Σ|)), where 𝓁 is the maximum length of a document in the database, n is the number of documents, and |Σ| is the size of the alphabet. We show that this is optimal up to a O(polylog(𝓃𝓁)) factor. Further, we show that for (ε,δ)-differential privacy, the bound for document counting can be improved to O(√𝓁 • polylog(𝓃𝓁|Σ|)). Additionally, our data structures are efficient. In particular, our data structures use O(𝓃𝓁2) space, O(𝓃2𝓁4) preprocessing time, and O(|P|) query time where P is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and q-grams. For q-grams, we further improve the preprocessing time of the data structure.
Cardinality estimation and conjunctive query evaluation are two of the most fundamental problems in database query processing. Recent work proposed, studied, and implemented a robust and practical information-theoretic cardinality estimation framework. In this framework, the estimator is the cardinality upper bound of a conjunctive query subject to ''degree-constraints'', which model a rich set of input data statistics. For general degree constraints, computing this bound is computationally hard. Researchers have naturally sought efficiently computable relaxed upper bounds that are as tight as possible. The polymatroid bound is the tightest among those relaxed upper bounds. While it is an open question whether the polymatroid bound can be computed in polynomial-time in general, it is known to be computable in polynomial-time for some classes of degree constraints.
Our focus is on a common class of degree constraints called simple degree constraints. Researchers had not previously determined how to compute the polymatroid bound in polynomial time for this class of constraints. Our first main result is a polynomial time algorithm to compute the polymatroid bound given simple degree constraints. Our second main result is a polynomial-time algorithm to compute a ''proof sequence'' establishing this bound. This proof sequence can then be incorporated in the PANDA-framework to give a faster algorithm to evaluate a conjunctive query. In addition, we show computational limitations to extending our results to broader classes of degree constraints. Finally, our technique leads naturally to a new relaxed upper bound called the flow bound, which is computationally tractable.
Despite the wide use of k-Nearest Neighbors as classification models, their explainability properties remain poorly understood from a theoretical perspective. While nearest neighbors classifiers offer interpretability from a ''data perspective'', in which the classification of an input vector x is explained by identifying the vectors v1, ..., vk in the training set that determine the classification of x, we argue that such explanations can be impractical in high-dimensional applications, where each vector has hundreds or thousands of features and it is not clear what their relative importance is. Hence, we focus on understanding nearest neighbor classifications through a ''feature perspective'', in which the goal is to identify how the values of the features in x affect its classification. Concretely, we study abductive explanations such as ''minimum sufficient reasons'', which correspond to sets of features in x that are enough to guarantee its classification, and counterfactual explanations based on the minimum distance feature changes one would have to perform in x to change its classification. We present a detailed landscape of positive and negative complexity results for counterfactual and abductive explanations, distinguishing between discrete and continuous feature spaces, and considering the impact of the choice of distance function involved. Finally, we show that despite some negative complexity results, Integer Quadratic Programming and SAT solving allow for computing explanations in practice.
One fundamental question in database theory is the following: Given a Boolean conjunctive query Q, what is the best complexity for computing the answer to Q in terms of the input database size N? When restricted to the class of combinatorial algorithms, it is known that the best known complexity for any query Q is captured by the submodular width of Q. However, beyond combinatorial algorithms, certain queries are known to admit faster algorithms that often involve a clever combination of fast matrix multiplication and data partitioning. Nevertheless, there is no systematic way to derive and analyze the complexity of such algorithms for arbitrary queries Q.
In this work, we introduce a general framework that captures the best complexity for answering any Boolean conjunctive query Q using matrix multiplication. Our framework unifies both combinatorial and non-combinatorial techniques under the umbrella of information theory. It generalizes the notion of submodular width to a new stronger notion called the ω-submodular width that naturally incorporates the power of fast matrix multiplication. We describe a matching algorithm that computes the answer to any query Q in time corresponding to the ω-submodular width of Q. We show that our framework recovers the best known complexities for Boolean queries that have been studied in the literature, to the best of our knowledge, and also discovers new algorithms for some classes of queries that improve upon the best known complexities.
We study differentially private algorithms for analyzing graph databases in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time, and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only (called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA '21), and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML '23). We provide the first differentially private and fully dynamic graph algorithms for several other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error, and show strong lower bounds on the error for all algorithms in this setting. Previously, only lower bounds for purely differentially private algorithms were known; our lower bounds give an exponential improvement in terms of the dependence on the number of time steps, while applying to algorithms satisfying pure as well as approximate differential privacy. We study two variants of edge differential privacy for fully dynamic graph algorithms: event-level and item-level. Under the former notion, two graph database update sequences are considered neighboring if, roughly speaking, they differ in at most one update; under the latter notion, they can differ only in updates pertaining to one edge. Differential privacy requires that for any two neighboring inputs, the output distributions of the algorithm are close. We give upper and lower bounds on the error of both---event-level and item-level---fully dynamic algorithms for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems, our algorithms match our lower bounds.
We study the minimization problem for Conjunctive Regular Path Queries (CRPQs) and unions of CRPQs (UCRPQs). This is the problem of checking, given a query and a number k, whether the query is equivalent to one of size at most k. For CRPQs we consider the size to be the number of atoms, and for UCRPQs the maximum number of atoms in a CRPQ therein, motivated by the fact that the number of atoms has a leading influence on the cost of query evaluation. We show that the minimization problem is decidable, both for CRPQs and UCRPQs. We provide a 2ExpSpace upper-bound for CRPQ minimization, based on a brute-force enumeration algorithm, and an ExpSpace lower-bound. For UCRPQs, we show that the problem is ExpSpace-complete, having thus the same complexity as the classical containment problem. The upper bound is obtained by defining and computing a notion of maximal under-approximation. Moreover, we show that for UCRPQs using the so-called simple regular expressions consisting of concatenations of expressions of the form a+ or a1 + ⋅⋅⋅ + ak, the minimization problem becomes "PiP2"-complete, again matching the complexity of containment.
This paper addresses one of the fundamental open questions in the realm of existential rules: the conjecture on the finite controllability of bounded derivation depth rule sets (bdd⇒fc). We take a step toward a positive resolution of this conjecture by demonstrating that universal models generated by BDD rule sets cannot contain arbitrarily large tournaments (arbitrarily directed cliques) without entailing a loop query, ∃ E(x,x). This simple yet elegant result narrows the space of potential counterexamples to the (bdd⇒fc) conjecture.
Locality-sensitive hashing (Indyk-Motwani'98) is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query independent of the randomness of the data structure. In many modern applications of nearest neighbor search the queries are, however, chosen adaptively. In this paper, we study the robustness of locality-sensitive hashing in Hamming space to adaptive queries. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.
We study the problem of privately releasing an approximate minimum spanning tree (MST). Given a graph G = (V, E, W) where V is a set of n vertices, E is a set of m undirected edges, and W ∈ ℝ|E| is an edge-weight vector, our goal is to publish an approximate MST under edge-weight differential privacy, as introduced by Sealfon in PODS 2016, where V and E are considered public and the weight vector is private. Our neighboring relation is 𝓁∞-distance on weights: for a sensitivity parameter Δ∞, graphs G = (V, E, W ) and G' = (V, E, W') are neighboring if ||W -W'||∞ ≤ Δ∞). Existing private MST algorithms face a trade-off, sacrificing either computational efficiency or accuracy. We show that it is possible to get the best of both worlds: With a suitable random perturbation of the input that does not suffice to make the weight vector private, the result of any non-private MST algorithm will be private and achieves a state-of-the-art error guarantee. Furthermore, by establishing a connection to Private Top-k Selection [Steinke and Ullman, FOCS '17], we give the first privacy-utility trade-off lower bound for MST under approximate differential privacy, demonstrating that the error magnitude, ~O(n3/2), is optimal up to logarithmic factors. That is, our approach matches the time complexity of any non-private MST algorithm and at the same time achieves optimal error. We complement our theoretical treatment with experiments that confirm the practicality of our approach.
One of the most celebrated results of computing join-aggregate queries defined over commutative semi-rings is the classic Yannakakis algorithm proposed in 1981. It is known that the runtime of the Yannakakis algorithm is O(N + OUT) for any free-connex query, where N is the input size of the database and ØUT is the output size of the query result. This is already output-optimal. However, only an upper bound O(N • OUT) on the runtime is known for the large remaining class of acyclic but non-free-connex queries. Alternatively, one can convert a non-free-connex query into a free-connex one using tree decomposition techniques and then run the Yannakakis algorithm. This approach takes O(N#fn-subw + OUT) time, where #fn-subw is the free-connex sub-modular width of the query. But, none of these results is known to be output-optimal.
In this paper, we show a matching lower and upper bound Θ(N • OUT1 - 1/(fn-fhtw) + OUT) for computing general acyclic join-aggregate queries by semiring algorithms, where fn-fhtw is the free-connex fractional hypertree width of the query. For example, fn-fhtw = 1 for free-connex queries, fn-fhtw = 2 for line queries (a.k.a. chain matrix multiplication), and fn-fhtw = k for star queries (a.k.a. star matrix multiplication) with k relations. Although free-connex fractional hypertree width is a natural and well-established measure of how far a join-aggregate query is from being free-connex, we demonstrate that it precisely captures the output-optimal complexity of these queries. To our knowledge, this has been the first polynomial improvement over the Yannakakis algorithm in the last 40 years and completely resolves the open question of computing acyclic join-aggregate queries in an output-optimal way. As a by-product, our output-optimal algorithm for acyclic queries also yields new output-sensitive algorithms for cyclic queries via tree decomposition techniques.
We study the classical evaluation problem for regular path queries: Given an edge-labeled graph and a regular path query, compute the set of pairs of vertices that are connected by paths that match the query.
The Product Graph (PG) is the established evaluation approach for regular path queries. PG first constructs the product automaton of the data graph and the query and then uses breadth-first search to find the accepting states reachable from each initial state in the product automaton. Its data complexity is O(|V|⋅|E|), where V and E are the sets of vertices and respectively edges in the data graph. This complexity cannot be improved by combinatorial algorithms.
In this paper, we introduce OSPG, an output-sensitive refinement of PG, whose data complexity is O(|E|3/2 + min(OUT⋅√|E|, |V|⋅|E|)), where OUT is the number of distinct vertex pairs in the query output. OSPG's complexity is at most that of PG and can be asymptotically smaller for small output and sparse input. The improvement of OSPG over PG is due to the unnecessary time wasted by PG in the breadth-first search phase, in case a few output pairs are eventually discovered. For queries without Kleene star, the complexity of OSPG can be further improved to O(|E| + |E|⋅√OUT).
Given a vector x ∈ ℝn induced by a turnstile stream S, a non-negative function G: ℝ → ℝ, a perfect G-sampler outputs an index i with probability G(xi)/Σj∈[n] + 1/poly(n). Jayaram and Woodruff (FOCS 2018) introduced a perfect Lp-sampler, where G(z)=|z|p, for p∈(0,2]. In this paper, we solve this problem for p>2 by a sampling-and-rejection method. Our algorithm runs in n1-2/p • polylog (n) bits of space, which is tight up to polylogarithmic factors in n. Our algorithm also provides a (1+ε)-approximation to the sampled item xi with high probability using an additional ε-2 n1-2/p • polylog (n) bits of space.
Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect Lp samplers. We also achieve perfect samplers for the logarithmic function G(z)=log(1+|z|) and the cap function G(z)=min(T,|z|p). Finally, we give an application of our results to the problem of norm/moment estimation for a subset Q of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set Q represents a range query, or the set n\Q represents a collection of entities who wish for their information to be expunged from the dataset.
Protecting sensitive information on data streams is a pivotal challenge for modern systems. Current approaches to providing privacy in data streams can be broadly categorized into two strategies. The first strategy involves transforming the stream into a private sequence of values, enabling the subsequent use of non-private methods of analysis. While effective, this approach incurs high memory costs, often proportional to the size of the database. Alternatively, a compact data structure can be used to provide a private summary of the stream. However, these data structures are limited to predefined queries, restricting their flexibility. To overcome these limitations, we propose a lightweight synthetic data generator, PrivHP, that provides differential privacy guarantees. PrivHP is based on a novel method for the private hierarchical decomposition of the input domain in bounded memory. As the decomposition approximates the cumulative distribution function of the input, it serves as a lightweight structure for synthetic data generation. PrivHP is the first method to provide a principled trade-off between accuracy and space for private hierarchical decompositions. It achieves this by balancing hierarchy depth, noise addition, and selective pruning of low-frequency subdomains while preserving high-frequency ones, all identified in a privacy-preserving manner. To ensure memory efficiency, we employ private sketches to estimate subdomain frequencies without accessing the entire dataset. Central to our approach is the introduction of a pruning parameter k, which enables an almost smooth interpolation between space usage and utility, and a measure of skew tailk, which is a vector of subdomain frequencies containing all but the largest k coordinates. PrivHP processes a dataset X using M = O(k log2 |X|)) space and, on input domain Ω = [0,1]d, while maintaining ε-differential privacy, produces a synthetic data generator that is at distance O(M(1-1/d)/εn + ||tailk(X)||1/M1/dn) from the empirical distribution in the expected Wasserstein metric. Compared to the state-of-the-art, PMM, which achieves accuracy O ((ε n)-1/d) with memory O(ε n), our method introduces an additional approximation error term of O(||tailk(X)||1/(M1/dn)), but operates in significantly reduced space. Additionally, we provide interpretable utility bounds that account for all error sources, including those introduced by the fixed hierarchy depth, privacy noise, hierarchy pruning, and frequency approximations.
The resilience problem for a query and an input set or bag database is to compute the minimum number of facts to remove from the database to make the query false. In this paper, we study how to compute the resilience of Regular Path Queries (RPQs) over graph databases. Our goal is to characterize the regular languages L for which it is tractable to compute the resilience of the existentially-quantified RPQ built from L.
We show that computing the resilience in this sense is tractable (even in combined complexity) for all RPQs defined from so-called local languages. By contrast, we show hardness in data complexity for RPQs defined from the following language classes (after reducing the languages to eliminate redundant words): all finite languages featuring a word containing a repeated letter, and all languages featuring a specific kind of counterexample to being local (which we call four-legged languages). The latter include in particular all languages that are not star-free. Our results also imply hardness for all non-local languages with a so-called neutral letter. We last highlight some remaining obstacles towards a full dichotomy.
The chase is a fundamental algorithm with ubiquitous uses in database theory. Given a database and a set of existential rules (aka tuple-generating dependencies), it iteratively extends the database to ensure that the rules are satisfied in a most general way. This process may not terminate, and a major problem is to decide whether it does. This problem has been studied for a large number of chase variants, which differ by the conditions under which a rule is applied to extend the database. Surprisingly, the complexity of the universal termination of the restricted (aka standard) chase is not fully understood. We close this gap by placing universal restricted chase termination in the analytical hierarchy. This higher hardness is due to the fairness condition, and we propose an alternative condition to reduce the hardness of universal termination.
We embark on a study of the consistent answers of queries over databases annotated with values from a naturally ordered positive semiring. In this setting, the consistent answers of a query are defined as the minimum of the semiring values that the query takes over all repairs of an inconsistent database. The main focus is on self-join free conjunctive queries and key constraints, which is the most extensively studied case of consistent query answering over standard databases. We introduce a variant of first-order logic with a limited form of negation, define suitable semiring semantics, and then establish the main result of the paper: the consistent query answers of a self-join free conjunctive query under key constraints are rewritable in this logic if and only if the attack graph of the query contains no cycles. This result generalizes an analogous result of Koutris and Wijsen for ordinary databases, but also yields new results for a multitude of semirings, including the bag semiring, the tropical semiring, and the fuzzy semiring. Further, for the bag semiring, we show that computing the consistent answers of any self-join free conjunctive query whose attack graph has a strong cycle is not only NP-hard but also it is NP-hard to even approximate the consistent answers with a constant relative approximation guarantee.
This paper considers statistical analysis on noisy datasets where near-duplicate elements need to be treated as identical ones. We focus on two basic problems, distinct elements and ℓ0-sampling, in the data stream model where the sequence of elements can only be scanned once using a limited space, under which a comprehensive data deduplication step before statistical analysis is not feasible. Previous streaming algorithms for these problems could only handle noisy datasets in O(1)-dimensional Euclidean spaces. In this paper, we propose sublinear-space streaming algorithms that work for noisy datasets in any metric space. We also give a lower bound result showing that solving the distinct elements problem on noisy datasets in general metric spaces is inherently more difficult than solving it on noiseless datasets and on noisy datasets in O(1)-dimensional Euclidean spaces.
The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by considering a cooperative game whose players are the facts and whose wealth function assigns 1 or 0 to each subset of the database, depending on whether the query answer holds in the given subset. While conceptually simple, this approach suffers from a notable drawback: the problem of computing such Shapley values is #P-hard in data complexity, even for simple conjunctive queries. This motivates us to revisit the question of what constitutes a reasonable responsibility measure and to introduce a new family of responsibility measures -- weighted sums of minimal supports (WSMS) -- which satisfy intuitive properties. Interestingly, while the definition of WSMSs is simple and bears no obvious resemblance to the Shapley value formula, we prove that every WSMS measure can be equivalently seen as the Shapley value of a suitably defined cooperative game. Moreover, WSMS measures enjoy tractable data complexity for a large class of queries, including all unions of conjunctive queries. We further explore the combined complexity of WSMS computation and establish (in)tractability results for various subclasses of conjunctive queries.
Given a self-join-free conjunctive query Q and a set of tuples S, a synthetic witness D is a database instance such that the result of Q on D is S. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witness D exists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to the smallest witness problem recently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as the role mining problem in data mining and the edge concentration problem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., for test database generation for applications accessing a database and for data compression by encoding a dataset S as a pair of a query Q and database D.
We prove that ESW is in P, presenting a simple algorithm that, given any S, decides whether a synthetic witness exists in polynomial time in the size of S. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size of S for any query Q that has the head-domination property. If Q does not have such a property, then SSW is generally hard. More specifically, we show that for the class of path queries (of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class of Berge-acyclic queries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP.
Hypertree decompositions provide a way to evaluate Conjunctive Queries (CQs) in polynomial time, where the exponent of this polynomial is determined by the width of the decomposition. In theory, the goal of efficient CQ evaluation therefore has to be a minimisation of the width. However, in practical settings, it turns out that there are also other properties of a decomposition that influence the performance of query evaluation. It is therefore of interest to restrict the computation of decompositions by constraints and to guide this computation by preferences. To this end, we propose a novel framework based on candidate tree decompositions, which allows us to introduce soft hypertree width (shw). This width measure is a relaxation of hypertree width (hw); it is never greater than hw and, in some cases, shw may actually be lower than hw. Most importantly, shw preserves the tractability of deciding if a given CQ is below some fixed bound, while offering more algorithmic flexibility. In particular, it provides a natural way to incorporate preferences and constraints into the computation of decompositions. A prototype implementation and preliminary experiments confirm that this novel framework can indeed have a practical impact on query evaluation.
Frequent pattern mining is widely used to find "important" or "interesting" patterns in data. While it is not easy to mathematically define such patterns, maximal frequent patterns are promising candidates, as frequency is a natural indicator of relevance and maximality helps to summarize the output. As such, their mining has been studied on various data types, including itemsets, graphs, and strings. The complexity of mining maximal frequent itemsets and subtrees has been thoroughly investigated (e.g., [Boros et al., 2003], [Uno et al., 2004]) in the literature. On the other hand, while the idea of mining frequent subsequences in sequential data was already introduced in the seminal paper [Agrawal et al., 1995], the complexity of the problem is still open.
In this paper, we investigate the complexity of the maximal common subsequence enumeration problem, which is both an important special case of maximal frequent subsequence mining and a generalization of the classic longest common subsequence (LCS) problem. We show the hardness of enumerating maximal common subsequences between multiple strings, ruling out the possibility of an output-polynomial time enumeration algorithm under ¶ ≠ NP, that is, an algorithm that runs in time poly(|I| + N), where |I| and N are the size of the input and number of output solutions, respectively. To circumvent this intractability, we also investigate the parameterized complexity of the problem, and show several results when the alphabet size, the number of strings, and the length of a string are taken into account as parameters.
#NFA refers to the problem of counting the words of length n accepted by a non-deterministic finite automaton. #NFA is #P-hard, and although fully-polynomial-time randomized approximation schemes (FPRAS) exist, they are all impractical. The first FPRAS for #NFA had a running time of Õ(n17 m17 ε-14 łog(δ-1 )), where m is the number of states in the automaton, δ ∈ (0,1] is the confidence parameter, and ε > 0 is the tolerance parameter (typically smaller than 1). The current best FPRAS achieved a significant improvement in the time complexity relative to the first FPRAS and obtained FPRAS with time complexity Õ((n10 m2 + n6m3)ε-4 łog2(δ-1 )). The complexity of the improved FPRAS is still too intimidating to attempt any practical implementation.
In this paper, we pursue the quest for practical FPRAS for #NFA by presenting a new algorithm with a time complexity of O(n2m3łog(nm)ε-2 łog(δ-1 )). Observe that evaluating whether a word of length n is accepted by an NFA has a time complexity of O(nm2). Therefore, our proposed FPRAS achieves sub-quadratic complexity with respect to membership checks.
This paper studies the hardness of maintaining self-join-free conjunctive queries over a dynamic database, where tuples can be inserted or deleted. The worst-case complexity of this problem under arbitrary updates has been well understood. It is known that most practical queries require Ω(√|D|) maintenance time for each update to ensure O(1)-delay enumeration, barring a very restricted class of queries (known as "q-hierarchical" queries). Nonetheless, most real-world update sequences are not arbitrary, far away from the worst-case scenario; instead, they are so "nice" that queries can greatly benefit from their inherent structure in query maintenance. In this paper, we aim to understand the hardness of query maintenance under different update sequences, in particular, the insertion-only (or deletion-only), first-in-first-out (FIFO), arbitrarily worse sequences, as well as their "mixed" sequences. We first provide a comprehensive characterization of queries that can be maintained in O(1) time for O(1)-delay enumeration over FIFO sequences. Then, we address mixed sequences, which may exhibit insertion-only or FIFO patterns on subqueries but lack a specific pattern in totality, and introduce a structural dichotomy for determining whether the input query can be maintained in O(1) time for O(1)-delay enumeration over mixed sequences.