JSON Schema validators are essential tools in the JSON ecosystem. However, today's validators are primarily designed to be used programmatically. Even web-based tools are not didactic in their feedback regarding the oftentimes subtle interactions between schemas and JSON instances. We present JTutor, a validation framework specifically designed with humans in mind, that lets users explore the mutual dependencies between a schema and a candidate instance in JSON Schema validation: Users explore schemas and candidate instances interactively, by navigating between subschemas and the JSON instance based on a detailed proof tree that is generated by a well-principled and rule-based validator. JTutor comes with a "Smart Format" feature that generates a focused view of the proof tree. "Smart Format" highlights branches that are relevant for understanding why an instance is valid or invalid, while deemphasizing others. This focused view can serve as an explanation of the validation. Although JTutor is designed to be educational, it can also help developers debug schemas. In this paper, we explain the internals of JTutor and demonstrate the tool itself.
Writing correct and efficient recursive SQL queries is exceptionally challenging because recursive queries risk returning incorrect results, throwing runtime exceptions, or never terminating. Failures in recursive query execution fall into two categories: non-recoverable failures, such as nontermination or database error, which must be identified statically, and recoverable failures, which arise from incorrect assumptions about program behavior - for example, the optimal join order based on input relations may no longer be optimal after execution begins - leading to severe performance degradation.
We present CaQL, a unified recursive query framework that statically enforces correctness properties at the type level while supporting runtime query optimization through multi-stage programming. CaQL unifies two existing systems under a shared architecture: TyQL, a statically typed, language-integrated query framework, and Carac, an adaptive recursive query engine that uses runtime metaprogramming to reoptimize queries based on observed behavior. CaQL leverages bidirectional information flow between TyQL and Carac to make static correctness checks more precise and runtime optimizations more aggressive, enabling behavior that neither system achieves independently. We evaluate CaQL on queries drawn from recursive SQL benchmarks, graph analytics, and program analysis, demonstrating the safety and convenience of language-integrated query with performance competitive with or superior to a state-of-the-art Datalog engine.
Several recent projects have argued that users would benefit from having a relational interface to operating systems state. In this paper, we report on the progress of a tool to automate this data modeling process. While automating schema inference in the general case appears to be intractable, we argue that it is feasible in the limited domain of operating system. Our approach leverages static analysis on the tightly-controlled coding styles common in OS kernels and hybridized techniques from the programming languages and databases literature for inferring and propagating constraints. We use a simple static analysis to extract relation names, types, and arity from constrained operating system source code patterns. We then leverage LLMs and iterative refinement to identify constraints. Our prototype implementation automates the extraction of 10% of the operating system structure into the relational model.
Previous work on UDF compilation strategies has proven that SQL engines can be used as efficient execution environments for imperative workloads over relational data. In this paper, we present BIRNE, a major extension of our Flummi compiler, with support for mixed-paradigm workloads, such that database-resident programs can be expressed using both imperative and functional constructs. To this end, we introduce a specialized form of control flow graphs that we call control flow plans, that allow us to express the control flow and data dependencies, such that we can use plain SQL as the target for compilation. We show that this approach allows us to compile a wide range of workloads into a single SQL query, and that it can outperform existing approaches in terms of overall runtime in many cases.
Join plan enumeration is a critical step in query optimization, impacting the performance of queries by orders of magnitude. Many queries contain complex non-inner joins, such as outer joins and semi joins, which make the efficient enumeration of all valid join plans difficult. There are existing solutions, but they are either incomplete or expensive. We improve the state-of-the-art join plan enumeration, efficiently handling all cases, including outer joins. Our approach is both complete and efficient by exploiting the properties of relational join transformations.
Executing applications in the cloud is becoming increasingly popular, primarily developed as microservices containing imperative code. In our previous work, we have made the case that such applications can benefit from using dataflow-based runtimes in a cloud environment. In particular, dataflow-based runtimes offer significant advantages over imperative code, namely, exactly-once processing, transparent message handling, and coarse-grained fault tolerance offered by dataflow systems. However, dataflow programming is not preferred by developers. In this work we bridge this gap, namely, we present our progress towards creating a suitable intermediate representation (IR) that can be used to compile stateful imperative code into dataflows, enabling seamless migration to the cloud. We then present a compiler pipeline prototype that offers two key benefits: i) it enables program optimizations and data parallelism, and ii) it decouples the input program from the target execution environment, while allowing interesting optimizations. Preliminary experiments demonstrate that our IR optimizations speed up the p50 request latency by 267x on average.
We show that all relational first-order logic queries can be effectively evaluated on finitary relations. Finitary relations are finite unions of Cartesian products of finite and cofinite sets. This works without any safety restrictions on formulas as in classical relational calculus. It relies on a translation to a relational algebra that includes an unrestricted complement operator and uses term rewriting on tensor terms, which are generally useful as compact data structures, also in conventional database systems.