Regular path queries (RPQs) are at the heart of navigational queries in graph databases. Motivated by new features of regular path queries in the languages Cypher, GQL, and SQL/PGQ, which require new approaches for indexing and compactly storing intermediate query results, we investigate a large corpus of real-world RPQs. Our corpus consists of 148.7 million RPQs occurring in 937.2 million SPARQL queries, used on 29 different data sets.
We investigate three main questions on these logs. First, what is the syntactic structure of these RPQs? Second, how much non-determinism (or ambiguity) do they have? Third, do they admit tractable evaluation under simple path and trail semantics?
Concerning the first question, we show that all the RPQs can be classified in only 572 different syntactic shapes, which we provide in a downloadable data set in Zenodo. Furthermore, we classify the relative use of various RPQ operators, and popular predicates that are used for transitive navigation. Concerning the second question, we show that although non-determinism occurs in the RPQs, less than one in ten million requires a deterministic finite automaton with more states than the size of the regular expression. This is remarkable because this blow-up is known to be exponential in the worst case. Concerning the last question, the vast majority of expressions admits tractable evaluation.
Stardog is a commercial Knowledge Graph platform built on top of an RDF graph database with SPARQL 1.1 support. This paper describes our journey of developing a more performant query execution layer and plugging it into the Stardog's query engine. The new executor-called BARQ-is based on the known principle of processing batches of tuples at a time in most critical query operators, particularly joins. Besides presenting BARQ, the paper describes the challenges of integrating it into a mature, tightly integrated system based on the classical tuple-at-a-time Volcano model. It offers a gradual approach to overcoming the challenges that small-to-medium-size engineering teams typically face. Finally, the paper presents experimental results showing that BARQ makes Stardog substantially faster on CPU-bound queries without sacrificing performance on disk-bound and OLTP-style queries.
Recently developed query languages for property graphs - GQL and SQL/PGQ - have now been standardized, and their theoretical models have been introduced and analyzed. These models however, do not yet have enough flexibility when it comes to process data stored as properties within property graphs; in particular they have limited ability to deal with various data types and operations on them. In a fully expressive query language, we should be able to perform meaningful computations on numerical and string data, among others. This includes queries that quantify over elements not explicitly present in the database.
In this paper, we take a first step toward formalizing such expressive query languages and developing mathematical models that support them. Our approach is inspired by work on constraint databases, which permitted query evaluation for declarative languages that can reason about infinite domains and operations on them. We extend this approach to recently produced formal models of property graph query languages and analyze the gain in expressive power. This lays the foundation for further exploration and refinement of expressive query mechanisms for property graphs.
Graph Neural Networks (GNNs) have the potential to address real-world problems involving continuously evolving, graph-structured data, such as fraud detection, real-time recommendations, and traffic monitoring. These applications require the timely processing of streaming (possibly unbounded) data, highlighting the need of integrating GNN inference with dataflow stream processing systems, like Apache Flink. In this paper, we present the first exploration of bridging this gap, by designing a streaming GNN serving pipeline with Flink and PyTorch. We propose a dataflow architecture that offloads subgraph construction to Flink, leveraging its state management and distributed processing capabilities. Despite achieving viable performance through asynchronous inference requests and careful parallelism tuning, we also identify significant limitations stemming from Flink's state scoping, lack of iterative processing, and computation pipelining. We propose solutions that mitigate these issues within Flink and discuss open challenges towards developing dataflow systems tailored to streaming GNN inference.
Graph Neural Networks (GNNs) are the de facto models for deep learning on graph datasets, but training GNNs on large-scale datasets remains a challenge. Data partitioning plays a critical role in distributed mini-batch GNN training, as it directly impacts memory usage, training speed, and model accuracy. This survey comprehensively compares prevalent partitioning strategies in the Deep Graph Library (DGL) framework across standard benchmarks and models of varying depths. We systematically analyze aspects such as partition sizes, training times, memory overhead, and the accuracy associated with each partitioning method. Through this analysis, we uncover practical insights and the inherent trade-offs of these strategies. Our findings reveal surprising cases where simpler partitioning approaches outperform more sophisticated schemes. We conclude by offering practical guidelines for GNN partitioning.
Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost-both in terms of time and effort-of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.
Property graph databases are widely used in applications such as social networks and fraud detection due to their ability to model relationships between entities efficiently. However, outsourcing graph data to third-party cloud providers introduces significant privacy risks, as memory access patterns and result sizes can leak sensitive information even when data is encrypted. While prior work has developed oblivious relational databases to mitigate such attacks, no existing solution supports oblivious query processing for property graphs, which pose unique challenges beyond those addressed in relational settings. We present a vision for an oblivious property graph database that supports Cypher queries on encrypted outsourced data. We envision leveraging secure hardware enclaves such as Intel SGX/TDX and AMD SEV-SNP to protect query execution while novel query processing algorithms mitigate side-channel vulnerabilities through doubly oblivious query processing schemes. Specifically, we discuss the challenges in supporting oblivious query processing algorithms for multi-hop and cyclic queries without revealing data or memory access patterns. Our work highlights a crucial gap and potential research directions in oblivious query processing for property graph databases.
SQL/PGQ is a new standard that integrates graph querying into relational systems, allowing users to freely switch between graph patterns and SQL. Our experiments show performance gaps between these models, as queries written in both formalisms can exhibit varying performance depending on the formalism used, suggesting that current approaches handle each query type separately, applying distinct optimizations to each formalism. We argue that a holistic optimization is necessary, where the system internally decides on the best algorithms regardless of whether queries are written in SQL or as graph patterns. We propose possible future research direction to unify these optimizations and mitigate performance gaps.