Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries: Code, Datasets, and Experimental Results

Pan Hu and Boris Motik

This Web page contains the code, the datasets, and the results of the experiments described in the paper Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries published in ACM Transactions on Database Systems. The paper describes a novel sampling-based method for estimating the cardinality (i.e., the number of answers) of queries of arbitrary algebraic form. The proposed method is proved to provide a consistent estimator of the query cardinality.

System

The system is being actively developed on GitHub, but the exact version in the paper is included in the CardinalityEstimator.zip archive. The system implements a mini RAM-based database that supports relations of arbitrary arity with hash-based indexing. It can load data from CVS and RDF Turtle tiles and then evaluate SPARQL-like queries as well as provide their cardinality estimations. The system was written in C++20, it can be compiled on macOS or Linux using the enclosed Xcode project or makefile, and it should work on any 64-bit processor and was tested on x86-64 and ARM architectures.

The files used in our experiments are large so loading them can take a very long time. This was a considerable problem during system development, where the ability to run an experiment quickly was essential. Loading large files while running the experiments was also suboptimal. Therefore, the system also provides a way to precompile a dataset into a binary file (typically with the .bin extension), which essentially contain an exact dump of the content of the data structures used to store and index data. Thus, binary files can be loaded typically in a matter of seconds since loading involves just reading various parts of the file into preallocated memory blocks. The use of binary files as input does not affect the outcome of the experiments: the state of the system is exactly the same regardless of whether the data was loaded from plain text or from a binary file.

The system does not provide a command line interface as such; rather, the actions that the system can perform are implemented as tasks listed in the /tasks directory. The /tasks/main.cpp file lists the tasks that were used to produce the results reported in the paper. The behaviour of the system can be customised by editing this file, and the way to do this should be relatively self-explanatory. The initial content of the /tasks/main.cpp file shows how to repeat the experiments with conjunctive queries.

Datasets

The following seven datasets, each provided in a separate archive, were used in the experiments. Each archive contains the original plain-text data, the precompiled .bin binary file, and a selection of conjunctive and complex queries. The AIDS, Human, Yago, and one of the LUBM variants were obtained from the G-CARE system used in the G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching paper published at the SIGMOD 2020 conference. The IMDB dataset was obtained from the NeuroCard: One Cardinality Estimator for All Tables paper published at the VLDB 2020 conference.

Experimental Results

The Results.zip archive contains detailed results of our experiments. The raw outputs produced by the system are contained in the Results-conjunctive-queries.txt and Results-complex-queries.txt files. Each of the two files contains the outputs for each of the datasets; moreover, for each dataset, the output contains a TSV-formatted table specifying the results for each query considered followed by a statistical summary of the results. This summary was directly included into the paper — that is, no further post-processing was needed in most cases. The archive also contains the Results-conjunctive-queries.numbers and Results-complex-queries.numbers spreadsheets for Apple Numbers which provide a more readable tabular view of the per-query TSV outputs.