Skip to content

DM-55216: Add Hyrise-based SQL parser backend#1030

Open
malensek wants to merge 19 commits into
tickets/DM-43715from
tickets/DM-55216
Open

DM-55216: Add Hyrise-based SQL parser backend#1030
malensek wants to merge 19 commits into
tickets/DM-43715from
tickets/DM-55216

Conversation

@malensek

Copy link
Copy Markdown
Member

Adds a new, optional SQL parser backend based on the Hyrise SQL parser (https://github.com/hyrise/sql-parser). The ANTLR4-based parser remains available as an option depending on whether QSERV_USE_HYRISE_SQL_PARSER is enabled. Based on our testing and benchmarking (also see DM-53046) this change should improve query processing speed.

A modified version of the Hyrise SQL parser that supports features we need in qserv (3-column identifiers, HAVING without GROUP BY, MOD, backtick-quoted identifiers, a few more centered around compatibility with mySQL) was added as a submodule under extern, branch qserv-compat, here: https://github.com/lsst/hyrise-sql-parser/ . Ideally the general modifications should be possible to submit as PRs for merging upstream. At some point we may want to evaluate what mySQL-derived features we want to explicitly drop or support in qserv, as that may improve compatibility with the unmodified library.

Primary functionality was implemented in src/ccontrol/HyriseAdapter.cc, which serves two purposes:

  • Traversing the Hyrise parse tree to validate queries against supported Qserv functionality
  • Converting the tree to Qserv IR for further analysis and processing downstream

There were previously two entrypoints to the parser, through ParseRunner::ParseRunner() and ParseRunner::makeSelectStmt, but those have been largely consolidated so that UserQueryFactory drives the parsing logic and selects the appropriate parser backend at compile time.

Finally, the test suite has been updated:

  • src/tests/ParserExpected.h helper to simplify normalizing a few non-semantic differences between the two backends.
  • new testHyriseGeneratedIR test (based on testAntlr4GeneratedIR) and expanded coverage for both IR suites
  • testParserCorpus + a dataset for stress testing and benchmarking to help catch performance regressions
  • testUserQueryType coverage for reworked SET/CALL (previously depended on the ANTLR parser)

The next milestone is adding integration tests that exercise the slow query paths that we observed previously to confirm performance has improved, and work to refactor / optimize the parse tree traversals for performance.

malensek added 14 commits June 19, 2026 11:00
These updates to the test suite slightly relax SQL comparisons so that
non-semantic differences in output from the query parser backend will still be
considered valid. This will allow for testing multiple parser backends.
This will be moved out of the qserv source tree later, but allows us to keep
this branch self-contained as changes land in the parser and adapter layers.
This commit covers three main changes:

* Additional query test coverage. It was possible to pass some of the
  test cases with simple string checks.

* Update test cases with Hyrise-specific strings.  Rather than forcing
  Hyrise adapter messages to match legacy ANTLR messages exactly, this
  allows us to retain both parser backends and test them individually.

* Simplify QuerySession::parseQuery call to parser. Previously this
  instantiated a ParseRunner, which avoided the makeSelectStmt() path
  and did not execute via Hyrise (if configured to do so).
The adapter previously took some shortcuts to pass test cases via string
normalization. These string manipulations have been removed from the
adapter and replaced with a full parse tree walk to confirm and reject
invalid queries.

This version is able to report more informative errors when failures
occur. Also did some minor cleanup for readability, renamed str()
function to nullToEmpty() since this does a better job of describing
what purpose the function serves, and removed temporary float conversion
and replaced with more robust std::to_chars.
This was previously unused and consolidates paths into the SQL parser so
it can be easily toggled between implementations (ANTLR/Hyrise in this
case)
We previously had a small helper (ParserNormalization.h) that applied
normalization to SQL strings, but also occasionally used compile-time
switches to differentiate between the backends in the test cases. This
commit adds a helper macro (PARSER_EXPECTED) and unifies the approaches
so that only compile-time checks + embedded normalization are used.
This previously was being routed through the ANTLR-based parser.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a compile-time switchable SQL parsing backend using a Hyrise-based parser/adapter to produce Qserv IR for SELECT statements, while retaining the existing ANTLR4-based path.

Changes:

  • Introduces ccontrol::HyriseAdapter and routes ParseRunner::makeSelectStmt() through it when QSERV_USE_HYRISE_SQL_PARSER is enabled.
  • Refactors parsing entrypoints so UserQueryFactory and QuerySession use ParseRunner::makeSelectStmt(), and updates tests to normalize backend output differences.
  • Adds a parser corpus stress test and adds/updates CMake + submodule wiring for the new dependency.

Reviewed changes

Copilot reviewed 35 out of 37 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/tests/QueryAnaHelper.h Removes ParseRunner helper declaration no longer used by tests.
src/tests/QueryAnaHelper.cc Removes ParseRunner helper implementation.
src/tests/ParserExpected.h Adds macro helper to normalize expected strings per backend.
src/qproc/testQueryAnaOrderBy.cc Updates expected ORDER BY serialization differences via PARSER_EXPECTED.
src/qproc/testQueryAnaGeneral.cc Normalizes numeric/operator formatting differences; updates parser invocation in tests.
src/qproc/testQueryAnaAggregation.cc Normalizes numeric literal formatting differences in expected strings.
src/qproc/QuerySession.cc Parses via ParseRunner::makeSelectStmt() instead of constructing a ParseRunner.
src/ccontrol/UserQueryType.h Adds helper APIs to identify CALL QSERV_RESULT_DELETE(...) and parse SET GLOBAL ....
src/ccontrol/UserQueryType.cc Implements regex-based CALL/SET extraction helpers.
src/ccontrol/UserQueryFactory.cc Consolidates SELECT parsing; adds Hyrise-specific handling for CALL/SET.
src/ccontrol/testUserQueryType.cc Adds unit tests for new UserQueryType helpers.
src/ccontrol/testParserCorpus.cc Adds parser corpus stress/perf regression test for IR construction.
src/ccontrol/testdata/parser-corpus/README.md Documents parser corpus usage.
src/ccontrol/testdata/parser-corpus/q01_small.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q02_medium.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q03_original.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q03_stripped.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q07_object_table_predicate_stress.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q08_object_source_join_wide.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q09_q3_expression_dense.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q10_having_limit.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q11_join_using_time_window.sql Adds corpus SQL input.
src/ccontrol/testdata/parser-corpus/q12_spatial_operator_mix.sql Adds corpus SQL input.
src/ccontrol/testCControl.cc Updates expected parse errors per backend; adds string-literal regression tests.
src/ccontrol/testAntlr4GeneratedIR.cc Expands ANTLR IR test coverage + unsupported-feature tests.
src/ccontrol/ParseRunner.cc Switches makeSelectStmt() to Hyrise adapter when enabled.
src/ccontrol/HyriseAdapter.h Declares Hyrise adapter entrypoint for building Qserv IR.
src/ccontrol/HyriseAdapter.cc Implements Hyrise parse-tree validation + conversion to Qserv IR.
src/ccontrol/CMakeLists.txt Adds adapter source, links Hyrise parser, adds corpus test, conditionally selects IR test suite.
doc/CMakeLists.txt Separates linkcheck doctree output directory.
CMakeLists.txt Adds build option and includes Hyrise parser subdirectory.
.gitmodules Adds extern/hyrise-sql-parser submodule definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CMakeLists.txt
Comment thread src/ccontrol/CMakeLists.txt
Comment thread src/ccontrol/CMakeLists.txt
Comment thread src/ccontrol/HyriseAdapter.cc Outdated
Comment thread src/ccontrol/UserQueryType.cc Outdated
Comment thread src/ccontrol/UserQueryFactory.cc Outdated
malensek and others added 4 commits June 22, 2026 18:38
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Previously all nonzero values were considered 'true' by SET GLOBAL user
queries. This specifically accepts "0" or "1" for false/true or returns
a UserQueryInvalid otherwise.
Area restrictor are handled as a special case, and in some situations
queries with OR / NOT were silently converted to conjunctive (AND)
queries. This resolves the issue and also allows some conjunctive
cases that were previously restricted. Test cases were added to exercise
these paths.

This issue also appears to be present in the ANTLR-based parser path.
@malensek malensek changed the title DM-55216: Add optional Hyrise-based SQL parser backend DM-55216: Add Hyrise-based SQL parser backend Jun 25, 2026
@malensek malensek requested a review from Copilot June 25, 2026 23:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated 2 comments.

Comment on lines +448 to +452
return setBooleanSessionVariable(varName, varValue, _useQservRowCounterOptimization);
} else if (varName == "QSERV_DEBUG_CZAR_NO_MERGE") {
return setBooleanSessionVariable(varName, varValue, _debugNoMerge);
}
return uq;
return std::make_shared<UserQueryInvalid>("Unsupported SET variable: " + varName);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was intentional, to inform clients that their SET GLOBAL didn't work rather than accepting it and silently swallowing it (no-op). I'll leave this unresolved so others can review this behavior change.

Comment thread src/ccontrol/UserQueryType.h
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants