DM-55216: Add Hyrise-based SQL parser backend#1030
Open
malensek wants to merge 19 commits into
Open
Conversation
These updates to the test suite slightly relax SQL comparisons so that non-semantic differences in output from the query parser backend will still be considered valid. This will allow for testing multiple parser backends.
This will be moved out of the qserv source tree later, but allows us to keep this branch self-contained as changes land in the parser and adapter layers.
This commit covers three main changes: * Additional query test coverage. It was possible to pass some of the test cases with simple string checks. * Update test cases with Hyrise-specific strings. Rather than forcing Hyrise adapter messages to match legacy ANTLR messages exactly, this allows us to retain both parser backends and test them individually. * Simplify QuerySession::parseQuery call to parser. Previously this instantiated a ParseRunner, which avoided the makeSelectStmt() path and did not execute via Hyrise (if configured to do so).
The adapter previously took some shortcuts to pass test cases via string normalization. These string manipulations have been removed from the adapter and replaced with a full parse tree walk to confirm and reject invalid queries. This version is able to report more informative errors when failures occur. Also did some minor cleanup for readability, renamed str() function to nullToEmpty() since this does a better job of describing what purpose the function serves, and removed temporary float conversion and replaced with more robust std::to_chars.
This was previously unused and consolidates paths into the SQL parser so it can be easily toggled between implementations (ANTLR/Hyrise in this case)
We previously had a small helper (ParserNormalization.h) that applied normalization to SQL strings, but also occasionally used compile-time switches to differentiate between the backends in the test cases. This commit adds a helper macro (PARSER_EXPECTED) and unifies the approaches so that only compile-time checks + embedded normalization are used.
This previously was being routed through the ANTLR-based parser.
There was a problem hiding this comment.
Pull request overview
Adds a compile-time switchable SQL parsing backend using a Hyrise-based parser/adapter to produce Qserv IR for SELECT statements, while retaining the existing ANTLR4-based path.
Changes:
- Introduces
ccontrol::HyriseAdapterand routesParseRunner::makeSelectStmt()through it whenQSERV_USE_HYRISE_SQL_PARSERis enabled. - Refactors parsing entrypoints so
UserQueryFactoryandQuerySessionuseParseRunner::makeSelectStmt(), and updates tests to normalize backend output differences. - Adds a parser corpus stress test and adds/updates CMake + submodule wiring for the new dependency.
Reviewed changes
Copilot reviewed 35 out of 37 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/QueryAnaHelper.h | Removes ParseRunner helper declaration no longer used by tests. |
| src/tests/QueryAnaHelper.cc | Removes ParseRunner helper implementation. |
| src/tests/ParserExpected.h | Adds macro helper to normalize expected strings per backend. |
| src/qproc/testQueryAnaOrderBy.cc | Updates expected ORDER BY serialization differences via PARSER_EXPECTED. |
| src/qproc/testQueryAnaGeneral.cc | Normalizes numeric/operator formatting differences; updates parser invocation in tests. |
| src/qproc/testQueryAnaAggregation.cc | Normalizes numeric literal formatting differences in expected strings. |
| src/qproc/QuerySession.cc | Parses via ParseRunner::makeSelectStmt() instead of constructing a ParseRunner. |
| src/ccontrol/UserQueryType.h | Adds helper APIs to identify CALL QSERV_RESULT_DELETE(...) and parse SET GLOBAL .... |
| src/ccontrol/UserQueryType.cc | Implements regex-based CALL/SET extraction helpers. |
| src/ccontrol/UserQueryFactory.cc | Consolidates SELECT parsing; adds Hyrise-specific handling for CALL/SET. |
| src/ccontrol/testUserQueryType.cc | Adds unit tests for new UserQueryType helpers. |
| src/ccontrol/testParserCorpus.cc | Adds parser corpus stress/perf regression test for IR construction. |
| src/ccontrol/testdata/parser-corpus/README.md | Documents parser corpus usage. |
| src/ccontrol/testdata/parser-corpus/q01_small.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q02_medium.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q03_original.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q03_stripped.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q07_object_table_predicate_stress.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q08_object_source_join_wide.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q09_q3_expression_dense.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q10_having_limit.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q11_join_using_time_window.sql | Adds corpus SQL input. |
| src/ccontrol/testdata/parser-corpus/q12_spatial_operator_mix.sql | Adds corpus SQL input. |
| src/ccontrol/testCControl.cc | Updates expected parse errors per backend; adds string-literal regression tests. |
| src/ccontrol/testAntlr4GeneratedIR.cc | Expands ANTLR IR test coverage + unsupported-feature tests. |
| src/ccontrol/ParseRunner.cc | Switches makeSelectStmt() to Hyrise adapter when enabled. |
| src/ccontrol/HyriseAdapter.h | Declares Hyrise adapter entrypoint for building Qserv IR. |
| src/ccontrol/HyriseAdapter.cc | Implements Hyrise parse-tree validation + conversion to Qserv IR. |
| src/ccontrol/CMakeLists.txt | Adds adapter source, links Hyrise parser, adds corpus test, conditionally selects IR test suite. |
| doc/CMakeLists.txt | Separates linkcheck doctree output directory. |
| CMakeLists.txt | Adds build option and includes Hyrise parser subdirectory. |
| .gitmodules | Adds extern/hyrise-sql-parser submodule definition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Previously all nonzero values were considered 'true' by SET GLOBAL user queries. This specifically accepts "0" or "1" for false/true or returns a UserQueryInvalid otherwise.
Area restrictor are handled as a special case, and in some situations queries with OR / NOT were silently converted to conjunctive (AND) queries. This resolves the issue and also allows some conjunctive cases that were previously restricted. Test cases were added to exercise these paths. This issue also appears to be present in the ANTLR-based parser path.
Comment on lines
+448
to
+452
| return setBooleanSessionVariable(varName, varValue, _useQservRowCounterOptimization); | ||
| } else if (varName == "QSERV_DEBUG_CZAR_NO_MERGE") { | ||
| return setBooleanSessionVariable(varName, varValue, _debugNoMerge); | ||
| } | ||
| return uq; | ||
| return std::make_shared<UserQueryInvalid>("Unsupported SET variable: " + varName); |
Member
Author
There was a problem hiding this comment.
This change was intentional, to inform clients that their SET GLOBAL didn't work rather than accepting it and silently swallowing it (no-op). I'll leave this unresolved so others can review this behavior change.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new, optional SQL parser backend based on the Hyrise SQL parser (https://github.com/hyrise/sql-parser). The ANTLR4-based parser remains available as an option depending on whether
QSERV_USE_HYRISE_SQL_PARSERis enabled. Based on our testing and benchmarking (also see DM-53046) this change should improve query processing speed.A modified version of the Hyrise SQL parser that supports features we need in qserv (3-column identifiers,
HAVINGwithoutGROUP BY,MOD, backtick-quoted identifiers, a few more centered around compatibility with mySQL) was added as a submodule underextern, branch qserv-compat, here: https://github.com/lsst/hyrise-sql-parser/ . Ideally the general modifications should be possible to submit as PRs for merging upstream. At some point we may want to evaluate what mySQL-derived features we want to explicitly drop or support in qserv, as that may improve compatibility with the unmodified library.Primary functionality was implemented in
src/ccontrol/HyriseAdapter.cc, which serves two purposes:There were previously two entrypoints to the parser, through
ParseRunner::ParseRunner()andParseRunner::makeSelectStmt, but those have been largely consolidated so thatUserQueryFactorydrives the parsing logic and selects the appropriate parser backend at compile time.Finally, the test suite has been updated:
src/tests/ParserExpected.hhelper to simplify normalizing a few non-semantic differences between the two backends.The next milestone is adding integration tests that exercise the slow query paths that we observed previously to confirm performance has improved, and work to refactor / optimize the parse tree traversals for performance.