feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896
feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896tanishqgandhi1908 wants to merge 11 commits into
Conversation
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5896 +/- ##
============================================
- Coverage 56.38% 56.34% -0.05%
Complexity 2986 2986
============================================
Files 1129 1129
Lines 43794 43800 +6
Branches 4743 4744 +1
============================================
- Hits 24693 24678 -15
- Misses 17650 17665 +15
- Partials 1451 1457 +6
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 360 | 0.22 | 27,899/32,782/32,782 us | 🔴 +23.9% / 🔴 +123.3% |
| 🔴 | bs=100 sw=10 sl=64 | 759 | 0.463 | 128,963/160,453/160,453 us | 🔴 +7.5% / 🔴 +50.1% |
| ⚪ | bs=1000 sw=10 sl=64 | 876 | 0.535 | 1,134,172/1,190,297/1,190,297 us | ⚪ within ±5% / 🔴 +16.3% |
Baseline details
Latest main a1a7eb0 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 360 tuples/sec | 392 tuples/sec | 786.27 tuples/sec | -8.2% | -54.2% |
| bs=10 sw=10 sl=64 | MB/s | 0.22 MB/s | 0.239 MB/s | 0.48 MB/s | -7.9% | -54.2% |
| bs=10 sw=10 sl=64 | p50 | 27,899 us | 22,520 us | 12,495 us | +23.9% | +123.3% |
| bs=10 sw=10 sl=64 | p95 | 32,782 us | 35,899 us | 14,784 us | -8.7% | +121.7% |
| bs=10 sw=10 sl=64 | p99 | 32,782 us | 35,899 us | 18,468 us | -8.7% | +77.5% |
| bs=100 sw=10 sl=64 | throughput | 759 tuples/sec | 777 tuples/sec | 991.49 tuples/sec | -2.3% | -23.4% |
| bs=100 sw=10 sl=64 | MB/s | 0.463 MB/s | 0.475 MB/s | 0.605 MB/s | -2.5% | -23.5% |
| bs=100 sw=10 sl=64 | p50 | 128,963 us | 124,399 us | 100,929 us | +3.7% | +27.8% |
| bs=100 sw=10 sl=64 | p95 | 160,453 us | 149,262 us | 106,894 us | +7.5% | +50.1% |
| bs=100 sw=10 sl=64 | p99 | 160,453 us | 149,262 us | 114,085 us | +7.5% | +40.6% |
| bs=1000 sw=10 sl=64 | throughput | 876 tuples/sec | 873 tuples/sec | 1,023 tuples/sec | +0.3% | -14.4% |
| bs=1000 sw=10 sl=64 | MB/s | 0.535 MB/s | 0.533 MB/s | 0.624 MB/s | +0.4% | -14.3% |
| bs=1000 sw=10 sl=64 | p50 | 1,134,172 us | 1,140,797 us | 983,835 us | -0.6% | +15.3% |
| bs=1000 sw=10 sl=64 | p95 | 1,190,297 us | 1,203,714 us | 1,023,777 us | -1.1% | +16.3% |
| bs=1000 sw=10 sl=64 | p99 | 1,190,297 us | 1,203,714 us | 1,053,883 us | -1.1% | +12.9% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,555.06,200,128000,360,0.220,27898.75,32782.08,32782.08
1,100,10,64,20,2635.89,2000,1280000,759,0.463,128963.37,160453.27,160453.27
2,1000,10,64,20,22819.99,20000,12800000,876,0.535,1134171.78,1190296.76,1190296.76|
@chenlica, what do you think? Should we support COUNT(*)? |
There was a problem hiding this comment.
Pull request overview
Adds first-class COUNT(*) support to the Aggregate operator so users can count rows without selecting a column, with backend execution/schema support and a small Aggregate-specific UI rule to disable the Attribute field when count(*) is selected.
Changes:
- Backend: introduce
COUNT_STAR("count(*)"), implement row-count aggregation, and rewrite both COUNT variants toSUMfor the global aggregation stage. - Backend/validation: make
attributeconditionally required via JSON-schema (attributerequired for all functions exceptcount(*)). - Frontend + docs: disable/clear the Attribute field when
count(*)is selected and document the new behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/app/workspace/component/property-editor/operator-property-edit-frame/operator-property-edit-frame.component.ts | Disables + clears Aggregate “Attribute” field when count(*) is chosen. |
| docs/reference/operators/data-cleaning/aggregate/aggregate.md | Documents count(*) and clarifies COUNT vs COUNT(*). |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationFunction.java | Adds COUNT_STAR("count(*)") enum value for JSON. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationOperation.scala | Adds COUNT(*) semantics, conditional attribute requirement in schema, and global-stage rewrite logic. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpExec.scala | Updates executor initialization to tolerate COUNT(*) without input-column lookup. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDesc.scala | Updates schema propagation to tolerate COUNT(*) without input-column lookup. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpSpec.scala | Adds unit/integration coverage for COUNT(*) behavior and final-stage rewrite. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDescSpec.scala | Adds schema-propagation coverage for COUNT(*) output typing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@mengw15 Please review it. |
|
maybe a bit more to ask, but can we hide the alternatively, you can add a |
Thanks! Quick reasoning on both, and happy to change either: I disabled (greyed out) the Attribute instead of hiding it so the rows don't change height/jump when switching functions, the layout stays consistent. But hiding is a one-line change can do if team prefers it. The Attribute box is a single column-picker shared by every row and filled from the input schema. If we add a * option to it, that * would also show up for sum, min, max, etc, since the dropdown is the same for all functions. Limiting * to count only would need extra per-row conditional logic. |
|
yeah but if we don't hide/remove/reset the value input box, it may have this experience: user selects |
|
That case is already handled, when you switch to |
ok so it is doing reset already. that will do. to hide it is optional and I will leave you to decide for the best experience! thanks |
Yicong-Huang
left a comment
There was a problem hiding this comment.
Left comments in line
|
@mengw15 Please review it after the pass of @Yicong-Huang . |
|
Same comment as yicong, we should do function count and * as a special column you can select there. honestly I think it will be eaiser if we make attribute an optional parameter and if you don't select anything it means count(*) |
|
According to comments and suggestions from both of you ( @zuozhiw and @Yicong-Huang ) Would this be okay and make sense? |
|
@tanishqgandhi1908 the new proposal sounds good thx |
|
Updated the PR and its description accordingly! |
Yicong-Huang
left a comment
There was a problem hiding this comment.
LGTM, one more test to add
What changes were proposed in this PR?
This PR adds
COUNT(*)support to the Aggregate operator, so users can count all rows without selecting a column — useful when there's no natural column to count.The existing
countfunction now handles both cases via an optional attribute:countwith an empty Attribute →COUNT(*)— counts every row (including rows with nulls)countwith a column → counts that column's non-null values (unchanged)The Attribute stays required for every other function (
sum,average,min,max,concat), enforced by a conditional JSON-schema rule and surfaced in the UIwith the usual required marker.
Backend
countcounts all rows when the attribute is empty/blank, otherwise counts non-null values of the selected column.count(validated by Ajv, so an empty attribute on other functions marks the operator invalid).Frontend
count; the required marker (red*) shows for all other functions. Description hints that an empty attribute counts all rows.Docs
Screenshots
Any related issues, documentation, discussions?
Closes #3142.
How was this PR tested?
Automated (
AggregateOpSpec,AggregateOpDescSpec):countwith an empty (or null) attribute counts all rows including nulls.countwith a column counts only non-null values.getAggregationAttribute/ schema propagation / executor tolerate an empty attribute (no input-column lookup; result typedINTEGER).sum/average/min/max/concat/getFinalbehavior unchanged.All Aggregate tests pass:
sbt "WorkflowOperator/testOnly *aggregate*".Manual (UI): verified
count+ empty Attribute counts all rows (with and without Group By),count+ a column counts non-null values, and non-count functions with an empty Attribute are invalid (Run disabled).Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)