fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility by yaooqinn · Pull Request #16947 · facebookincubator/velox

yaooqinn · 2026-03-28T03:44:31Z

Summary

Fixes a backward compatibility bug introduced in PR #16416.

The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false (RESPECT NULLS). When the 1-arg signature collect_set(T) is used, setConstantInputs() does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls (true).

Root cause

// Before (broken): includes nulls by default
bool ignoreNulls_{false};

// After (fixed): ignores nulls by default (Spark's default)
bool ignoreNulls_{true};

Impact

Without this fix, any downstream consumer (e.g., Gluten) using the native collect_set with the 1-arg signature would get null elements in the output array, causing NullPointerException during Spark's result projection.

Testing

Verified in Gluten with VeloxAggregateFunctionsDefaultSuite — all 16 collect_set/collect_list tests pass after this fix.

Related: Gluten PR apache/gluten#11837

netlify · 2026-03-28T03:44:37Z

✅ Deploy Preview for meta-velox ready!

Name	Link
🔨 Latest commit	`889d8d5`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69cd37db07bb570008b483cc
😎 Deploy Preview	https://deploy-preview-16947--meta-velox.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2026-03-28T03:45:09Z

Build Impact Analysis

Selective Build Targets (building these covers all 5 affected)

cmake --build _build/release --target spark_aggregation_fuzzer_test velox_functions_spark_aggregates_test velox_spark_query_runner_test velox_sparksql_coverage

Total affected: 5/556 targets

Affected targets (5)

Directly changed (2)

Target	Changed Files
`velox_functions_spark_aggregates`	CollectSetAggregate.cpp
`velox_functions_spark_aggregates_test`	CollectSetAggregateTest.cpp

Transitively affected (3)

spark_aggregation_fuzzer_test
velox_spark_query_runner_test
velox_sparksql_coverage

Fast path • Graph from main@24e6ab97ba8015c3b6fae82e8184047cadf521df

jinchengchenghh

Please don't include non-related change

yaooqinn · 2026-03-30T13:22:01Z

Rebased on latest main — removed unrelated changes from the diff. Now only the 1-file fix (CollectSetAggregate.cpp).

rui-mo

Can you please add test for verify the default behavior? Thanks.

…ompatibility The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false (RESPECT NULLS), which breaks backward compatibility when the 1-arg signature is used. In this case, setConstantInputs() does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yaooqinn · 2026-03-31T03:57:12Z

Added defaultIgnoreNulls test verifying the 1-arg collect_set(c0) ignores nulls by default (Spark's default behavior). All 10 tests pass. Thanks @rui-mo!

rui-mo · 2026-03-31T17:06:12Z

velox/functions/sparksql/aggregates/CollectSetAggregate.cpp

-            SBase::allocator_);
-      }
+      // Intermediate results already have null filtering applied by the
+      // partial step. Always preserve all elements (including nulls) here.


Intermediate results already have null filtering applied by the partial step.

Velox supports flushing during partial aggregation. When this happens, the intermediate results are left unaggregated, with the final aggregation step responsible for processing them. Could this cause any result issues?

Good point! This is safe because:

When partial flushes (toIntermediate is called), null filtering is already applied there:

ignoreNulls_=true: null inputs become empty arrays (size=0), so the intermediate output contains no null elements

ignoreNulls_=false: Base::toIntermediate wraps each value (including nulls) into [value] arrays

Final step receives pre-filtered data: Since toIntermediate already handles null filtering, addIntermediateResults just needs to merge arrays — using addValues (preserve everything) is correct for both cases.

No behavior change: Before this fix, the final/intermediate nodes had ignoreNulls_={false} by default, which also always used addValues in the intermediate path. My change makes this explicit and removes the dead addNonNullValues branch that was never reachable in the final step.

rui-mo

Thanks.

meta-codesync · 2026-04-02T18:31:49Z

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this in D99321794.

meta-codesync · 2026-04-03T18:29:58Z

@bikramSingh91 merged this pull request in 1dfcfbb.

yaooqinn requested review from assignUser, jinchengchenghh, majetideepak, rui-mo and zhli1142015 as code owners March 28, 2026 03:44

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 28, 2026

jinchengchenghh reviewed Mar 30, 2026

View reviewed changes

yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from 565e97c to d07e804 Compare March 30, 2026 13:21

rui-mo reviewed Mar 30, 2026

View reviewed changes

yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from d07e804 to 5dde4ff Compare March 30, 2026 18:47

yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from 5dde4ff to de0ff6a Compare March 31, 2026 03:57

rui-mo reviewed Mar 31, 2026

View reviewed changes

Merge branch 'main' into fix/collect-set-default-ignore-nulls

889d8d5

rui-mo approved these changes Apr 2, 2026

View reviewed changes

rui-mo added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 2, 2026

meta-codesync bot closed this in 1dfcfbb Apr 3, 2026

facebook-github-tools bot added the Merged label Apr 3, 2026

yaooqinn deleted the fix/collect-set-default-ignore-nulls branch April 4, 2026 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947
yaooqinn wants to merge 2 commits intofacebookincubator:mainfrom
yaooqinn:fix/collect-set-default-ignore-nulls

yaooqinn commented Mar 28, 2026

Uh oh!

netlify bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2026 •

edited

Loading

Directly changed (2)

Transitively affected (3)

Uh oh!

jinchengchenghh left a comment

Uh oh!

yaooqinn commented Mar 30, 2026

Uh oh!

rui-mo left a comment

Uh oh!

yaooqinn commented Mar 31, 2026

Uh oh!

rui-mo Mar 31, 2026

Uh oh!

yaooqinn Apr 1, 2026

Uh oh!

rui-mo left a comment

Uh oh!

meta-codesync bot commented Apr 2, 2026

Uh oh!

meta-codesync bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Mar 28, 2026

Summary

Root cause

Impact

Testing

Uh oh!

netlify bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox ready!

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Impact Analysis

Selective Build Targets (building these covers all 5 affected)

Directly changed (2)

Transitively affected (3)

Uh oh!

jinchengchenghh left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 30, 2026

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 31, 2026

Uh oh!

rui-mo Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Apr 2, 2026

Uh oh!

meta-codesync bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading