Skip to content

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947

Closed
yaooqinn wants to merge 2 commits intofacebookincubator:mainfrom
yaooqinn:fix/collect-set-default-ignore-nulls
Closed

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947
yaooqinn wants to merge 2 commits intofacebookincubator:mainfrom
yaooqinn:fix/collect-set-default-ignore-nulls

Conversation

@yaooqinn
Copy link
Copy Markdown
Contributor

Summary

Fixes a backward compatibility bug introduced in PR #16416.

The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false (RESPECT NULLS). When the 1-arg signature collect_set(T) is used, setConstantInputs() does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls (true).

Root cause

// Before (broken): includes nulls by default
bool ignoreNulls_{false};

// After (fixed): ignores nulls by default (Spark's default)
bool ignoreNulls_{true};

Impact

Without this fix, any downstream consumer (e.g., Gluten) using the native collect_set with the 1-arg signature would get null elements in the output array, causing NullPointerException during Spark's result projection.

Testing

Verified in Gluten with VeloxAggregateFunctionsDefaultSuite — all 16 collect_set/collect_list tests pass after this fix.

Related: Gluten PR apache/gluten#11837

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 28, 2026

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 889d8d5
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69cd37db07bb570008b483cc
😎 Deploy Preview https://deploy-preview-16947--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Build Impact Analysis

Selective Build Targets (building these covers all 5 affected)

cmake --build _build/release --target spark_aggregation_fuzzer_test velox_functions_spark_aggregates_test velox_spark_query_runner_test velox_sparksql_coverage

Total affected: 5/556 targets

Affected targets (5)

Directly changed (2)

Target Changed Files
velox_functions_spark_aggregates CollectSetAggregate.cpp
velox_functions_spark_aggregates_test CollectSetAggregateTest.cpp

Transitively affected (3)

  • spark_aggregation_fuzzer_test
  • velox_spark_query_runner_test
  • velox_sparksql_coverage

Fast path • Graph from main@24e6ab97ba8015c3b6fae82e8184047cadf521df

Copy link
Copy Markdown
Collaborator

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't include non-related change

@yaooqinn yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from 565e97c to d07e804 Compare March 30, 2026 13:21
@yaooqinn
Copy link
Copy Markdown
Contributor Author

Rebased on latest main — removed unrelated changes from the diff. Now only the 1-file fix (CollectSetAggregate.cpp).

Copy link
Copy Markdown
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add test for verify the default behavior? Thanks.

@yaooqinn yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from d07e804 to 5dde4ff Compare March 30, 2026 18:47
…ompatibility

The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false
(RESPECT NULLS), which breaks backward compatibility when the 1-arg signature
is used. In this case, setConstantInputs() does not receive a boolean constant,
so the default value is used — which must match Spark's default behavior of
ignoring nulls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from 5dde4ff to de0ff6a Compare March 31, 2026 03:57
@yaooqinn
Copy link
Copy Markdown
Contributor Author

Added defaultIgnoreNulls test verifying the 1-arg collect_set(c0) ignores nulls by default (Spark's default behavior). All 10 tests pass. Thanks @rui-mo!

SBase::allocator_);
}
// Intermediate results already have null filtering applied by the
// partial step. Always preserve all elements (including nulls) here.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intermediate results already have null filtering applied by the partial step.

Velox supports flushing during partial aggregation. When this happens, the intermediate results are left unaggregated, with the final aggregation step responsible for processing them. Could this cause any result issues?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This is safe because:

  1. When partial flushes (toIntermediate is called), null filtering is already applied there:

    • ignoreNulls_=true: null inputs become empty arrays (size=0), so the intermediate output contains no null elements
    • ignoreNulls_=false: Base::toIntermediate wraps each value (including nulls) into [value] arrays
  2. Final step receives pre-filtered data: Since toIntermediate already handles null filtering, addIntermediateResults just needs to merge arrays — using addValues (preserve everything) is correct for both cases.

  3. No behavior change: Before this fix, the final/intermediate nodes had ignoreNulls_={false} by default, which also always used addValues in the intermediate path. My change makes this explicit and removes the dead addNonNullValues branch that was never reachable in the final step.

Copy link
Copy Markdown
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@rui-mo rui-mo added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 2, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 2, 2026

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this in D99321794.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 3, 2026

@bikramSingh91 merged this pull request in 1dfcfbb.

@yaooqinn yaooqinn deleted the fix/collect-set-default-ignore-nulls branch April 4, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants