fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947
fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947yaooqinn wants to merge 2 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Build Impact AnalysisSelective Build Targets (building these covers all 5 affected)Total affected: 5/556 targets Affected targets (5)Directly changed (2)
Transitively affected (3)
Fast path • Graph from main@24e6ab97ba8015c3b6fae82e8184047cadf521df |
jinchengchenghh
left a comment
There was a problem hiding this comment.
Please don't include non-related change
565e97c to
d07e804
Compare
|
Rebased on latest main — removed unrelated changes from the diff. Now only the 1-file fix (CollectSetAggregate.cpp). |
rui-mo
left a comment
There was a problem hiding this comment.
Can you please add test for verify the default behavior? Thanks.
d07e804 to
5dde4ff
Compare
…ompatibility The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false (RESPECT NULLS), which breaks backward compatibility when the 1-arg signature is used. In this case, setConstantInputs() does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5dde4ff to
de0ff6a
Compare
|
Added |
| SBase::allocator_); | ||
| } | ||
| // Intermediate results already have null filtering applied by the | ||
| // partial step. Always preserve all elements (including nulls) here. |
There was a problem hiding this comment.
Intermediate results already have null filtering applied by the partial step.
Velox supports flushing during partial aggregation. When this happens, the intermediate results are left unaggregated, with the final aggregation step responsible for processing them. Could this cause any result issues?
There was a problem hiding this comment.
Good point! This is safe because:
-
When partial flushes (
toIntermediateis called), null filtering is already applied there:ignoreNulls_=true: null inputs become empty arrays (size=0), so the intermediate output contains no null elementsignoreNulls_=false:Base::toIntermediatewraps each value (including nulls) into[value]arrays
-
Final step receives pre-filtered data: Since
toIntermediatealready handles null filtering,addIntermediateResultsjust needs to merge arrays — usingaddValues(preserve everything) is correct for both cases. -
No behavior change: Before this fix, the final/intermediate nodes had
ignoreNulls_={false}by default, which also always usedaddValuesin the intermediate path. My change makes this explicit and removes the deadaddNonNullValuesbranch that was never reachable in the final step.
|
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this in D99321794. |
|
@bikramSingh91 merged this pull request in 1dfcfbb. |
Summary
Fixes a backward compatibility bug introduced in PR #16416.
The
ignoreNulls_field inSparkCollectSetAggregatewas defaulting tofalse(RESPECT NULLS). When the 1-arg signaturecollect_set(T)is used,setConstantInputs()does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls (true).Root cause
Impact
Without this fix, any downstream consumer (e.g., Gluten) using the native
collect_setwith the 1-arg signature would get null elements in the output array, causingNullPointerExceptionduring Spark's result projection.Testing
Verified in Gluten with
VeloxAggregateFunctionsDefaultSuite— all 16 collect_set/collect_list tests pass after this fix.Related: Gluten PR apache/gluten#11837