Skip to content

fix(impala/pyspark): use regexp_replace to strip#11989

Open
deepyaman wants to merge 13 commits intomainfrom
deepyaman-patch-1
Open

fix(impala/pyspark): use regexp_replace to strip#11989
deepyaman wants to merge 13 commits intomainfrom
deepyaman-patch-1

Conversation

@deepyaman
Copy link
Copy Markdown
Collaborator

@deepyaman deepyaman commented Apr 6, 2026

Description of changes

Use regexp_replace instead of trim methods to implement strip methods, due to issues handling the form feed character.

This is almost certainly slower. I alternatively considered dropping \f from the set of characters to trim; however, data can include form feeds (e.g. from legacy systems), so correctness seems preferable.

As an added bonus, this fixes two xfails on the PySpark side.

Issues closed

@github-actions github-actions bot added the tests Issues or PRs related to tests label Apr 6, 2026
@deepyaman deepyaman changed the title fix(pyspark): make sure trim doesn't remove _f_s fix(pyspark): make sure trim does not remove f's Apr 6, 2026
@deepyaman deepyaman force-pushed the deepyaman-patch-1 branch from 07b2fb4 to b78c203 Compare April 6, 2026 01:17
@github-actions github-actions bot added the polars The polars backend label Apr 6, 2026
@github-actions github-actions bot added the sql Backends that generate SQL label Apr 6, 2026
@deepyaman deepyaman changed the title fix(pyspark): make sure trim does not remove f's fix(compilers): make sure trim does not remove f's Apr 6, 2026
@deepyaman deepyaman force-pushed the deepyaman-patch-1 branch from 0ae77dc to b5ad8f6 Compare April 6, 2026 04:46
@github-actions github-actions bot added the impala The Apache Impala backend label Apr 7, 2026
@deepyaman deepyaman changed the title fix(compilers): make sure trim does not remove f's fix(impala/pyspark): use regexp_replace to strip Apr 7, 2026
@deepyaman deepyaman marked this pull request as ready for review April 7, 2026 15:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Impala and PySpark string lstrip/rstrip/strip compilation to use regexp_replace for correct whitespace handling (notably form feed), and adjusts tests/snapshots accordingly.

Changes:

  • Implement LStrip/RStrip/Strip in the PySpark and Impala SQL compilers via regexp_replace patterns using \\s.
  • Extend backend string method tests with a form-feed case and remove PySpark/Databricks xfails for lstrip/rstrip.
  • Refresh Impala SQL snapshots to match the new REGEXP_REPLACE output; includes a few small Polars backend cleanups.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ibis/backends/tests/test_string.py Adds a \f test value and updates expectations; removes PySpark/Databricks lstrip/rstrip xfails.
ibis/backends/sql/compilers/pyspark.py Implements visit_LStrip/visit_RStrip/visit_Strip using regexp_replace.
ibis/backends/sql/compilers/impala.py Switches strip variants to regexp_replace to avoid incorrect trimming behavior.
ibis/backends/polars/init.py Minor cleanup (typo fix, exception type, type annotation tweaks).
ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/lstrip/out.sql Updates snapshot to REGEXP_REPLACE(..., '^\\s+', '').
ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/rstrip/out.sql Updates snapshot to REGEXP_REPLACE(..., '\\s+$', '').
ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/strip/out.sql Updates snapshot to `REGEXP_REPLACE(..., '^\s+

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

/,
*,
params: Mapping[ir.Expr, object] | None = None,
params: Mapping[ir.Scalar, Any] | None = None,
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute() params type annotation was narrowed to Mapping[ir.Scalar, Any], but this method passes params through to _to_dataframe()/compile() which are annotated to accept Mapping[ir.Expr, ...]. Because Mapping is invariant in its key type, this can introduce static type-checking errors. Consider aligning these annotations (e.g., keep Mapping[ir.Expr, Any] everywhere, or update the internal helpers to also accept Mapping[ir.Scalar, Any]).

Suggested change
params: Mapping[ir.Scalar, Any] | None = None,
params: Mapping[ir.Expr, object] | None = None,

Copilot uses AI. Check for mistakes.
@deepyaman deepyaman removed the polars The polars backend label Apr 7, 2026
@deepyaman deepyaman added the pyspark The Apache PySpark backend label Apr 7, 2026
@NickCrews
Copy link
Copy Markdown
Contributor

I agree that the expected behavior is that \f should be treated as whitespace, and therefore stripped.

To be clear, my understanding is (please correct if I'm wrong):

On main, the current behavior is:

  • on pyspark and databricks, it didn't even compile, because Spark SQL LTRIM doesn't accept characters to trim. It does look like this param was added in pyspark 4.0. But we support pyspark back to 3.5. Can you please add a TODO comment to the implementation that says that when we drop support for pyspark<4, we should update to use this new syntax?
  • on impala, most other .strip() calls worked, but it misses the \f case. It uses LTRIM(t0.string_col, ' \t\n\r\v\f') but there is some bug in upstream impala such that it doesn't interpret the \f? Do you also see this as an upstream bug? Have you filed a bug with the impala folks?
  • All other backends work as expected.

After this change:

  • pyspark works as expected. We have to use regex because we need to keep supporting pyspark>3.5.
  • impala now works because we have to workaround the impala bug/limitation by using a regex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

impala The Apache Impala backend pyspark The Apache PySpark backend sql Backends that generate SQL tests Issues or PRs related to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: String expression strip removes leading and trailing "f" characters

3 participants