[#38059] Fix GCS glob matching to support ** and / in object names#38099
[#38059] Fix GCS glob matching to support ** and / in object names#38099hilaryRope wants to merge 4 commits intoapache:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where GCS filesystem glob patterns were incorrectly filtered due to the standard library's path-aware matching logic. By implementing a custom regex-based matching system, the GCS connector now properly handles object names containing slashes and supports recursive globbing, ensuring consistent behavior across different Apache Beam SDKs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
utafrali
left a comment
There was a problem hiding this comment.
The core approach of replacing filepath.Match with a custom globToRegex is correct and well-motivated. The main issue is that ? is translated to . (matches /) while * is explicitly restricted to [^/]*, creating an inconsistency that will surprise callers. There are also minor concerns around the hand-maintained isRegexSpecial list (use regexp.QuoteMeta instead) and the silent swallowing of malformed [ patterns.
| case '[': | ||
| // Character class - find the closing bracket | ||
| j := i + 1 | ||
| if j < len(pattern) && pattern[j] == '!' { |
There was a problem hiding this comment.
The ? wildcard is translated to . (matches any character including /), but * is explicitly translated to [^/]* (excludes /). This is inconsistent: a user writing dir?file.txt would unexpectedly match dir/file.txt.
Since the PR's design treats * as a single-segment matcher and ** as the cross-segment matcher, ? should follow the same rule and not cross /:
case '?':
result.WriteString("[^/]")The comment above the function even documents this asymmetry as intentional (? matches a single character (any character including /)), but no other major glob implementation (Python pathlib, Java PathMatcher) lets ? cross path separators when * does not.
There was a problem hiding this comment.
Done, resolved the inconsistency and updated the comment
| content := pattern[i+1 : j] | ||
| if len(content) > 0 && content[0] == '!' { | ||
| result.WriteByte('^') | ||
| content = content[1:] |
There was a problem hiding this comment.
isRegexSpecial maintains a hand-written list of regex metacharacters. The list is currently missing ] (harmless in Go RE2 outside [...], but still surprising) and is a maintenance risk if the code is adapted later. The default case in globToRegex can just call regexp.QuoteMeta directly, which is already imported and handles all metacharacters correctly:
default:
result.WriteString(regexp.QuoteMeta(string(c)))This removes the need for isRegexSpecial entirely.
There was a problem hiding this comment.
Agreed, makes a lot of sense. Done. Replaced with regexp.QuoteMeta() and removed isRegexSpecial altogether
| if j < len(pattern) && pattern[j] == ']' { | ||
| j++ | ||
| } | ||
| for j < len(pattern) && pattern[j] != ']' { |
There was a problem hiding this comment.
When there is no closing ], the code silently treats [ as a literal character. Previously, filepath.Match would return filepath.ErrBadPattern for the same input. This is a silent behavior change: a caller who was relying on the error to detect bad patterns will no longer get one. Consider returning an error instead:
if j >= len(pattern) {
return nil, fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern)
}There was a problem hiding this comment.
Done. Now returns fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern)
| // Character classes | ||
| {"file[0-9].txt", "file1.txt", true}, | ||
| {"file[0-9].txt", "filea.txt", false}, | ||
| {"file[!0-9].txt", "filea.txt", true}, |
There was a problem hiding this comment.
The ? test cases only verify matching a digit (file1.txt) and rejecting two characters (file12.txt). There is no test confirming whether ? matches or rejects /. Given the inconsistency with *, add an explicit case:
{"file?.txt", "file/.txt", false}, // ? should not cross /(Or true if crossing is intentional, but document it clearly.)
There was a problem hiding this comment.
Done. Added {"file?.txt", "file/.txt", false}, // ? should not cross /
| want []string | ||
| }{ | ||
| // Single * should only match top-level files | ||
| {dirPath + "/*.txt", []string{dirPath + "/file.txt", dirPath + "/other.txt"}}, |
There was a problem hiding this comment.
The integration test for /** lists all four objects including file.txt and other.txt. The /** pattern is compiled to ^.*$ which matches any string, including an empty one. This is correct for GCS, but it is also worth adding a case like dirPath + "/dir/subdir/**" to verify deeply nested ** matching, since the nested case (dir/subdir/file.txt) is the core scenario from issue #38059.
There was a problem hiding this comment.
Added dirPath + "/dir/subdir/**" test case to verify deeply nested matching, which is the core scenario from issue #38059
|
Assigning reviewers: R: @jrmccluskey for label go. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Fix GCS filesystem glob matching to handle
/in object names and support**Fixes #38059
Summary
The GCS filesystem's List() method was using filepath.Match() to filter objects, which incorrectly treats
/as a path separator. Since GCS object names are flat (with/being just another character), patterns likegs://bucket/**failed to match objects containing/.Problem
/as a path separator, so*cannot match across/**for recursive matchingfileio.MatchFiles(scope, "gs://my-bucket/**")silently excluded objects likedir/subdir/file.txtSolution
Replaced filepath.Match with a custom globToRegex() function:
*→ matches any characters except/(single path segment)**→ matches any characters including/(recursive)**/→ matches zero or more path segmentsThis aligns the Go SDK with the Python and Java SDKs.
Testing
Added unit tests for globToRegex() and List() with nested object names.
addresses #123), if applicable.