[#38059] Fix GCS glob matching to support ** and / in object names by hilaryRope · Pull Request #38099 · apache/beam

hilaryRope · 2026-04-07T21:59:14Z

Fix GCS filesystem glob matching to handle `/` in object names and support `**`

Summary

The GCS filesystem's List() method was using filepath.Match() to filter objects, which incorrectly treats / as a path separator. Since GCS object names are flat (with / being just another character), patterns like gs://bucket/** failed to match objects containing /.

Problem

filepath.Match treats / as a path separator, so * cannot match across /
filepath.Match does not support ** for recursive matching
fileio.MatchFiles(scope, "gs://my-bucket/**") silently excluded objects like dir/subdir/file.txt

Solution

Replaced filepath.Match with a custom globToRegex() function:

* → matches any characters except / (single path segment)
** → matches any characters including / (recursive)
**/ → matches zero or more path segments

This aligns the Go SDK with the Python and Java SDKs.

Testing

Added unit tests for globToRegex() and List() with nested object names.

Mention the appropriate issue in your description (for example: addresses #123), if applicable.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

…ort **

gemini-code-assist · 2026-04-07T21:59:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where GCS filesystem glob patterns were incorrectly filtered due to the standard library's path-aware matching logic. By implementing a custom regex-based matching system, the GCS connector now properly handles object names containing slashes and supports recursive globbing, ensuring consistent behavior across different Apache Beam SDKs.

Highlights

Custom Glob Matching: Replaced the standard library's filepath.Match with a custom globToRegex function to correctly handle GCS object names, which are flat and treat '/' as a standard character.
Recursive Matching Support: Added support for the '**' glob pattern to enable recursive matching across GCS object paths, aligning the Go SDK with Python and Java implementations.
Comprehensive Testing: Introduced new unit tests for the glob-to-regex conversion logic and verified GCS list operations with nested object structures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

utafrali

The core approach of replacing filepath.Match with a custom globToRegex is correct and well-motivated. The main issue is that ? is translated to . (matches /) while * is explicitly restricted to [^/]*, creating an inconsistency that will surprise callers. There are also minor concerns around the hand-maintained isRegexSpecial list (use regexp.QuoteMeta instead) and the silent swallowing of malformed [ patterns.

utafrali · 2026-04-07T22:10:02Z

sdks/go/pkg/beam/io/filesystem/gcs/gcs.go

+		case '[':
+			// Character class - find the closing bracket
+			j := i + 1
+			if j < len(pattern) && pattern[j] == '!' {


The ? wildcard is translated to . (matches any character including /), but * is explicitly translated to [^/]* (excludes /). This is inconsistent: a user writing dir?file.txt would unexpectedly match dir/file.txt.

Since the PR's design treats * as a single-segment matcher and ** as the cross-segment matcher, ? should follow the same rule and not cross /:

case '?': result.WriteString("[^/]")

The comment above the function even documents this asymmetry as intentional (? matches a single character (any character including /)), but no other major glob implementation (Python pathlib, Java PathMatcher) lets ? cross path separators when * does not.

Done, resolved the inconsistency and updated the comment

utafrali · 2026-04-07T22:10:02Z

sdks/go/pkg/beam/io/filesystem/gcs/gcs.go

+				content := pattern[i+1 : j]
+				if len(content) > 0 && content[0] == '!' {
+					result.WriteByte('^')
+					content = content[1:]


isRegexSpecial maintains a hand-written list of regex metacharacters. The list is currently missing ] (harmless in Go RE2 outside [...], but still surprising) and is a maintenance risk if the code is adapted later. The default case in globToRegex can just call regexp.QuoteMeta directly, which is already imported and handles all metacharacters correctly:

default: result.WriteString(regexp.QuoteMeta(string(c)))

This removes the need for isRegexSpecial entirely.

Agreed, makes a lot of sense. Done. Replaced with regexp.QuoteMeta() and removed isRegexSpecial altogether

utafrali · 2026-04-07T22:10:02Z

sdks/go/pkg/beam/io/filesystem/gcs/gcs.go

+			if j < len(pattern) && pattern[j] == ']' {
+				j++
+			}
+			for j < len(pattern) && pattern[j] != ']' {


When there is no closing ], the code silently treats [ as a literal character. Previously, filepath.Match would return filepath.ErrBadPattern for the same input. This is a silent behavior change: a caller who was relying on the error to detect bad patterns will no longer get one. Consider returning an error instead:

if j >= len(pattern) { return nil, fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern) }

Done. Now returns fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern)

utafrali · 2026-04-07T22:10:02Z

sdks/go/pkg/beam/io/filesystem/gcs/gcs_test.go

+		// Character classes
+		{"file[0-9].txt", "file1.txt", true},
+		{"file[0-9].txt", "filea.txt", false},
+		{"file[!0-9].txt", "filea.txt", true},


The ? test cases only verify matching a digit (file1.txt) and rejecting two characters (file12.txt). There is no test confirming whether ? matches or rejects /. Given the inconsistency with *, add an explicit case:

{"file?.txt", "file/.txt", false}, // ? should not cross /

(Or true if crossing is intentional, but document it clearly.)

Done. Added {"file?.txt", "file/.txt", false}, // ? should not cross /

utafrali · 2026-04-07T22:10:02Z

sdks/go/pkg/beam/io/filesystem/gcs/gcs_test.go

+		want []string
+	}{
+		// Single * should only match top-level files
+		{dirPath + "/*.txt", []string{dirPath + "/file.txt", dirPath + "/other.txt"}},


The integration test for /** lists all four objects including file.txt and other.txt. The /** pattern is compiled to ^.*$ which matches any string, including an empty one. This is correct for GCS, but it is also worth adding a case like dirPath + "/dir/subdir/**" to verify deeply nested ** matching, since the nested case (dir/subdir/file.txt) is the core scenario from issue #38059.

Added dirPath + "/dir/subdir/**" test case to verify deeply nested matching, which is the core scenario from issue #38059

github-actions · 2026-04-07T22:47:31Z

Assigning reviewers:

R: @jrmccluskey for label go.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

corda.ilaria@gmail.com added 3 commits April 7, 2026 23:48

fix GCS filesystem glob matching to handle / in object names and supp…

612d5b4

…ort **

update CHANGES.md

8476e8f

update CHANGES.md

61e625d

github-actions bot added go io labels Apr 7, 2026

utafrali reviewed Apr 7, 2026

View reviewed changes

github-actions bot added the Next Action: Reviewers label Apr 7, 2026

address comments

5936119

Conversation

hilaryRope commented Apr 7, 2026

Fix GCS filesystem glob matching to handle / in object names and support **

Summary

Problem

Solution

Testing

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

utafrali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix GCS filesystem glob matching to handle `/` in object names and support `**`