Skip to content

refactor: rewrite parser core as explicit Block tree#285

Draft
Hocnonsense wants to merge 55 commits intomasterfrom
syntax
Draft

refactor: rewrite parser core as explicit Block tree#285
Hocnonsense wants to merge 55 commits intomasterfrom
syntax

Conversation

@Hocnonsense
Copy link
Copy Markdown
Contributor

@Hocnonsense Hocnonsense commented Apr 9, 2026

Motivation

The existing parser (parser/parser.py, parser/syntax.py) has grown organically and is now difficult to reason about:
indent tracking, comment handling, parameter formatting, and fmt: directive handling are all interleaved in a single imperative loop.
Adding new features (e.g. # fmt: skip on keyword lines, supporting main snakemake features) require touching many tightly-coupled places.

Moreover, the compilation logic in the main snakemake repository and the planned LSP server (for code highlighting and symbol navigation) share similar parsing concerns.
Maintaining them separately risks redundancy, drift, and subtle inconsistencies.

This PR rewrites the parser core as an explicit Block tree, making each concern addressable independently — bringing the architecture back under human control.

What changed

New: snakefmt/blocken.py

A self-contained parser that walks the token stream and produces a tree of typed Block objects:

Block (abstract)
├── PythonBlock             | contiguous Python code at one indent level
├── ColonBlock              | any block ending with :
│   ├── NoSnakemakeBlock    (def / class / async)
│   ├── IfForTryWithBlock   (if / for / try / with / elif / else / ...)
│   ├── MatchBlock / CaseBlock
│   └── SnakemakeBlock      | snakemake keywords
│       ├── SnakemakeArgumentsBlock   (input / output / params / ...)
│       │   ├── SnakemakeUnnamedArgumentsBlock
│       │   └── SnakemakeInlineArgumentBlock  (include / workdir / ...)
│       ├── SnakemakeExecutableBlock  (run / shell / script / ...)
│       └── SnakemakeKeywordBlock     (rule / module / use rule / ...)
└── GlobalBlock             | top-level document root

Each Block exposes:

  • segment2format() -> Generator[tuple[str, str | None], None, None]: python/snakemake segments for combined formatting
  • compilation() -> str: pure Python for snakemake execution
  • components() -> Iterator[DocumentSymbol]: LSP document symbols (e.g. for VS Code highlighting)

Comment ownership and blank-line attribution are resolved once at tree construction time, not scattered across formatting passes.

Test for new design: tests/test_blocken.py

Unit tests for block parsing and DocumentSymbol extraction, independent of formatting output.

Modified: tests/test_formatter.py

A small number of edge-case behaviors were adjusted where the previous implementation had unintentional quirks.
I'd welcome review of these changes specifically.
If any adjusted behavior should be preserved, I'm happy to discuss or restore it.

What this enables / will support

  • # fmt: skip on keyword lines: the Block and LogicalLine structure makes skip directives detectable before formatting begins, with no changes needed across multiple call sites
  • Compilation output: Block.compilation() stubs are present; planned to expose compilation functions to the main snakemake repository
  • LSP semantic tokens: Block.components() yields DocumentSymbol; planned to support a language server for code highlighting and symbol navigation
  • Eventual snakemake integration: once Block.compilation() is validated, snakemake could depend on snakefmt for parsing rather than maintaining parallel implementations

Questions for maintainer

  • Does the Block hierarchy cover all formatting cases you're aware of?
  • Are the adjusted test_formatter.py behaviors acceptable?
  • What's the right scope for a first mergeable chunk:
    just wire blocken.py in to replace the existing logic,
    or first split it into smaller modules (e.g. TokenIterator, LogicalLine, FormatState/format_black as utilities, and the Block hierarchy as the core) and retire the existing parser/ package?

The old parser/ and formatter are untouched in this branch, and blocken.py runs in parallel. The transition can be done incrementally.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced enhanced formatting engine with improved handling of Snakemake directives and mixed Python/Snakemake syntax
  • Chores

    • Updated code linting configuration to ignore additional flake8 error codes (E701, E704)

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d6d3ab5-7bb1-4b95-b8da-90661531a826

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch syntax

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 90.58704% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.13%. Comparing base (e7201e1) to head (4ea65f6).

Files with missing lines Patch % Lines
snakefmt/blocken.py 90.58% 56 Missing and 37 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #285       +/-   ##
===========================================
- Coverage   96.52%   83.13%   -13.39%     
===========================================
  Files          12       13        +1     
  Lines        1497     2485      +988     
  Branches      309      555      +246     
===========================================
+ Hits         1445     2066      +621     
- Misses         24      295      +271     
- Partials       28      124       +96     
Flag Coverage Δ
unittests 83.05% <90.48%> (-13.41%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
snakefmt/blocken.py 90.58% <90.58%> (ø)

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
snakefmt/blocken.py (7)

1859-1866: Parameter input shadows Python builtin.

Consider renaming to source or content to avoid shadowing the builtin.

📝 Suggested fix
-def parse(input: str | Callable[[], str], name: str = "<string>"):
-    if isinstance(input, str):
+def parse(source: str | Callable[[], str], name: str = "<string>"):
+    if isinstance(source, str):
         tokens = tokenize.generate_tokens(
-            iter(input.splitlines(keepends=True)).__next__
+            iter(source.splitlines(keepends=True)).__next__
         )
     else:
-        tokens = tokenize.generate_tokens(input)
+        tokens = tokenize.generate_tokens(source)
     return GlobalBlock(0, TokenIterator(name, tokens), [])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 1859 - 1866, The parse function currently
names its parameter input which shadows the Python builtin; rename the parameter
to a non-shadowing name like source (or content) in the parse definition and
update all references inside the function (the isinstance check, the
tokenize.generate_tokens call and the TokenIterator invocation) to use the new
name, and then update any callers of parse to pass the renamed parameter
positionally or by the new name if keyword-used.

1086-1086: Typo in comment.

"olgorithm" should be "algorithm".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` at line 1086, Fix the typo in the inline comment that
reads "please do not blame if the olgorithm is slow :)" by changing "olgorithm"
to "algorithm" so the comment becomes "please do not blame if the algorithm is
slow :)"; locate the comment text in snakefmt/blocken.py and update it
accordingly.

57-65: Redundant assignment on line 61.

self.tokens = tokens appears redundant since self._live_tokens = tokens is already assigned on line 59. If self.tokens is intended for external access, consider documenting its purpose or removing if unused.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 57 - 65, The assignment self.tokens =
tokens in __init__ is redundant with self._live_tokens; remove the redundant
self.tokens assignment or, if external access is required, replace it with a
documented property that returns self._live_tokens (e.g., implement a tokens
`@property`) so all uses reference a single source of truth; update any external
usages of the attribute to use the chosen API (either the property name tokens
or the internal _live_tokens) and add a short docstring to clarify intended
external access.

1800-1802: Typo: "sort_direcives" should be "sort_directives".

The attribute name has a consistent typo throughout the codebase. While it works, fixing it would improve code readability.

The typo also appears at:

  • Line 329 in FormatState
  • Line 1821 in get_formatted
  • Line 1881 in setup_formatter
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 1800 - 1802, Rename the misspelled
attribute sort_direcives to sort_directives everywhere to improve readability:
update the __slots__ tuple ("mode", "sort_direcives") to ("mode",
"sort_directives"), rename the attribute annotations (mode: Mode;
sort_direcives: bool -> sort_directives: bool) and change all
references/assignments/usages in FormatState, get_formatted, and setup_formatter
to use sort_directives; ensure any constructors, attribute accesses, and tests
are updated to the new name and run linters/tests to catch any remaining
references.

1833-1842: Typo: "place_hode" should be "placeholder".

The variable names place_hode_str and place_hode appear to be typos for "placeholder".

📝 Suggested fix
-        place_hode_str = "o" * 50
+        placeholder_str = "o" * 50
         raw_str = "".join(python_codes)
-        while place_hode_str in raw_str:
-            place_hode_str *= 2
+        while placeholder_str in raw_str:
+            placeholder_str *= 2
         raw_str = "#\n"
         for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes):
             if snakemake_code.count("\n") == 1:  # must at the end of line
-                place_hode = f"{indent}def l{place_hode_str}1ng(): ...\n"
+                placeholder = f"{indent}def l{placeholder_str}1ng(): ...\n"
             else:
-                place_hode = f"{indent}def l{place_hode_str}ng():\n{indent} return\n"
-            raw_str += python_code + place_hode
+                placeholder = f"{indent}def l{placeholder_str}ng():\n{indent} return\n"
+            raw_str += python_code + placeholder
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 1833 - 1842, The variables place_hode_str
and place_hode in the block inside the function that builds raw_str (the loop
using python_codes and snakemake_codes) are typos and should be renamed to
placeholder_str and placeholder respectively; update all usages including the
initial assignment, the while loop that multiplies the string, and the f-strings
that reference them (e.g., f"{indent}def l{place_hode_str}1ng(): ..." and
f"{indent}def l{place_hode_str}ng():") so they become f"{indent}def
l{placeholder_str}1ng(): ..." and f"{indent}def l{placeholder_str}ng():", and
ensure you rename the temporary variable place_hode to placeholder where it’s
assigned and used.

96-99: Minor typos in docstring.

"Donot" should be "Do not" and "INDEDT" should be "INDENT".

📝 Suggested fix
     def next_block(self):
         """Returns a entire block, just consume until the end of the block.
-        Donot care if there are nested blocks inside or snakemake keywords inside.
+        Do not care if there are nested blocks inside or snakemake keywords inside.

-        it could be INDEDT -> [any content] -> DEDENT, or [any content] -> DEDENT
+        it could be INDENT -> [any content] -> DEDENT, or [any content] -> DEDENT
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 96 - 99, Fix the typos in the docstring in
snakefmt/blocken.py that documents the routine which "Returns a entire block"
(the docstring describing consuming until the end of a block); change "Donot" to
"Do not" and "INDEDT" to "INDENT", and while here correct "a entire block" to
"an entire block" for grammar consistency so the docstring reads clearly.

443-448: Error message note may be misleading when line number is available.

The "line number may be incorrect" note is always appended, even when start_token is provided and the line number is adjusted. Consider only adding this note when the line number couldn't be determined.

📝 Suggested fix
         else:
             err_msg = str(e)
-        err_msg += (
-            "\n\n(Note reported line number may be incorrect, as"
-            " snakefmt could not determine the true line number)"
-        )
+        if start_token is None or not match:
+            err_msg += (
+                "\n\n(Note reported line number may be incorrect, as"
+                " snakefmt could not determine the true line number)"
+            )
         err_msg = f"Black error:\n```\n{err_msg}\n```\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 443 - 448, The error text always appends
the ambiguous line-number note before wrapping and raising InvalidPython; change
the logic in the block that builds err_msg (the code that presently appends
"\n\n(Note reported line number may be incorrect...)" and then wraps err_msg and
raises InvalidPython) to only add that ambiguous-line-number note when the
start_token (or whatever marker is used to compute an adjusted line number) is
not available or the line could not be determined; keep the existing formatting
that wraps err_msg in "Black error:\n```\n...\n```\n" and then raise
InvalidPython(err_msg) from None but gate the extra note on start_token being
falsy/None (or on the failure case that determines line-number uncertainty).
tests/test_blocken.py (1)

29-32: Consider renaming parameter to avoid shadowing builtin.

The parameter input shadows the Python builtin. While harmless in this limited scope, using source or code would be cleaner.

📝 Suggested fix
-def generate_tokens(input: str):
+def generate_tokens(source: str):
     return list(
-        tokenize.generate_tokens(iter(input.splitlines(keepends=True)).__next__)
+        tokenize.generate_tokens(iter(source.splitlines(keepends=True)).__next__)
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_blocken.py` around lines 29 - 32, The parameter name `input` in
generate_tokens shadows the Python builtin; rename it (e.g., to `source` or
`code`) in the function signature and update all references inside the function
(the call to input.splitlines and any callers in tests) to use the new name so
the tokenizer still receives the same string argument.
tests/test_formatter.py (1)

2444-2444: Unused variable formatted0.

The unpacked variable formatted0 is never used in this test. Use underscore prefix to indicate it's intentionally ignored.

📝 Suggested fix
-        code1, formatted0 = TestSortFormatting.sorting_comprehensive
+        code1, _formatted0 = TestSortFormatting.sorting_comprehensive
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_formatter.py` at line 2444, The test unpacks
TestSortFormatting.sorting_comprehensive into code1, formatted0 but never uses
formatted0; change the unused variable to a deliberately ignored name (e.g.,
_formatted0 or _) in the assignment where
TestSortFormatting.sorting_comprehensive is unpacked so the intent is clear and
linters won't flag formatted0 as unused.
.flake8 (1)

3-4: Comment is now outdated.

The comment says "the default ignores minus E704", but E704 is now explicitly added to the ignore list. Consider updating the comment to reflect the current state.

📝 Suggested fix
-# the default ignores minus E704
-ignore = E121,E123,E126,E226,E203,E24,E701,E704,W503,W504
+# the default ignores plus E701 and E704 for single-line class/def patterns
+ignore = E121,E123,E126,E226,E203,E24,E701,E704,W503,W504
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.flake8 around lines 3 - 4, The comment on the .flake8 file is stale: update
or remove the phrase "the default ignores minus E704" to reflect that E704 is
now explicitly included in the ignore list; edit the comment above the ignore =
E121,E123,E126,E226,E203,E24,E701,E704,W503,W504 line (or replace it) to
accurately describe the current ignore configuration or remove the misleading
text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakefmt/blocken.py`:
- Around line 233-235: The condition is comparing self.body[0].type (an int) to
the string "**" which is always false; update the check in the method using
self.body[0] to verify both the token type and the token text — e.g., test that
self.body[0].type equals the appropriate token type constant and that
self.body[0].string (or .value) == "**" (i.e., replace the current if
self.body[0].type == "**" with a combined type-and-string check using
self.body[0].type and self.body[0].string).
- Around line 357-358: Remove the redundant pre-assignment to the variable
`match` before the walrus check: delete the line `match =
_FMT_DIRECTIVE_RE.match(comment)` so the code only uses the walrus expression
`if match := _FMT_DIRECTIVE_RE.match(comment):` in blocken.py; this eliminates
the overwritten assignment and preserves the intended control flow around
`_FMT_DIRECTIVE_RE.match`.

---

Nitpick comments:
In @.flake8:
- Around line 3-4: The comment on the .flake8 file is stale: update or remove
the phrase "the default ignores minus E704" to reflect that E704 is now
explicitly included in the ignore list; edit the comment above the ignore =
E121,E123,E126,E226,E203,E24,E701,E704,W503,W504 line (or replace it) to
accurately describe the current ignore configuration or remove the misleading
text.

In `@snakefmt/blocken.py`:
- Around line 1859-1866: The parse function currently names its parameter input
which shadows the Python builtin; rename the parameter to a non-shadowing name
like source (or content) in the parse definition and update all references
inside the function (the isinstance check, the tokenize.generate_tokens call and
the TokenIterator invocation) to use the new name, and then update any callers
of parse to pass the renamed parameter positionally or by the new name if
keyword-used.
- Line 1086: Fix the typo in the inline comment that reads "please do not blame
if the olgorithm is slow :)" by changing "olgorithm" to "algorithm" so the
comment becomes "please do not blame if the algorithm is slow :)"; locate the
comment text in snakefmt/blocken.py and update it accordingly.
- Around line 57-65: The assignment self.tokens = tokens in __init__ is
redundant with self._live_tokens; remove the redundant self.tokens assignment
or, if external access is required, replace it with a documented property that
returns self._live_tokens (e.g., implement a tokens `@property`) so all uses
reference a single source of truth; update any external usages of the attribute
to use the chosen API (either the property name tokens or the internal
_live_tokens) and add a short docstring to clarify intended external access.
- Around line 1800-1802: Rename the misspelled attribute sort_direcives to
sort_directives everywhere to improve readability: update the __slots__ tuple
("mode", "sort_direcives") to ("mode", "sort_directives"), rename the attribute
annotations (mode: Mode; sort_direcives: bool -> sort_directives: bool) and
change all references/assignments/usages in FormatState, get_formatted, and
setup_formatter to use sort_directives; ensure any constructors, attribute
accesses, and tests are updated to the new name and run linters/tests to catch
any remaining references.
- Around line 1833-1842: The variables place_hode_str and place_hode in the
block inside the function that builds raw_str (the loop using python_codes and
snakemake_codes) are typos and should be renamed to placeholder_str and
placeholder respectively; update all usages including the initial assignment,
the while loop that multiplies the string, and the f-strings that reference them
(e.g., f"{indent}def l{place_hode_str}1ng(): ..." and f"{indent}def
l{place_hode_str}ng():") so they become f"{indent}def l{placeholder_str}1ng():
..." and f"{indent}def l{placeholder_str}ng():", and ensure you rename the
temporary variable place_hode to placeholder where it’s assigned and used.
- Around line 96-99: Fix the typos in the docstring in snakefmt/blocken.py that
documents the routine which "Returns a entire block" (the docstring describing
consuming until the end of a block); change "Donot" to "Do not" and "INDEDT" to
"INDENT", and while here correct "a entire block" to "an entire block" for
grammar consistency so the docstring reads clearly.
- Around line 443-448: The error text always appends the ambiguous line-number
note before wrapping and raising InvalidPython; change the logic in the block
that builds err_msg (the code that presently appends "\n\n(Note reported line
number may be incorrect...)" and then wraps err_msg and raises InvalidPython) to
only add that ambiguous-line-number note when the start_token (or whatever
marker is used to compute an adjusted line number) is not available or the line
could not be determined; keep the existing formatting that wraps err_msg in
"Black error:\n```\n...\n```\n" and then raise InvalidPython(err_msg) from None
but gate the extra note on start_token being falsy/None (or on the failure case
that determines line-number uncertainty).

In `@tests/test_blocken.py`:
- Around line 29-32: The parameter name `input` in generate_tokens shadows the
Python builtin; rename it (e.g., to `source` or `code`) in the function
signature and update all references inside the function (the call to
input.splitlines and any callers in tests) to use the new name so the tokenizer
still receives the same string argument.

In `@tests/test_formatter.py`:
- Line 2444: The test unpacks TestSortFormatting.sorting_comprehensive into
code1, formatted0 but never uses formatted0; change the unused variable to a
deliberately ignored name (e.g., _formatted0 or _) in the assignment where
TestSortFormatting.sorting_comprehensive is unpacked so the intent is clear and
linters won't flag formatted0 as unused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c9fb18f7-ef63-4e85-a5c7-5d804a776e30

📥 Commits

Reviewing files that changed from the base of the PR and between e7201e1 and 5da1e6c.

📒 Files selected for processing (4)
  • .flake8
  • snakefmt/blocken.py
  • tests/test_blocken.py
  • tests/test_formatter.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
snakefmt/blocken.py (5)

446-446: Redundant str() call.

err_msg is already a string at this point. The str() conversion is unnecessary:

-        err_msg = f"Black error:\n```\n{str(err_msg)}\n```\n"
+        err_msg = f"Black error:\n```\n{err_msg}\n```\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` at line 446, The f-string that builds the Black error
message unnecessarily calls str() on err_msg even though err_msg is already a
string; replace the f-string expression "{str(err_msg)}" with "{err_msg}" so the
line constructing err_msg (the variable err_msg used to format the Black error
block) uses the value directly without redundant conversion.

532-534: Mutable class attribute defaults flagged by static analysis.

Ruff warns about mutable default values for class attributes (RUF012). While these appear to be intentionally shared registries, consider using ClassVar type hints to make the intent explicit:

from typing import ClassVar

class Block(ABC):
    subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] = {}
    deprecated: ClassVar[Mapping[str, str]] = {}

This also applies to similar patterns at lines 908, 959-960, 1725, 1785, and 1804.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 532 - 534, The class-level mutable mappings
subautomata and deprecated should be annotated as ClassVar to indicate
intentional shared registries and silence RUF012; update their type hints to use
ClassVar (e.g., subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] and
deprecated: ClassVar[Mapping[str, str]]), add the necessary typing import if
missing, and apply the same change to the other occurrences referenced (around
the symbols at the other noted locations) so all mutable class defaults are
explicitly ClassVar-annotated.

360-360: Minor: Prefer next() over single-element slice.

Per Ruff RUF015, consider using next() for extracting the first element:

-            option = [opt.strip() for opt in (options or "").split(",")][0]
+            option = next(opt.strip() for opt in (options or "").split(","))

This is more idiomatic and avoids creating an intermediate list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` at line 360, Replace the single-element list
comprehension used to set the variable option with an iterator + next() to avoid
building an intermediate list: change the assignment to use next((opt.strip()
for opt in (options or "").split(",")), "") so it yields the first stripped
element (or a safe default) — update the line that currently sets option =
[opt.strip() for opt in (options or "").split(",")][0] in snakefmt/blocken.py.

1833-1837: Confusing variable reuse - raw_str assigned then immediately overwritten.

The code assigns raw_str = "".join(python_codes) at line 1834, uses it for collision checking, then immediately overwrites it at line 1837 with raw_str = "#\n". While functionally correct (the collision check uses the first value), this is confusing:

-        placeholder = "o" * 50
-        raw_str = "".join(python_codes)
+        placeholder = "o" * 50
+        all_python = "".join(python_codes)
         while placeholder in raw_str:
             placeholder *= 2
-        raw_str = "#\n"
+        raw_str = "#\n"

Consider using a distinct variable name for clarity:

         placeholder = "o" * 50
-        raw_str = "".join(python_codes)
-        while placeholder in raw_str:
+        all_python = "".join(python_codes)
+        while placeholder in all_python:
             placeholder *= 2
         raw_str = "#\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 1833 - 1837, The variable raw_str is reused
confusingly: it’s set to "".join(python_codes) for the collision check then
immediately overwritten with "#\n"; rename the first usage to a distinct
variable (e.g., joined_python or python_blob) and use that in the while
placeholder in <joined_variable> loop, then keep raw_str = "#\n" as the intended
subsequent value; update references in the collision-checking loop (placeholder
doubling) to use the new joined variable and leave raw_str only for the "#\n"
assignment.

1838-1852: Add strict=True to zip() calls to catch length mismatches.

The zip() calls on lines 1838 and 1847 pair python_codes with snakemake_codes. If there's ever a bug in the iteration logic that causes these lists to have different lengths, data would be silently dropped:

-        for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes):
+        for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes, strict=True):

And:

-        for formatted, (snakemake_code, _) in zip(formatted_split, snakemake_codes):
+        for formatted, (snakemake_code, _) in zip(formatted_split, snakemake_codes, strict=True):

This will raise ValueError if lists have different lengths, making bugs easier to detect.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 1838 - 1852, The two zip() usages pairing
python_codes and snakemake_codes should be made strict to surface length
mismatches: update the first loop that iterates "for python_code,
(snakemake_code, indent) in zip(python_codes, snakemake_codes):" and the second
loop "for formatted, (snakemake_code, _) in zip(formatted_split,
snakemake_codes):" to use zip(..., strict=True). This ensures mismatched lengths
raise ValueError during the preparation of raw_str/py proxies and during
reconstruction of final_str after format_black, helping catch bugs early.
tests/test_formatter.py (1)

2396-2401: # fmt: skip in directives produces unusual output.

The expected output shows x = [ 1,2,3] # fmt: skip followed by a trailing comma on its own line (TAB * 2 + ",\n"). This is noted in the comment as a limitation ("currently # fmt: skip in directives is not supported").

Consider whether this behavior should be:

  1. Fixed to properly handle # fmt: skip inline
  2. Documented as a known limitation in user-facing docs
  3. Tracked as a future enhancement issue

Would you like me to open an issue to track improving # fmt: skip support in directives?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_formatter.py` around lines 2396 - 2401, The test shows a directive
line containing "# fmt: skip" (seen in the test case using
formatter.get_formatted()) but the formatter leaves a trailing comma on the next
line; update the directive-handling logic (e.g., the Formatter method that
processes directives — look for functions named like process_directives,
_apply_directives, or _should_skip_range) to treat an inline "# fmt: skip" as
skipping the entire directive token sequence (including following separator
tokens such as trailing commas) so no lone comma line is emitted; adjust the
formatter's token-range skip logic to remove or absorb following punctuation
when a directive applies, add a focused unit test mirroring this example, and if
you prefer not to change behavior add/update docs/tests to explicitly record
this limitation instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakefmt/blocken.py`:
- Around line 617-623: The decorator-handling loop can hang if no recognized
block follows; wrap the calls to tokens.next_new_line() inside a try/except to
catch the TokenIterator/UnsupportedSyntax EOF case (or otherwise check
tokens.has_next()/peek() before calling next_new_line), and on EOF either raise
a clearer error or break out and handle the accumulated headers (call append_sub
with an appropriate fallback or abort with a descriptive exception). Update the
block around line handling where you detect line.body[0].string == "@" to use
self.recognize, tokens.next_new_line, and append_sub with this
EOF/unsupported-syntax guard so the loop cannot run forever.
- Around line 233-235: Update the condition that detects unpacking operators on
the parameter token (currently checking self.body[0].string == "**") to also
treat the single-star unpacking operator as a new parameter start; e.g., change
the test to check for self.body[0].string == "*" or self.body[0].string == "**"
(or use membership like self.body[0].string in ("*", "**")) so both *args and
**kwargs trigger the early-return that starts a new parameter line.

---

Nitpick comments:
In `@snakefmt/blocken.py`:
- Line 446: The f-string that builds the Black error message unnecessarily calls
str() on err_msg even though err_msg is already a string; replace the f-string
expression "{str(err_msg)}" with "{err_msg}" so the line constructing err_msg
(the variable err_msg used to format the Black error block) uses the value
directly without redundant conversion.
- Around line 532-534: The class-level mutable mappings subautomata and
deprecated should be annotated as ClassVar to indicate intentional shared
registries and silence RUF012; update their type hints to use ClassVar (e.g.,
subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] and deprecated:
ClassVar[Mapping[str, str]]), add the necessary typing import if missing, and
apply the same change to the other occurrences referenced (around the symbols at
the other noted locations) so all mutable class defaults are explicitly
ClassVar-annotated.
- Line 360: Replace the single-element list comprehension used to set the
variable option with an iterator + next() to avoid building an intermediate
list: change the assignment to use next((opt.strip() for opt in (options or
"").split(",")), "") so it yields the first stripped element (or a safe default)
— update the line that currently sets option = [opt.strip() for opt in (options
or "").split(",")][0] in snakefmt/blocken.py.
- Around line 1833-1837: The variable raw_str is reused confusingly: it’s set to
"".join(python_codes) for the collision check then immediately overwritten with
"#\n"; rename the first usage to a distinct variable (e.g., joined_python or
python_blob) and use that in the while placeholder in <joined_variable> loop,
then keep raw_str = "#\n" as the intended subsequent value; update references in
the collision-checking loop (placeholder doubling) to use the new joined
variable and leave raw_str only for the "#\n" assignment.
- Around line 1838-1852: The two zip() usages pairing python_codes and
snakemake_codes should be made strict to surface length mismatches: update the
first loop that iterates "for python_code, (snakemake_code, indent) in
zip(python_codes, snakemake_codes):" and the second loop "for formatted,
(snakemake_code, _) in zip(formatted_split, snakemake_codes):" to use zip(...,
strict=True). This ensures mismatched lengths raise ValueError during the
preparation of raw_str/py proxies and during reconstruction of final_str after
format_black, helping catch bugs early.

In `@tests/test_formatter.py`:
- Around line 2396-2401: The test shows a directive line containing "# fmt:
skip" (seen in the test case using formatter.get_formatted()) but the formatter
leaves a trailing comma on the next line; update the directive-handling logic
(e.g., the Formatter method that processes directives — look for functions named
like process_directives, _apply_directives, or _should_skip_range) to treat an
inline "# fmt: skip" as skipping the entire directive token sequence (including
following separator tokens such as trailing commas) so no lone comma line is
emitted; adjust the formatter's token-range skip logic to remove or absorb
following punctuation when a directive applies, add a focused unit test
mirroring this example, and if you prefer not to change behavior add/update
docs/tests to explicitly record this limitation instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5a8390e-0306-4f92-834b-c45b26ea794b

📥 Commits

Reviewing files that changed from the base of the PR and between 5da1e6c and 4ea65f6.

📒 Files selected for processing (2)
  • snakefmt/blocken.py
  • tests/test_formatter.py

Comment on lines +233 to +235
if self.body[0].string == "**":
return True
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how *args is handled in directive blocks
rg -n "is_keyword_line" --type=py -A3 -B3

Repository: snakemake/snakefmt

Length of output: 1564


🏁 Script executed:

# Get the full implementation of is_keyword_line property
sed -n '220,250p' snakefmt/blocken.py | cat -n

Repository: snakemake/snakefmt

Length of output: 1182


🏁 Script executed:

# Search for test cases or documentation about this property
rg -n "is_keyword_line\|keyword_line" --type=py -B2 -A5 | head -100

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Look for patterns of * and ** handling in the codebase
rg -n 'string.*\*|\*\*' --type=py snakefmt/blocken.py | head -30

Repository: snakemake/snakefmt

Length of output: 480


🏁 Script executed:

# Check if there are parameter-related tests
fd -name "*test*" --type=f | xargs grep -l "keyword_line\|\*args\|\*\*kwargs" 2>/dev/null | head -5

Repository: snakemake/snakefmt

Length of output: 292


🏁 Script executed:

# Get more context about how is_keyword_line is used in the code
rg -n "is_keyword_line" --type=py -B5 -A10 snakefmt/blocken.py

Repository: snakemake/snakefmt

Length of output: 2440


🏁 Script executed:

# Look for any comments or documentation about keyword lines vs positional arguments
rg -n "keyword.*line|positional|args|kwargs" --type=py snakefmt/blocken.py | head -50

Repository: snakemake/snakefmt

Length of output: 1820


🏁 Script executed:

# Check if there's a test file for blocken.py
find . -type f -name "*test*blocken*" 2>/dev/null

Repository: snakemake/snakefmt

Length of output: 85


🏁 Script executed:

# Check test cases for is_keyword_line
rg -n "is_keyword_line|\*args|\*\*kwargs" ./tests/test_blocken.py -B3 -A3

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Look for examples of parameter handling in tests
rg -n "keyword.*line|positional.*arg" ./tests/test_blocken.py -B2 -A5

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Search for docstrings or comments explaining keyword vs positional args
rg -n "keyword|positional|\*args|\*\*" snakefmt/blocken.py -B1 -A1 | grep -E "(#|\"\"\"|-\s)" | head -30

Repository: snakemake/snakefmt

Length of output: 1649


🏁 Script executed:

# Check if there's any test data that shows parameter handling
fd -name "*.py" -path "*/tests/*" -exec rg -l "Argument\|keyword.*arg|positional" {} \;

Repository: snakemake/snakefmt

Length of output: 292


🏁 Script executed:

# Look for actual usage examples of parameters in test files
rg -n "\*args|\*\*kwargs" ./tests/ -B2 -A2 | head -60

Repository: snakemake/snakefmt

Length of output: 1596


🏁 Script executed:

# Check if there's documentation about the Argument class and parameter semantics
rg -n "class Argument" snakefmt/blocken.py -A20

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Look for any comments that explain why ** is treated as keyword but maybe * isn't
rg -n "keyword.*\*\*|positional.*\*" snakefmt/blocken.py -B2 -A2

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Check what tokenize.OP represents
python3 << 'EOF'
import tokenize
import io

# Test what * and ** are tokenized as
test_cases = ["*args", "**kwargs", "a=1"]
for case in test_cases:
    tokens = list(tokenize.generate_tokens(io.StringIO(case).readline))
    print(f"\n{case}:")
    for tok in tokens:
        if tok.type != tokenize.ENCODING and tok.type != tokenize.ENDMARKER:
            print(f"  type={tokenize.tok_name[tok.type]}, string={repr(tok.string)}")
EOF

Repository: snakemake/snakefmt

Length of output: 337


🏁 Script executed:

# Check how parameters are actually parsed in real function definitions
rg -n "def.*\*args|\*\*kwargs" snakefmt/ --type=py -A5 | head -40

Repository: snakemake/snakefmt

Length of output: 2277


🏁 Script executed:

# Look at the actual usage context more carefully
sed -n '1140,1230p' snakefmt/blocken.py | cat -n

Repository: snakemake/snakefmt

Length of output: 5104


🏁 Script executed:

# Let's understand the distinction better - what makes something a "keyword line"
# Check the context around how positional vs keyword args are processed
sed -n '1175,1190p' snakefmt/blocken.py | cat -n

Repository: snakemake/snakefmt

Length of output: 1139


🏁 Script executed:

# Look at the comment explaining positional arguments
sed -n '1105,1115p' snakefmt/blocken.py | cat -n

Repository: snakemake/snakefmt

Length of output: 425


🏁 Script executed:

# Verify: in Python function definitions, is *args considered "keyword" or "positional"?
python3 << 'EOF'
import inspect

def example_func(a, b=2, *args, c=3, **kwargs):
    pass

sig = inspect.signature(example_func)
for param_name, param in sig.parameters.items():
    kind_name = param.kind.name
    print(f"{param_name}: kind={kind_name}")
    if param.default != inspect.Parameter.empty:
        print(f"  has default: {param.default}")
EOF

Repository: snakemake/snakefmt

Length of output: 227


🏁 Script executed:

# Check if there are any tests that distinguish between * and **
rg -n "VAR_POSITIONAL|VAR_KEYWORD" snakefmt/ --type=py

Repository: snakemake/snakefmt

Length of output: 44


🏁 Script executed:

# Verify the concern: what happens if *args line follows a positional arg
# Check if there are any actual test cases with *args
rg -n "\*args" snakefmt/blocken.py -B2 -A2

Repository: snakemake/snakefmt

Length of output: 448


🏁 Script executed:

# Look for test cases that might show the actual parameter formatting
fd "test.*blocken" -type f | xargs rg -l "def.*args" 2>/dev/null | head -3

Repository: snakemake/snakefmt

Length of output: 299


🏁 Script executed:

# Let's look at actual test examples with *args formatting
rg -n "def.*\*args" tests/test_formatter.py -A10 -B2 | head -80

Repository: snakemake/snakefmt

Length of output: 2188


🏁 Script executed:

# Check if the Argument class has any special handling for * operator
rg -n "class.*Argument|def.*format.*arg" snakefmt/blocken.py | head -20

Repository: snakemake/snakefmt

Length of output: 1225


🏁 Script executed:

# Let's trace through the logic more carefully - what happens if a parameter line starts with *
# Create a simple test to see how the current code behaves
python3 << 'EOF'
import tokenize
import io

# Simulate a line starting with *args
test_line = "*args,"
tokens = list(tokenize.generate_tokens(io.StringIO(test_line).readline))
print(f"Tokens for '{test_line}':")
for tok in tokens:
    if tok.type != tokenize.ENCODING and tok.type != tokenize.ENDMARKER:
        print(f"  type={tokenize.tok_name[tok.type]}, string={repr(tok.string)}")

# Check the first token
first_token = [t for t in tokens if t.type not in (tokenize.ENCODING, tokenize.ENDMARKER, tokenize.NEWLINE)][0]
print(f"\nFirst token: type={tokenize.tok_name[first_token.type]}, string={repr(first_token.string)}")
print(f"First token string == '**': {first_token.string == '**'}")
print(f"First token string == '*': {first_token.string == '*'}")
print(f"First token type == tokenize.NAME: {first_token.type == tokenize.NAME}")
print(f"First token type == tokenize.NUMBER: {first_token.type == tokenize.NUMBER}")
EOF

Repository: snakemake/snakefmt

Length of output: 363


🏁 Script executed:

# One final check - are there any existing cases where self.body[0].string is accessed without type checking?
rg -n "self\.body\[0\]\.string" snakefmt/blocken.py -B2 -A2

Repository: snakemake/snakefmt

Length of output: 201


🏁 Script executed:

# Check if there might be cases where self.body is empty or self.body[0] doesn't have .string
rg -n "self\.body\[0\]" snakefmt/blocken.py -B1 -A1 | head -40

Repository: snakemake/snakefmt

Length of output: 502


Also handle *args unpacking operator.

The condition checks for **kwargs unpacking but not *args. Both should mark the start of a new parameter line. The missing check for * could cause *args parameters to be incorrectly appended to the previous parameter.

Consider:

Suggested fix
-        if self.body[0].string == "**":
+        if self.body[0].type == tokenize.OP and self.body[0].string in ("*", "**"):
             return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 233 - 235, Update the condition that
detects unpacking operators on the parameter token (currently checking
self.body[0].string == "**") to also treat the single-star unpacking operator as
a new parameter start; e.g., change the test to check for self.body[0].string ==
"*" or self.body[0].string == "**" (or use membership like self.body[0].string
in ("*", "**")) so both *args and **kwargs trigger the early-return that starts
a new parameter line.

Comment on lines +617 to +623
elif line.body[0].string == "@":
headers = [line]
while True:
headers.append(tokens.next_new_line())
if block := self.recognize(headers[-1].body[0]):
break
append_sub(block, headers)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential infinite loop when decorators aren't followed by a recognized block.

The while True loop on line 619 consumes lines after a decorator (@) until self.recognize() returns a block type. If decorators are followed by unrecognized content (or EOF), this loop will continue until TokenIterator.__next__ raises UnsupportedSyntax.

Consider adding a safety check:

 elif line.body[0].string == "@":
     headers = [line]
     while True:
         headers.append(tokens.next_new_line())
+        if headers[-1].end.type == tokenize.ENDMARKER:
+            raise UnsupportedSyntax(
+                f"Decorator at line {line.body[0].start[0]} not followed by def/class"
+            )
         if block := self.recognize(headers[-1].body[0]):
             break
     append_sub(block, headers)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakefmt/blocken.py` around lines 617 - 623, The decorator-handling loop can
hang if no recognized block follows; wrap the calls to tokens.next_new_line()
inside a try/except to catch the TokenIterator/UnsupportedSyntax EOF case (or
otherwise check tokens.has_next()/peek() before calling next_new_line), and on
EOF either raise a clearer error or break out and handle the accumulated headers
(call append_sub with an appropriate fallback or abort with a descriptive
exception). Update the block around line handling where you detect
line.body[0].string == "@" to use self.recognize, tokens.next_new_line, and
append_sub with this EOF/unsupported-syntax guard so the loop cannot run
forever.

@mbhall88
Copy link
Copy Markdown
Member

mbhall88 commented Apr 9, 2026

I had a very initial attempt a couple of years ago at writing a grammar for snakemake and using Lark to build a parser that could then be used by all projects that need to parse snakemake code: https://github.com/mbhall88/snakemake-grammar. I think that code is probably not worth building on, but I did have a conversation with Johannes and he agreed this would be a wonderful first start. Also see which is the main hinge point for this issue.

So I think rather than enhancing the snakefmt parser I think we need to zoom out and get a broader snakemake grammar and thus parser made.

@Hocnonsense
Copy link
Copy Markdown
Contributor Author

Thanks for the pointer to snakemake-grammar and the broader context. I agree completely that the right goal is a unified grammar that all snakemake-adjacent tools can build on, rather than each tool maintaining its own parser.

I've compared the snakefmt and snakemake parser implementations and distilled that into the Block hierarchy in this PR: it already covers snakefmt's formatting needs and validates the refactoring direction, and the goal is to build toward a broader solution.

On the question of lark specifically: I don't think it's the right foundation for a production snakemake parser, for a few reasons:

  • Directive values are arbitrary Python expressions. lark can identify the boundaries of a directive, but the value itself (including expand(...), lambdas, bracket-spanning multiline calls) needs a bracket-aware, Python-semantic parser. lark's CONTENT terminal captures these line by line and loses the expression structure entirely, so a second tokenize pass is always needed anyway.
  • No error recovery. One bad token and the entire file fails to parse. For a runtime parser and for an LSP, partial results on malformed input are essential.
  • No incremental parsing. Every edit re-parses the whole file from scratch, which is fine for a formatter but not for editor tooling.

I think tree-sitter is a better fit here. It handles all of the above: the grammar can inherit from tree-sitter-python (so Python expressions inside directives are parsed correctly, not as opaque line fragments), has error recovery and incremental parsing built in. The output CST carries precise byte offsets for every node, which lets snakemake replace its current linemap dict with exact line/column tracking directly from the tree, fixing the known line number inaccuracies in the current parser.

The concrete split would be: tree-sitter-snakemake produces the CST; snakemake walks it to build Python AST nodes with correct source locations and compiles them directly, without the text-substitution step; snakefmt walks the same CST for formatting, using the Block logic that's already implemented and tested in this PR.

I'm planning to start a tree-sitter-snakemake repository to work this out. Happy to coordinate with you and Johannes on the grammar design if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants