fix: await cancelled subscription tasks on ws shutdown by Flamefork · Pull Request #4319 · strawberry-graphql/strawberry

Flamefork · 2026-03-19T16:48:23Z

Description

cleanup_operation cancels subscription tasks but does not await them (to avoid blocking the message loop). During WebSocket shutdown this meant tasks' finally blocks could run after shared state (DB pools, event loop) was already torn down.

This fix collects cancelled tasks during shutdown and awaits them via asyncio.gather before returning.

Types of Changes

Issues Fixed or Closed by This PR

Fixes graphql-transport-ws: cancelled subscription tasks are not awaited during shutdown, allowing zombie cleanup to corrupt shared state #4284

Checklist

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
I have tested the changes and verified that they work and don't break anything (as well as I can manage).

Summary by Sourcery

Ensure WebSocket shutdown waits for cancelled subscription tasks so their cleanup completes before shared resources are torn down.

Bug Fixes:

Fix WebSocket shutdown to await previously cancelled subscription tasks, preventing subscription cleanup from running after shared resources are closed.

Documentation:

Document this change as a patch release in RELEASE.md.

Tests:

Add a regression test verifying that WebSocket shutdown leaves no active infinity subscriptions after awaiting cancelled subscription tasks.

sourcery-ai · 2026-03-19T16:48:31Z

Reviewer's Guide

Ensures that subscription tasks cancelled during GraphQL over WebSocket shutdown are explicitly awaited so their cleanup/finally blocks run before shared resources are torn down, and adds a regression test plus release note for this behavior.

File-Level Changes

Change	Details	Files
Await cancelled subscription tasks during WebSocket shutdown to ensure proper cleanup ordering.	Augment shutdown() to collect all active operation tasks before invoking cleanup_operation() After cancelling operations via cleanup_operation(), await all collected tasks with asyncio.gather(return_exceptions=True) so their finally blocks run before shutdown completes Keep existing behavior of cancelling connection_init_timeout_task and reaping completed tasks	`strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py`
Add regression test verifying cancelled subscription tasks complete before shutdown returns.	Patch DebuggableGraphQLTransportWSHandler.on_init to wrap shutdown with a tracking wrapper that captures active subscription count at the end of shutdown Establish a long‑lived subscription and close the WebSocket, then verify cleanup has driven active_infinity_subscriptions back to 0 by the time shutdown finishes Skip the test for ChannelsHttpClient where on_init cannot be patched, and wait briefly after close to allow shutdown to run	`tests/websockets/test_graphql_transport_ws.py`
Document the behavior change as a patch-level release note.	Introduce RELEASE.md with a patch release entry describing awaiting cancelled subscription tasks during WebSocket shutdown	`RELEASE.md`

Assessment against linked issues

Issue	Objective	Addressed	Explanation
#4284	Ensure that during WebSocket shutdown, cancelled subscription operation tasks are awaited so their cleanup/finally blocks complete before shared state (e.g. DB pools, event loop) is torn down.	✅
#4284	Add a regression test that verifies WebSocket shutdown waits for subscription cleanup to finish (no lingering active subscription state after shutdown).	✅

Possibly linked issues

graphql-transport-ws: cancelled subscription tasks are not awaited during shutdown, allowing zombie cleanup to corrupt shared state #4284: The PR implements the proposed fix from the issue by collecting and awaiting cancelled subscription tasks on shutdown.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

botberry · 2026-03-19T16:49:23Z

Thanks for adding the RELEASE.md file!

Here's a preview of the changelog:

Await cancelled subscription tasks during WebSocket shutdown so their finally blocks run before shared state (DB pools, event loop) is torn down.

Here's the tweet text:

🆕 Release (next) is out! Thanks to Ilia Ablamonov for the PR 👏

Get it here 👉 https://strawberry.rocks/release/(next)

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In shutdown, you collect tasks for all operations before calling cleanup_operation, but you rely on op.task still being valid; consider iterating self.operations.values() directly and/or documenting the invariant that cleanup_operation won’t replace the task reference so future maintainers don’t accidentally break this ordering assumption.
The test test_shutdown_awaits_cancelled_subscription_tasks uses a fixed asyncio.sleep(0.5) to wait for shutdown; to avoid flakiness, consider synchronizing on a concrete condition (e.g., polling cleanup_done_at_shutdown_end or using an event) instead of a hardcoded delay.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `shutdown`, you collect tasks for all operations before calling `cleanup_operation`, but you rely on `op.task` still being valid; consider iterating `self.operations.values()` directly and/or documenting the invariant that `cleanup_operation` won’t replace the task reference so future maintainers don’t accidentally break this ordering assumption.
- The test `test_shutdown_awaits_cancelled_subscription_tasks` uses a fixed `asyncio.sleep(0.5)` to wait for shutdown; to avoid flakiness, consider synchronizing on a concrete condition (e.g., polling `cleanup_done_at_shutdown_end` or using an event) instead of a hardcoded delay.

## Individual Comments

### Comment 1
<location path="strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py" line_range="95-104" />
<code_context>
             with suppress(asyncio.CancelledError):
                 await self.connection_init_timeout_task

+        cancelled_tasks: list[asyncio.Task] = []
         for operation_id in list(self.operations.keys()):
+            op = self.operations[operation_id]
+            if op.task:
+                cancelled_tasks.append(op.task)
             await self.cleanup_operation(operation_id)
+
         await self.reap_completed_tasks()
+        # cleanup_operation cancels but does not await tasks (would block
+        # the message loop). Safe to await here — no more messages to process.
+        await asyncio.gather(*cancelled_tasks, return_exceptions=True)

     def on_request_accepted(self) -> None:
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider a timeout or bounded wait for cancelled tasks during shutdown.

Because `cleanup_operation` may cancel tasks that never finish (e.g., stuck in uninterruptible I/O or user code that ignores cancellation), this `asyncio.gather` can block shutdown indefinitely. Consider bounding the wait with `asyncio.wait_for` and a reasonable timeout, then logging and proceeding with shutdown if the timeout is hit.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py

greptile-apps · 2026-03-19T16:52:37Z

Greptile Summary

This patch fixes a race condition where subscription tasks' finally blocks (e.g., decrementing active subscription counters, closing DB connections) could execute after shared state was torn down on WebSocket shutdown. The handler's shutdown() method previously cancelled tasks via cleanup_operation but never awaited them, letting their cleanup run asynchronously after the connection was gone.

The fix is minimal and correct: task references are collected before each cleanup_operation call, and a single asyncio.gather(*cancelled_tasks, return_exceptions=True) at the end of shutdown ensures every cancelled task fully drains (including its finally block) before shutdown returns.

Key observations:

Awaiting a task multiple times is safe in Python — tasks that had already been awaited by reap_completed_tasks are simply returned from asyncio.gather immediately with their stored result.
return_exceptions=True is correctly used so that CancelledError from the tasks does not propagate and abort the shutdown.
The new test accurately exercises the fix by monkey-patching shutdown to read active_infinity_subscriptions after original_shutdown() returns.
Minor: the test uses asyncio.sleep(0.5), while the adjacent comparable test test_unexpected_client_disconnects_are_gracefully_handled uses asyncio.sleep(1), which could be slightly less robust in slow CI environments.

Confidence Score: 4/5

Safe to merge — the fix is logically correct and the only concern is a minor test timing fragility.
The core handler change is small, well-reasoned, and handles all edge cases (already-done tasks, double-await safety, return_exceptions=True). The accompanying test correctly validates the fix. One point deducted for the 0.5-second sleep in the test being shorter than the 1-second used in comparable tests, which introduces a small risk of intermittent CI failures.
tests/websockets/test_graphql_transport_ws.py — review the asyncio.sleep(0.5) timing assumption.

Important Files Changed

Filename	Overview
strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py	Adds task reference collection before `cleanup_operation` and an `asyncio.gather` after `reap_completed_tasks` to ensure cancelled subscription tasks' `finally` blocks complete before `shutdown` returns. Logic is sound and handles edge cases (already-done tasks, None tasks, double-await safety).
tests/websockets/test_graphql_transport_ws.py	Adds `test_shutdown_awaits_cancelled_subscription_tasks` which patches `on_init` to wrap `shutdown` and asserts `active_infinity_subscriptions` is 0 after shutdown completes. Uses a 0.5-second sleep (inconsistent with the 1-second sleep in the similar adjacent test), which could be fragile on slow CI.
RELEASE.md	New release notes file correctly classifies this as a `patch` release and accurately describes the fix.

Sequence Diagram

sequenceDiagram
    participant WS as WebSocket Client
    participant H as BaseGraphQLTransportWSHandler
    participant T as Subscription Task(s)
    participant EL as Event Loop

    WS->>H: disconnect
    H->>H: shutdown()
    Note over H: Collect op.task refs → cancelled_tasks[]
    loop For each active operation
        H->>T: cleanup_operation() → task.cancel()
    end
    H->>H: reap_completed_tasks()<br/>(awaits already-finished tasks)
    H->>EL: asyncio.gather(*cancelled_tasks,<br/>return_exceptions=True)
    EL->>T: raise CancelledError
    T->>T: finally: active_subscriptions -= 1
    T-->>EL: done
    EL-->>H: gather returns
    Note over H: All finally blocks complete<br/>before shutdown() returns

_{Last reviewed commit: "Fix formatting"}

greptile-apps · 2026-03-19T16:52:40Z

tests/websockets/test_graphql_transport_ws.py

+
+            await ws.close()
+
+    await asyncio.sleep(0.5)


Short sleep may cause intermittent failures

The existing similar test test_unexpected_client_disconnects_are_gracefully_handled uses asyncio.sleep(1) to wait for server-side shutdown. Using only 0.5 seconds here could make this test intermittently fail on slow CI systems, since the tracked_shutdown coroutine must fully complete (including the new asyncio.gather over cancelled tasks) within that window.

Suggested change

await asyncio.sleep(0.5)

await asyncio.sleep(1)

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py

Copilot

Pull request overview

Fixes graphql-transport-ws shutdown behavior so cancelled subscription tasks are awaited during WebSocket teardown, ensuring their finally blocks complete before shared resources (e.g., DB pools / event loop) are torn down.

Changes:

Track cancelled subscription operation tasks and await them during handler shutdown.
Add a regression test asserting subscription cleanup has completed by the end of shutdown.
Add a patch release note describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py`	Adds tracking for cancelled tasks and awaits them during `shutdown()` via a new `reap_cancelled_tasks()` method.
`tests/websockets/test_graphql_transport_ws.py`	Adds a regression test to ensure shutdown waits for subscription cancellation cleanup.
`RELEASE.md`	Documents the patch release change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T16:06:29Z

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py

    async def cleanup_operation(self, operation_id: str) -> None:
        if operation_id not in self.operations:
            return
        operation = self.operations.pop(operation_id)
        assert operation.task
        operation.task.cancel()
+        self.cancelled_tasks.append(operation.task)
        # do not await the task here, lest we block the main


cleanup_operation appends every cancelled operation task to self.cancelled_tasks, but reap_cancelled_tasks() is only called during shutdown(). For long-lived WebSocket connections that start/stop many subscriptions, this list can grow without bound (and can retain already-finished tasks that were reaped via reap_completed_tasks), causing a memory leak. Consider avoiding the instance-level list by collecting tasks locally inside shutdown() before calling cleanup_operation, or ensure cancelled_tasks is periodically drained without blocking the message loop (e.g., keep only pending tasks, and opportunistically reap task.done() ones in handle_message like reap_completed_tasks does).

This is something that I noticed with the current code. It is not an issue with cancelled_tasks only, but completed_tasks as well.

Maybe we should consider creating a periodic task that cleans up those?

Will discuss this tomorrow with @patrick91, but maybe it can be done in a follow-up PR

we could use task.add_done_callback (let's do this in a new PR)

Copilot · 2026-03-23T16:06:29Z

tests/websockets/test_graphql_transport_ws.py

+    await asyncio.sleep(0.5)
+    assert handler is not None
+    assert cleanup_done_at_shutdown_end == 0


The shutdown sync point is currently a fixed await asyncio.sleep(0.5). This can be flaky on slow CI (shutdown may not have run yet, leaving cleanup_done_at_shutdown_end as None). Prefer synchronizing deterministically (e.g., set an asyncio.Event in tracked_shutdown and await asyncio.wait_for(event.wait(), timeout=...), or poll cleanup_done_at_shutdown_end is not None with wait_for) so the test fails only when the shutdown invariant is actually broken.

codspeed-hq · 2026-03-23T16:30:08Z

Merging this PR will not alter performance

✅ 31 untouched benchmarks

_{Comparing Flamefork:fix/shutdown-await-tasks (8259402) with main (d722c53)}

bellini666 · 2026-04-05T12:02:20Z

Hi @Flamefork ,

Based on #4319 (comment), I ended up making this change to fix the memory leak: #4345

Maybe it also fixes the issue here? If not, could you rebase this PR to adjust it to the new code?

Flamefork added 3 commits March 19, 2026 17:26

fix: await cancelled subscription tasks on ws shutdown

f30ea7a

Add release

e0422f9

Fix formatting

3aadfaf

Flamefork mentioned this pull request Mar 19, 2026

graphql-transport-ws: cancelled subscription tasks are not awaited during shutdown, allowing zombie cleanup to corrupt shared state #4284

Open

botberry added bot:has-release-file bot:release-type-patch labels Mar 19, 2026

sourcery-ai bot reviewed Mar 19, 2026

View reviewed changes

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

bellini666 reviewed Mar 21, 2026

View reviewed changes

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py Show resolved Hide resolved

Move cancelled task tracking into cleanup_operation

8259402

bellini666 requested a review from Copilot March 23, 2026 16:01

Copilot started reviewing on behalf of bellini666 March 23, 2026 16:02 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

botberry removed the bot:release-type-patch label Apr 6, 2026

Uh oh!

Conversation

Flamefork commented Mar 19, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of Changes

Issues Fixed or Closed by This PR

Checklist

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Assessment against linked issues

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

botberry commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Mar 19, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

bellini666 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

patrick91 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Mar 23, 2026

Merging this PR will not alter performance

Uh oh!

bellini666 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Flamefork commented Mar 19, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 19, 2026 •

edited

Loading

botberry commented Mar 19, 2026 •

edited

Loading