Skip to content

fix: await cancelled subscription tasks on ws shutdown#4319

Open
Flamefork wants to merge 4 commits intostrawberry-graphql:mainfrom
Flamefork:fix/shutdown-await-tasks
Open

fix: await cancelled subscription tasks on ws shutdown#4319
Flamefork wants to merge 4 commits intostrawberry-graphql:mainfrom
Flamefork:fix/shutdown-await-tasks

Conversation

@Flamefork
Copy link
Copy Markdown

@Flamefork Flamefork commented Mar 19, 2026

Description

cleanup_operation cancels subscription tasks but does not await them (to avoid blocking the message loop). During WebSocket shutdown this meant tasks' finally blocks could run after shared state (DB pools, event loop) was already torn down.

This fix collects cancelled tasks during shutdown and awaits them via asyncio.gather before returning.

Types of Changes

  • Core
  • Bugfix
  • New feature
  • Enhancement/optimization
  • Documentation

Issues Fixed or Closed by This PR

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • I have tested the changes and verified that they work and don't break anything (as well as I can manage).

Summary by Sourcery

Ensure WebSocket shutdown waits for cancelled subscription tasks so their cleanup completes before shared resources are torn down.

Bug Fixes:

  • Fix WebSocket shutdown to await previously cancelled subscription tasks, preventing subscription cleanup from running after shared resources are closed.

Documentation:

  • Document this change as a patch release in RELEASE.md.

Tests:

  • Add a regression test verifying that WebSocket shutdown leaves no active infinity subscriptions after awaiting cancelled subscription tasks.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 19, 2026

Reviewer's Guide

Ensures that subscription tasks cancelled during GraphQL over WebSocket shutdown are explicitly awaited so their cleanup/finally blocks run before shared resources are torn down, and adds a regression test plus release note for this behavior.

File-Level Changes

Change Details Files
Await cancelled subscription tasks during WebSocket shutdown to ensure proper cleanup ordering.
  • Augment shutdown() to collect all active operation tasks before invoking cleanup_operation()
  • After cancelling operations via cleanup_operation(), await all collected tasks with asyncio.gather(return_exceptions=True) so their finally blocks run before shutdown completes
  • Keep existing behavior of cancelling connection_init_timeout_task and reaping completed tasks
strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py
Add regression test verifying cancelled subscription tasks complete before shutdown returns.
  • Patch DebuggableGraphQLTransportWSHandler.on_init to wrap shutdown with a tracking wrapper that captures active subscription count at the end of shutdown
  • Establish a long‑lived subscription and close the WebSocket, then verify cleanup has driven active_infinity_subscriptions back to 0 by the time shutdown finishes
  • Skip the test for ChannelsHttpClient where on_init cannot be patched, and wait briefly after close to allow shutdown to run
tests/websockets/test_graphql_transport_ws.py
Document the behavior change as a patch-level release note.
  • Introduce RELEASE.md with a patch release entry describing awaiting cancelled subscription tasks during WebSocket shutdown
RELEASE.md

Assessment against linked issues

Issue Objective Addressed Explanation
#4284 Ensure that during WebSocket shutdown, cancelled subscription operation tasks are awaited so their cleanup/finally blocks complete before shared state (e.g. DB pools, event loop) is torn down.
#4284 Add a regression test that verifies WebSocket shutdown waits for subscription cleanup to finish (no lingering active subscription state after shutdown).

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@botberry
Copy link
Copy Markdown
Member

botberry commented Mar 19, 2026

Thanks for adding the RELEASE.md file!

Here's a preview of the changelog:


Await cancelled subscription tasks during WebSocket shutdown so their finally blocks run before shared state (DB pools, event loop) is torn down.

Here's the tweet text:

🆕 Release (next) is out! Thanks to Ilia Ablamonov for the PR 👏

Get it here 👉 https://strawberry.rocks/release/(next)

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In shutdown, you collect tasks for all operations before calling cleanup_operation, but you rely on op.task still being valid; consider iterating self.operations.values() directly and/or documenting the invariant that cleanup_operation won’t replace the task reference so future maintainers don’t accidentally break this ordering assumption.
  • The test test_shutdown_awaits_cancelled_subscription_tasks uses a fixed asyncio.sleep(0.5) to wait for shutdown; to avoid flakiness, consider synchronizing on a concrete condition (e.g., polling cleanup_done_at_shutdown_end or using an event) instead of a hardcoded delay.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `shutdown`, you collect tasks for all operations before calling `cleanup_operation`, but you rely on `op.task` still being valid; consider iterating `self.operations.values()` directly and/or documenting the invariant that `cleanup_operation` won’t replace the task reference so future maintainers don’t accidentally break this ordering assumption.
- The test `test_shutdown_awaits_cancelled_subscription_tasks` uses a fixed `asyncio.sleep(0.5)` to wait for shutdown; to avoid flakiness, consider synchronizing on a concrete condition (e.g., polling `cleanup_done_at_shutdown_end` or using an event) instead of a hardcoded delay.

## Individual Comments

### Comment 1
<location path="strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py" line_range="95-104" />
<code_context>
             with suppress(asyncio.CancelledError):
                 await self.connection_init_timeout_task

+        cancelled_tasks: list[asyncio.Task] = []
         for operation_id in list(self.operations.keys()):
+            op = self.operations[operation_id]
+            if op.task:
+                cancelled_tasks.append(op.task)
             await self.cleanup_operation(operation_id)
+
         await self.reap_completed_tasks()
+        # cleanup_operation cancels but does not await tasks (would block
+        # the message loop). Safe to await here — no more messages to process.
+        await asyncio.gather(*cancelled_tasks, return_exceptions=True)

     def on_request_accepted(self) -> None:
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider a timeout or bounded wait for cancelled tasks during shutdown.

Because `cleanup_operation` may cancel tasks that never finish (e.g., stuck in uninterruptible I/O or user code that ignores cancellation), this `asyncio.gather` can block shutdown indefinitely. Consider bounding the wait with `asyncio.wait_for` and a reasonable timeout, then logging and proceeding with shutdown if the timeout is hit.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This patch fixes a race condition where subscription tasks' finally blocks (e.g., decrementing active subscription counters, closing DB connections) could execute after shared state was torn down on WebSocket shutdown. The handler's shutdown() method previously cancelled tasks via cleanup_operation but never awaited them, letting their cleanup run asynchronously after the connection was gone.

The fix is minimal and correct: task references are collected before each cleanup_operation call, and a single asyncio.gather(*cancelled_tasks, return_exceptions=True) at the end of shutdown ensures every cancelled task fully drains (including its finally block) before shutdown returns.

Key observations:

  • Awaiting a task multiple times is safe in Python — tasks that had already been awaited by reap_completed_tasks are simply returned from asyncio.gather immediately with their stored result.
  • return_exceptions=True is correctly used so that CancelledError from the tasks does not propagate and abort the shutdown.
  • The new test accurately exercises the fix by monkey-patching shutdown to read active_infinity_subscriptions after original_shutdown() returns.
  • Minor: the test uses asyncio.sleep(0.5), while the adjacent comparable test test_unexpected_client_disconnects_are_gracefully_handled uses asyncio.sleep(1), which could be slightly less robust in slow CI environments.

Confidence Score: 4/5

  • Safe to merge — the fix is logically correct and the only concern is a minor test timing fragility.
  • The core handler change is small, well-reasoned, and handles all edge cases (already-done tasks, double-await safety, return_exceptions=True). The accompanying test correctly validates the fix. One point deducted for the 0.5-second sleep in the test being shorter than the 1-second used in comparable tests, which introduces a small risk of intermittent CI failures.
  • tests/websockets/test_graphql_transport_ws.py — review the asyncio.sleep(0.5) timing assumption.

Important Files Changed

Filename Overview
strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py Adds task reference collection before cleanup_operation and an asyncio.gather after reap_completed_tasks to ensure cancelled subscription tasks' finally blocks complete before shutdown returns. Logic is sound and handles edge cases (already-done tasks, None tasks, double-await safety).
tests/websockets/test_graphql_transport_ws.py Adds test_shutdown_awaits_cancelled_subscription_tasks which patches on_init to wrap shutdown and asserts active_infinity_subscriptions is 0 after shutdown completes. Uses a 0.5-second sleep (inconsistent with the 1-second sleep in the similar adjacent test), which could be fragile on slow CI.
RELEASE.md New release notes file correctly classifies this as a patch release and accurately describes the fix.

Sequence Diagram

sequenceDiagram
    participant WS as WebSocket Client
    participant H as BaseGraphQLTransportWSHandler
    participant T as Subscription Task(s)
    participant EL as Event Loop

    WS->>H: disconnect
    H->>H: shutdown()
    Note over H: Collect op.task refs → cancelled_tasks[]
    loop For each active operation
        H->>T: cleanup_operation() → task.cancel()
    end
    H->>H: reap_completed_tasks()<br/>(awaits already-finished tasks)
    H->>EL: asyncio.gather(*cancelled_tasks,<br/>return_exceptions=True)
    EL->>T: raise CancelledError
    T->>T: finally: active_subscriptions -= 1
    T-->>EL: done
    EL-->>H: gather returns
    Note over H: All finally blocks complete<br/>before shutdown() returns
Loading

Last reviewed commit: "Fix formatting"


await ws.close()

await asyncio.sleep(0.5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Short sleep may cause intermittent failures

The existing similar test test_unexpected_client_disconnects_are_gracefully_handled uses asyncio.sleep(1) to wait for server-side shutdown. Using only 0.5 seconds here could make this test intermittently fail on slow CI systems, since the tracked_shutdown coroutine must fully complete (including the new asyncio.gather over cancelled tasks) within that window.

Suggested change
await asyncio.sleep(0.5)
await asyncio.sleep(1)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes graphql-transport-ws shutdown behavior so cancelled subscription tasks are awaited during WebSocket teardown, ensuring their finally blocks complete before shared resources (e.g., DB pools / event loop) are torn down.

Changes:

  • Track cancelled subscription operation tasks and await them during handler shutdown.
  • Add a regression test asserting subscription cleanup has completed by the end of shutdown.
  • Add a patch release note describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py Adds tracking for cancelled tasks and awaits them during shutdown() via a new reap_cancelled_tasks() method.
tests/websockets/test_graphql_transport_ws.py Adds a regression test to ensure shutdown waits for subscription cancellation cleanup.
RELEASE.md Documents the patch release change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 329 to 336
async def cleanup_operation(self, operation_id: str) -> None:
if operation_id not in self.operations:
return
operation = self.operations.pop(operation_id)
assert operation.task
operation.task.cancel()
self.cancelled_tasks.append(operation.task)
# do not await the task here, lest we block the main
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_operation appends every cancelled operation task to self.cancelled_tasks, but reap_cancelled_tasks() is only called during shutdown(). For long-lived WebSocket connections that start/stop many subscriptions, this list can grow without bound (and can retain already-finished tasks that were reaped via reap_completed_tasks), causing a memory leak. Consider avoiding the instance-level list by collecting tasks locally inside shutdown() before calling cleanup_operation, or ensure cancelled_tasks is periodically drained without blocking the message loop (e.g., keep only pending tasks, and opportunistically reap task.done() ones in handle_message like reap_completed_tasks does).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I noticed with the current code. It is not an issue with cancelled_tasks only, but completed_tasks as well.

Maybe we should consider creating a periodic task that cleans up those?

Will discuss this tomorrow with @patrick91, but maybe it can be done in a follow-up PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use task.add_done_callback (let's do this in a new PR)

Comment on lines +1267 to +1269
await asyncio.sleep(0.5)
assert handler is not None
assert cleanup_done_at_shutdown_end == 0
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown sync point is currently a fixed await asyncio.sleep(0.5). This can be flaky on slow CI (shutdown may not have run yet, leaving cleanup_done_at_shutdown_end as None). Prefer synchronizing deterministically (e.g., set an asyncio.Event in tracked_shutdown and await asyncio.wait_for(event.wait(), timeout=...), or poll cleanup_done_at_shutdown_end is not None with wait_for) so the test fails only when the shutdown invariant is actually broken.

Copilot uses AI. Check for mistakes.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 23, 2026

Merging this PR will not alter performance

✅ 31 untouched benchmarks


Comparing Flamefork:fix/shutdown-await-tasks (8259402) with main (d722c53)

Open in CodSpeed

@bellini666
Copy link
Copy Markdown
Member

Hi @Flamefork ,

Based on #4319 (comment), I ended up making this change to fix the memory leak: #4345

Maybe it also fixes the issue here? If not, could you rebase this PR to adjust it to the new code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

graphql-transport-ws: cancelled subscription tasks are not awaited during shutdown, allowing zombie cleanup to corrupt shared state

5 participants