Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions apps/sim/executor/execution/block-executor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ import {
} from '@/executor/types'
import { streamingResponseFormatProcessor } from '@/executor/utils'
import { buildBlockExecutionError, normalizeError } from '@/executor/utils/errors'
import { withRetry } from '@/executor/utils/retry'
import {
buildUnifiedParentIterations,
getIterationContext,
Expand Down Expand Up @@ -120,9 +121,11 @@ export class BlockExecutor {
cleanupSelfReference?.()

try {
const output = handler.executeWithNode
? await handler.executeWithNode(ctx, block, resolvedInputs, nodeMetadata)
: await handler.execute(ctx, block, resolvedInputs)
const output = await withRetry(
() => handler.executeWithNode
? handler.executeWithNode(ctx, block, resolvedInputs, nodeMetadata)
: handler.execute(ctx, block, resolvedInputs)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Retry applied to all block types, not just LLM blocks

The withRetry wrapper is placed around every block's handler execution — not just AI/LLM blocks. This means any block that calls an external service (HTTP request blocks, webhook blocks, database blocks, email blocks, etc.) will also be retried up to 3 times on a 429 or 503 response.

For non-idempotent operations — e.g. sending an email, creating a database row, triggering a webhook — this can silently create 2–4 duplicate operations per execution. The PR description says "all agent blocks automatically retry," but the implementation actually retries all block types.

Consider either:

  1. Only retrying within the LLM/AI provider layer (not at the block executor level), or
  2. Checking the block type before applying retry logic:
const isLLMBlock = /* check block.metadata?.id against known LLM block types */
const output = isLLMBlock
  ? await withRetry(() => handler.executeWithNode ? ... : ...)
  : await (handler.executeWithNode ? ... : ...)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.


const isStreamingExecution =
output && typeof output === 'object' && 'stream' in output && 'execution' in output
Expand Down
74 changes: 74 additions & 0 deletions apps/sim/executor/utils/retry.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import { describe, it, expect, vi, beforeEach } from 'vitest'
import { withRetry } from './retry'

describe('withRetry', () => {
beforeEach(() => {
vi.useFakeTimers()
})

it('returns result immediately on success', async () => {
Comment on lines +6 to +13
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing afterEach to restore real timers

vi.useFakeTimers() is called in beforeEach but there is no corresponding vi.useRealTimers() in an afterEach. If Vitest runs these tests in the same worker as other test files that depend on real timers (e.g. setTimeout, Date.now()), the leaked fake timer state can cause subtle test failures elsewhere.

afterEach(() => {
  vi.useRealTimers()
})

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

const fn = vi.fn().mockResolvedValue('ok')
const result = await withRetry(fn)
expect(result).toBe('ok')
expect(fn).toHaveBeenCalledTimes(1)
})

it('retries on 429 and succeeds on second attempt', async () => {
const error = Object.assign(new Error('rate limited'), { status: 429 })
const fn = vi.fn()
.mockRejectedValueOnce(error)
.mockResolvedValue('ok')
const promise = withRetry(fn, { initialDelayMs: 10, maxRetries: 3 })
await vi.runAllTimersAsync()
const result = await promise
expect(result).toBe('ok')
expect(fn).toHaveBeenCalledTimes(2)
})

it('retries up to maxRetries times then throws', async () => {
const error = Object.assign(new Error('service unavailable'), { status: 503 })
const fn = vi.fn().mockRejectedValue(error)
const promise = withRetry(fn, { maxRetries: 3, initialDelayMs: 10 })
await vi.runAllTimersAsync()
await expect(promise).rejects.toThrow('service unavailable')
expect(fn).toHaveBeenCalledTimes(4)
})

it('does NOT retry on 400 bad request', async () => {
const error = Object.assign(new Error('bad request'), { status: 400 })
const fn = vi.fn().mockRejectedValue(error)
await expect(withRetry(fn, { maxRetries: 3 })).rejects.toThrow('bad request')
expect(fn).toHaveBeenCalledTimes(1)
})

it('does NOT retry on 401 unauthorized', async () => {
const error = Object.assign(new Error('unauthorized'), { status: 401 })
const fn = vi.fn().mockRejectedValue(error)
await expect(withRetry(fn, { maxRetries: 3 })).rejects.toThrow('unauthorized')
expect(fn).toHaveBeenCalledTimes(1)
})

it('retries on network error with no status code', async () => {
const error = new Error('network failure')
const fn = vi.fn()
.mockRejectedValueOnce(error)
.mockResolvedValue('recovered')
const promise = withRetry(fn, { maxRetries: 3, initialDelayMs: 10 })
await vi.runAllTimersAsync()
const result = await promise
expect(result).toBe('recovered')
expect(fn).toHaveBeenCalledTimes(2)
})

it('respects Retry-After header on 429', async () => {
const headers = new Headers({ 'retry-after': '2' })
const error = Object.assign(new Error('rate limited'), { status: 429, headers })
const fn = vi.fn()
.mockRejectedValueOnce(error)
.mockResolvedValue('ok')
const promise = withRetry(fn, { maxRetries: 3, initialDelayMs: 100 })
await vi.runAllTimersAsync()
await promise
expect(fn).toHaveBeenCalledTimes(2)
})
})
95 changes: 95 additions & 0 deletions apps/sim/executor/utils/retry.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import { createLogger } from '@/lib/logging/client'

const logger = createLogger('retry')

export interface RetryOptions {
maxRetries?: number
initialDelayMs?: number
maxDelayMs?: number
retryableStatusCodes?: number[]
}

const RETRY_DEFAULTS = {
MAX_RETRIES: 3,
INITIAL_DELAY_MS: 1000,
MAX_DELAY_MS: 30000,
RETRYABLE_STATUS_CODES: [429, 503, 529],
} as const

function calculateBackoffDelay(attempt: number, initialDelayMs: number, maxDelayMs: number): number {
const exponential = initialDelayMs * Math.pow(2, attempt)
const capped = Math.min(maxDelayMs, exponential)
const jitter = Math.random() * capped * 0.2
return Math.floor(capped + jitter)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jitter eliminated when backoff delay reaches max cap

Low Severity

In calculateBackoffDelay, capped is already clamped to maxDelayMs, but then jitter is added to capped and the result is clamped again by Math.min(..., maxDelayMs). When exponential >= maxDelayMs, the jitter is always clamped back to zero, producing a deterministic maxDelayMs with no randomization. This defeats the purpose of jitter (preventing thundering herd) once the exponential component reaches the cap. The jitter needs to be subtracted from (or applied within) capped rather than added on top of it.

Fix in Cursor Fix in Web


function parseRetryAfterHeader(headers: Headers): number | null {
const retryAfter = headers.get('retry-after')
if (!retryAfter) return null
const seconds = parseFloat(retryAfter)
if (!isNaN(seconds)) return Math.ceil(seconds * 1000)
const date = new Date(retryAfter)
if (!isNaN(date.getTime())) {
const delayMs = date.getTime() - Date.now()
return delayMs > 0 ? delayMs : null
}
return null
}

/**
* Wraps an async function with retry logic using exponential backoff and jitter.
* Respects Retry-After headers from LLM providers on 429 responses.
* Only retries on transient errors (429, 503, 529) — never on user errors (4xx).
*/
export async function withRetry<T>(
fn: () => Promise<T>,
options: RetryOptions = {}
): Promise<T> {
const maxRetries = options.maxRetries ?? RETRY_DEFAULTS.MAX_RETRIES
const initialDelayMs = options.initialDelayMs ?? RETRY_DEFAULTS.INITIAL_DELAY_MS
const maxDelayMs = options.maxDelayMs ?? RETRY_DEFAULTS.MAX_DELAY_MS
const retryableStatusCodes = options.retryableStatusCodes ?? RETRY_DEFAULTS.RETRYABLE_STATUS_CODES

let lastError: unknown

for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn()
} catch (error) {
lastError = error

if (attempt === maxRetries) break

const status = (error as any)?.status ?? (error as any)?.statusCode ?? null
const responseHeaders: Headers | null = (error as any)?.headers ?? null

const isRetryable =
status === null ||
retryableStatusCodes.includes(status)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 All statusless errors trigger retries, including non-transient JS errors

When status === null (no .status or .statusCode on the thrown error), isRetryable is true. This catches genuine network errors — but it also retries any JavaScript runtime error, validation error, or unexpected exception that doesn't carry a status code.

For example, a TypeError, RangeError, or any internal error thrown by a block handler will be silently retried 3 times with up to 30 seconds of total delay before finally failing. This masks bugs and significantly degrades the user-facing error latency for logic errors.

Consider reversing the default: treat unknown errors as non-retryable unless the error is explicitly recognized as transient (e.g., it is a known network error type, or it matches a recognizable pattern):

const isRetryable =
  status !== null
    ? retryableStatusCodes.includes(status)
    : isNetworkError(error) // helper that checks error.code, error.name, etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.


if (!isRetryable) {
logger.warn('Non-retryable error, aborting retry loop', { status, attempt })
throw error
}

let delayMs: number
if (responseHeaders && status === 429) {
const retryAfterMs = parseRetryAfterHeader(responseHeaders)
delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
} else {
delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Retry-After header not respected for 503 responses

The Retry-After header is parsed and used only when status === 429. However, RFC 7231 §7.1.3 specifies that servers sending 503 Service Unavailable may also include a Retry-After header to signal when the service will recover. Both OpenAI and Anthropic can send 503s with Retry-After.

// suggestion
      if (responseHeaders && (status === 429 || status === 503)) {
        const retryAfterMs = parseRetryAfterHeader(responseHeaders)
        delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
      } else {
        delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
      }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.


logger.warn('Retrying after transient LLM provider error', {
attempt: attempt + 1,
maxRetries,
status,
delayMs,
})

await new Promise((resolve) => setTimeout(resolve, delayMs))
}
}

throw lastError
}