LCORE-1591: Observability for Lightspeed Core#1482
LCORE-1591: Observability for Lightspeed Core#1482asimurka wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughAdds a new OpenTelemetry tracing design spec for Lightspeed Core: YAML-driven global enablement with required OTLP exporter config, independent inbound/outbound W3C trace-context propagation toggles, FastAPI auto-instrumentation plus manual spans, lifecycle/resilience rules, deployment and env var guidance, and library-mode limitations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant LCORE as Lightspeed Core (FastAPI)
participant Llama as Llama Stack / External Service
participant OTLP as OTLP Collector/Exporter
Client->>LCORE: HTTP request (incoming)
activate LCORE
LCORE->>LCORE: Extract inbound W3C trace-context (if enabled)
LCORE->>LCORE: FastAPI auto-instrumentation creates server span
LCORE->>Llama: Outbound HTTP call (inject trace-context via httpx hook if enabled)
Llama-->>LCORE: Response
LCORE->>OTLP: Export spans to OTLP endpoint (OTLP exporter)
deactivate LCORE
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 193-243: Fix markdown lint failures MD022 and MD031 by ensuring
there's exactly one blank line before and after each heading and that fenced
code blocks are preceded and followed by a blank line; specifically adjust
spacing around the "Configuration model:", "Pros:", "Cons:", "Important
constraints:", and the "Option 2 — Manual SDK initialization with YAML-driven
hybrid configuration" heading blocks in the observability-opentelemetry.md
content (lines shown in the review) so headings have a single blank line
separation and add/remove blank lines around the fenced code block at line ~297
to ensure one blank line before and after the triple-backtick fence.
- Line 107: Add a new "Verification / Test Plan" section in
observability-opentelemetry.md (to satisfy R10) that contains explicit
step-by-step validation checks and expected results: include subsections for
"Startup validation" (checks that tracing initializes, config/env var checks,
sample commands and expected logs), "Inbound continuation" (send an incoming
request and verify trace IDs are continued with expected span relationships),
"Outbound propagation" (make an outbound call and verify propagation headers and
downstream spans), and "Exporter failure behavior" (simulate exporter downtime
and verify retry/drop behavior and observable error logs); for each subsection
list exact commands or curl examples, environment variables to set, acceptance
criteria, and sample evidence (log snippets or trace UI screenshots).
- Line 298: Fix the typos in the documentation text: change "initializaion" to
"initialization" in the section heading "Enrichment of Llama Stack server-mode
client initializaion" and change "primarly" to "primarily" where it appears
(line referenced as 582). Update those two words in the markdown content so the
normative spec uses correct spelling.
- Around line 56-57: Change the Setext-style line so it is treated as a normal
paragraph: replace the standalone line "When tracing is disabled, no spans are
created and no tracing-related processing is performed." so it is followed by a
blank line before the thematic break (the sequence of hyphens) or remove the
hyphen-only line immediately after; ensure the hyphen separator remains a proper
horizontal rule by having an empty line above it so markdownlint MD003 no longer
interprets that sentence as a Setext heading.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 96b8e5e7-b660-45b6-8800-46ea7d3f5849
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: list_outdated_dependencies
- GitHub Check: radon
- GitHub Check: unit_tests (3.13)
- GitHub Check: unit_tests (3.12)
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
- GitHub Check: build-pr
- GitHub Check: mypy
- GitHub Check: bandit
- GitHub Check: Pyright
- GitHub Check: Pylinter
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
🪛 LanguageTool
docs/design/observability-opentelemetry/observability-opentelemetry.md
[grammar] ~298-~298: Ensure spelling is correct
Context: ...hment of Llama Stack server-mode client initializaion: python client_kwargs: dict[str, Any] = { "base_url": base_url, "api_key": api_key, "timeout": config.timeout, } if distributed_tracing_to_llama_enabled() and base_url is not None: client_kwargs["http_client"] = llama_stack_httpx_async_client( base_url=base_url, timeout=config.timeout, ) self._lsc = AsyncLlamaStackClient(**client_kwargs) Library mode: This wiring applies only...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 56-56: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
[warning] 193-193: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 198-198: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 204-204: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 209-209: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 223-223: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 231-231: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 237-237: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 243-243: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 297-297: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
67f9919 to
e0e6b4f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
docs/design/observability-opentelemetry/observability-opentelemetry.md (2)
194-244:⚠️ Potential issue | 🟡 MinorFix markdownlint blockers (MD022/MD031) in heading/fence spacing.
Line 194/199/205/210/224/232/238/244 need blank lines around headings, and Line 298 needs a blank line after the fenced code block before prose on Line 299.
Minimal formatting fix pattern
### Configuration model: + - All tracing configuration is provided via `OTEL_*` environment variables ... -``` -Enrichment of Llama Stack server-mode client initialization: +``` + +Enrichment of Llama Stack server-mode client initialization:Also applies to: 298-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 194 - 244, The markdown has lint errors MD022/MD031 due to missing blank lines around several headings and a missing blank line after a fenced code block; add a single blank line before and after each top-level and subheading such as "Configuration model:", "Pros:", "Cons:", "Important constraints:", "Option 2 — Manual SDK initialization with YAML-driven hybrid configuration" and the other listed headings, and ensure there is an empty line after any fenced code block before the following paragraph (e.g., the fenced block that precedes "Enrichment of Llama Stack server-mode client initialization:"); this will satisfy markdownlint spacing rules without changing heading text or content.
107-108:⚠️ Potential issue | 🟠 MajorAdd an explicit verification/test plan section to satisfy R10.
The requirement on Line 108 says docs must describe verification, but there is no concrete, stepwise validation procedure with expected evidence.
Suggested structure to add
+## Verification / Test Plan + +### 1) Startup validation +- Configure `observability.otel.enabled=true` with valid `export.endpoint/protocol/service`. +- Expected: app starts, tracer initializes, no config validation errors. + +### 2) Missing mandatory export fields +- Remove one mandatory field (for example `service`). +- Expected: startup fails with validation error. + +### 3) Inbound continuation +- Send request with `traceparent`. +- Expected: server span continues same trace ID. + +### 4) Outbound propagation (service mode) +- Enable `propagation.outgoing`. +- Expected: downstream call carries W3C trace headers and links in same trace. + +### 5) Exporter failure behavior +- Point endpoint to unreachable collector. +- Expected: user-facing responses unaffected; tracing errors observable in logs. + +### 6) Sensitive data check +- Inspect span attributes/events. +- Expected: no raw prompts/retrieved content.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 107 - 108, Add a new "Verification / Test Plan" section to the observability-opentelemetry.md doc under the existing "R10 – Documentation" heading that provides a concise, stepwise validation procedure: list required and optional env vars to set (e.g., OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRICS_EXPORTER), commands to run the service and otel-collector/test backends, exact actions to generate traces (e.g., sample HTTP requests or workflows calling instrumented functions), how to confirm traces/spans appear in the selected backend (Jaeger/OTLP/console) and sample log/trace evidence to look for, and pass/fail criteria (e.g., span counts, presence of service.name, correct attributes); include minimal example verification commands and expected outputs so reviewers can follow and mark R10 satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 178-180: The document currently gives conflicting normative
guidance: Step 1 endorses Option 3 (hybrid YAML+env) while later text (around
"Line 250") favors env-driven auto-instrumentation (Option 2) causing ambiguity;
pick one normative path and update the doc so Step 1, the summary paragraph (the
sentence mentioning Option 1/2/3), and the later lines referencing env-driven
auto-instrumentation consistently state the chosen approach (e.g., if keeping
Option 3, change the "env-driven auto-instrumentation is generally preferred"
sentence to explain that env-vars are supported but the canonical approach is
configuration-driven via LCORE and requires explicit tracing fields; if choosing
Option 2, change Step 1 and the summary to promote env-driven
auto-instrumentation and note when hybrid YAML+env is acceptable), and ensure
references to "opentelemetry-instrument", "LCORE", "Option 2", and "Option 3"
are updated for consistency.
- Around line 175-176: The doc contains contradictory guidance about missing
mandatory OTEL fields: one place says startup should fail when enabled:true is
missing endpoint/protocol/service name, another suggests the app may instead
“refuse tracing”; pick one normative behavior (e.g., "startup MUST fail" or
"startup MAY continue but disable tracing") and make both occurrences consistent
by replacing the ambiguous phrase with the chosen policy; update the paragraphs
that mention "If `enabled: true`, YAML must contain mandatory endpoint,
protocol, and service name; startup should fail if missing" and the later "or
refuse tracing" sentence so they both state the exact chosen action, and ensure
any precedence wording about YAML vs OTEL_* env vars remains consistent with
that decision.
---
Duplicate comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 194-244: The markdown has lint errors MD022/MD031 due to missing
blank lines around several headings and a missing blank line after a fenced code
block; add a single blank line before and after each top-level and subheading
such as "Configuration model:", "Pros:", "Cons:", "Important constraints:",
"Option 2 — Manual SDK initialization with YAML-driven hybrid configuration" and
the other listed headings, and ensure there is an empty line after any fenced
code block before the following paragraph (e.g., the fenced block that precedes
"Enrichment of Llama Stack server-mode client initialization:"); this will
satisfy markdownlint spacing rules without changing heading text or content.
- Around line 107-108: Add a new "Verification / Test Plan" section to the
observability-opentelemetry.md doc under the existing "R10 – Documentation"
heading that provides a concise, stepwise validation procedure: list required
and optional env vars to set (e.g., OTEL_SERVICE_NAME,
OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRICS_EXPORTER), commands to run the service
and otel-collector/test backends, exact actions to generate traces (e.g., sample
HTTP requests or workflows calling instrumented functions), how to confirm
traces/spans appear in the selected backend (Jaeger/OTLP/console) and sample
log/trace evidence to look for, and pass/fail criteria (e.g., span counts,
presence of service.name, correct attributes); include minimal example
verification commands and expected outputs so reviewers can follow and mark R10
satisfied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a79fc6b4-a3ec-477e-b931-22b2a9fabdad
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build-pr
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 194-194: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 199-199: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 205-205: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 210-210: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 224-224: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 232-232: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 238-238: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 244-244: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 298-298: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
tisnik
left a comment
There was a problem hiding this comment.
It looks ok!
Q. for discussion: what happen if we won't include any Llama Stack traces? IMHO not much harm will be done as the only relevant + visible part is LCORE. In Llama Stack, the flow can change in future, so tracing will be a bit unstable from (for example) point of view of engineer, who wants to improve performance.
Context: almost all LS calls are pretty low level, so even LCORE traces might have enough granularity.
But I'm open for discussion.
| - When both YAML and env vars define the same concern, **YAML mandatory fields take precedence** for endpoint, protocol, and service name; `OTEL_*` variables control optional advanced settings (sampling, headers, TLS files, instrumentor-only options). | ||
|
|
||
| **Recommendation:** Option 1 is rejected for maintainability. Option 2 remains viable if LCORE is run with `opentelemetry-instrument`. Option 3 leverages LCORE’s configuration-driven design to ensure mandatory fields are always explicit when tracing is enabled. | ||
|
|
There was a problem hiding this comment.
option #2 seems promising.
Q - will those options be visible in config endpoint? If not, how we help admins with identifying tracing-related problems?
Q. for the future - how to handle configuration changes in this situation?
There was a problem hiding this comment.
No, these options won't be visible in config endpoint by default. We can possibly implement some "injection" into v1/config response model if desired. Problem is that there is a lot of configurable options for OTEL (see this link) but it is possible to collect all environment variables with OTEL_ prefix and inject them into config.
To the best of my knowledge, OTEL_* environment variables are collected during process startup. As a result, any change to these variables requires the service process to be restarted/redeployed to apply changes.
There was a problem hiding this comment.
🧩 Analysis chain
@tisnik, I've created a GitHub issue to track the exposure of OTEL_* environment variables in the v1/config endpoint. The issue covers:
- The problem:
OTEL_*settings are not visible viav1/config, making it hard for admins to diagnose tracing configuration - The proposed solution: collect all
OTEL_-prefixed env vars and inject them into the config response - A note that changes to these variables require a process restart
I'll share the issue link once the script completes.
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo lightspeed-core/lightspeed-stack \
--title "Expose OTEL_* environment variables in the v1/config endpoint" \
--body "## Summary
When OpenTelemetry tracing is configured via \`OTEL_*\` environment variables (Option 2 or Option 3 from the observability architecture), these settings are currently not visible through the LCORE \`v1/config\` endpoint. This makes it difficult for administrators to:
- Verify which tracing configuration is active
- Diagnose tracing-related problems
- Understand the effective OpenTelemetry configuration at runtime
## Proposed Solution
Collect all environment variables with the \`OTEL_\` prefix and inject them into the \`v1/config\` response model, providing administrators visibility into the active tracing configuration alongside other LCORE settings.
Note: As \`OTEL_*\` environment variables are read at process startup, any configuration changes require a service restart/redeployment to take effect. This should be clearly communicated in the config endpoint response or documentation.
## Related
- PR: https://github.com/lightspeed-core/lightspeed-stack/pull/1482
- Discussion: https://github.com/lightspeed-core/lightspeed-stack/pull/1482#discussion_r3062907258
- Requested by: `@tisnik`" \
--assignee tisnik
Length of output: 63
@tisnik, the issue has been created and assigned to you: #1499
It tracks exposing OTEL_* environment variables in the v1/config endpoint to help administrators verify and diagnose tracing configuration.
|
|
||
| **Recommendation:** | ||
| - Continue using the `/metrics` endpoint for all operational metrics. | ||
| - Expand the set of Prometheus metrics as needed to cover additional components (e.g., LLM calls, RAG retrieval, tool execution) as product needs evolve. |
There was a problem hiding this comment.
yes, it would be very valuable
| **Recommendation:** | ||
| - Continue using the `/metrics` endpoint for all operational metrics. | ||
| - Expand the set of Prometheus metrics as needed to cover additional components (e.g., LLM calls, RAG retrieval, tool execution) as product needs evolve. | ||
| - Ensure metrics align with existing naming conventions and maintain low cardinality to avoid high-memory cost in Prometheus servers. |
There was a problem hiding this comment.
is it a real problem now? Or just warning/note for the future. Because now (IIRC) the cardinality selectors are just provider+model...
There was a problem hiding this comment.
I think it isn't a real problem rn.
I have an impression that the general expectation in distributed systems is that when you have multi-service integration, spans from downstream services should be part of the same trace to preserve end-to-end observability. From what I’ve seen, Llama Stack already provides fairly rich and detailed tracing, and it would be valuable to leverage that (at least in server mode) rather than losing that level of visibility. Even with LLS spans, LCORE traces would remain complete and consistent on their own, but LLS spans would provide optional deeper insight into request-level behavior (especially latency breakdowns for more complex calls). |
Tbh, I don't think so. From what I can see in Jaeger, Llama Stack traces are pretty granular, with a lot of internal spans exposed per request. |
I agree, but my point is who will benefit from it. Admin/devops who look for perf. problems wont be probably interested in LS internals. More interesting is (IMHO) "is the RAG what caused the slowdown" or "which MCP call is the most time expensive?" etc. OTOH we migth be interested knowing tracing about LS providers, that would be good, indeed. |
Yeah, I agree but exactly for this we need LLS traces, right? Otherwise this would be completely hidden under a single LCORE responses request span. |
|
@asimurka just wondering why the spike document is not included in this PR? |
The document I propose IMO contains everything necessary for scoping the feature, including decision points. |
e0e6b4f to
bc1df2f
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
docs/design/observability-opentelemetry/observability-opentelemetry.md (2)
107-108:⚠️ Potential issue | 🟠 MajorR10 verification requirement remains unfulfilled.
Requirement R10 mandates that the feature "shall include documentation describing how to... verify correct behavior," but the document contains no verification section, test plan, or acceptance criteria. This was flagged in a previous review and remains unresolved.
Without explicit verification procedures, implementation teams lack guidance on:
- How to confirm tracing is correctly initialized
- How to validate trace context propagation (inbound/outbound)
- How to verify exporter connectivity and span delivery
- How to test resilience when exporters fail
- What observable evidence confirms correct behavior
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 107 - 108, Add a new "Verification / Test Plan" section under R10 that provides concrete acceptance criteria and step-by-step checks: include a tracing initialization smoke test (start app, assert tracer provider registered and first span emitted), trace context propagation tests for inbound/outbound requests (inject/extract W3C traceparent and assert span linkage), exporter connectivity checks (configure OTLP/console exporters, confirm spans are received by a collector or logs), resiliency tests (simulate exporter failure and verify graceful degradation, buffer/drop behavior and error logs), and a list of observable evidence (examples of expected span names, resource attributes, log entries, and metrics) plus environment-variable toggle instructions and commands to run automated e2e/unit tests that validate each item; reference R10 and the observability-opentelemetry.md verification section when implementing.
535-546:⚠️ Potential issue | 🟠 MajorResolve contradiction between architecture recommendation and deployment guidance.
Lines 178-179 recommend Option 3 (hybrid YAML+env with manual SDK initialization), but lines 541-544 show deployment files using
opentelemetry-instrumentin the ENTRYPOINT, which implements Option 1 (auto-instrumentation with env-only config). These approaches are mutually exclusive:
- Option 1 (
opentelemetry-instrument): SDK initializes before application code runs; configuration comes entirely fromOTEL_*environment variables; manual SDK initialization is not possible.- Option 3 (manual SDK init): SDK is initialized explicitly in application code (e.g., FastAPI lifespan) after YAML is loaded; using
opentelemetry-instrumentwould conflict with manual initialization.Impact: Implementation teams will be confused about which approach to follow, potentially leading to broken deployments or wasted integration effort.
📋 Suggested resolution
Option A: If choosing Option 3 (hybrid, manual SDK init):
Update Step 11 to remove
opentelemetry-instrumentfrom deployment files:-**`Containerfile` (LCORE image)** — change the final **`ENTRYPOINT`** so the wrapper runs first, for example: -`ENTRYPOINT ["opentelemetry-instrument", "python3.12", "src/lightspeed_stack.py"]` -(instead of invoking `python3.12` alone). +**`Containerfile` (LCORE image)** — no changes required; tracing is initialized via manual SDK setup in application lifespan.-**`scripts/llama-stack-entrypoint.sh` (Llama Stack image)** — the shell entrypoint should become **`exec opentelemetry-instrument llama stack run …`**. +**`scripts/llama-stack-entrypoint.sh` (Llama Stack image)** — prefix with **`opentelemetry-instrument`** if Llama Stack uses auto-instrumentation (LCORE does not).Option B: If choosing Option 1 (auto-instrumentation):
Update Step 1 recommendation at line 178 to endorse Option 2 (env-first) and clarify that LCORE YAML only provides propagation flags, not export configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 535 - 546, The docs currently recommend Option 3 (manual SDK initialization) but the deployment examples use opentelemetry-instrument in ENTRYPOINT (auto-instrumentation), which conflicts; resolve by making the chosen approach consistent: either remove opentelemetry-instrument from the Containerfile ENTRYPOINT and llama-stack-entrypoint.sh and document that the application will call manual SDK init (retain references to manual init, YAML+env hybrid and functions like FastAPI lifespan), or change the architecture recommendation to Option 1/2 and update the prose to state that deployments use opentelemetry-instrument in ENTRYPOINT with configuration purely via OTEL_* env vars (and that manual SDK init must not be used). Ensure all mentions of ENTRYPOINT, opentelemetry-instrument, OTEL_* env vars, and "manual SDK initialization"/"Option 3" are updated to the chosen option.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 519-520: Clarify that the documented global kill switch
OTEL_SDK_DISABLED only applies automatically to auto-instrumentation and must be
manually honored in the hybrid/manual SDK initialization path (Option 3); update
the doc to instruct developers using Option 3 to read
process.env.OTEL_SDK_DISABLED (or equivalent env access) at startup and skip
initializing the OpenTelemetry SDK/tracer provider/exporters when it is truthy
so the kill switch behaves as documented.
- Line 412: Clarify that OTEL_PYTHON_FASTAPI_EXCLUDED_URLS only takes effect
when FastAPI auto-instrumentation is in use (e.g., via opentelemetry-instrument
with opentelemetry-instrumentation-fastapi); for Option 3 (hybrid with manual
SDK initialization) explicitly state that you must either enable the FastAPI
auto-instrumentation or register the FastAPI instrumentation manually (e.g.,
FastAPIInstrumentor or equivalent configuration in your manual SDK bootstrap)
for OTEL_PYTHON_FASTAPI_EXCLUDED_URLS to be honored, and add a short example
note describing enabling FastAPI instrumentation in manual setup.
- Around line 600-603: Clarify the contradictory guidance about library mode
tracing: update the Open questions section to match the definitive statement
earlier by explicitly declaring AsyncLlamaStackAsLibraryClient (library mode) as
unsupported for unified tracing and documenting the technical rationale
(LCORE/LLS in-process context mismatch) and operational implications; or, if
undecided, replace the single-question bullet with a short decision-framework
listing the three options (document as unsupported, invest in in-process context
propagation requiring LCORE+LLS contract changes, or recommend service mode for
production), include decision criteria (effort, user demand, operational
complexity) and a decision deadline so the specification is unambiguous for
implementers.
- Around line 504-513: Add a Pydantic validator on the
OpenTelemetryConfiguration model to enforce that when enabled is True the export
attribute (type OpenTelemetryExportConfiguration) is not None and that its
required fields (endpoint, protocol, service_name) are present and non-empty;
implement this as a `@root_validator` (or class validator) on
OpenTelemetryConfiguration, check values.get("enabled") and validate
values.get("export") and its properties, and raise a ValueError with a clear
message if the export block or any required subfield is missing so configuration
loading fails early.
---
Duplicate comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 107-108: Add a new "Verification / Test Plan" section under R10
that provides concrete acceptance criteria and step-by-step checks: include a
tracing initialization smoke test (start app, assert tracer provider registered
and first span emitted), trace context propagation tests for inbound/outbound
requests (inject/extract W3C traceparent and assert span linkage), exporter
connectivity checks (configure OTLP/console exporters, confirm spans are
received by a collector or logs), resiliency tests (simulate exporter failure
and verify graceful degradation, buffer/drop behavior and error logs), and a
list of observable evidence (examples of expected span names, resource
attributes, log entries, and metrics) plus environment-variable toggle
instructions and commands to run automated e2e/unit tests that validate each
item; reference R10 and the observability-opentelemetry.md verification section
when implementing.
- Around line 535-546: The docs currently recommend Option 3 (manual SDK
initialization) but the deployment examples use opentelemetry-instrument in
ENTRYPOINT (auto-instrumentation), which conflicts; resolve by making the chosen
approach consistent: either remove opentelemetry-instrument from the
Containerfile ENTRYPOINT and llama-stack-entrypoint.sh and document that the
application will call manual SDK init (retain references to manual init,
YAML+env hybrid and functions like FastAPI lifespan), or change the architecture
recommendation to Option 1/2 and update the prose to state that deployments use
opentelemetry-instrument in ENTRYPOINT with configuration purely via OTEL_* env
vars (and that manual SDK init must not be used). Ensure all mentions of
ENTRYPOINT, opentelemetry-instrument, OTEL_* env vars, and "manual SDK
initialization"/"Option 3" are updated to the chosen option.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1dd68690-43fc-4a02-82c2-f68b723194fe
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: unit_tests (3.13)
- GitHub Check: Pylinter
- GitHub Check: unit_tests (3.12)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E: server mode / ci
🧰 Additional context used
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 194-194: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 199-199: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 205-205: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 210-210: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 224-224: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 232-232: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 238-238: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 244-244: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 298-298: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
bc1df2f to
7e5db9c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (8)
docs/design/observability-opentelemetry/observability-opentelemetry.md (8)
250-251:⚠️ Potential issue | 🟠 MajorResolve normative recommendation conflict between Step 1 and Step 2.
Step 1 recommends Option 3 (hybrid), but this paragraph says env-driven auto-instrumentation is generally preferred. Pick one normative implementation path for LCORE.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 250 - 251, Choose a single normative implementation for LCORE and update the text to remove the conflict between Step 1 (Option 3 / hybrid) and the recommendation favoring environment-variable-driven auto-instrumentation: either adopt Option 3 (hybrid) as the authoritative path and explain how env-driven auto-instrumentation is secondary/optional, or declare env-driven auto-instrumentation as the primary approach and update Step 1 to match; ensure the doc references the same canonical choice consistently and keep LCORE YAML propagation flags (`incoming` / `outgoing`) behavior described as compatible with that single chosen model.
599-602:⚠️ Potential issue | 🟠 MajorMake library-mode status consistent (decision vs open question).
Earlier sections state library mode is not covered, but this section frames it as undecided. Either declare it as a known limitation or define a decision framework with deadline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 599 - 602, The document currently treats library-mode tracing status inconsistently; resolve this by making a single authoritative statement: update the Open questions section to either declare AsyncLlamaStackAsLibraryClient (library mode) as a known unsupported limitation for unified tracing (mention LCORE and LLS will not join traces) or state a concrete decision plan with a deadline and owner to investigate in-process context alignment (LCORE + LLS contract); ensure references to AsyncLlamaStackAsLibraryClient, LCORE, and LLS are used so readers can find the relevant sections and remove the contradictory language elsewhere so library-mode status is consistent across the doc.
454-455:⚠️ Potential issue | 🟠 MajorDocument manual handling of
OTEL_SDK_DISABLEDfor hybrid/manual init.In manual SDK initialization paths, this kill switch must be explicitly checked before tracer/provider setup; otherwise behavior won’t match this requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 454 - 455, Update the doc to state that in manual/hybrid SDK initialization paths the OTEL_SDK_DISABLED env var must be explicitly checked before performing any tracer/provider setup; call out that implementations of initOpenTelemetry(), setupTracerProvider(), configureTracer(), or any manual tracer initialization entry points must early-return when process.env.OTEL_SDK_DISABLED (or equivalent config) is truthy so no providers, span processors, or exporters are created, and include a short example checklist of where to add the check (entry/init functions) and the expected behavior when the switch is set.
298-298:⚠️ Potential issue | 🟡 MinorAdd blank lines around fenced code block (MD031).
The fence boundary around this snippet needs surrounding blank lines to pass markdownlint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` at line 298, The fenced code block in the documentation is missing blank lines around its opening/closing triple-backtick markers; update the markdown so there is an empty line immediately before the opening ``` and an empty line immediately after the closing ```, thereby satisfying MD031 and ensuring the fenced code block has surrounding blank lines.
108-108:⚠️ Potential issue | 🟠 MajorAdd explicit verification/test-plan steps to satisfy R10.
R10 requires “how to verify correct behavior,” but there is no concrete test procedure with commands, expected outcomes, and evidence criteria.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` at line 108, Add a concrete verification/test-plan section to observability-opentelemetry.md that satisfies R10 by listing step-by-step commands to run (e.g., start the service with the documented env vars, run a sample request via curl or a test script, and push spans via OTLP/collector), the exact expected outcomes (span names, attributes, sampling decisions, and HTTP status codes), and the evidence criteria to collect (trace IDs, screenshots of Jaeger/UI queries, presence of spans in OTLP/collector logs or storage, and matching attribute values). Reference the R10 requirement explicitly and include example commands for starting the collector and service, running a test request, querying the tracing backend (Jaeger/Tempo/OTLP endpoint), and a pass/fail checklist for each verification step so reviewers can reproduce and validate tracing behavior.
194-244:⚠️ Potential issue | 🟡 MinorFix MD022 heading spacing in this section.
### Configuration model:,### Pros:,### Cons:, and### Important constraints:are missing required blank-line separation and will fail markdown lint checks.Suggested doc fix
The application does not explicitly configure the SDK; instead, it relies on OpenTelemetry’s default initialization behavior. ### Configuration model: + - All tracing configuration is provided via `OTEL_*` environment variables - YAML does not participate in SDK initialization - No runtime configuration merging occurs ### Pros: + - No application-level tracing setup required ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 194 - 244, Add a blank line before each second-level heading instances "### Configuration model:", "### Pros:", "### Cons:", and "### Important constraints:" so they are separated from preceding paragraphs (fixing MD022); update the section where Option 2 begins to ensure each of those headings has a single empty line above it in the markdown so the linter no longer reports missing heading spacing.
557-557:⚠️ Potential issue | 🟠 MajorRemove “fail fast or refuse tracing” ambiguity.
This line reintroduces two mutually exclusive startup behaviors for missing mandatory YAML export fields; specify one required behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` at line 557, Replace the ambiguous phrase so the doc mandates a single behavior: when LCORE tracing is set to enabled: true and any mandatory YAML export fields are missing or invalid, the system must fail fast and refuse to start (i.e., abort startup with a clear error and log), rather than allowing tracing to continue; update the sentence referencing LCORE, enabled: true, and export fields to state this explicit fail-fast requirement and remove the alternative "refuse tracing" wording that implies a choice.
412-412:⚠️ Potential issue | 🟡 MinorClarify
OTEL_PYTHON_FASTAPI_EXCLUDED_URLSdependency on FastAPI instrumentation.The text currently implies it always applies, but it only works when FastAPI instrumentation is active.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` at line 412, Update the note to clarify that OTEL_PYTHON_FASTAPI_EXCLUDED_URLS only takes effect when FastAPI instrumentation is enabled: change the sentence mentioning OTEL_PYTHON_FASTAPI_EXCLUDED_URLS to explicitly state its dependency on active FastAPI instrumentation (e.g., "These can be filtered via OTEL_PYTHON_FASTAPI_EXCLUDED_URLS when FastAPI instrumentation is enabled") so readers understand the env var has no effect without the FastAPI instrumentation being active.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 250-251: Choose a single normative implementation for LCORE and
update the text to remove the conflict between Step 1 (Option 3 / hybrid) and
the recommendation favoring environment-variable-driven auto-instrumentation:
either adopt Option 3 (hybrid) as the authoritative path and explain how
env-driven auto-instrumentation is secondary/optional, or declare env-driven
auto-instrumentation as the primary approach and update Step 1 to match; ensure
the doc references the same canonical choice consistently and keep LCORE YAML
propagation flags (`incoming` / `outgoing`) behavior described as compatible
with that single chosen model.
- Around line 599-602: The document currently treats library-mode tracing status
inconsistently; resolve this by making a single authoritative statement: update
the Open questions section to either declare AsyncLlamaStackAsLibraryClient
(library mode) as a known unsupported limitation for unified tracing (mention
LCORE and LLS will not join traces) or state a concrete decision plan with a
deadline and owner to investigate in-process context alignment (LCORE + LLS
contract); ensure references to AsyncLlamaStackAsLibraryClient, LCORE, and LLS
are used so readers can find the relevant sections and remove the contradictory
language elsewhere so library-mode status is consistent across the doc.
- Around line 454-455: Update the doc to state that in manual/hybrid SDK
initialization paths the OTEL_SDK_DISABLED env var must be explicitly checked
before performing any tracer/provider setup; call out that implementations of
initOpenTelemetry(), setupTracerProvider(), configureTracer(), or any manual
tracer initialization entry points must early-return when
process.env.OTEL_SDK_DISABLED (or equivalent config) is truthy so no providers,
span processors, or exporters are created, and include a short example checklist
of where to add the check (entry/init functions) and the expected behavior when
the switch is set.
- Line 298: The fenced code block in the documentation is missing blank lines
around its opening/closing triple-backtick markers; update the markdown so there
is an empty line immediately before the opening ``` and an empty line
immediately after the closing ```, thereby satisfying MD031 and ensuring the
fenced code block has surrounding blank lines.
- Line 108: Add a concrete verification/test-plan section to
observability-opentelemetry.md that satisfies R10 by listing step-by-step
commands to run (e.g., start the service with the documented env vars, run a
sample request via curl or a test script, and push spans via OTLP/collector),
the exact expected outcomes (span names, attributes, sampling decisions, and
HTTP status codes), and the evidence criteria to collect (trace IDs, screenshots
of Jaeger/UI queries, presence of spans in OTLP/collector logs or storage, and
matching attribute values). Reference the R10 requirement explicitly and include
example commands for starting the collector and service, running a test request,
querying the tracing backend (Jaeger/Tempo/OTLP endpoint), and a pass/fail
checklist for each verification step so reviewers can reproduce and validate
tracing behavior.
- Around line 194-244: Add a blank line before each second-level heading
instances "### Configuration model:", "### Pros:", "### Cons:", and "###
Important constraints:" so they are separated from preceding paragraphs (fixing
MD022); update the section where Option 2 begins to ensure each of those
headings has a single empty line above it in the markdown so the linter no
longer reports missing heading spacing.
- Line 557: Replace the ambiguous phrase so the doc mandates a single behavior:
when LCORE tracing is set to enabled: true and any mandatory YAML export fields
are missing or invalid, the system must fail fast and refuse to start (i.e.,
abort startup with a clear error and log), rather than allowing tracing to
continue; update the sentence referencing LCORE, enabled: true, and export
fields to state this explicit fail-fast requirement and remove the alternative
"refuse tracing" wording that implies a choice.
- Line 412: Update the note to clarify that OTEL_PYTHON_FASTAPI_EXCLUDED_URLS
only takes effect when FastAPI instrumentation is enabled: change the sentence
mentioning OTEL_PYTHON_FASTAPI_EXCLUDED_URLS to explicitly state its dependency
on active FastAPI instrumentation (e.g., "These can be filtered via
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS when FastAPI instrumentation is enabled") so
readers understand the env var has no effect without the FastAPI instrumentation
being active.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 89e9f8c7-723f-4bde-b600-1098fa1dd938
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Pylinter
- GitHub Check: build-pr
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E: server mode / ci
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 194-194: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 199-199: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 205-205: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 210-210: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 224-224: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 232-232: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 238-238: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 244-244: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 298-298: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
Description
Adds the feature architecture document for OpenTelemetry tracing in Lightspeed Core. Feature Specification only; no code changes.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit