Skip to content

Runtime memory metrics use ObservableCounter instead of ObservableGauge, report 0 with delta temporality #9994

@tomachristian

Description

@tomachristian

🤖 DISCLAIMER: Found by my Claude bot, I hope it may serve as useful. Uncertainty made me create an issue instead of a PR.

Description

The orleans-runtime-total-physical-memory and orleans-runtime-available-memory metrics in EnvironmentStatisticsProvider are created using CreateObservableCounter, but they represent point-in-time gauge values, not monotonically increasing cumulative totals.

This causes the metrics to report incorrect values when consumed by exporters that use delta temporality (e.g., Azure Monitor / Application Insights, Datadog). Specifically:

  • orleans-runtime-total-physical-memory effectively always reports 0 because the underlying value (GC.GetGCMemoryInfo().TotalAvailableMemoryBytes) is near-constant for the lifetime of a container — the delta between successive observations of a near-constant value is 0.
  • orleans-runtime-available-memory reports small positive/negative deltas instead of the actual available memory value.

With cumulative temporality exporters (e.g., Prometheus), the raw absolute values are visible at scrape time, masking the bug — but applying rate() or increase() to these metrics would still produce nonsensical results, since they are not cumulative quantities.

Evidence

Production telemetry from Orleans 10.0.1 on AKS with Azure Monitor (delta temporality):

  • orleans-runtime-total-physical-memory: every sample shows value: 0
  • orleans-runtime-available-memory: samples show values like +19, -10, 0, +1, +11, -2

The negative values (-10, -2) are the definitive evidence: available memory cannot be negative as an absolute measurement. These are deltas produced by the exporter subtracting successive observations of a fluctuating value.

Source

In EnvironmentStatisticsProvider.cs:

_availableMemoryCounter = Instruments.Meter.CreateObservableCounter(
    InstrumentNames.RUNTIME_MEMORY_AVAILABLE_MEMORY_MB, 
    () => (long)(_availableMemoryBytes / OneKiloByte / OneKiloByte), 
    unit: "MB");

_maximumAvailableMemoryCounter = Instruments.Meter.CreateObservableCounter(
    InstrumentNames.RUNTIME_MEMORY_TOTAL_PHYSICAL_MEMORY_MB, 
    () => (long)(_maximumAvailableMemoryBytes / OneKiloByte / OneKiloByte), 
    unit: "MB");

Expected Behavior

These should be CreateObservableGauge since they represent the current state of memory, not a running total:

  • Total physical memory is a near-static value (container memory limit / GC available memory)
  • Available memory is a fluctuating point-in-time measurement

Per the OpenTelemetry specification, an Asynchronous Gauge should be used for "non-additive" values where "it does not make sense to report the sum from multiple entities." Summing available memory across silos is meaningless — each silo's available memory is an independent measurement.

This is also consistent with other gauge-like Orleans metrics (e.g., orleans-catalog-activations, orleans-directory-partition-size) that correctly use ObservableGauge.

Impact

  • Operators cannot monitor container memory limits or actual available memory via Orleans metrics with delta-temporality backends (Azure Monitor, Datadog)
  • Dashboards and alerts based on these metrics are broken
  • Orleans' internal behavior (grain collection, activation shedding via ActivationCollector) is NOT affected — it reads from IEnvironmentStatisticsProvider.GetEnvironmentStatistics() directly, not from the exported metrics

Secondary: metric name is misleading in containers

The metric orleans-runtime-total-physical-memory sources its value from GC.GetGCMemoryInfo().TotalAvailableMemoryBytes, which is documented as "memory available to the GC" — in containers, this is the cgroup memory limit, not host physical RAM. The name total-physical-memory is a legacy artifact from bare-metal deployments that can be misleading in containerized environments.

Environment

  • Orleans 10.0.1
  • .NET 10.0
  • Azure Monitor OpenTelemetry Exporter (delta temporality)
  • Running in Kubernetes (AKS) with Linux Alpine containers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions