Skip to content

fix(one shot load): adding metadata loading when reading from a dataset after writing#3305

Open
CarolinePascal wants to merge 1 commit intomainfrom
fix/one-shot-load
Open

fix(one shot load): adding metadata loading when reading from a dataset after writing#3305
CarolinePascal wants to merge 1 commit intomainfrom
fix/one-shot-load

Conversation

@CarolinePascal
Copy link
Copy Markdown
Collaborator

@CarolinePascal CarolinePascal commented Apr 7, 2026

Type / Scope

  • Type: Bug
  • Scope: LeRobotDataset

Summary / Motivation

This PR adds metadata loading when reading from a dataset after writing it (i.e. without instantiating a LeRobotDataset class from an existing dataset).

Related issues

  • Fixes / Closes: # (if any)
  • Related: # (if any)

What changed

Add self._meta._load_metadata() in the DatasetReader.load_and_activate() method.

How was this tested (or how to run locally)

Here's a minimal reproducible example :

import shutil
import tempfile
from pathlib import Path
import numpy as np

from lerobot.datasets.lerobot_dataset import LeRobotDataset

def main():
    root = Path(tempfile.mkdtemp()) / "dummy_dataset"

    repo_id = "examples/dummy_dataset"
    fps = 10
    features = {
        "observation.image": {
            "dtype": "video",
            "shape": (64, 64, 3),
            "names": ["height", "width", "channels"],
        },
    }

    dataset = LeRobotDataset.create(
        repo_id=repo_id,
        fps=fps,
        features=features,
        root=root,
    )

    num_episodes = 2
    frames_per_episode = 20
    for ep in range(num_episodes):
        for step in range(frames_per_episode):
            frame = {
                "observation.image": np.random.randint(0, 256, size=(64, 64, 3), dtype=np.uint8),
                "task": "reach the red block",
            }
            dataset.add_frame(frame)

        dataset.save_episode()
    dataset.finalize()

    try:
        # This line will fail without the suggested fix !
        first_frame = dataset[0]
    finally:
        shutil.rmtree(root.parent)

if __name__ == "__main__":
    main()

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

Reviewer notes

  • Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
  • Anyone in the community is free to review the PR.

Copilot AI review requested due to automatic review settings April 7, 2026 13:19
@github-actions github-actions bot added the dataset Issues regarding data inputs, processing, or datasets label Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a read-after-write bug in the LeRobotDataset read path by ensuring dataset metadata is loaded from disk before activating the Hugging Face dataset and building the index mapping—particularly important when reading immediately after recording/finalizing without re-instantiating the dataset.

Changes:

  • Reload metadata from disk in DatasetReader.load_and_activate() before loading the HF dataset.
  • Ensures episode/task metadata needed by __getitem__ (especially for video-backed datasets) is available in one-shot read-after-write flows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 99 to +101
def load_and_activate(self) -> None:
"""Load HF dataset from disk and build index mapping. Call after data is on disk."""
self._meta._load_metadata()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_and_activate() now relies on LeRobotDatasetMetadata._load_metadata(), which is a private method on another class. To avoid coupling to a private API (and potential future breakage), consider adding/using a public metadata refresh method (e.g., reload() / load_from_disk()) on LeRobotDatasetMetadata and calling that here instead.

Suggested change
def load_and_activate(self) -> None:
"""Load HF dataset from disk and build index mapping. Call after data is on disk."""
self._meta._load_metadata()
def _refresh_metadata(self) -> None:
"""Refresh dataset metadata using a public API exposed by the metadata object."""
for method_name in ("reload", "load_from_disk"):
refresh = getattr(self._meta, method_name, None)
if callable(refresh):
refresh()
return
raise AttributeError(
"LeRobotDatasetMetadata must expose a public metadata refresh method "
"(for example, 'reload()' or 'load_from_disk()') before "
"DatasetReader.load_and_activate() can refresh metadata."
)
def load_and_activate(self) -> None:
"""Load HF dataset from disk and build index mapping. Call after data is on disk."""
self._refresh_metadata()

Copilot uses AI. Check for mistakes.
Comment on lines 99 to 103
def load_and_activate(self) -> None:
"""Load HF dataset from disk and build index mapping. Call after data is on disk."""
self._meta._load_metadata()
self.hf_dataset = self._load_hf_dataset()
self._build_index_mapping()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a read-after-write path (metadata reloaded before building the index), but there’s no regression test covering the failing scenario described in the PR (create dataset with video features, write episodes, finalize, then access dataset[0] without re-instantiating). Adding a test in the existing DatasetReader/LeRobotDataset test suite would prevent this from regressing again.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants