fix(one shot load): adding metadata loading when reading from a dataset after writing#3305
fix(one shot load): adding metadata loading when reading from a dataset after writing#3305CarolinePascal wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a read-after-write bug in the LeRobotDataset read path by ensuring dataset metadata is loaded from disk before activating the Hugging Face dataset and building the index mapping—particularly important when reading immediately after recording/finalizing without re-instantiating the dataset.
Changes:
- Reload metadata from disk in
DatasetReader.load_and_activate()before loading the HF dataset. - Ensures episode/task metadata needed by
__getitem__(especially for video-backed datasets) is available in one-shot read-after-write flows.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def load_and_activate(self) -> None: | ||
| """Load HF dataset from disk and build index mapping. Call after data is on disk.""" | ||
| self._meta._load_metadata() |
There was a problem hiding this comment.
load_and_activate() now relies on LeRobotDatasetMetadata._load_metadata(), which is a private method on another class. To avoid coupling to a private API (and potential future breakage), consider adding/using a public metadata refresh method (e.g., reload() / load_from_disk()) on LeRobotDatasetMetadata and calling that here instead.
| def load_and_activate(self) -> None: | |
| """Load HF dataset from disk and build index mapping. Call after data is on disk.""" | |
| self._meta._load_metadata() | |
| def _refresh_metadata(self) -> None: | |
| """Refresh dataset metadata using a public API exposed by the metadata object.""" | |
| for method_name in ("reload", "load_from_disk"): | |
| refresh = getattr(self._meta, method_name, None) | |
| if callable(refresh): | |
| refresh() | |
| return | |
| raise AttributeError( | |
| "LeRobotDatasetMetadata must expose a public metadata refresh method " | |
| "(for example, 'reload()' or 'load_from_disk()') before " | |
| "DatasetReader.load_and_activate() can refresh metadata." | |
| ) | |
| def load_and_activate(self) -> None: | |
| """Load HF dataset from disk and build index mapping. Call after data is on disk.""" | |
| self._refresh_metadata() |
| def load_and_activate(self) -> None: | ||
| """Load HF dataset from disk and build index mapping. Call after data is on disk.""" | ||
| self._meta._load_metadata() | ||
| self.hf_dataset = self._load_hf_dataset() | ||
| self._build_index_mapping() |
There was a problem hiding this comment.
This change fixes a read-after-write path (metadata reloaded before building the index), but there’s no regression test covering the failing scenario described in the PR (create dataset with video features, write episodes, finalize, then access dataset[0] without re-instantiating). Adding a test in the existing DatasetReader/LeRobotDataset test suite would prevent this from regressing again.
Type / Scope
Summary / Motivation
This PR adds metadata loading when reading from a dataset after writing it (i.e. without instantiating a
LeRobotDatasetclass from an existing dataset).Related issues
What changed
Add
self._meta._load_metadata()in theDatasetReader.load_and_activate()method.How was this tested (or how to run locally)
Here's a minimal reproducible example :
Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes