Skip to content

Improve loading performance in _absolute_to_relative_idx when remapping indices#3279

Open
hello3x3 wants to merge 1 commit intohuggingface:mainfrom
hello3x3:fix-index-column-performance
Open

Improve loading performance in _absolute_to_relative_idx when remapping indices#3279
hello3x3 wants to merge 1 commit intohuggingface:mainfrom
hello3x3:fix-index-column-performance

Conversation

@hello3x3
Copy link
Copy Markdown

@hello3x3 hello3x3 commented Apr 3, 2026

Type / Scope

  • Type: Performance
  • Scope: datasets / DatasetReader

Summary / Motivation

This PR improves the initialization performance of DatasetReader when episodes is specified and index remapping is required. Materializing the entire column as Python / tensor objects becomes a significant bottleneck for large datasets. Use .to_numpy() is much faster.

Related issues

What changed

Replaced formatted index column loading with direct Arrow column access in DatasetReader
Optimized remapping dictionary construction for _absolute_to_relative_idx
Preserved existing remapping semantics and public API behavior
No breaking changes

Test

from lerobot.datasets.lerobot_dataset import LeRobotDataset
from time import time

root = "/path/to/lerobot/libero_goal_image"
episodes = list(range(100))

start = time()

data = LeRobotDataset(
    "lerobot/libero_goal_image",
    root=root,
    episodes=episodes,
)

end = time()
print(f"loading {len(data.episodes)} eposides in {end - start:.2f} seconds.")
# Before: loading 100 eposides in 274.72 seconds.
# After: loading 100 eposides in 0.06 seconds.

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

@github-actions github-actions bot added the dataset Issues regarding data inputs, processing, or datasets label Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant