Concepts

HDF5 schema

session.export(...) writes all time-series annotations into a single annotation.hdf5 file alongside the episode video and calibrations. This page is the canonical schema reference.

Top-level groups

annotation.hdf5
├── /depth           ─ per-RGB-frame depth maps, gzip-compressed
├── /cam-pose        ─ /camera/pose translations + rotations
├── /imu             ─ /device/imu samples
├── /hand-pose       ─ accumulated HandPose detections (only if buffered)
└── /metadata        ─ counts, durations, mcap start/end timestamps

Each group is independent, open the file with h5py and read whichever subset you need.

import h5py

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    print(list(f.keys()))
    print(f["depth"]["frames"].shape)
    print(f.attrs.get("metadata"))

/depth

Aligned per RGB frame, one slot per frame.index. Frames where no depth was matched are zeroed and valid[i]=False.

DatasetShapeDtypeNotes
frames(num_rgb_frames, H, W)uint16Millimetres. Gzip level 4. Chunked as (1, H, W) for fast random-frame reads.
timestamps(num_rgb_frames,)float64Depth-message timestamp; 0.0 when invalid.
valid(num_rgb_frames,)boolTrue iff a depth match was found within max_depth_dt.

Group attributes: units="mm", height, width.

Skipped when: no depth intrinsics on the session.

/cam-pose

One row per /camera/pose message (not per RGB frame).

DatasetShapeDtypeNotes
timestamps(num_pose_samples,)float64
translations(num_pose_samples, 3)float32Metres, world frame.
rotations(num_pose_samples, 3, 3)float32Camera-link → world rotation.

Skipped when: no /camera/pose messages.

/imu

One row per /device/imu message.

DatasetShapeDtypeNotes
timestamps(num_imu_samples,)float64
linear_acceleration(num_imu_samples, 3)float32m/s².
angular_velocity(num_imu_samples, 3)float32rad/s.
orientation_xyzw(num_imu_samples, 4)float32xyzw quaternion (matches ROS convention).

Skipped when: no /device/imu messages.

/hand-pose

Aligned per RGB frame, separate datasets for left and right hands. NaN-filled where no detection was attached for that frame.

Always written when you called session.add_hand_pose(...) at least once:

DatasetShapeDtypeNotes
timestamps(num_rgb_frames,)float64RGB-frame timestamps.
{left,right}_joints(num_rgb_frames, 21, 3)float32MANO joint order. NaN if no detection.
{left,right}_valid(num_rgb_frames,)bool
{left,right}_confidence(num_rgb_frames,)float32

Group attributes: coord_frame ("camera_3d" or "image_2d"), joint_layout (description string), and backend (e.g. "wilor", "hamer", "mediapipe") when the tracker stamped one.

Optional MANO + extras (WiLoR / HaMeR only)

Written only when the tracker attached the corresponding private fields to its HandPose outputs (save_mano_vertices=True etc.). NaN where missing.

DatasetShapeDtypeNotes
{left,right}_kpts_2d_rgb(F, 21, 2)float32Pixel coords in the RGB frame.
{left,right}_mano_vertices(F, 778, 3)float32Gzip-compressed (level 4).
{left,right}_mano_global_orient(F, 1, 3, 3)float32Hand root rotation.
{left,right}_mano_hand_pose(F, 15, 3, 3)float32Per-finger joint rotations.
{left,right}_mano_betas(F, 10)float32Shape coefficients.
{left,right}_pred_cam(F, 3)float32Weak-perspective [s, tx, ty].
{left,right}_pred_cam_t(F, 3)float32
{left,right}_cam_t(F, 3)float32Translation in image space.
{left,right}_focal_length(F, 2)float32

MANO extras are absent for MediaPipe, it doesn't produce them. Code that consumes them should check key in f["hand-pose"] rather than assuming presence.

Skipped when: session.add_hand_pose was never called.

/metadata

Written as HDF5 group attributes (no datasets):

AttributeTypeNotes
num_rgb_framesint
num_depth_framesint
num_pose_samplesint
num_imu_samplesint
duration_sfloatRecording duration in seconds.
start_timefloatFirst message timestamp (when summary available).
end_timefloatLast message timestamp.

Reading back examples

Get every detected hand wrist position

import h5py, numpy as np

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    hp = f["hand-pose"]
    valid = hp["right_valid"][:]
    wrist = hp["right_joints"][:, 0]   # joint 0 = wrist
    print(wrist[valid].shape)          # (M, 3)

Reconstruct a depth point cloud for one frame

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    depth = f["depth"]["frames"][100]    # (H, W) uint16
K = np.load("episodes/run_01/calibrations/depth_K.npy")
# ... feed into your back-projection of choice

Stream IMU as a pandas DataFrame

import pandas as pd
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    imu = f["imu"]
    df = pd.DataFrame({
        "ts":   imu["timestamps"][:],
        **{f"a{ax}": imu["linear_acceleration"][:, i] for i, ax in enumerate("xyz")},
        **{f"g{ax}": imu["angular_velocity"][:, i]    for i, ax in enumerate("xyz")},
    })

See also