HDF5 schema

session.export(...) writes all time-series annotations into a single annotation.hdf5 file alongside the episode video and calibrations. This page is the canonical schema reference.

Top-level groups

annotation.hdf5
├── /depth           ─ per-RGB-frame depth maps, gzip-compressed
├── /cam-pose        ─ /camera/pose translations + rotations
├── /imu             ─ /device/imu samples
├── /hand-pose       ─ accumulated HandPose detections (only if buffered)
└── /metadata        ─ counts, durations, mcap start/end timestamps

Each group is independent, open the file with h5py and read whichever subset you need.

import h5py

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    print(list(f.keys()))
    print(f["depth"]["frames"].shape)
    print(f.attrs.get("metadata"))

/depth

Aligned per RGB frame, one slot per frame.index. Frames where no depth was matched are zeroed and valid[i]=False.

Dataset	Shape	Dtype	Notes
`frames`	`(num_rgb_frames, H, W)`	`uint16`	Millimetres. Gzip level 4. Chunked as `(1, H, W)` for fast random-frame reads.
`timestamps`	`(num_rgb_frames,)`	`float64`	Depth-message timestamp; `0.0` when invalid.
`valid`	`(num_rgb_frames,)`	`bool`	True iff a depth match was found within `max_depth_dt`.

Group attributes: units="mm", height, width.

Skipped when: no depth intrinsics on the session.

/cam-pose

One row per /camera/pose message (not per RGB frame).

Dataset	Shape	Dtype	Notes
`timestamps`	`(num_pose_samples,)`	`float64`
`translations`	`(num_pose_samples, 3)`	`float32`	Metres, world frame.
`rotations`	`(num_pose_samples, 3, 3)`	`float32`	Camera-link → world rotation.

Skipped when: no /camera/pose messages.

/imu

One row per /device/imu message.

Dataset	Shape	Dtype	Notes
`timestamps`	`(num_imu_samples,)`	`float64`
`linear_acceleration`	`(num_imu_samples, 3)`	`float32`	m/s².
`angular_velocity`	`(num_imu_samples, 3)`	`float32`	rad/s.
`orientation_xyzw`	`(num_imu_samples, 4)`	`float32`	xyzw quaternion (matches ROS convention).

Skipped when: no /device/imu messages.

/hand-pose

Aligned per RGB frame, separate datasets for left and right hands. NaN-filled where no detection was attached for that frame.

Always written when you called session.add_hand_pose(...) at least once:

Dataset	Shape	Dtype	Notes
`timestamps`	`(num_rgb_frames,)`	`float64`	RGB-frame timestamps.
`{left,right}_joints`	`(num_rgb_frames, 21, 3)`	`float32`	MANO joint order. NaN if no detection.
`{left,right}_valid`	`(num_rgb_frames,)`	`bool`
`{left,right}_confidence`	`(num_rgb_frames,)`	`float32`

Group attributes: coord_frame ("camera_3d" or "image_2d"), joint_layout (description string), and backend (e.g. "wilor", "hamer", "mediapipe") when the tracker stamped one.

Optional MANO + extras (WiLoR / HaMeR only)

Written only when the tracker attached the corresponding private fields to its HandPose outputs (save_mano_vertices=True etc.). NaN where missing.

Dataset	Shape	Dtype	Notes
`{left,right}_kpts_2d_rgb`	`(F, 21, 2)`	`float32`	Pixel coords in the RGB frame.
`{left,right}_mano_vertices`	`(F, 778, 3)`	`float32`	Gzip-compressed (level 4).
`{left,right}_mano_global_orient`	`(F, 1, 3, 3)`	`float32`	Hand root rotation.
`{left,right}_mano_hand_pose`	`(F, 15, 3, 3)`	`float32`	Per-finger joint rotations.
`{left,right}_mano_betas`	`(F, 10)`	`float32`	Shape coefficients.
`{left,right}_pred_cam`	`(F, 3)`	`float32`	Weak-perspective `[s, tx, ty]`.
`{left,right}_pred_cam_t`	`(F, 3)`	`float32`
`{left,right}_cam_t`	`(F, 3)`	`float32`	Translation in image space.
`{left,right}_focal_length`	`(F, 2)`	`float32`

MANO extras are absent for MediaPipe, it doesn't produce them. Code that consumes them should check key in f["hand-pose"] rather than assuming presence.

Skipped when: session.add_hand_pose was never called.

/metadata

Written as HDF5 group attributes (no datasets):

Attribute	Type	Notes
`num_rgb_frames`	int
`num_depth_frames`	int
`num_pose_samples`	int
`num_imu_samples`	int
`duration_s`	float	Recording duration in seconds.
`start_time`	float	First message timestamp (when summary available).
`end_time`	float	Last message timestamp.

Reading back examples

Get every detected hand wrist position

import h5py, numpy as np

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    hp = f["hand-pose"]
    valid = hp["right_valid"][:]
    wrist = hp["right_joints"][:, 0]   # joint 0 = wrist
    print(wrist[valid].shape)          # (M, 3)

Reconstruct a depth point cloud for one frame

with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    depth = f["depth"]["frames"][100]    # (H, W) uint16
K = np.load("episodes/run_01/calibrations/depth_K.npy")
# ... feed into your back-projection of choice

Stream IMU as a pandas DataFrame

import pandas as pd
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
    imu = f["imu"]
    df = pd.DataFrame({
        "ts":   imu["timestamps"][:],
        **{f"a{ax}": imu["linear_acceleration"][:, i] for i, ax in enumerate("xyz")},
        **{f"g{ax}": imu["angular_velocity"][:, i]    for i, ax in enumerate("xyz")},
    })