HDF5 schema
session.export(...) writes all time-series annotations into a single annotation.hdf5 file alongside the episode video and calibrations. This page is the canonical schema reference.
Top-level groups
annotation.hdf5
├── /depth ─ per-RGB-frame depth maps, gzip-compressed
├── /cam-pose ─ /camera/pose translations + rotations
├── /imu ─ /device/imu samples
├── /hand-pose ─ accumulated HandPose detections (only if buffered)
└── /metadata ─ counts, durations, mcap start/end timestampsEach group is independent, open the file with h5py and read whichever subset you need.
import h5py
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
print(list(f.keys()))
print(f["depth"]["frames"].shape)
print(f.attrs.get("metadata"))/depth
Aligned per RGB frame, one slot per frame.index. Frames where no depth was matched are zeroed and valid[i]=False.
| Dataset | Shape | Dtype | Notes |
|---|---|---|---|
frames | (num_rgb_frames, H, W) | uint16 | Millimetres. Gzip level 4. Chunked as (1, H, W) for fast random-frame reads. |
timestamps | (num_rgb_frames,) | float64 | Depth-message timestamp; 0.0 when invalid. |
valid | (num_rgb_frames,) | bool | True iff a depth match was found within max_depth_dt. |
Group attributes: units="mm", height, width.
Skipped when: no depth intrinsics on the session.
/cam-pose
One row per /camera/pose message (not per RGB frame).
| Dataset | Shape | Dtype | Notes |
|---|---|---|---|
timestamps | (num_pose_samples,) | float64 | |
translations | (num_pose_samples, 3) | float32 | Metres, world frame. |
rotations | (num_pose_samples, 3, 3) | float32 | Camera-link → world rotation. |
Skipped when: no /camera/pose messages.
/imu
One row per /device/imu message.
| Dataset | Shape | Dtype | Notes |
|---|---|---|---|
timestamps | (num_imu_samples,) | float64 | |
linear_acceleration | (num_imu_samples, 3) | float32 | m/s². |
angular_velocity | (num_imu_samples, 3) | float32 | rad/s. |
orientation_xyzw | (num_imu_samples, 4) | float32 | xyzw quaternion (matches ROS convention). |
Skipped when: no /device/imu messages.
/hand-pose
Aligned per RGB frame, separate datasets for left and right hands. NaN-filled where no detection was attached for that frame.
Always written when you called session.add_hand_pose(...) at least once:
| Dataset | Shape | Dtype | Notes |
|---|---|---|---|
timestamps | (num_rgb_frames,) | float64 | RGB-frame timestamps. |
{left,right}_joints | (num_rgb_frames, 21, 3) | float32 | MANO joint order. NaN if no detection. |
{left,right}_valid | (num_rgb_frames,) | bool | |
{left,right}_confidence | (num_rgb_frames,) | float32 |
Group attributes: coord_frame ("camera_3d" or "image_2d"), joint_layout (description string), and backend (e.g. "wilor", "hamer", "mediapipe") when the tracker stamped one.
Optional MANO + extras (WiLoR / HaMeR only)
Written only when the tracker attached the corresponding private fields to its HandPose outputs (save_mano_vertices=True etc.). NaN where missing.
| Dataset | Shape | Dtype | Notes |
|---|---|---|---|
{left,right}_kpts_2d_rgb | (F, 21, 2) | float32 | Pixel coords in the RGB frame. |
{left,right}_mano_vertices | (F, 778, 3) | float32 | Gzip-compressed (level 4). |
{left,right}_mano_global_orient | (F, 1, 3, 3) | float32 | Hand root rotation. |
{left,right}_mano_hand_pose | (F, 15, 3, 3) | float32 | Per-finger joint rotations. |
{left,right}_mano_betas | (F, 10) | float32 | Shape coefficients. |
{left,right}_pred_cam | (F, 3) | float32 | Weak-perspective [s, tx, ty]. |
{left,right}_pred_cam_t | (F, 3) | float32 | |
{left,right}_cam_t | (F, 3) | float32 | Translation in image space. |
{left,right}_focal_length | (F, 2) | float32 |
MANO extras are absent for MediaPipe, it doesn't produce them. Code that
consumes them should check key in f["hand-pose"] rather than assuming
presence.
Skipped when: session.add_hand_pose was never called.
/metadata
Written as HDF5 group attributes (no datasets):
| Attribute | Type | Notes |
|---|---|---|
num_rgb_frames | int | |
num_depth_frames | int | |
num_pose_samples | int | |
num_imu_samples | int | |
duration_s | float | Recording duration in seconds. |
start_time | float | First message timestamp (when summary available). |
end_time | float | Last message timestamp. |
Reading back examples
Get every detected hand wrist position
import h5py, numpy as np
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
hp = f["hand-pose"]
valid = hp["right_valid"][:]
wrist = hp["right_joints"][:, 0] # joint 0 = wrist
print(wrist[valid].shape) # (M, 3)Reconstruct a depth point cloud for one frame
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
depth = f["depth"]["frames"][100] # (H, W) uint16
K = np.load("episodes/run_01/calibrations/depth_K.npy")
# ... feed into your back-projection of choiceStream IMU as a pandas DataFrame
import pandas as pd
with h5py.File("episodes/run_01/annotation.hdf5", "r") as f:
imu = f["imu"]
df = pd.DataFrame({
"ts": imu["timestamps"][:],
**{f"a{ax}": imu["linear_acceleration"][:, i] for i, ax in enumerate("xyz")},
**{f"g{ax}": imu["angular_velocity"][:, i] for i, ax in enumerate("xyz")},
})See also
- Episode export guide, how detections get attached.
- Episode layout, full directory tree.
HandPose, what each tracker stuffs into the HDF5.