Concepts

Synced frames

How session.frames() pairs each RGB frame with the nearest depth image, camera pose, and IMU sample, and what fields a SyncedFrame holds.

What it is

A synced frame bundles together the data streams that a Stera recording produces (RGB, depth, camera pose, IMU) at a single moment in time. The SDK gives you them via session.frames():

for frame in session.frames():
    frame.rgb            # (H, W, 3) uint8
    frame.depth          # (H, W) uint16 mm, or None
    frame.camera_pose    # Pose6D in world frame, or None
    frame.imu            # dict, or None
    frame.depth_K        # (3, 3) depth intrinsics
    frame.rgb_K          # (3, 3) RGB intrinsics
    frame.timestamp      # seconds (RGB clock)
    frame.index          # 0..num_rgb_frames - 1

How synchronisation works

The RGB clock drives the loop. For each RGB frame at timestamp t_rgb:

  • Depth: nearest-neighbour match by timestamp, dropped if |t_depth - t_rgb| > max_depth_dt (default 0.1 s).
  • Camera pose: same nearest-neighbour rule against /camera/pose, with max_pose_dt (default 0.1 s).
  • IMU: nearest-neighbour with a tight 50 ms window. IMU rates are typically 100-200 Hz so this almost always lands on a sample.

If a stream is missing in the MCAP entirely (no depth topic, no poses), the corresponding field is None and the loop carries on.

You can tighten or loosen the matching windows:

for frame in session.frames(max_depth_dt=0.03, max_pose_dt=0.05):
    ...

Depth and pose are loaded eagerly into memory before the loop starts so the nearest-neighbour lookup is O(log N) per frame. RGB stays streamed off the mcap one frame at a time.

Intrinsics on every frame

Every SyncedFrame carries frame.depth_K and frame.rgb_K so downstream code (e.g. HandTracker.detect_hands) doesn't have to re-plumb intrinsics through. They're set once from the session and copied into each yielded frame.

fx = frame.depth_K[0, 0]
fy = frame.depth_K[1, 1]
cx = frame.depth_K[0, 2]
cy = frame.depth_K[1, 2]

When the MCAP contains no dedicated depth-camera info topic, depth_K is computed by scaling rgb_K to match the depth image resolution.

Coordinate conventions

  • frame.rgb and frame.depth are in the camera optical frame (X right, Y down, Z forward).
  • frame.camera_pose is in the world frame, with translation in metres.
  • frame.depth is uint16 millimetres (a value of 1230 means 1.23 m). Convert to metres with depth.astype(np.float32) / 1000.0.

See Coordinate frames for the full optical → link → world chain and how R_optical_to_link is derived from your TF tree.

Common patterns

Project a depth pixel into world space

import numpy as np
from stera.core.transforms import optical_to_world

K = frame.depth_K
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]

u, v = 320, 240                              # pixel of interest
z = float(frame.depth[v, u]) / 1000.0        # metres
pt_optical = np.array([
    (u - cx) * z / fx,
    (v - cy) * z / fy,
    z,
])

# Lift to world if a pose is present
if frame.camera_pose is not None:
    pt_world = optical_to_world(
        pt_optical[None, :],
        frame.camera_pose.rotation,
        frame.camera_pose.translation,
        session.R_optical_to_link,
    )[0]

Skip frames without depth

for frame in session.frames():
    if frame.depth is None:
        continue
    ...

Limit a run to N frames

The SDK doesn't have a max_frames knob, break out yourself:

for i, frame in enumerate(session.frames()):
    if i >= 500:
        break
    ...

See also