Synced frames
How session.frames() pairs each RGB frame with the nearest depth image, camera pose, and IMU sample, and what fields a SyncedFrame holds.
What it is
A synced frame bundles together the data streams that a Stera recording produces (RGB, depth, camera pose, IMU) at a single moment in time. The SDK gives you them via session.frames():
for frame in session.frames():
frame.rgb # (H, W, 3) uint8
frame.depth # (H, W) uint16 mm, or None
frame.camera_pose # Pose6D in world frame, or None
frame.imu # dict, or None
frame.depth_K # (3, 3) depth intrinsics
frame.rgb_K # (3, 3) RGB intrinsics
frame.timestamp # seconds (RGB clock)
frame.index # 0..num_rgb_frames - 1How synchronisation works
The RGB clock drives the loop. For each RGB frame at timestamp t_rgb:
- Depth: nearest-neighbour match by timestamp, dropped if
|t_depth - t_rgb| > max_depth_dt(default0.1 s). - Camera pose: same nearest-neighbour rule against
/camera/pose, withmax_pose_dt(default0.1 s). - IMU: nearest-neighbour with a tight 50 ms window. IMU rates are typically 100-200 Hz so this almost always lands on a sample.
If a stream is missing in the MCAP entirely (no depth topic, no poses), the corresponding field is None and the loop carries on.
You can tighten or loosen the matching windows:
for frame in session.frames(max_depth_dt=0.03, max_pose_dt=0.05):
...Depth and pose are loaded eagerly into memory before the loop starts so the nearest-neighbour lookup is O(log N) per frame. RGB stays streamed off the mcap one frame at a time.
Intrinsics on every frame
Every SyncedFrame carries frame.depth_K and frame.rgb_K so downstream code (e.g. HandTracker.detect_hands) doesn't have to re-plumb intrinsics through. They're set once from the session and copied into each yielded frame.
fx = frame.depth_K[0, 0]
fy = frame.depth_K[1, 1]
cx = frame.depth_K[0, 2]
cy = frame.depth_K[1, 2]When the MCAP contains no dedicated depth-camera info topic, depth_K is computed by scaling rgb_K to match the depth image resolution.
Coordinate conventions
frame.rgbandframe.depthare in the camera optical frame (X right, Y down, Z forward).frame.camera_poseis in the world frame, with translation in metres.frame.depthisuint16millimetres (a value of1230means 1.23 m). Convert to metres withdepth.astype(np.float32) / 1000.0.
See Coordinate frames for the full optical → link → world chain and how R_optical_to_link is derived from your TF tree.
Common patterns
Project a depth pixel into world space
import numpy as np
from stera.core.transforms import optical_to_world
K = frame.depth_K
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
u, v = 320, 240 # pixel of interest
z = float(frame.depth[v, u]) / 1000.0 # metres
pt_optical = np.array([
(u - cx) * z / fx,
(v - cy) * z / fy,
z,
])
# Lift to world if a pose is present
if frame.camera_pose is not None:
pt_world = optical_to_world(
pt_optical[None, :],
frame.camera_pose.rotation,
frame.camera_pose.translation,
session.R_optical_to_link,
)[0]Skip frames without depth
for frame in session.frames():
if frame.depth is None:
continue
...Limit a run to N frames
The SDK doesn't have a max_frames knob, break out yourself:
for i, frame in enumerate(session.frames()):
if i >= 500:
break
...See also
MCAPReader.frames, full signature.SyncedFrame, field-by-field reference.- Coordinate frames, optical / link / world.