Concepts

Coordinate frames

The SDK works with three coordinate frames (optical, link, world). Understanding the chain between them is the difference between hand keypoints landing on the right pixel vs. drifting half a metre away in the 3D scene.

The three frames

FrameOriginAxesUsed by
OpticalPinhole centre of the RGB-D cameraX right, Y down, Z forward (looking out of the lens)Depth back-projection, raw model outputs (WiLoR/HaMeR/MediaPipe), frame.rgb/frame.depth
LinkSame origin, robotic-arm conventionX forward, Y left, Z upTF graph from the rig, R_optical_to_link
WorldWherever the SLAM system fixes the origin (typically first-frame pose)Y-up (visual-inertial SLAM convention)frame.camera_pose, session.all_camera_poses(), session.mesh(), session.point_cloud()

Translations are always metres.

The transform chain

opticalR_o2llinkcamera_poseworld

  • R_o2l (a fixed 3×3 rotation) is read from /tf at MCAP open time and exposed as session.R_optical_to_link. It's the only piece of static rig calibration the SDK needs.
  • camera_pose is per-frame and comes from your SLAM system over /camera/pose. It has rotation R_world and translation t_world.

To go from a point in the optical frame to world:

from stera.core.transforms import optical_to_world

pt_world = optical_to_world(
    pts_optical,                  # (N, 3) in metres, optical frame
    frame.camera_pose.rotation,   # (3, 3) link → world
    frame.camera_pose.translation,# (3,)   link origin in world
    session.R_optical_to_link,    # (3, 3) optical → link
)

The function applies R_o2l first, then the camera_pose. You don't normally call this directly, Visualizer and UpperBodyEstimator do it for you, but it's the canonical recipe when you want to lift your own depth-derived points into the SLAM map.

When a SLAM topic is missing or the pose for a particular RGB frame failed to match, frame.camera_pose is None and any world-frame computation is skipped. The SDK never throws a "missing pose" error mid-loop.

Why hand poses live in the optical frame by default

HandTracker.detect_hands(frame) returns HandPose objects whose joints are in the camera optical frame (metres, depth-anchored). This makes them useful even when no SLAM pose is present, depth alone is enough to pin the hand.

To put them into world coordinates for, say, a 3D viewer:

joints_optical = np.array([[k.x, k.y, k.z] for k in hand.all_keypoints])
joints_world   = optical_to_world(
    joints_optical,
    frame.camera_pose.rotation,
    frame.camera_pose.translation,
    session.R_optical_to_link,
)

The Visualizer.log_frame path does exactly this when both hands and a camera pose are present.

Quaternion / matrix conversions

Camera pose comes in as a quaternion on the wire. The SDK normalises everything to 3×3 rotation matrices. If you need to interop with a quaternion-based library:

from stera.core.transforms import rot_to_quat, quat_to_rot

qx, qy, qz, qw = rot_to_quat(R)
R = quat_to_rot(qx, qy, qz, qw)

Convention: (x, y, z, w) ordering, matching ROS / Rerun.

Constants

  • R_OPTICAL_TO_LINK, the rotation matrix corresponding to (qx=-0.7071, qy=0, qz=-0.7071, qw=0). The default fallback when a recording has no TF tree.
  • OPTICAL_QUAT_XYZW, [-0.7071, 0, -0.7071, 0]. The quaternion form of the same rotation, useful for Rerun Transform3D logging.

The SDK reads the actual TF in MCAPReader.R_optical_to_link lazily, falls back to these constants only when /tf is empty.

See also