Coordinate frames
The SDK works with three coordinate frames (optical, link, world). Understanding the chain between them is the difference between hand keypoints landing on the right pixel vs. drifting half a metre away in the 3D scene.
The three frames
| Frame | Origin | Axes | Used by |
|---|---|---|---|
| Optical | Pinhole centre of the RGB-D camera | X right, Y down, Z forward (looking out of the lens) | Depth back-projection, raw model outputs (WiLoR/HaMeR/MediaPipe), frame.rgb/frame.depth |
| Link | Same origin, robotic-arm convention | X forward, Y left, Z up | TF graph from the rig, R_optical_to_link |
| World | Wherever the SLAM system fixes the origin (typically first-frame pose) | Y-up (visual-inertial SLAM convention) | frame.camera_pose, session.all_camera_poses(), session.mesh(), session.point_cloud() |
Translations are always metres.
The transform chain
optical → R_o2l → link → camera_pose → world
R_o2l(a fixed 3×3 rotation) is read from/tfat MCAP open time and exposed assession.R_optical_to_link. It's the only piece of static rig calibration the SDK needs.camera_poseis per-frame and comes from your SLAM system over/camera/pose. It has rotationR_worldand translationt_world.
To go from a point in the optical frame to world:
from stera.core.transforms import optical_to_world
pt_world = optical_to_world(
pts_optical, # (N, 3) in metres, optical frame
frame.camera_pose.rotation, # (3, 3) link → world
frame.camera_pose.translation,# (3,) link origin in world
session.R_optical_to_link, # (3, 3) optical → link
)The function applies R_o2l first, then the camera_pose. You don't normally call this directly, Visualizer and UpperBodyEstimator do it for you, but it's the canonical recipe when you want to lift your own depth-derived points into the SLAM map.
When a SLAM topic is missing or the pose for a particular RGB frame failed
to match, frame.camera_pose is None and any world-frame computation is
skipped. The SDK never throws a "missing pose" error mid-loop.
Why hand poses live in the optical frame by default
HandTracker.detect_hands(frame) returns HandPose objects whose joints are in the camera optical frame (metres, depth-anchored). This makes them useful even when no SLAM pose is present, depth alone is enough to pin the hand.
To put them into world coordinates for, say, a 3D viewer:
joints_optical = np.array([[k.x, k.y, k.z] for k in hand.all_keypoints])
joints_world = optical_to_world(
joints_optical,
frame.camera_pose.rotation,
frame.camera_pose.translation,
session.R_optical_to_link,
)The Visualizer.log_frame path does exactly this when both hands and a camera pose are present.
Quaternion / matrix conversions
Camera pose comes in as a quaternion on the wire. The SDK normalises everything to 3×3 rotation matrices. If you need to interop with a quaternion-based library:
from stera.core.transforms import rot_to_quat, quat_to_rot
qx, qy, qz, qw = rot_to_quat(R)
R = quat_to_rot(qx, qy, qz, qw)Convention: (x, y, z, w) ordering, matching ROS / Rerun.
Constants
R_OPTICAL_TO_LINK, the rotation matrix corresponding to(qx=-0.7071, qy=0, qz=-0.7071, qw=0). The default fallback when a recording has no TF tree.OPTICAL_QUAT_XYZW,[-0.7071, 0, -0.7071, 0]. The quaternion form of the same rotation, useful for RerunTransform3Dlogging.
The SDK reads the actual TF in MCAPReader.R_optical_to_link lazily, falls back to these constants only when /tf is empty.
See also
- Synced frames, what a
SyncedFrameholds. Pose6D, rotation + translation container.stera.core.transforms,optical_to_world,quat_to_rot,depth_to_pointcloud.