Skeleton estimation
Lift an upper-body skeleton (head, shoulders, elbows, wrists) from the camera pose plus tracked wrist 3D positions, with tunable limb lengths.
What it does
UpperBodyEstimator derives a plausible upper-body skeleton from two inputs:
- The camera pose in world frame (head/eye is approximately co-located with the camera mount).
- The 3D wrist positions from
HandTracker.detect_hands(...).
It uses inverse kinematics to solve elbow placement given fixed limb lengths and the wrist target. The result is a 10-joint skeleton (head, neck, spine, left/right shoulder/elbow/wrist, mount-cam) that you can log to Rerun or stash for downstream training.
This isn't a learned model, it's deterministic geometry. It's "free" once you already have hand tracking and camera poses, and it's usually plausible enough for visualization and as a soft prior in downstream RL/IL pipelines.
Basic usage
from stera.models import HandTracker, UpperBodyEstimator
tracker = HandTracker(model="mediapipe")
estimator = UpperBodyEstimator(session=session)
for frame in session.frames():
hands = tracker.detect_hands(frame)
skeleton = estimator.estimate(frame, hands=hands)
if skeleton is not None:
skeleton.joints # (10, 3) world-frame metres, NaN where missing
skeleton.visible # (10,) bool
skeleton.bone_lines() # list of [[p1, p2], ...] for visible bones
skeleton.visible_joints() # (M, 3) only the visible rowsestimate returns None when the frame has no camera_pose (you can't anchor the skeleton without it).
Pass the result straight into the visualizer:
viz.log_frame(frame, hands=hands, skeleton=skeleton)Joint layout
| Index | Name | Source |
|---|---|---|
| 0 | head | derived from mount_cam + neck_to_head_up |
| 1 | neck | derived from mount_cam + neck_back / neck_drop |
| 2 | spine | derived from neck + torso_drop |
| 3 | l_shoulder | derived from neck − neck_to_shoulder |
| 4 | l_elbow | IK-solved from shoulder + l_wrist target |
| 5 | l_wrist | left-hand wrist (or NaN if not detected) |
| 6 | r_shoulder | derived from neck + neck_to_shoulder |
| 7 | r_elbow | IK-solved from shoulder + r_wrist target |
| 8 | r_wrist | right-hand wrist (or NaN if not detected) |
| 9 | mount_cam | the rig's camera_pose.translation |
Default edges: shoulders → elbows → wrists, neck → both shoulders, spine connects shoulders, mount-cam connects to neck.
Tuning the body proportions
SkeletonConfig exposes the knobs as metric distances. All lengths in metres.
from stera.models.skeleton import SkeletonConfig
from stera.models import UpperBodyEstimator
config = SkeletonConfig(
neck_back=0.10, # how far behind camera the neck sits (m)
neck_drop=0.20, # how far below camera (m)
shoulder_drop=0.12,
neck_to_shoulder=0.18,
torso_drop=0.45,
arm_length=0.60, # total shoulder→wrist
upper_arm_ratio=0.55, # upper-arm fraction; forearm = 1 - this
)
estimator = UpperBodyEstimator(session=session, config=config)| Field | Default | What it controls |
|---|---|---|
neck_back | 0.10 | Horizontal offset of neck behind the camera. |
neck_drop | 0.20 | Vertical drop of neck below the camera. |
shoulder_drop | 0.12 | Vertical drop of shoulders below neck. |
neck_to_head_up | 0.10 | Distance from neck up to head. |
neck_to_shoulder | 0.18 | Lateral offset of each shoulder from neck. |
torso_drop | 0.45 | Drop from neck to spine. |
arm_length | 0.60 | Total arm length (shoulder to wrist). |
upper_arm_ratio | 0.55 | Upper-arm fraction of arm_length. Forearm = 1 - ratio. |
up_axis | -1 | World up axis: -1 = auto-detect, 0/1/2 = X/Y/Z. |
The defaults were tuned on adult-height egocentric recordings; bump everything down ~15% for a smaller user.
Auto-detected up axis
By default the estimator auto-detects the world up axis by averaging the camera's local "up" vector across the first 10 frames and choosing the axis with the largest mean component. Force a specific axis if you know your SLAM convention:
SkeletonConfig(up_axis=1) # Y-up (visual-inertial SLAM convention)
SkeletonConfig(up_axis=2) # Z-upPatterns
Visualise alongside hands
for frame in session.frames():
hands = tracker.detect_hands(frame)
skeleton = estimator.estimate(frame, hands=hands)
viz.log_frame(frame, hands=hands, skeleton=skeleton)The visualizer renders the skeleton as LineStrips3D in the world scene with bone connectivity from skeleton.edges.
Get just the bones for custom rendering
if skeleton is not None:
for p1, p2 in skeleton.bone_lines():
# p1, p2 are [x, y, z] world-frame metres
draw_line(p1, p2)Reset between sequences
estimator.reset()Currently a no-op (the estimator is mostly stateless), but reserved so future temporal smoothing changes won't break callers.
The skeleton is not written to annotation.hdf5 automatically. Stash
skeleton.joints per-frame yourself if you want to persist it; it's
cheap (10×3 floats).
See also
UpperBodyEstimatorAPI, full reference.- Hand tracking, feeds the wrist inputs.
- Visualization,
viz.log_frame(..., skeleton=...).