Guides

Skeleton estimation

Lift an upper-body skeleton (head, shoulders, elbows, wrists) from the camera pose plus tracked wrist 3D positions, with tunable limb lengths.

What it does

UpperBodyEstimator derives a plausible upper-body skeleton from two inputs:

  1. The camera pose in world frame (head/eye is approximately co-located with the camera mount).
  2. The 3D wrist positions from HandTracker.detect_hands(...).

It uses inverse kinematics to solve elbow placement given fixed limb lengths and the wrist target. The result is a 10-joint skeleton (head, neck, spine, left/right shoulder/elbow/wrist, mount-cam) that you can log to Rerun or stash for downstream training.

This isn't a learned model, it's deterministic geometry. It's "free" once you already have hand tracking and camera poses, and it's usually plausible enough for visualization and as a soft prior in downstream RL/IL pipelines.

Basic usage

from stera.models import HandTracker, UpperBodyEstimator

tracker   = HandTracker(model="mediapipe")
estimator = UpperBodyEstimator(session=session)

for frame in session.frames():
    hands    = tracker.detect_hands(frame)
    skeleton = estimator.estimate(frame, hands=hands)

    if skeleton is not None:
        skeleton.joints      # (10, 3) world-frame metres, NaN where missing
        skeleton.visible     # (10,) bool
        skeleton.bone_lines()       # list of [[p1, p2], ...] for visible bones
        skeleton.visible_joints()   # (M, 3) only the visible rows

estimate returns None when the frame has no camera_pose (you can't anchor the skeleton without it).

Pass the result straight into the visualizer:

viz.log_frame(frame, hands=hands, skeleton=skeleton)

Joint layout

IndexNameSource
0headderived from mount_cam + neck_to_head_up
1neckderived from mount_cam + neck_back / neck_drop
2spinederived from neck + torso_drop
3l_shoulderderived from neckneck_to_shoulder
4l_elbowIK-solved from shoulder + l_wrist target
5l_wristleft-hand wrist (or NaN if not detected)
6r_shoulderderived from neck + neck_to_shoulder
7r_elbowIK-solved from shoulder + r_wrist target
8r_wristright-hand wrist (or NaN if not detected)
9mount_camthe rig's camera_pose.translation

Default edges: shoulders → elbows → wrists, neck → both shoulders, spine connects shoulders, mount-cam connects to neck.

Tuning the body proportions

SkeletonConfig exposes the knobs as metric distances. All lengths in metres.

from stera.models.skeleton import SkeletonConfig
from stera.models import UpperBodyEstimator

config = SkeletonConfig(
    neck_back=0.10,       # how far behind camera the neck sits (m)
    neck_drop=0.20,       # how far below camera (m)
    shoulder_drop=0.12,
    neck_to_shoulder=0.18,
    torso_drop=0.45,
    arm_length=0.60,      # total shoulder→wrist
    upper_arm_ratio=0.55, # upper-arm fraction; forearm = 1 - this
)

estimator = UpperBodyEstimator(session=session, config=config)
FieldDefaultWhat it controls
neck_back0.10Horizontal offset of neck behind the camera.
neck_drop0.20Vertical drop of neck below the camera.
shoulder_drop0.12Vertical drop of shoulders below neck.
neck_to_head_up0.10Distance from neck up to head.
neck_to_shoulder0.18Lateral offset of each shoulder from neck.
torso_drop0.45Drop from neck to spine.
arm_length0.60Total arm length (shoulder to wrist).
upper_arm_ratio0.55Upper-arm fraction of arm_length. Forearm = 1 - ratio.
up_axis-1World up axis: -1 = auto-detect, 0/1/2 = X/Y/Z.

The defaults were tuned on adult-height egocentric recordings; bump everything down ~15% for a smaller user.

Auto-detected up axis

By default the estimator auto-detects the world up axis by averaging the camera's local "up" vector across the first 10 frames and choosing the axis with the largest mean component. Force a specific axis if you know your SLAM convention:

SkeletonConfig(up_axis=1)   # Y-up (visual-inertial SLAM convention)
SkeletonConfig(up_axis=2)   # Z-up

Patterns

Visualise alongside hands

for frame in session.frames():
    hands    = tracker.detect_hands(frame)
    skeleton = estimator.estimate(frame, hands=hands)
    viz.log_frame(frame, hands=hands, skeleton=skeleton)

The visualizer renders the skeleton as LineStrips3D in the world scene with bone connectivity from skeleton.edges.

Get just the bones for custom rendering

if skeleton is not None:
    for p1, p2 in skeleton.bone_lines():
        # p1, p2 are [x, y, z] world-frame metres
        draw_line(p1, p2)

Reset between sequences

estimator.reset()

Currently a no-op (the estimator is mostly stateless), but reserved so future temporal smoothing changes won't break callers.

The skeleton is not written to annotation.hdf5 automatically. Stash skeleton.joints per-frame yourself if you want to persist it; it's cheap (10×3 floats).

See also