Skeleton estimation

Lift an upper-body skeleton (head, shoulders, elbows, wrists) from the camera pose plus tracked wrist 3D positions, with tunable limb lengths.

What it does

UpperBodyEstimator derives a plausible upper-body skeleton from two inputs:

The camera pose in world frame (head/eye is approximately co-located with the camera mount).
The 3D wrist positions from HandTracker.detect_hands(...).

It uses inverse kinematics to solve elbow placement given fixed limb lengths and the wrist target. The result is a 10-joint skeleton (head, neck, spine, left/right shoulder/elbow/wrist, mount-cam) that you can log to Rerun or stash for downstream training.

This isn't a learned model, it's deterministic geometry. It's "free" once you already have hand tracking and camera poses, and it's usually plausible enough for visualization and as a soft prior in downstream RL/IL pipelines.

Basic usage

from stera.models import HandTracker, UpperBodyEstimator

tracker   = HandTracker(model="mediapipe")
estimator = UpperBodyEstimator(session=session)

for frame in session.frames():
    hands    = tracker.detect_hands(frame)
    skeleton = estimator.estimate(frame, hands=hands)

    if skeleton is not None:
        skeleton.joints      # (10, 3) world-frame metres, NaN where missing
        skeleton.visible     # (10,) bool
        skeleton.bone_lines()       # list of [[p1, p2], ...] for visible bones
        skeleton.visible_joints()   # (M, 3) only the visible rows

estimate returns None when the frame has no camera_pose (you can't anchor the skeleton without it).

Pass the result straight into the visualizer:

viz.log_frame(frame, hands=hands, skeleton=skeleton)

Joint layout

Index	Name	Source
0	`head`	derived from `mount_cam` + `neck_to_head_up`
1	`neck`	derived from `mount_cam` + `neck_back` / `neck_drop`
2	`spine`	derived from `neck` + `torso_drop`
3	`l_shoulder`	derived from `neck` − `neck_to_shoulder`
4	`l_elbow`	IK-solved from shoulder + `l_wrist` target
5	`l_wrist`	left-hand wrist (or NaN if not detected)
6	`r_shoulder`	derived from `neck` + `neck_to_shoulder`
7	`r_elbow`	IK-solved from shoulder + `r_wrist` target
8	`r_wrist`	right-hand wrist (or NaN if not detected)
9	`mount_cam`	the rig's `camera_pose.translation`

Default edges: shoulders → elbows → wrists, neck → both shoulders, spine connects shoulders, mount-cam connects to neck.

Tuning the body proportions

SkeletonConfig exposes the knobs as metric distances. All lengths in metres.

from stera.models.skeleton import SkeletonConfig
from stera.models import UpperBodyEstimator

config = SkeletonConfig(
    neck_back=0.10,       # how far behind camera the neck sits (m)
    neck_drop=0.20,       # how far below camera (m)
    shoulder_drop=0.12,
    neck_to_shoulder=0.18,
    torso_drop=0.45,
    arm_length=0.60,      # total shoulder→wrist
    upper_arm_ratio=0.55, # upper-arm fraction; forearm = 1 - this
)

estimator = UpperBodyEstimator(session=session, config=config)

Field	Default	What it controls
`neck_back`	`0.10`	Horizontal offset of neck behind the camera.
`neck_drop`	`0.20`	Vertical drop of neck below the camera.
`shoulder_drop`	`0.12`	Vertical drop of shoulders below neck.
`neck_to_head_up`	`0.10`	Distance from neck up to head.
`neck_to_shoulder`	`0.18`	Lateral offset of each shoulder from neck.
`torso_drop`	`0.45`	Drop from neck to spine.
`arm_length`	`0.60`	Total arm length (shoulder to wrist).
`upper_arm_ratio`	`0.55`	Upper-arm fraction of `arm_length`. Forearm = `1 - ratio`.
`up_axis`	`-1`	World up axis: `-1` = auto-detect, `0/1/2` = X/Y/Z.

The defaults were tuned on adult-height egocentric recordings; bump everything down ~15% for a smaller user.

Auto-detected up axis

By default the estimator auto-detects the world up axis by averaging the camera's local "up" vector across the first 10 frames and choosing the axis with the largest mean component. Force a specific axis if you know your SLAM convention:

SkeletonConfig(up_axis=1)   # Y-up (visual-inertial SLAM convention)
SkeletonConfig(up_axis=2)   # Z-up

Patterns

Visualise alongside hands

for frame in session.frames():
    hands    = tracker.detect_hands(frame)
    skeleton = estimator.estimate(frame, hands=hands)
    viz.log_frame(frame, hands=hands, skeleton=skeleton)

The visualizer renders the skeleton as LineStrips3D in the world scene with bone connectivity from skeleton.edges.

Get just the bones for custom rendering

if skeleton is not None:
    for p1, p2 in skeleton.bone_lines():
        # p1, p2 are [x, y, z] world-frame metres
        draw_line(p1, p2)

Reset between sequences

estimator.reset()

Currently a no-op (the estimator is mostly stateless), but reserved so future temporal smoothing changes won't break callers.

The skeleton is not written to annotation.hdf5 automatically. Stash skeleton.joints per-frame yourself if you want to persist it; it's cheap (10×3 floats).