Guides

Hand tracking

Pick a backend (WiLoR, MediaPipe, HaMeR), get 21-joint hand poses anchored to depth, and stash MANO outputs for HDF5 export.

HandTracker is a unified wrapper around three backends. They all return the same HandPose schema, so the rest of your loop is backend-agnostic:

from stera.models import HandTracker

tracker = HandTracker(model="mediapipe")  # or "wilor", or "hamer"
for frame in session.frames():
    hands = tracker.detect_hands(frame)   # list[HandPose]

Picking a backend

BackendSetupSpeedAccuracyMANO output
MediaPipepip install "stera-sdk[mediapipe]"; nothing else~30 ms / frame on CPUGood for far / small handsNo
WiLoRClone WiLoR repo + run its requirements.txt~80 ms / frame on RTX 5090Tighter finger jointsYes (vertices + betas + global orient + hand_pose)
HaMeRClone HaMeR repo + extract _DATA/; auto-downloads ~3 GB on first call~200 ms / frame on RTX 5090Best joint accuracy via ViTPose+ wholebody pipelineYes

All three anchor 3D joint positions to the depth image when frame.depth is present, so the same loop works whether you're running CPU MediaPipe or GPU WiLoR.

MediaPipe

from stera.models import HandTracker

tracker = HandTracker(
    model="mediapipe",
    max_num_hands=2,
    min_detection_confidence=0.6,   # tunable, see Configuration below
)

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    for hp in hands:
        print(hp.hand_side, hp.wrist.x, hp.wrist.y, hp.wrist.z)

21 joints in MANO order: wrist + [mcp, pip, dip, tip] for thumb, index, middle, ring, pinky. Joint coordinates are metres in the camera optical frame when depth was available, otherwise pixels with z=0.

The MediaPipe asset (~6 MB) is downloaded to ~/.cache/mediapipe/hand_landmarker.task on first call.

WiLoR

tracker = HandTracker(
    model="wilor",
    model_path="/opt/WiLoR",       # local clone of rolpotamias/WiLoR
    save_mano_vertices=True,       # default; needed for /hand-pose MANO export
)

for frame in session.frames():
    hands = tracker.detect_hands(frame)

WiLoR runs YOLO + ViTDet on each frame, then a MANO regressor on each hand bbox. The full MANO outputs (778×3 vertices, 10 betas, global orient, per-finger pose) are stashed onto each HandPose and written to annotation.hdf5:/hand-pose by session.export.

See Installation → WiLoR.

HaMeR

tracker = HandTracker(
    model="hamer",
    model_path="/opt/hamer",       # local clone of geopavlakos/hamer
)

HaMeR's pipeline:

  1. detectron2 (Cascade Mask R-CNN ViTDet-H or RegNet-Y) → human bboxes.
  2. ViTPose+-Huge wholebody → 133 keypoints; the last 42 = both hands.
  3. Build hand bboxes from confident hand keypoints.
  4. HaMeR → MANO regression on each hand bbox.

Same HandPose outputs as WiLoR, same MANO extras. First call downloads ~3 GB of detector weights from FAIR public files; everything is cached after that.

HaMeR + detectron2 + ViTPose pulls in mmpose, mmcv, and chumpy. In our reference env (wilor-fresh, Python 3.10) we pinned numpy<2 because xtcocotools ships precompiled wheels with a different ABI.

What detect_hands accepts

tracker.detect_hands(frame)                      # SyncedFrame → uses frame.rgb / .depth / .depth_K
tracker.detect_hands(rgb)                        # raw (H, W, 3) uint8 array
tracker.detect_hands(rgb, depth=d, intrinsics=K) # both explicit

When you pass a SyncedFrame, the wrapper plumbs frame.rgb, frame.depth, and frame.depth_K (or frame.rgb_K as fallback) for you. Pass raw arrays only when you've decoupled the loop from MCAPReader.

What you get back

hands: list[HandPose]

for hp in hands:
    hp.hand_side             # "left" or "right"
    hp.confidence            # float
    hp.wrist                 # Keypoint
    hp.fingers["index"][3]   # tip of the index finger
    hp.all_keypoints         # flat list of 21 Keypoints (or [wrist] for wrist-only)
    hp.has_fingers           # True if full hand was returned

See HandPose for the full schema.

For backends that produce MANO (WiLoR, HaMeR), private _mano_* attributes are also stashed on the HandPose and forwarded to HDF5 export, you don't normally read them yourself.

Configuration

Pass any backend config field as a kwarg to HandTracker(...). The wrapper forwards it to the backend's Config dataclass.

MediaPipe

FieldDefaultNotes
max_num_hands2
min_detection_confidence0.3Higher = fewer false positives.
min_presence_confidence0.3Lower = stickier tracking.
min_tracking_confidence0.3
depth_sample_radius7Pixel patch around joint for depth.
depth_buffer_size15Rolling-median window for depth smoothing.

WiLoR

FieldDefaultNotes
wilor_dirrequiredPath to local WiLoR clone.
yolo_conf0.4YOLO detection threshold.
rescale_factor2.0bbox padding for ViTDet.
batch_size16
detect_every_n1Skip detection on N-1 of every N frames for speed.
save_mano_verticesTrue778×3 vertices stashed per hand.
depth_buffer_size15
depth_sample_radius7

HaMeR

FieldDefaultNotes
hamer_dirrequiredPath to local HaMeR clone.
body_detector"vitdet"Or "regnety" (faster, smaller).
body_detector_score_thresh0.5
min_hand_keypoints3Confident wholebody hand keypoints required to accept.
hand_keypoint_score_thresh0.5
batch_size8
save_mano_verticesTrue

All three backends also expose sanity-check thresholds (max_joint_abs, min_wrist_depth, min_palm_span, max_palm_span), bumps with implausible joint configurations are dropped silently.

Patterns

Attach to the session for HDF5 export

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    session.add_hand_pose(frame.index, hands)

session.export("episodes/run_01")

The accumulated detections land in annotation.hdf5:/hand-pose as (F, 21, 3) arrays per side. See HDF5 schema.

Visualize over RGB

from stera.viz import Visualizer
viz = Visualizer(session)

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    viz.log_frame(frame, hands=hands)

viz.export("hands.rrd")

Hands appear as 2D overlays on the RGB panel and as 3D Points3D in the world scene when a camera pose is present.

When 3D isn't available

If your recording has no depth, joints come back with z=0 and Keypoint.x/y carry pixel coords. HandPose._kpts_2d_rgb is also stashed as a (21, 2) array.

See also