Hand tracking

Pick a backend (WiLoR, MediaPipe, HaMeR), get 21-joint hand poses anchored to depth, and stash MANO outputs for HDF5 export.

HandTracker is a unified wrapper around three backends. They all return the same HandPose schema, so the rest of your loop is backend-agnostic:

from stera.models import HandTracker

tracker = HandTracker(model="mediapipe")  # or "wilor", or "hamer"
for frame in session.frames():
    hands = tracker.detect_hands(frame)   # list[HandPose]

Picking a backend

Backend	Setup	Speed	Accuracy	MANO output
MediaPipe	`pip install "stera-sdk[mediapipe]"`; nothing else	~30 ms / frame on CPU	Good for far / small hands	No
WiLoR	Clone WiLoR repo + run its `requirements.txt`	~80 ms / frame on RTX 5090	Tighter finger joints	Yes (vertices + betas + global orient + hand_pose)
HaMeR	Clone HaMeR repo + extract `_DATA/`; auto-downloads ~3 GB on first call	~200 ms / frame on RTX 5090	Best joint accuracy via ViTPose+ wholebody pipeline	Yes

All three anchor 3D joint positions to the depth image when frame.depth is present, so the same loop works whether you're running CPU MediaPipe or GPU WiLoR.

MediaPipe

from stera.models import HandTracker

tracker = HandTracker(
    model="mediapipe",
    max_num_hands=2,
    min_detection_confidence=0.6,   # tunable, see Configuration below
)

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    for hp in hands:
        print(hp.hand_side, hp.wrist.x, hp.wrist.y, hp.wrist.z)

21 joints in MANO order: wrist + [mcp, pip, dip, tip] for thumb, index, middle, ring, pinky. Joint coordinates are metres in the camera optical frame when depth was available, otherwise pixels with z=0.

The MediaPipe asset (~6 MB) is downloaded to ~/.cache/mediapipe/hand_landmarker.task on first call.

WiLoR

tracker = HandTracker(
    model="wilor",
    model_path="/opt/WiLoR",       # local clone of rolpotamias/WiLoR
    save_mano_vertices=True,       # default; needed for /hand-pose MANO export
)

for frame in session.frames():
    hands = tracker.detect_hands(frame)

WiLoR runs YOLO + ViTDet on each frame, then a MANO regressor on each hand bbox. The full MANO outputs (778×3 vertices, 10 betas, global orient, per-finger pose) are stashed onto each HandPose and written to annotation.hdf5:/hand-pose by session.export.

See Installation → WiLoR.

HaMeR

tracker = HandTracker(
    model="hamer",
    model_path="/opt/hamer",       # local clone of geopavlakos/hamer
)

HaMeR's pipeline:

detectron2 (Cascade Mask R-CNN ViTDet-H or RegNet-Y) → human bboxes.
ViTPose+-Huge wholebody → 133 keypoints; the last 42 = both hands.
Build hand bboxes from confident hand keypoints.
HaMeR → MANO regression on each hand bbox.

Same HandPose outputs as WiLoR, same MANO extras. First call downloads ~3 GB of detector weights from FAIR public files; everything is cached after that.

HaMeR + detectron2 + ViTPose pulls in mmpose, mmcv, and chumpy. In our reference env (wilor-fresh, Python 3.10) we pinned numpy<2 because xtcocotools ships precompiled wheels with a different ABI.

What detect_hands accepts

tracker.detect_hands(frame)                      # SyncedFrame → uses frame.rgb / .depth / .depth_K
tracker.detect_hands(rgb)                        # raw (H, W, 3) uint8 array
tracker.detect_hands(rgb, depth=d, intrinsics=K) # both explicit

When you pass a SyncedFrame, the wrapper plumbs frame.rgb, frame.depth, and frame.depth_K (or frame.rgb_K as fallback) for you. Pass raw arrays only when you've decoupled the loop from MCAPReader.

What you get back

hands: list[HandPose]

for hp in hands:
    hp.hand_side             # "left" or "right"
    hp.confidence            # float
    hp.wrist                 # Keypoint
    hp.fingers["index"][3]   # tip of the index finger
    hp.all_keypoints         # flat list of 21 Keypoints (or [wrist] for wrist-only)
    hp.has_fingers           # True if full hand was returned

See HandPose for the full schema.

For backends that produce MANO (WiLoR, HaMeR), private _mano_* attributes are also stashed on the HandPose and forwarded to HDF5 export, you don't normally read them yourself.

Configuration

Pass any backend config field as a kwarg to HandTracker(...). The wrapper forwards it to the backend's Config dataclass.

MediaPipe

Field	Default	Notes
`max_num_hands`	`2`
`min_detection_confidence`	`0.3`	Higher = fewer false positives.
`min_presence_confidence`	`0.3`	Lower = stickier tracking.
`min_tracking_confidence`	`0.3`
`depth_sample_radius`	`7`	Pixel patch around joint for depth.
`depth_buffer_size`	`15`	Rolling-median window for depth smoothing.

WiLoR

Field	Default	Notes
`wilor_dir`	required	Path to local WiLoR clone.
`yolo_conf`	`0.4`	YOLO detection threshold.
`rescale_factor`	`2.0`	bbox padding for ViTDet.
`batch_size`	`16`
`detect_every_n`	`1`	Skip detection on N-1 of every N frames for speed.
`save_mano_vertices`	`True`	778×3 vertices stashed per hand.
`depth_buffer_size`	`15`
`depth_sample_radius`	`7`

HaMeR

Field	Default	Notes
`hamer_dir`	required	Path to local HaMeR clone.
`body_detector`	`"vitdet"`	Or `"regnety"` (faster, smaller).
`body_detector_score_thresh`	`0.5`
`min_hand_keypoints`	`3`	Confident wholebody hand keypoints required to accept.
`hand_keypoint_score_thresh`	`0.5`
`batch_size`	`8`
`save_mano_vertices`	`True`

All three backends also expose sanity-check thresholds (max_joint_abs, min_wrist_depth, min_palm_span, max_palm_span), bumps with implausible joint configurations are dropped silently.

Patterns

Attach to the session for HDF5 export

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    session.add_hand_pose(frame.index, hands)

session.export("episodes/run_01")

The accumulated detections land in annotation.hdf5:/hand-pose as (F, 21, 3) arrays per side. See HDF5 schema.

Visualize over RGB

from stera.viz import Visualizer
viz = Visualizer(session)

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    viz.log_frame(frame, hands=hands)

viz.export("hands.rrd")

Hands appear as 2D overlays on the RGB panel and as 3D Points3D in the world scene when a camera pose is present.

When 3D isn't available

If your recording has no depth, joints come back with z=0 and Keypoint.x/y carry pixel coords. HandPose._kpts_2d_rgb is also stashed as a (21, 2) array.

Hand tracking

Picking a backend

MediaPipe

WiLoR

HaMeR

What detect_hands accepts

What you get back

Configuration

MediaPipe

WiLoR

HaMeR

Patterns

Attach to the session for HDF5 export

Visualize over RGB

When 3D isn't available

See also

On this page