Hand tracking
Pick a backend (WiLoR, MediaPipe, HaMeR), get 21-joint hand poses anchored to depth, and stash MANO outputs for HDF5 export.
HandTracker is a unified wrapper around three backends. They all return the same HandPose schema, so the rest of your loop is backend-agnostic:
from stera.models import HandTracker
tracker = HandTracker(model="mediapipe") # or "wilor", or "hamer"
for frame in session.frames():
hands = tracker.detect_hands(frame) # list[HandPose]Picking a backend
| Backend | Setup | Speed | Accuracy | MANO output |
|---|---|---|---|---|
| MediaPipe | pip install "stera-sdk[mediapipe]"; nothing else | ~30 ms / frame on CPU | Good for far / small hands | No |
| WiLoR | Clone WiLoR repo + run its requirements.txt | ~80 ms / frame on RTX 5090 | Tighter finger joints | Yes (vertices + betas + global orient + hand_pose) |
| HaMeR | Clone HaMeR repo + extract _DATA/; auto-downloads ~3 GB on first call | ~200 ms / frame on RTX 5090 | Best joint accuracy via ViTPose+ wholebody pipeline | Yes |
All three anchor 3D joint positions to the depth image when frame.depth is present, so the same loop works whether you're running CPU MediaPipe or GPU WiLoR.
MediaPipe
from stera.models import HandTracker
tracker = HandTracker(
model="mediapipe",
max_num_hands=2,
min_detection_confidence=0.6, # tunable, see Configuration below
)
for frame in session.frames():
hands = tracker.detect_hands(frame)
for hp in hands:
print(hp.hand_side, hp.wrist.x, hp.wrist.y, hp.wrist.z)21 joints in MANO order: wrist + [mcp, pip, dip, tip] for thumb, index, middle, ring, pinky. Joint coordinates are metres in the camera optical frame when depth was available, otherwise pixels with z=0.
The MediaPipe asset (~6 MB) is downloaded to ~/.cache/mediapipe/hand_landmarker.task on first call.
WiLoR
tracker = HandTracker(
model="wilor",
model_path="/opt/WiLoR", # local clone of rolpotamias/WiLoR
save_mano_vertices=True, # default; needed for /hand-pose MANO export
)
for frame in session.frames():
hands = tracker.detect_hands(frame)WiLoR runs YOLO + ViTDet on each frame, then a MANO regressor on each hand bbox. The full MANO outputs (778×3 vertices, 10 betas, global orient, per-finger pose) are stashed onto each HandPose and written to annotation.hdf5:/hand-pose by session.export.
See Installation → WiLoR.
HaMeR
tracker = HandTracker(
model="hamer",
model_path="/opt/hamer", # local clone of geopavlakos/hamer
)HaMeR's pipeline:
- detectron2 (Cascade Mask R-CNN ViTDet-H or RegNet-Y) → human bboxes.
- ViTPose+-Huge wholebody → 133 keypoints; the last 42 = both hands.
- Build hand bboxes from confident hand keypoints.
- HaMeR → MANO regression on each hand bbox.
Same HandPose outputs as WiLoR, same MANO extras. First call downloads ~3 GB of detector weights from FAIR public files; everything is cached after that.
HaMeR + detectron2 + ViTPose pulls in mmpose, mmcv, and chumpy. In our
reference env (wilor-fresh, Python 3.10) we pinned numpy<2 because
xtcocotools ships precompiled wheels with a different ABI.
What detect_hands accepts
tracker.detect_hands(frame) # SyncedFrame → uses frame.rgb / .depth / .depth_K
tracker.detect_hands(rgb) # raw (H, W, 3) uint8 array
tracker.detect_hands(rgb, depth=d, intrinsics=K) # both explicitWhen you pass a SyncedFrame, the wrapper plumbs frame.rgb, frame.depth, and frame.depth_K (or frame.rgb_K as fallback) for you. Pass raw arrays only when you've decoupled the loop from MCAPReader.
What you get back
hands: list[HandPose]
for hp in hands:
hp.hand_side # "left" or "right"
hp.confidence # float
hp.wrist # Keypoint
hp.fingers["index"][3] # tip of the index finger
hp.all_keypoints # flat list of 21 Keypoints (or [wrist] for wrist-only)
hp.has_fingers # True if full hand was returnedSee HandPose for the full schema.
For backends that produce MANO (WiLoR, HaMeR), private _mano_* attributes are also stashed on the HandPose and forwarded to HDF5 export, you don't normally read them yourself.
Configuration
Pass any backend config field as a kwarg to HandTracker(...). The wrapper forwards it to the backend's Config dataclass.
MediaPipe
| Field | Default | Notes |
|---|---|---|
max_num_hands | 2 | |
min_detection_confidence | 0.3 | Higher = fewer false positives. |
min_presence_confidence | 0.3 | Lower = stickier tracking. |
min_tracking_confidence | 0.3 | |
depth_sample_radius | 7 | Pixel patch around joint for depth. |
depth_buffer_size | 15 | Rolling-median window for depth smoothing. |
WiLoR
| Field | Default | Notes |
|---|---|---|
wilor_dir | required | Path to local WiLoR clone. |
yolo_conf | 0.4 | YOLO detection threshold. |
rescale_factor | 2.0 | bbox padding for ViTDet. |
batch_size | 16 | |
detect_every_n | 1 | Skip detection on N-1 of every N frames for speed. |
save_mano_vertices | True | 778×3 vertices stashed per hand. |
depth_buffer_size | 15 | |
depth_sample_radius | 7 |
HaMeR
| Field | Default | Notes |
|---|---|---|
hamer_dir | required | Path to local HaMeR clone. |
body_detector | "vitdet" | Or "regnety" (faster, smaller). |
body_detector_score_thresh | 0.5 | |
min_hand_keypoints | 3 | Confident wholebody hand keypoints required to accept. |
hand_keypoint_score_thresh | 0.5 | |
batch_size | 8 | |
save_mano_vertices | True |
All three backends also expose sanity-check thresholds (max_joint_abs, min_wrist_depth, min_palm_span, max_palm_span), bumps with implausible joint configurations are dropped silently.
Patterns
Attach to the session for HDF5 export
for frame in session.frames():
hands = tracker.detect_hands(frame)
session.add_hand_pose(frame.index, hands)
session.export("episodes/run_01")The accumulated detections land in annotation.hdf5:/hand-pose as (F, 21, 3) arrays per side. See HDF5 schema.
Visualize over RGB
from stera.viz import Visualizer
viz = Visualizer(session)
for frame in session.frames():
hands = tracker.detect_hands(frame)
viz.log_frame(frame, hands=hands)
viz.export("hands.rrd")Hands appear as 2D overlays on the RGB panel and as 3D Points3D in the world scene when a camera pose is present.
When 3D isn't available
If your recording has no depth, joints come back with z=0 and Keypoint.x/y carry pixel coords. HandPose._kpts_2d_rgb is also stashed as a (21, 2) array.
See also
HandTrackerAPI, constructor and methods.HandPose, output schema.- HDF5 schema → /hand-pose, what gets exported.