- Checkpoint-based scoring: Each task has multiple checkpoints with weighted 0.0–1.0 scores
- Anti-reward-hacking: Evaluators check actual state changes, not just UI clicks
- Resettable: Clear localStorage and reload for a fresh environment
- OSWorld-ready: Direct integration with OSWorld evaluation harness