Long‑horizon RL environments for frontier AI labs.

Realistic, long-horizon environments for code, computer-use, and enterprise workflows that challenge SOTA models.

What we build

RL environments, custom & OTS.

Executable worlds with programmatic graders across three frontiers, hard enough that today’s best models fail most tasks.

Coding

Multi-file repos with build, run and test loops. Agents plan, edit, execute and repair across long sessions, graded against SWE-bench.

Computer-use

Full desktop and browser control, judged on the end state of long, multi-step tasks. Benchmarked on OSWorld.

Enterprise workflows

CRMs, spreadsheets, ticketing and finance — the real work companies run, with custom graders on your data.

Coding · sample environments

Pass@8 mean · SOTA models · 20 tasks

Claude Opus 4.831.4%

GPT 5.527.2%

Sample environments · shared with customers

rails-ecommerce-bugsRails · Spree

react-frontend-bugsReact · Redux

typescript-video-implTypeScript

Difficulty Pass@8 11–37% · eight trials (K=8), graded reward

Computer-use · sample environments

Pass@8 mean · SOTA models · 33 environments

Claude Opus 4.822.5%

GPT 5.518.0%

Sample environments · shared with customers

research-agentmulti-step

message-triageinbox

content-publishingweb apps

Graded on the end state · eight trials, continuous reward

Enterprise workflows · sample environments

Pass@8 mean · SOTA models · 33 environments

Claude Opus 4.819.6%

GPT 5.515.3%

Sample environments · shared with customers

access-review-quarterlyaccess

quarterly-tax-prepfinance

customer-escalationsupport

Custom graders on your data · eight trials, continuous reward

What makes us different.

Verifiable
data quality Expert
network Huzzle Labs

Data quality

Verifiable downstream model improvements

HuzzleWorld-8B, our 8B computer-use model, is trained entirely on our own computer-use environments, and ranks #11 on OSWorld, beside models many times its size.

Coasty CUA v182.8

Holo3-35B-A3B82.6

HuzzleWorld-8B57.0

Scale

300k+ expert network

Built on Huzzle.com. Our AI recruiter sources vetted specialists for any domain, on demand.

100k monthly active88 expert NPS

The result — hundreds of high-quality tasks per week. Thousands per month.

Comparison

RL-env startupsDeeptune · Mechanize

Human-data co’sMercor · Scale

Focus on RL environments

Expert access & operational scale

Get started

Request sample data.

Tell us what you’d like to see and we’ll tailor the sample to you.

Which environments?

or talk to the founders