Find the prompts that break frontier AI.
ARP turns reproducible model failures into rigorous eval data. Hunt the cases the best models get wrong, our judge verifies every claim, and you earn a reward for each one that holds up.
Any Google account · admin-approved · payouts in 48–72h
“Return ONLY JSON with keys sentiment and confidence. No prose.”
Wrapped the JSON in prose — violates “ONLY JSON”. Reproduced.
Pressure-testing the frontier
Benchmarks are saturated. The real failures hide in the long tail.
Frontier models ace public benchmarks — then quietly hallucinate citations, break strict formats, ignore explicit constraints, and mishandle the edge cases that matter in production. ARP captures those misses as reproducible, scored cases the labs can actually learn from.
Confident, fabricated facts, links and quotes.
Ignores constraints, formats, or refusals.
Subtle reasoning and tool-use breakdowns.
From a broken prompt to lab-ready eval data.
Submit a failure
Paste the exact prompt that broke a model, what you expected, and the full transcript. Seven quick steps — no rubric-writing degree required.
The judge verifies
An automated LLM judge re-reads your transcript against the rubric and confirms the failure actually reproduces — no he-said-she-said.
Humans review
Reviewers triage a keyboard-first queue; claimed-vs-judged mismatches are flagged. An admin gives the final nod before anything counts.
You get paid
Every approved case earns a reward scaled to the model's tier, and ships as clean, scored JSONL the labs can train on.
What separates signal from noise.
Reproducible by design
We re-run your exact prompt. If the failure doesn't hold up, it never makes the dataset — so what ships is real.
Discriminating, not noisy
The best cases split the field: one model trips where another sails. Each is scored for difficulty and discrimination.
Judge-verified
An LLM judge grades every transcript against your rubric, catching mislabeled or wishful submissions automatically.
Reviewer-grade tooling
A fast, keyboard-driven review queue with search, filters, tiers and bulk triage keeps the quality bar high at scale.
Scored for the labs
Difficulty and discrimination land on every case, so accepted failures are immediately useful eval signal.
Clean JSONL export
Filter by tag or difficulty and export accepted cases as tidy JSONL — prompt, rubric, attempts, transcripts and scores.
Every verified failure pays.
Rewards scale with how hard the model is to break. The exact bracket for your model is shown before you submit, and approved cases pay out in 48–72 hours.
Start earningBreaking the hardest, most capable models.
Solid, reproducible failures on standard models.
Questions, answered.
What is ARP?
ARP is Adzzat's LLM evaluation platform. Contributors find reproducible prompts where frontier models fail; an automated judge plus human reviewers verify each one; accepted failures become scored eval datasets — and contributors earn a reward for every case that holds up.
Who can join?
Anyone with a Google account can request access. New accounts join a short waitlist and an admin approves them. You don't need to be an ML researcher — if you're good at finding where AI breaks, you're who we want.
What makes a good failure case?
One clear task, an unambiguous expected outcome, and a failure a reviewer can understand without extra context — ideally one where some models fail and others succeed. Hallucinated facts, broken format, ignored constraints, and unsafe compliance are all great.
How much can I earn?
Every verified, approved failure pays a reward scaled to the model's tier — frontier models pay more than standard ones. Tiers and amounts are set by admins, and the exact bracket is shown on the submission form before you submit.
How are submissions verified?
We re-run your transcript through an automated LLM judge that grades it against the rubric and the expected behaviour. If the judge disagrees with your claim, the case is flagged for a human reviewer. A reviewer accepts, then an admin gives final approval.
When do I get paid?
Verified bugs that pass review and admin approval pay out within 48–72 hours. You can track received vs pending earnings on your payouts page at any time.
Ready to break some models?
Sign in with Google, get approved, and turn the failures you find into rewarded, lab-grade eval data.
Continue with Google