Question 1

What is ARP?

Accepted Answer

ARP is Adzzat's LLM evaluation platform. Contributors find reproducible prompts where frontier models fail; an automated judge plus human reviewers verify each one; accepted failures become scored eval datasets — and contributors earn a reward for every case that holds up.

Question 2

Who can join?

Accepted Answer

Anyone with a Google account can request access. New accounts join a short waitlist and an admin approves them. You don't need to be an ML researcher — if you're good at finding where AI breaks, you're who we want.

Question 3

What makes a good failure case?

Accepted Answer

One clear task, an unambiguous expected outcome, and a failure a reviewer can understand without extra context — ideally one where some models fail and others succeed. Hallucinated facts, broken format, ignored constraints, and unsafe compliance are all great.

Question 4

How much can I earn?

Accepted Answer

Every verified, approved failure pays a reward scaled to the model's tier — frontier models pay more than standard ones. Tiers and amounts are set by admins, and the exact bracket is shown on the submission form before you submit.

Question 5

How are submissions verified?

Accepted Answer

We re-run your transcript through an automated LLM judge that grades it against the rubric and the expected behaviour. If the judge disagrees with your claim, the case is flagged for a human reviewer. A reviewer accepts, then an admin gives final approval.

Question 6

When do I get paid?

Accepted Answer

Verified bugs that pass review and admin approval pay out within 48–72 hours. You can track received vs pending earnings on your payouts page at any time.

Find the prompts that break frontier AI.

Benchmarks are saturated. The real failures hide in the long tail.

From a broken prompt to lab-ready eval data.

Submit a failure

The judge verifies

Humans review

You get paid

What separates signal from noise.

Reproducible by design

Discriminating, not noisy

Judge-verified

Reviewer-grade tooling

Scored for the labs

Clean JSONL export

Every verified failure pays.

Questions, answered.

Ready to break some models?