AutismBench Leaderboard

How it works

The model receives a prompt with N constraints and must write a single sentence that satisfies all of them at once. Every constraint is checked by code. No LLM judge, no ambiguity, pure binary pass/fail.

Example: Level 5

The model gets this prompt:

Write a single English sentence that satisfies ALL of the following constraints simultaneously.

Constraints:
1. The sentence must contain exactly 8 words
2. The sentence must contain an animal name (e.g., cat, eagle, wolf)
3. Every word must start with a different letter
4. The sentence must contain exactly one number written as a digit
5. The last word must end with the suffix "-ing"

IMPORTANT:
- Output ONLY the sentence, nothing else.
- No explanations, no numbering, no quotes.

A valid answer:

Big cats devoured 5 exotic fish near swimming

The validator checks each constraint independently:

✔ word_count: 8 words (expected 8)
✔ contains_animal: found "cats"
✔ unique_first_letters: B, c, d, 5, e, f, n, s (all unique)
✔ contains_number: found 1 digit
✔ last_word_suffix: "swimming" ends with "-ing"

Result: 5/5 passed, PERFECT SOLVE (score: 10 = 5 x2 bonus)

Example: Level 10 (where it gets hard)

Constraints:
1. The sentence must contain exactly 9 words
2. The total number of characters (letters only) must be exactly 42
3. The word at position 3 must start with the letter 'f'
4. The sentence must contain a color name
5. The sentence must contain a profession
6. The first word must have exactly 4 letters
7. Every word must start with a different letter
8. The total number of vowels must be exactly 14
9. The sentence must use at least 15 different letters
10. The sentence must not contain the letter "z" anywhere

Now the model must juggle exact word count, exact character count, exact vowel count, positional rules, lexical requirements, and a forbidden letter - all within one sentence. This is where most models break.

Why it's hard

Each constraint alone is trivial. "Write 9 words" - easy. "Include a color" - easy. But satisfying 10+ constraints simultaneously in a single sentence is a combinatorial search problem. Fix one constraint, break another. It's like juggling - each ball is light, but keeping 10 in the air is a different task entirely.

At level 15+, even frontier models pass only ~40-50% of constraints individually and almost never solve everything perfectly.

Scoring

+1 point per satisfied constraint.

×2 bonus if ALL constraints pass (perfect solve). So a level-10 perfect solve = 20 points, while passing 8/10 = just 8 points.

Validity = total constraints passed / total constraints attempted across all tasks.

Perfect solve rate = % of tasks where every single constraint was satisfied.

Constraint categories

Structural - hard numeric constraints on sentence shape

word_count:        "The sentence must contain exactly 8 words"
char_count:        "Total characters (letters only) must be exactly 42"
word_length_at_pos: "The word at position 3 must have exactly 5 characters"
all_words_min_len: "Every word must be at least 4 characters long"

Lexical - sentence must include specific types of words

contains_color:      "Must contain a color name (red, blue, green...)"
contains_animal:     "Must contain an animal name (cat, eagle, wolf...)"
contains_number:     "Must contain exactly one digit (0-9)"
contains_profession: "Must contain a profession (doctor, pilot, chef...)"

Positional - rules about specific word positions

first_word_length:      "The first word must have exactly 5 letters"
last_word_suffix:       "The last word must end with '-ing'"
word_at_pos_starts_with: "The word at position 4 must start with 's'"

Relational - constraints between words

unique_first_letters: "Every word must start with a different letter"
ascending_word_length: "Each word must be longer than the previous"
first_letters_spell:  "First letters of all words must spell 'COLD'"

Meta-structural - global numeric properties

vowel_count:         "Total vowels (a,e,i,o,u) must be exactly 18"
unique_letters_count: "Must use at least 16 different letters"
word_length_sum:     "Sum of all word lengths must equal 45"
no_letter:           "Must not contain the letter 'x' anywhere"

Difficulty scaling

Level 3-5	Warm-up. Most models pass.
Level 6-10	Non-reasoning models start failing.
Level 11-15	Only strong models survive. Perfect solves become rare.
Level 16-20	Frontier models only. Validity drops below 50%.
Level 21-30	Approaching NP-hard constraint satisfaction. No model survives.

Why this benchmark matters

No LLM judge - every constraint is binary pass/fail via code. Zero judge bias.
No ceiling - add more constraints for harder tasks. Combinatorial explosion.
Contamination-resistant - random constraint combos with random params = infinite unique tasks.
Cheap - ~500 API calls per model, $2-5 total.
Tests what matters - planning, working memory, precision. The same skills needed for coding, legal docs, engineering specs.

GitHub · 𝕏 @chiefofautism · Created by Chief of Autism