Single-sentence constraint satisfaction benchmark for LLMs
| # | Model | Score | Validity | Perfect Solve | Levels |
|---|
The model receives a prompt with N constraints and must write a single sentence that satisfies all of them at once. Every constraint is checked by code. No LLM judge, no ambiguity, pure binary pass/fail.
The model gets this prompt:
Write a single English sentence that satisfies ALL of the following constraints simultaneously. Constraints: 1. The sentence must contain exactly 8 words 2. The sentence must contain an animal name (e.g., cat, eagle, wolf) 3. Every word must start with a different letter 4. The sentence must contain exactly one number written as a digit 5. The last word must end with the suffix "-ing" IMPORTANT: - Output ONLY the sentence, nothing else. - No explanations, no numbering, no quotes.
A valid answer:
Big cats devoured 5 exotic fish near swimming
The validator checks each constraint independently:
✔ word_count: 8 words (expected 8) ✔ contains_animal: found "cats" ✔ unique_first_letters: B, c, d, 5, e, f, n, s (all unique) ✔ contains_number: found 1 digit ✔ last_word_suffix: "swimming" ends with "-ing" Result: 5/5 passed, PERFECT SOLVE (score: 10 = 5 x2 bonus)
Constraints: 1. The sentence must contain exactly 9 words 2. The total number of characters (letters only) must be exactly 42 3. The word at position 3 must start with the letter 'f' 4. The sentence must contain a color name 5. The sentence must contain a profession 6. The first word must have exactly 4 letters 7. Every word must start with a different letter 8. The total number of vowels must be exactly 14 9. The sentence must use at least 15 different letters 10. The sentence must not contain the letter "z" anywhere
Now the model must juggle exact word count, exact character count, exact vowel count, positional rules, lexical requirements, and a forbidden letter - all within one sentence. This is where most models break.
Each constraint alone is trivial. "Write 9 words" - easy. "Include a color" - easy. But satisfying 10+ constraints simultaneously in a single sentence is a combinatorial search problem. Fix one constraint, break another. It's like juggling - each ball is light, but keeping 10 in the air is a different task entirely.
At level 15+, even frontier models pass only ~40-50% of constraints individually and almost never solve everything perfectly.
+1 point per satisfied constraint.
×2 bonus if ALL constraints pass (perfect solve). So a level-10 perfect solve = 20 points, while passing 8/10 = just 8 points.
Validity = total constraints passed / total constraints attempted across all tasks.
Perfect solve rate = % of tasks where every single constraint was satisfied.
word_count: "The sentence must contain exactly 8 words" char_count: "Total characters (letters only) must be exactly 42" word_length_at_pos: "The word at position 3 must have exactly 5 characters" all_words_min_len: "Every word must be at least 4 characters long"
contains_color: "Must contain a color name (red, blue, green...)" contains_animal: "Must contain an animal name (cat, eagle, wolf...)" contains_number: "Must contain exactly one digit (0-9)" contains_profession: "Must contain a profession (doctor, pilot, chef...)"
first_word_length: "The first word must have exactly 5 letters" last_word_suffix: "The last word must end with '-ing'" word_at_pos_starts_with: "The word at position 4 must start with 's'"
unique_first_letters: "Every word must start with a different letter" ascending_word_length: "Each word must be longer than the previous" first_letters_spell: "First letters of all words must spell 'COLD'"
vowel_count: "Total vowels (a,e,i,o,u) must be exactly 18" unique_letters_count: "Must use at least 16 different letters" word_length_sum: "Sum of all word lengths must equal 45" no_letter: "Must not contain the letter 'x' anywhere"
| Level 3-5 | Warm-up. Most models pass. |
| Level 6-10 | Non-reasoning models start failing. |
| Level 11-15 | Only strong models survive. Perfect solves become rare. |
| Level 16-20 | Frontier models only. Validity drops below 50%. |
| Level 21-30 | Approaching NP-hard constraint satisfaction. No model survives. |
GitHub · 𝕏 @chiefofautism · Created by Chief of Autism