Basketball Court Eval Design: Mapping LLM Application Test Coverage Visually to Find Failure Regions
Problem / Context
An LLM application's eval suite only covered happy-path queries from demos. After deployment, users submitted queries outside the tested distribution and the failure rate was unknown. Evals were either too narrow (only testing what already worked) or too broad (testing inputs users would never send), wasting engineering effort on the wrong failure regions.
Solution
Model your eval dataset as a basketball court where each data point is plotted by difficulty (distance from basket) and domain relevance (in/out of bounds). Easy common queries go near the basket; rare hard queries go far from the basket; out-of-domain queries are out of bounds and should not be in the dataset. Build the dataset incrementally: collect thumbs-up/thumbs-down signals from users, sample 100 random traces from observability logs weekly and read them manually, monitor community forums and X for reported failures. When you have the court mapped, structure evals with constants in data and variables in tasks: keep the test inputs fixed in the dataset, vary the system prompt, RAG strategy, or model in the task definition so you can A/B test prompt changes without rebuilding the dataset. For scoring: prefer deterministic pass/fail over 0-1 floats. Ask 'what am I looking for to know this failed?' and write code that checks exactly that. Use output-shaping tricks in test prompts (e.g., 'output your final answer in tags') to make string matching trivial even if you would not include that in prod prompts. Add evals to CI via Braintrust so every PR gets a report showing tiles flipped from red to blue and any regressions. Share the AISDK middleware abstraction between the production API route and the eval task so the same pre-processing code runs in both, making practice identical to the game.
Result
Coverage grid revealed a failure quadrant (high-difficulty, in-domain) concentrating 40% of user issues. CI eval via Braintrust caught 2 prompt regressions pre-deploy. Team prioritized that quadrant next sprint, cutting time-to-fix for top user complaints by half.