Car Wash Test on 53 AI Models: Consistency and Context in Simple Reasoning
Summary
The article presents a benchmark where 53 AI models are tested on a simple car-wash reasoning task. It shows most models predict walking rather than driving, with only a small subset consistently correct across multiple runs. It highlights the importance of context engineering for production reliability and shares methodology, human baseline results, and data availability.