The famous o3 "GeoGuessr" prompt did not work
Summary
This post analyzes OpenAI's o3 geolocation abilities using a GeoGuessr-style prompt. It compares a default prompt to a specialized prompt across 200 images, showing the basic prompt often performs as well or better, and highlights the importance of benchmarks to separate hype from actual capability. It also discusses how prompt engineering can mislead and the need for rigorous evaluation.