GLM-OCR: Accurate × Fast × Comprehensive
Summary
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction loss and stable full-task reinforcement learning, integrates a pre-trained CogViT visual encoder, and uses a two-stage PP-DocLayout-V3-based pipeline to achieve robust, high-accuracy OCR with efficient inferences. The project is open source and provides SDKs and multiple deployment options (cloud MaaS or self-hosted via vLLM, SGLang, or Ollama).