Cohere's new Command A Vision model is designed to handle images, diagrams, PDFs, and other types of visual data. Cohere says the model outperforms GPT-4.1, Llama 4 Maverick, Pixtral Large, and Mistral Medium 3 on standard vision benchmarks.
The model's OCR can recognize both the text and the structure of documents such as invoices and forms, outputting the extracted data in structured JSON. Command A Vision can also process real-world images, like identifying potential risks in industrial environments, the company says.

Command A Vision is available through the Cohere platform and for research on Hugging Face. The model can run locally with either two A100 GPUs or a single H100 using 4-bit quantization.