“Best AI model” depends on the job. The best language model tells you nothing about the best speech-to-text, vision, or embedding model. EyesInAI tracks the leaders at each layer — measured live where we benchmark, curated where we don’t yet.
Speech models are scored on word error rate (WER) and diarization (DER/tcpWER), not reasoning prompts — a different yardstick, so they live here as curated picks rather than on the text leaderboard.
Unified speech-to-text that transcribes up to 60 minutes of audio in a single pass, jointly producing who-said-what-and-when — transcription, speaker diarization, and timestamps in one model instead of stitched-together steps. 9B params, 64K-token window, 50+ languages with native code-switching, plus custom hotwords for domain vocabulary. MIT-licensed and in Hugging Face Transformers.
The open-weight ASR baseline. Excellent raw transcription accuracy, but transcription only — speaker labels and word-level timestamps need separate tooling (pyannote, WhisperX). The point of comparison for any new long-form or diarizing model.
Image understanding, OCR, and document parsing — scored on task accuracy, not chat.
Text/vector embeddings ranked on retrieval quality (MTEB-style) and cost per million tokens.