Voxtral Transcribe 2
- simonw - 23977 sekunder sedanThis demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
- mnbbrown - 3420 sekunder sedanIncroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
- iagooar - 16078 sekunder sedanIn English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
- dmix - 24818 sekunder sedan> At approximately 4% word error rate on FLEURS and $0.003/min
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
- maxdo - 588 sekunder sedanhttps://www.tavus.io/post/sparrow-1-human-level-conversation...
how does it compare to sparrow-1?
- pietz - 22418 sekunder sedanDo we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.
- janalsncm - 19155 sekunder sedanI noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.
- sbinnee - 1250 sekunder sedan3 hours for a single request sounds nice to me. Although the graph suggests that it’s not going to perform as good as openai model I have been using, it is open source and surely I will give it a try.
- observationist - 25635 sekunder sedanNative diarization, this looks exciting. edit: or not, no diarization in real-time.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
- yko - 13808 sekunder sedanPlayed with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
- jiehong - 15251 sekunder sedanIt’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.
We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.
I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.
For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.
- fph - 15015 sekunder sedanIs there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.
- mdrzn - 25042 sekunder sedanThere's no comparison to Whisper Large v3 or other Whisper models..
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
- gwerbret - 16899 sekunder sedanI really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.
- satvikpendem - 22906 sekunder sedanLooks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.
- XCSme - 19053 sekunder sedanIs it me or error rate of 3% is really high?
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
- serf - 25630 sekunder sedanthings I hate:
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
- aavci - 22814 sekunder sedanWhat's the cheapest device specs that this could realistically run on?
- antirez - 24351 sekunder sedanItalian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.
- sgt - 5393 sekunder sedanWhat's the best way to train this further on a specific dialect or accent or even terminology?
- ccleve - 9867 sekunder sedanThis looks great, but it's not clear to me how to use it for a practical task. I need to transcribe about 10 years worth of monthly meetings. These are government hearings with a variety of speakers. All the videos are on YouTube. What's the most practical and cost-effective way to get reasonably accurate transcripts?
- Archelaos - 24506 sekunder sedanAs a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.
What estimates do others use?
- siddbudd - 18736 sekunder sedanWired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
- yewenjie - 18350 sekunder sedanOne week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.
- jszymborski - 17195 sekunder sedanI'm guessing I won't be able to finetune this until they come out with a HF tranformers model, right?
- antirez - 2078 sekunder sedanDisappointing how this lacks a clear reference implementation, if not mixed at almost yet unreleased VLLM (nightly version) stuff. I'm ok with Open Weights being a form of OSS in the case of models, because frankly I don't believe that, for large LLMs, it is feasible to release the training data, all the orchestration stuff, and so forth. But it can't be: here are the weights, we partnered with VLLM for inference. Come on. Open Weights must mean that you put me in a situation to write an implementation easily for any hardware.
p.s. even the demo uses a remote server via websocket.
- blobinabottle - 14009 sekunder sedanImpressive results, tested on crappy audio files (in french and english)...
- - 16340 sekunder sedan
- numbers - 13213 sekunder sedandoes anyone know if there's any desktop tools I can use this transcription model with? e.g. something where like Wisper Flow/WillowVoice but with custom model selection
- tallesborges92 - 11590 sekunder sedanI added it to my bot agent,let’s see how it performs
- atentaten - 9977 sekunder sedanNice. Can this be ran on a mobile device?
- derac - 17903 sekunder sedanAny chance Voxtral Mini Transcribe 2 will ever be an open model?
- ewuhic - 18516 sekunder sedanCan it translate in real time?
- scotty79 - 9437 sekunder sedanDo you know anything better for Polish language, low quality audio than Whisper large-v3 through WhisperX?
This combo has almost unbeatable accuracy and it rejects noises in the background really well. It can even reject people talking in the background.
The only better thing I've seen is Ursa model from Speechmatics. Not open weights unfortunately.
- dumpstate - 19759 sekunder sedanI'm on voxtral-mini-latest and that's why I started seeing 500s today lol
- boringg - 22076 sekunder sedanPseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),
- varispeed - 24891 sekunder sedan[flagged]
- - 16401 sekunder sedan
Nördnytt! 🤓