Posted by freediver 6 hours ago
I was aiming for an agent-like experience, and found the accuracy drop below what I'd consider useful levels even above the 1GB mark.
Perhaps for shorter, few word sentences like "lights on"?
I would agree that the "tiny" model has a clear drop off in accuracy, not good enough for anything real (even if transcribing your own speech, the error rate means too much editing needed). In my experience, accuracy can be more of a problem on shorter sentences because there is less context to help it.
I think for serious use (on GPU) it would be the "medium" or "large" models only. There is now a "large-turbo" model which is apparently faster than "medium" (on GPU) but more accurate than "medium" - haven't tried it yet.
On CPU for personal use (faster-whisper, CPU) I have found "base" is usable, "small" is good. On a laptop CPU though "small" is slow for real time. "Medium" is more accurate, though mostly just on punctuation, far too slow for CPU. Of course all models will get some uncommon surnames, place names wrong.
Since OpenAI have re-released the "large" models twice and now done a "large-turbo" I hope that they will re-release the smaller models too so that the smallest models become more useful.
These moonshine models are compared to original OpenAI whisper, but really I'd say they need to compare to faster-whisper: multiple projects are faster than original OpenAI whisper.
Was really hoping it would be a quick, brilliant solution to something I'm working on now, perhaps I'll dig in and invest in it, but I'm not sure I have the luxury right now to do the exploratory work... Hope someone else has better luck than I!
I would recommend then to be more specific. Did you had trouble installing it? Did it give you an error? Was there no output? Was the output wrong? Is it not working on your files, but working on example files? Is it solving a different problem than the one you have?
UserWarning: You are using a softmax over axis 3 of a tensor of shape (1, 8, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead? warnings.warn(
I know this isn't the right place for this, the right place is raising within github, but because you asked I posted...
The section "3.2. Training data collection & preprocessing" covers what you're inquiring about: "We train Moonshine on a combination of 90K hours from open ASR datasets and over 100K hours from own internally-prepared dataset, totalling around 200K hours. From open datasets, we use Common Voice 16.1 (Ardila et al., 2020), the AMI corpus (Carletta et al., 2005), Gi- gaSpeech (Chen et al., 2021), LibriSpeech (Panayotov et al., 2015), the English subset of multilingual Lib- riSpeech (Pratap et al., 2020), and People’s Speech (Galvez et al., 2021). We then augment this training corpus with data that we collect from openly-available sources on the web. We discuss preparation methods for our self-collected data in the following."
It does continue...
I think “Artemis” or “Luna” would work better.
You can play this game with every possible name.