4TB of voice samples just stolen from 40k AI contractors at Mercor

Posted by Oravys 1 day ago

4TB of voice samples just stolen from 40k AI contractors at Mercor(app.oravys.com)

555 points | 211 commentspage 2

embedding-shape 23 hours ago|

I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.

nmacias 17 hours ago||

GOOG-411 was "competing" with a strong company (1-800-FREE411) by serving no ads in a category worth ~$3.5B at the time. It was inexplicable at the time, but they did this to get voice samples, way back when. For reasons like that, I expect that this category of training is baked — but I don't have current domain knowledge fwiw.

hirako2000 23 hours ago|||

It's already there. And keeps moving.

Even have a nice UI on top.

https://voicebox.sh/

jubilanti 23 hours ago|||

Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.

https://commonvoice.mozilla.org/en/languages

interludead 17 hours ago||

Yep, the silence around provenance is probably the most suspicious part

yesman_x 19 hours ago||

If this is real, the bigger issue might not even be the leak itself. It could be that we are quietly moving into a world where voice plus ID is enough to fully impersonate someone, and most systems are still not built for that reality.

deferredgrant 15 hours ago||

There is also an ugly labor story here. The people labeling and training these systems are often the least protected when the data pipeline itself turns into the attack surface.

john_strinlai 23 hours ago||

>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.

good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.

>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.

would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.

tenpointwo 22 hours ago||

With most US banks, you can ask them to put in a note on your account file for a code word, it will show up anytime the account file is pulled up. Now, whether or not a customer service agent will know to do so is another question. Maybe as attack vectors like this are utilized more often it will become part of their SOP. Or just stop using voice verification. In my experience, even if you pass voice verification, it only grants you access to the account and check balance and txs but still requires information like PIN or a code sent in the app or phone number. There are attack vectors for these as well but not guaranteed.

The other use cases (like calling payroll, etc) likely don’t have the same protections and probably would be more effective.

wongarsu 23 hours ago|||

Someone who has hundreds or thousands of clients presumably couldn't remember every client's voice either, so no meaningful security is lost. They are approximately as secure or insecure as before

john_strinlai 23 hours ago||

>presumably couldn't remember every client's voice either, so no meaningful security is lost

there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.

the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.

iterateoften 23 hours ago||

Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly

MarsIronPI 22 hours ago||

The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.

eolgun 22 hours ago||

The biometric pairing is what makes this particularly bad. A leaked password is recoverable. A leaked voiceprint combined with ID scans is permanent, you can not rotate your voice.

The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.

Oravys 5 hours ago|

[dead]

tracker1 21 hours ago||

I'm pretty sure Google and Apple already have some decent examples of a LOT of people's voices in concert with other data collation. Google Voice IIRC was bought for audio sampling voicemail in the first place. Not sure if Apple has done similar, but would be more surprised if they didn't... Let alone the voice search options for both.

flockonus 18 hours ago||

> How to check if your voice is being misused

I love that the answer here is basically.. - you don't -

But maybe mitigate at unreasonable personal costs.

How about services simply stop taking public information as proof of identity?

amarcheschi 23 hours ago||

I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with

meric_ 20 hours ago||

Is this post not just an ad for a vibe coded site / product? It adds no new info on the mercor breach and advertises something which I presume has even worse safety practices

AntiUSAbah 21 hours ago|

I'm curious: if i create an online sample from my voice, might this make it a lot harder for an AI model to identify me if every trainingdata contains my particular voice sample?

More comments...