← All talks

BSIDES CPT 2019 - Put Words In My Mouth - Amy Manià

BSides Cape Town38:36752 viewsPublished 2019-12Watch on YouTube ↗
About this talk
Title: Put Words In My Mouth Abstract: Money has been withdrawn from your account. You don’t remember making, or authorising that transaction. When you follow up with the bank, they say you called earlier and requested the transfer – it was, after-all, you speaking – right? Unbeknownst to you, your voice was stolen, and so was your money. With the rise of voice authentication biometrics, so will the opportunities to spoof it. Text-to-Speech API’s are constantly improving, with Google’s technology now being indistinguishable from the real human speaker. Threat actors have access to a target’s YouTube videos, social media posts. Even more invasive channels are certain vulnerable IoT devices, littered throughout homes and offices. Social media posts and IoT’s allow threat actors to listen to your voice, capture and then manipulate it (all using free online tools). So what exactly can be done with a ‘stolen’ voice? This research explores the possibilities of fraud, by using voice-spoofing to bypass authentication. Speaker: Amy Manià Twitter: @The_MunX Speaker Bio: Amy is currently a Security Analyst at Telspace Systems. Amy is currently focused on penetration testing, but comes from a career as a Professional Architect. Her history in ‘making things’ has led to an interest in ‘breaking them apart’. With a curious mind, she innately has the ability to notice where well-meaning applications could become dangerous tools in the hands of an attacker. Amy is also passionate about the information security community and regularly attends community based events.
Show transcript [en]

which i generally wonder there's a rule to start a presentation with a joke but I'd prefer not to share too much about my life at this moment great like a good luck 10% of the audience will get my ridiculous sense of humor if it's not immediately obvious I work at cell space if anyone wants to contact me I'm in the way of my own contact details I don't know everyone can see that it's a me at L space or find me on Twitter at the monks I really value any kind of feedback or if you want to send me information because obviously there's always more that I can learn a little bit about Who I am I'm currently a

junior security analyst at Tel space I like to refer to myself as a recreational researcher because I do like to learn and I'm an ocean cynthaiz easiest a little bit of my my background is that I'm actually an architect so the kind of background from that to profession has taught me that if you want to learn how to break something it's best to know how to make it in the first place and as Keegan so Conde introduced me my talk today is called put words in my mouth and really what I want to find out is can I steal your voice can anyone identify this for me exactly it's such an iconic image and kits is just one of the few characters

that is one of voice assistance in any kind of film or popular culture reference where we see some kind of AI system that is very intelligent and can speak back to us and it's definitely had some kind of long lasting impression and we continue to see it in movies the most recent one that kind of comes to mind for me is Jarvis in the Avengers and it seems to be some kind of thing that's just becoming more popular and I think that's because you have something that's got some kind of cool factor it's really fun to be able to speak to a machine and get some kind of information back but when you couple that with convenience

people are really going to go for it it's so convenient to be able to say to a machine okay Google what's on my calendar for today or what's the weather like you know what's not to like about convenience and when you start to look at really reliable statistics we see that there's already 1 billion Google assistant devices that are already in use that's a fairly big number so these voice assistants and IOT devices that have some kind of microphone in them are just becoming more and more popular so if we look in about five years time that's 75 billion devices I don't know how accurate this is but it just gives an idea of the kind of popularity of

these devices I read an article recently which referred to IOT devices as the infiltration of things which i think is quite a nice way to think of them if you're buying from a reputable brand like Google or Amazon you're your odds of getting some kind of updates are quite high because these brands are they have security awareness and they're not going to want to sell a bad product but there's also a lot of cheap devices which have no way of updating their firmware or their software so what you buy is what you're stuck with and the problem is is that if you do have some kind of vocal voice assistant they are designed to always be listening it's

because they're waiting for that key word so that they can hear your instruction or your question and that they can answer you and always listening means always recording and both Google and Amazon have confirmed that human staff QA the recorded audio just to check that the information that is returned to you by this device is accurately answering your question but that brings into question things like trust in your privacy because even though these large brands are saying yeah it's anonymous there's no way it's it's high these recordings back to you as a user that's absolute nonsense because to have a Google home assistant you have to have a Google account which means that there's obviously a way

of tracing you back another huge problem with IO T's is that there's best practice versus regulations and if manufacturers are not forced to do something and it's more work and more money for them odds are that they're going to ignore it in part of the development obviously as well where is your darts are stored and if there's a breach like any breach it's a problem but here we were using your personal data but if we just kind of pause for a second I wanted to understand how machines understand human speech so then to understand that we have to learn very basics about how a voice is made and that essentially happens in two parts so

there is the part of your body that makes the sound that's really just vibrations that happen in your throat and it's the equivalent of the strings of a violin and these are known as your vocal folds the second part that's involved is what shapes the sound so the equivalent is the body of a violin and these are known as your super laryngeal articulate tsa's super laryngeal articulate sus is just a fancy name for everything that happens kind of from the neck up it's your nasal cavity your mouth your lips everything that helps you articulate words so that you can make a sentence but really all that we're doing when we speak is that we're making noises and our brains are able to

interpret that quite naturally as human beings but when we break these down they're known as Finn ohms and there's quite a lot of similarity as you can see because if you say the two sentences are loud recognize speech and recognize speech it's very difficult for a machine to be able to differentiate between the two you then add in a whole bunch of other stuff like homonyms which are quite obvious which would be easy for a human and a little bit more difficult for some kind of software as well as things like our accents how fast do you speak do you speak clearly what's your enunciation like and then also certain other that I haven't put on here like your

microphone quality speaker quality all things like that so how software understands what we're saying it's really just making a very educated guess and context so it's gonna try understand words in a series it's gonna it's gonna see how likely it is that you're asking about something and it's going to return that answer to you it's worth noting that a lot of this takes a huge amounts of processing power and when you issue a request to a device like a home speaker voice assistance all that device does is it records it and it sends it uploads it and it's processed somewhere else because that small device itself does not have the processing capacity to do it then in there so once we speak to a

machine that is captured graphically in one of two ways the images here the one is a wave and the other the bottom one is a spectrogram this is a voice print of the word yes once you understand kind of what is being said it becomes a little bit easier to read these graphics so it's actually broken up quite nicely here with the red lines each letter of yes the the role wave is a is quite a simplified way of looking at sound the spectrogram is a little bit more in-depth because it's actually helping to measure energy your latter parts of the diagram or more energy and your darker or less there's actually people who view spectrograms and try and

identify what's actually being said as far as voice verification goes so there's generally two systems that are used text dependence and independence dependence is obviously quite easy concept to understand the passphrase that's used in some kind of verification process is the exact same phrase that's used when you sign up or create the accounts the independence is using characteristics of your voice instead of an that kind of match so that is a little bit more technical as a system the problem with unique is that we've always been sold this idea that everyone is different and we're all special but I'm of the opinion that that's not necessarily the case if if these studies on a kind of uniqueness have been done

on sample groups it's it's using a very limited amounts of people and there's absolutely no factual way to know if your fingerprint or your iris or your voice is absolutely unique as well as the fact that we're just creating more and more of a putting more and more of our data online so if some kind of voice authentication becomes very popular and it's an unrealistic but just work with me here if every single person on earth use their voice as some kind of authentication and it was stored in a database that's a several billion people right but now if that information is stored and you keep adding every year that people are born and add their voice

onto the system in ten years time you wind up with a huge amounts of people and it just lowers your odds of being unique this ties in to puppy or papaya which I think everyone is familiar with your voice is covered as a biometric as biometric data is something that belongs to you and the reason why this is troubling is because verbal agreements on earliest banning the written agreements so if you have if you find someone and you make some kind of contracts so if you phone your lawyer and you ask them to do something that's law generally verbal agreements are difficult to prove but if it's recorded it's no different from a contracts and

that's why having your voice spoofed is undoubtedly theft so there was a lot of speculation about it people just speaking about what would be possible and then it actually happened so in September there was a company that had two hundred and forty three thousand dollars transferred because one of the senior managers received a call from the CEO allegedly CEO requesting this payment to be made reports of this say that the voice was done so well it even had the CEOs accent down to a tee and the only reason they found out or they found out sooner that this was a spoof because once the first transaction was successful the the criminals phoned back and asked for a second amount not dodgy

at all at the moment these kind of attacks require quite a lot of preparation time expertise and resources so it's not a simple thing to kind of carry off but like anything these things get better it becomes much easier and at the moment only your higher profile individuals are at risk I want to kind of contextualize higher profile because I think that if someone incredibly famous contacts you you might be a little bit skeptical but if it's a CEO of a company and you find someone else within the company and ask for some kind of very urgent payments to be done it can be quite believable this video which you would have seen and now will not see

is a spoof of Joe Rogan's voice this was not done with his permission he wasn't a brilliant target because he has such a huge amount of dates online I don't know if anyone watches his podcasts at several hours every week he has the perfect dataset for someone to spoof his voice even he was scared by these results they were really really good that you will not see here as well as some audio that you will not hear so obviously Google are really good at what they do and I think most people in this room will be familiar with the Google Voice what's interesting is when I tried to find out who she is they haven't disclosed your name but

their tool is becoming so good that it's starting to be people when they listen to a according by this voice-over artist versus a generated audio sample people are struggling to tell the difference between what's real and what's not I was going to get you guys to guess but here's the answer I'm hoping I think I would really like to put this online just if anyone's interested in actually seeing the videos that are in this presentation my humble apologies this video is shows googles duplex duplex is going to be Google's assistant what's incredibly amazing about this software is that you can say against your your voice assistance please book a haircut for me on this dates between

this time and this time and this software will phone your hairdresser and we'll have a live discussion with a person on the other end of of the line when Google tested this these the targets that they phoned were not aware that they were part of this testing and this duplex was able to make the appointment flawlessly as far as tools that are available so there are several the easiest one is the second on this list known as lie Abed if anyone is interested you just go on to Google type in Liebert your first results may be for the actual bird itself but just look for the voice spoofing tool it will give you sentences to read and I think you you

only need a small data set for it to be able to spoof your voice so if you're interested have a look at that a more technical tool that has been developed by Google is known as tacit Ron it's definitely nowhere near as user-friendly as something like Lyrebird you will need a certain graphics card and a little bit of experience behind the terminal to be able to get it working Adobe had to put together beats a version of a program called volcko and it was kind of going to be marketed as the Photoshop for voice it was never released but it also had incredibly good results what's worth mentioning here as well is that these all of the software has a

legitimate application so if you consider maybe there's a form set and the sound guy was maybe you're not doing his job properly and they lost a few seconds of data from one of the actors in the movie you don't necessarily want to get that accent studio to just record two or three words so you can use software like this just to generate the the words that you're missing any kind of call center is able to use this application to set up their kind of their menu make it interactive as well as people who suffer from any kind of muscular degenerative diseases if they don't want to lose their voice they can make a copy of it and then still be able

to express themselves using their voice so I don't know if anyone saw either of these applications online the first on the Left fakeup what it did is that it took video or image and you then supplied a different person's face and what it could do is that it could put the second person's face very convincingly in the video or the image that you provided the tool it wasn't online for very long it got taken down shortly after it was released if you do some careful kind of searching you're still able to find it though the second one that's worth mentioning is deep nude it worked what the hell is that timing I mean we couldn't have planned that

better every time so deep nude works on a similar kind of

this is serious money all right so what deep nude did is that you would provide it with the photograph unfortunately of woman only and the software would undress the woman also as you can imagine caused a huge uproar on land was also taken down so this was also a dramatic reveal on PowerPoint so feel free to gasp as I scroll down through my PDF yeah if we name a tool for spoofing voices I mean what else do you call it all right so then dramatic fades black because now you know we're doing dodgy sorry that's probably not a lot alright so how does deep throat actually work the tool guys so you just have a little bit of

your input data which is the the voice that you want to spoof magic happens and then voila you can just attack whoever you want if only it was that simple right so a kind of rough breakdown of this is that you would obviously need to have your targets in mind so you have your input data which you pass through this tool it chops it up into sin ten-second clips which just makes it easier for the tool to work with an optional step is passing it to a subset subtitle API so that you can manually check if the the tool is understanding what's being said correctly it's then passed through a deep learning module while this is all happening in the

background your attacker can prepare what I call the attack script and once your spook voice is generated the app will automatically create the phrases that you put in the attack script in order to make this attack vector as convincing as possible the tool has a virtual audio cable which can pass the audio from the tool to a phone call app when I did my testing I was using Skype the great thing about Skype is that you can make international calls the numbers not easily traced back to you also if you want a free account you can make calls for like 30 days and not have to pay if anyone wants to do that something that I've read consistently in

reports is about background noise as soon as you have something like a baby crying or airport noise or something that adds to the urgency of your attack script it makes something seem a whole lot more believable and then all you have to do is attack here is a wonderful video it's it was part of some of my early proof of concepts what I did is that I took a spoof of my voice and I thought who can I fool and the people that obviously know me best to all my friends and family so what I did is I just put together a very kind of short sentence I said there's a problem I really need you

to find me I had mall noise with the kid screaming and then I sent this as a whatsapp voice note to several people but I don't know if this is surprising or unsurprising but it was good enough to fool both my father and a friend of 25 plus years so there's definitely room for improvement but it's convincing enough so then I wondered can I fool a bank and the scary thing is I think I can and I got pretty close the what the only difficulty I had was an audio quality because at that point I hadn't figured out the virtual audio cable yet and I was doing it from a speaker through to the Skype call which wasn't

very wasn't great quality but the lady on the other side of the call was convinced enough that she was speaking to a real person what worries me more though than a bank is what about fooling an entire nation there are general elections during the elections next year in America California has recently passed a bill where you cannot falsely distribute a deep fake 60 days before an election if you do you're liable to a whopping four thousand dollar fine and I think a month in jail now if you think that you can kind of sway the election in an entire country and possibly influence international politics four thousand dollars is nothing I also had a video

here it was done a journalist has software a deep fake software where he's interviewing Putin the one camera shows him where he's asking questions he then runs to the other camera which has the software enabled and it makes his voice his voice his voice and his face look and sound like Putin so it's just it becomes very difficult of like seeing is no longer believing and neither is hearing and then as if it was magic this tool got released there was also a dramatic pan just so that if you miss the name you can see it nice and clear and this tool is pretty impressive so I was planning to do a live demo which is

not going to happen this is also a video this was one of my tests at least I can show you what the tool looks like and explain a little bit so what you do is that you're able to take a recording of a voice you only need five seconds that's how good the software is becoming so the top spectrogram is the actual input the recording that you make you then input a sentence in the top right the two synthesizers and photo codes this into spoofed audio if you can see on the left these circular dots is from the recordings the X is the spoofed voice so take really what you want in terms of qualities you want

that X to be as close to the the circles as possible I don't know how clear this is but you can see that there is one X quite close to the dots that was good enough to fool my mother all right so how do we stop this from happening yeah not a great kind of mitigation rights if anyone also wants to watch this video it's amazing it's a deaf people teaching you how to swim that's by the way if anyone is interested so yeah so learning sign language is not an option I also think I didn't realize there were a whole bunch of different sign languages I think this is a huge missed opportunity for a universal

language but that's a guess a different talk so mitigation trust but verify and if you think something is dodgy trust your gut it's generally good advice and as I mentioned seeing is not believing and no longer is hearing a phone call from someone that you know might not necessarily actually be from them at all in my opinion prevention is almost impossible as soon as someone has a tiny kind of data set the irony of doing these talks is that I'm us making myself vulnerable to those kind of attack which I can appreciate the irony of that but the only advice that I can really give to people is that if a service provider is using your

voice print as authorization or is the only form of authorization I would opt the hull out of it there are already two service providers that I'm aware of in South Africa that are using this at this point in time I'm not a big fan and even though I don't necessarily have a problem with biometrics I think that again convenience is has its place but the the massive difference for me between something like a password and biometrics is that a password you can change your biometrics you're stuck with so as soon as something of yours is compromised there's nothing you can do about it so yes so biomatrix are not bad but best to use them in combination with something

else thank you for bearing with me through this disaster and at the Comedy Store and some idiot ran up on stage he comes up to me during the middle of my set and tells me that we are in a simulation the guy was drunk out of his mind he was so drunk that he couldn't stand up straight so we all laughed at him and let security escort him out but now that we have deep fakes and fake voices and starting to believe that we're not far off from simulations after all all right so I don't know if anyone here watches Dorgan podcasts if you do that's pretty believable right no everyone is here what I'm doing it's not

negotiable at all this is great what was next okay it turns out a big part of getting things done is making a phone call we think hey I can help with this problem let's say you want to ask Google to make you a haircut appointment on Tuesday between 10:00 and noon what happens is the Google assistant makes the call seamlessly in the background for you so what you're going to hear is the Google assistant actually calling a real salon to schedule the appointment for you let's listen

hi client I'm looking for something I'm the third give me one sorry I'm just gonna pause that for a second what I want you to pay attention to is when duplex is talking its learn to use aspects of speech like um which is incredible because generally what was easy to identify when you were listening to machine is that it spoke perfectly whereas this is kind of it's learning the nuances of human speech at 12 p.m. we do not have a quality available the closest we have to that is a 115 do you have anything between 10:00 a.m. and 12:00 p.m. depending on what service she was like what service is she looking for just a woman's haircut for now okay we

have a 10 o'clock 10:00 a.m. is fine okay what's her birth name the first name is Lisa okay perfect so I will see Li 5 10 o'clock on May 3rd ok please thanks Lee love is a

feel free to abort as well this was actually a test just to see if you guys could remember the order of my my presentation because Kelly I can't alright so this is my proof of concepts it is Wednesday the 17th of April this is a proof of concept video from a test based research report what I'm doing is I'm saying if I can use my spoofed voice to fool my friends and family so what I'm doing is I'm going to send a good friend of mine a voice notes asking her to find me

okay she's unmanned and she's listening to the voice nodes there we go that's wonderful also shout-out to Greg who I did all my video editing mad skills here's the the one with Putin mr. president hello mr. president can you hear us

hello Cambridge good morning thank you mr. Brisbane it's uh it's wonderful to have you here with us so as you know people here are somewhat concerned we have an election campaign coming up in 2020 and people are worried that Russia may interfere with it as it did in 2016 perhaps using more advanced technologies such as deep fakes what do you have to say to that all right so look obviously this isn't the best deep fake you can see around ease forward that is the discoloration of the skin and a kind of like there are moments where the face freezes but if this was done as good quality I think it's so insidious because it doesn't

matter about facts anymore if you're able to create something and plants a seed in people's brains I think it's incredibly problematic and no one gives a about a four thousand dollar fun

I'm not actually sure I'd have to check that for you think we can definitely try the demo this is the one I was able to record so the audio happens right at the end everything that I explained to you guys I was going to explain while this video goes what what is something that's quite helpful in the tool is that you want to have a really good quality spectrogram if you don't obviously the software can only work with what you give it so if you have a rubbish recording it's not going to be able to do anything good with it these are these are fairly good quality spectrograms both the inputs and the generated speech

I'm doing a test to see what my spoof voice sounds like after only providing the tool with a few seconds of input all right so it's not the best there's definitely room for improvement I think what would making it a huge difference and something that I would want to add to my tool is location because the accent makes a huge amount of difference part of my plans is to create a South African data sets and see if I can get more convincing results using that I don't think the only thing else I can find possibly are the audio from Google

all right so these are not the ones that I'd put in my presentation but it will suffice so I don't know if you guys can read because it does say which is which she earned a doctorate in sociology at Columbia University she earned a doctorate in sociology at Columbia University okay so there the first one was generated the second one was recorded George Washington was the first president of the United States George Washington was the first president of the United States to me that's a really good example of the how close the generated speech is to the the recording of the voice-over artist and I think that that is those those are all the videos I'm just having a quick look what

are you two minute papers with károly on IIFA here today we are going to listen to some amazing improvements is it just more examples ai voice cloning for instance if someone wanted to clone my voice there are hours and hours of my recordings on YouTube and elsewhere they could do it with previously existing techniques but the question today is if we had even more advanced methods to do this how big of a sound sample would we really need for this do we need a few hours a few minutes the answer is no not at all hold on to your papers because this new technique only requires five seconds let's listen to a couple examples the

Norseman considered the rainbow as a bridge over which the gods passed from Earth to their home in the sky take a look at these pages from Crooked Creek Drive there are several listings for gas station here's the forecast for the next four days alright so the order of that hasn't been very great I hope you've been able to kind of keep track and not gets you confused questions

do you guys is there a microphone that gets passed around is it okay sure at the moment the tool that I've used is only done English but because it's open-source there are other people creating datasets for other languages yes there was another question there right well there are several factors so if someone speaks really well and they announce it properly of course you're going to get a better audio from them it also depends on your actual device quality so is the microphone good is the microphone terrible how much ambient noise do you have so they're there quite a few factors that kind of influence your ability to make a really good spoof oh I didn't repeat the

question I'm sorry I'm bad at following instructions apparently

oh it's it's definitely being done but not for the tools that I showed so taciturn is not the tool that I was showing that was done just for English