
thanks guys noticed earlier today that there was like a discrepancy in the description for this talk that was in the on the website in here so there might be a little bit of confusion as to what this is about this is this is about audio interface systems like Alexa and other voice assistants and how we can do like how we can do bad things too so the genesis of this talk actually comes out of 2013 where there was a there was a discussion or the a bet that went on between what the future of interacting with a computer would look like and one side you had Google glass how many of you remember Google glass well that was
a big thing they really remember how much Google glass retailed for $1,500 all right and then that was 2013 then about two years later Alexa actually came out and that would retail for about a hundred and fifty dollars and I remember the entire debate there was are we going to wear additional equipment on our bodies in turn ourselves more into cyborgs or are we going to go back to how we can naturally interact with with each other and then have that be an extension of how we interact with computers and and through that you also had Siri that was coming up and it's kind of a sad story here that Siri was one of the last
acquisitions that Steve Jobs made before he died and this was like calling as he was sick calling and making sure that that acquisition came through and they pushed it through and we have cereal and our phones but now it's been able or it's been allowed to atrophy to where Siri doesn't really dominate the space anymore kind of a swan song there's that Siri has made a comeback as shortcuts that users have kind of propped it up if you get a chance to look at Syria and hey Siri I'm being pulled over there's a fun shortcut system there but most of you probably saw audio assistance or voice assistance come through really strong in Christmas of last year that
was where my family members got me a one Alexa device and that quickly multiplied into five Alexa device in fact 11 million Alexa devices were sold this past Christmas so many that I had random family members show up with them show of hands who here has an Alexa device in their in their house anybody else have a Google dot actually got excited so much about it that I went and got one of each and then my nefarious children got them to argue with each other this past Christmas do what yes so we got so many of them we put them all throughout the the house and it wasn't until after we ended up implementing them in our cars that I started to
wonder and it's kind of putting the cart before the horse what is the security implications and that's actually the heart of where this entire talk comes from I really I try to beat it into my kids we shouldn't have a world where we we have technological magic around us we should try to figure it out and in that way you know what the attack surface looks like so I want to start a little bit with how a voice interface system works now in order to know how a voice interface system works you have to do a little bit of machine learning and and I promise this is gonna be like a really fast walkthrough so to
do this I'm going to borrow from Shazam's white paper on how to do audio fingerprinting it turns out audio fingerprinting is really old and it's a fairly simple to grasp concept first thing you do is you take segments of audio and you slice it up by Windows and then you take these windows so this is a segment of a song sliced up into 10 millisecond increments one thing in the machine learning space in this is that the the wider the windows the faster you can process the audio but the more you lose out of that audio and that's going to be very important later but the narrower the windows the more accurate you can be but the more data is required
to do the analysis so thing we do is we we chop up the audio into windows the next thing we do is we take the the waveform there we we throw out all the other data except for what the highest peak of the audio is like the highest point of each signal or of each of point in time and that gives us this nice constellation of points all along the x-axis here and what that gives us is kind of a waveform but then we take this and we combine it with we know we're at in each point in time a high point is and then we know the Delta offset of all the other times in that
sliding window so what happens is over time you say I've got a high point here here here or these are the frequencies that are the high points and then over time this there's a convergence of this fingerprint matches what I've got stored over here so you would generally have like this is one fingerprint in time you generally have lots of different fingerprints that you're taking over the span of many many milliseconds okay so that kind of like a really rough overview of how audio is fingerprinted and how you soon say this piece of audio sounds like this piece of audio over here and that's the first piece of this whole voice audio interface the second
piece is the reason I was talking about the windowing is that the the wider the windows the more susceptible you are to error and error in this case is very there's a very big overlap in the literature between noise and silence one of the reasons is that at every everything is is noise at a certain point and so silence in itself is actually pretty noisy so if you were to put a speaker in a room and just crank up the gain on it it would eventually just hammer you a static because there's there's still stuff going on in that room voice assistants have to have to tackle the far field voice problem that means that you're not going to get
a very clean signal like the Mike's here on me are right next to my mouth they can get a pretty clean signal at least that's the idea a voice assistant needs to be able to pick up my voice across a room it's actually pretty surprising how resilient and Alexa is to to interference so that's one thing the next thing is that silence is actually a relative thing and and well noise ends up hurting our fingerprinting going back to the fingerprint here if if one of these if there's noise interference while I'm taking a fingerprint then I have to wait until the next time or the next time slot or the next window to start fingerprinting again and I'll
probably have to take multiple fingerprints and I'm not going to get an exact match I'm gonna get maybe 80% of the fingerprints to match up and then I'll have a certain confidence threshold that this fingerprint is a match noise comes in and and noise in silence when you're doing audio analysis are roughly the same thing like I said everything is noises certain point and decibels this is a fascinating thing that I found out so breathing is 10 decibels a regular whisper is about 30 decibels the average normal conversation of 60 decibels which is basically my son's whispering and another fascinating fact is Microsoft created a sound dampening room on their redmond campus and it went to negative
60 decibels something that happens in the human ears that we start to we lower our ability to to process sound by what's ambient around us you basically get used to the noise around you if you've ever gone camping and it sounds too quiet that's that's the reason your your adjusted to the higher noise of a city well in this case if the noise is adjusted to negative 60 you start hearing really interesting things like your own bodies mechanics all the way down to bones grinding on bones so there's there's this whole other fascinating field of silence another thing with silence is that um your your cell phone has a lot of algorithms on it just to detect
silence because the cell phone doesn't want to send information that it doesn't have to and you're probably wondering at this point what does all this have to do with a voice Assessor the voice assistant is designed to not send all of the data that's around it it's actually one of the one of the things that I wanted to point out it's so when we put it in our car the first thing I was worried about is it's gonna chew through all the bandwidth um that I have on myself plan but it turns out that it doesn't use very much it uses about as much as browsing an average website yes the average website is about a Meg now
but this means that it's not piping raw audio up to the cloud that would actually overwhelm the services or it would just produce a whole lot more processing that Amazon would have to do so what they do is they take a stream of fingerprints and those are sent up to the cloud now your question is could they reconstruct the voice that was talking to the Alexa from those from from those fingerprints yes they could it wouldn't be the exact voice it would be a very robotic voice if you've ever been on a voice conference or a teleconference and somebody just starts like getting really pixelated out in their audio that's what it would sound like that said it still contains enough
information that it could be useful to law enforcement investigators or something else that's why there's at least three pending cases for the Alexa data but they're not looking for the raw audio and another interesting point remember the noise the Alexa tries to cut out as much of that ambient sound as possible so it's not going to collect like there's no CSI here it's not going to allow you to listen in by some other mean of you know means of bouncing sound around the Alexus designed to pick up on the primary person who triggers it and then take fingerprints from that source so the other interesting thing is that the fingerprints are sent over secured connection the first versions
the Alexa two years ago or so they they had one leg of their connection I think the provisioning process that was not secured from what I could see there's no raw pcaps of the data that's sent to Amazon at least none that are public right now even skills are required to use secure connections so that kind of raises the question what can we do when it comes to audio here's another like here's how the entire loop goes you tell something to the Alexa it picks up on the primary speaker the primary speaker is then you know the fingerprints or trends are sent over to the Alexa service which is a speech-to-text service those that speeches is chopped up you've got the
hot word that triggers the device to start listening and to start taking fingerprints then it listens for that the next leg of the problem is natural language processing NLP because when you talk to the device you want to be able to use more or less natural like conversational style the problem is that doesn't actually work very well right now with the Alexa what it's actually looking for is the action the skill the intent and then what are called slots slots are like various the skill would be the application okay so so here's the attack vectors in that chain here's how they go bad first of all the Alexa device is an Android okay we could spend
a lot of time here talking about the jailbreaking of an Alexa it's actually from what I understand not entirely hard but it's also not really an attack vector if you've got one locked away in your car or in your house it would require somebody to jailbreak it and then like put malicious firmware but one thing to note is that Amazon did a really good job with the firmware of the Alexa itself like I said it has the certificates baked in to talk back to Amazon and there's no public pcaps that i've been able to find just yet that said oh and also it's worth noting that Amazon thinks that their security is so good that they've sponsored the
law that just got passed I think in California that all IOT devices have to have really strong encryption one way that Alexis can go bad is side side channel commands we'll see an example of that in just a second one of the big issues with an Alexa is that it is over privileged when you provision one it is assigned to an Amazon account and there's no way they've added a few ways in the app to limit some of the access that it has one of the things that you can limit is whether or not the Alexa device can complete a purchase and that's because early on there were several cases of like young children being able to order like dollhouses and
cookies or whatever else on their parents account other than that though anything you can do you're doing as the primary account and the reason for that is that the identification and authorization piece on audio security is not solved yet it's and it's frightening because about a year ago the bank that I actually use um swab they came out with a feature that you could authorize voice authorize yourself to your account and they ripped it out about a month later because that's really insecure what Alexa does offer though is if you can have a primary account and you can have sub accounts that you assign or that you tie into your account and then what it'll do is it'll try to figure out like
maybe you your wife children and then try to switch between those accounts based on the voices that it hears talking to it but that's not at all the same as authorizing you according to these other accounts another way that voice assistance could go bad there's there's like an entire fascinating field of data capture and exfiltration using super and subsonic channels we'll get to that in a second to the and the the biggest one is that it just generally teaches users bad habits if you're talking to your device how many of you have ever seen somebody talking into their device as a search turn hey buddy I was walking around the track behind our house a while back and
I was watching these people talking into their devices and wondering if I could just like walk around with a with a directional microphone and pick those out I wonder if it would freak people out if they've heard or they were able to like see how easy it is to pull that back out but we're we're in the rush to get these voice assistance out to market we're teaching people bad habits by tying their their accounts to these giving them too many privileges and just generally not solving some of the authorization problems so here's the side channel command that I was mentioning initially when Alexa came out it had no way to say ignore your key
word for your name like if anybody said Alec's or anything that even remotely sounded like Alexa it would wake up there was several humorous stories of CNN reports and things like that where they would use the word Alexa and there was one story where they were reporting on Alexis being used or being triggered accidentally to buy things and then it triggered all the people that were listening to the report there Alexis to buy things and so Amazon came out I think towards the summer here of a way to to use a way of app for advertisers to use the Alexa trigger keyword without triggering people's Alexis and the way they did that was you can see
that this is a wave form of the word Alexa over time but what's cut out is that middle band there and so Alexa will see that and it does a trigger because a normal human voice fills that in but something that's generated from a computer can leave that out now bring that up because this is a benign example of just not triggering Alexa because it's missing that waveform the human ear fills stuff like this in and so this is the heart of audio security we've evolved in such a way that we fill in a lot of data we do this visually we do this audibly - and actually the way that mp3's as they cut out a whole lot of range of
sound there was this hole I didn't realize I didn't understand the the analog versus digital debate for music until I started really studying how voice assistance work because voice assistants are the the digital we're studying the waveform but then you get into how mp3 is compressed and all that mp3 just chops out all the additional data sort of like a fingerprint but a lot a lot higher resolution so the next thing that I did after analyzing the device and all that is to go into rolling your own voice assistant it's actually possible I really didn't like the idea of having all these devices in my house that I had no root access to and no way to really tell what they're
doing other than just trusting Amazon which they've done a pretty decent job but I'd like to be able to extend it one of the things I'd like to do is sound event detection and we'll get to that in just a second but there's a lot of other reasons learning how voice is processed you go from the voice to text piece and then there's the text back to speech piece and and all that stuff then you know the whole keeping data on premises and one of the big pieces is a voice style transfer so here's a collector hears a sound event processing that I ran in my own house when we just were watching a movie one night this is just a normal
night in our house watching a movie and we have things like screaming smashing pigs I think I captured that because there was a point to to hit on a pig I think it was keying off of squealing or something like that so this this goes to show that model development is really hard when it comes to what your the the fingerprints so Shazam was just taking a body of songs and finger printing all those songs and then the service was I could take my phone and hold it up to any source and it would try to fingerprint against those songs right when it comes to voice or speech-to-text figuring out how to sew generalizing speech-to-text is a really hard problem
you have to have a whole lot of data behind that how many of you guys remember Dragon NaturallySpeaking prefer just dragon prefer anybody have a guess of how old that is at least 25 years yeah and the initial versions were you had to you had to read it a book and it would train itself to your voice okay the problem that we have now is for the voice assistance coming out of the box we want them to be able to pick up on anyone's voice that's a really hard problem takes a lot of data to train and so here are some of the those open data sets that I've found Blizzard challenged LJ speech now the interesting thing
about these different data sets is not just the audio collected but the metadata around it because when you want to train you want to you know reinforce the training to say here's it here's what you thought it was here's what it really was and so there's different bits that are added to it librivox has a lot of just simple voice audio sample to word the blizzard challenge has has a lot of that too the Ryerson audio visual database that adds emotion into it which is an interesting concept Alexa Risa era amazon recently patented the ability to tell whether or not you were sick by the change of your voice which is a fascinating concept other
ones are the urban sound data set and Google's audio set and I mentioned those separately because sound event detection is a whole nother realm of audio security like being able to so the Alexa actually has a guard mode now that it can listen for breaking of glass and breaking wood what we really need one of the saving grace is here is that there's no there's no black market set for models like some some model that I can just plug into any other system every all the all the machine learning scenarios that I've seen all have to be trained in the same one that does the detection which means that it's not yet possible to just
buy a model of someone's voice so speaking of that how to steal a voice so there's some legitimate uses for how to clone someone's voice one of the interesting ones that I found was Roger Ebert regaining his voice in 2010 with the help of Sarah proc he basically had it was cancer the throat they had to cut out his larynx it wasn't able to speak and he didn't want to go the route of Steven Hawkings and just have a computer-generated voice he wanted to reclaim his voice so Sarah proc which is an Israeli company offered to help they said we've got hundreds of hours of you speaking we can take your voice clone it and you can have your own voice back
it's pretty cool there's services online that you can clone voices with Lyrebird and Adobe voco which is a I think it's only a software-as-a-service it's basically the Photoshop of voice but you can also do this on your own what they use under the hood is called a generative adversarial Network again one so this is two to machine learning networks one is generating one one has the classification the other one is generating possible possible things and then there's a discriminator that will discriminate based off of what's been generated does it match does it kind of match what the person wanted in the first place so you have the original signal analysis frame synthesis another another way to look at this is let's say
you have a training set of handwritten characters and then you have random noise generator and then it's generating these possibilities and you have this possible image that the discriminator says yes that looks like an 8 okay that's that's a way that you could basically change the style of someone's handwriting or write a note in someone else's style have you who hears heard deep fake videos this is effectively the building block of deep fake videos does anybody know how long it takes to create or how much source content it takes to create a deep fake video it takes many hours of source content I would dare say about 20 hours or so it takes a whole
lot to create a believable deep fade going back to the generative adversarial Network you've got the random noise generator one interesting like side effect of this is you can tell what the training set visually looks like by running random noise through this and just taking the image out of it how many of you have seen Google's a deep dream project the deep dream ends up looking like deep nightmares but what it's actually showing you is what the what the what the model thinks the ideal form in this case of a dog looks like and you'll notice that it has like every every facet has a nose and eyes are kind of a part of it ears maybe less so but so
anyway you run random noise through it the discriminator is very indiscriminate and that's how you end up with images like this I would argue the biggest attack vector for audio is social engineering personality being the user expression or the user UX of audio there's a concern how children interact with Alexa there's a story that I want to get to Soupy Sales 1970 65 who's ever heard of Soupy Sales it's like a really that's awesome I just came across this so for those of you who either didn't live through it or work or haven't read about it so there's a there's a guy Soupy Sales back in what was the show that he was on I was a kids
program and basically told the kids one morning New Year's morning that his parents worked really are they they parted really hard last night to go in there and get the green peas the paper that had a man's picture on it and send them to him now he claimed he got eighty thousand dollars from this he did get fired from the network but I don't think he got eighty thousand dollars but think about that in the context of these voice assistants that are in homes that don't have have all that now I'm at a time but I'm gonna rush through some of this audio enter another piece of that is audio interfaces are being used for counseling
you've got sarcasm as a service the reason you would have sarcasm as a service is because you tell it your ideas and it tells you that all of your ideas are crap and so you end up arguing with your voice assistant really defending your ideas and what that produces is better ideas the point is these are audio or these assistants are becoming more human-like as we have more as we have more data and we train them they NLP processes the bottom line is don't trust the voice in your head the and in fact going back to the amount of source data that it takes to clone a voice researchers out of China have recently put out a paper that it takes
about 3.7 seconds to clone someone's voice so I leave you with that and I want you to basically mistrust your Alexa as much as possible treat it as a device that is hey I where in your house but I mean that that's just what I've told my kids too all right other yeah we've got prizes here in just second I've got a slides recap but if you need I'll tweet it out later for you guys all right we have three prizes because I'm drawing a blank right now so does anybody remember Oh questions yes sir
hmm right it is the the question was if you 3.7 seconds won't give you the exact saint-like it has to do with the resolution of the voice that you're cloning right
right yes so he's saying that you could clone the voice but you can't clone the mannerisms that's another issue itself you're right and there is a there is a style transfer as far as the NLP part of it and there's a whole fascinating field of like cloning child predators mannerisms how they speak and applying that back and so there's two problems there right there's the voice itself and then how the voice is used as the second part so to give away prizes because I know that we're tight on time first prize for the first person that could tell me the the tight or the the the technical term for how voice is cloned the type of network that
is used for a clone yes yes second one would be oh yeah yeah another activation word gentleman up yeah alright last one you have a choice or preference all right and
because I don't work for Amazon and I don't have Alexis
all right thank you all right and one last thing to point out I ran through a whole bunch of source material I do keep an awesome list for all the references so if you're interested in like the actual projects for doing this come see me here email me thank you guys