BSidesAtlanta 2019 - Wes Widner - The Sound of Evil

Name: BSidesAtlanta 2019 - Wes Widner - The Sound of Evil
Uploaded: 2019-06-24
Duration: 47 min 48 s
Description: Voice assistants are popping up everywhere. Many wonder and worry whether we are voluntarily wire tapping ourselves. Your Alexa or Google Assistant doesn't have to be a magical box that represents a possible security vulnerability in your home (or car). Come join me as we explore how voice assistant

BSides Atlanta47:4817 viewsPublished 2019-06Watch on YouTube ↗

About this talk

Voice assistants are popping up everywhere. Many wonder and worry whether we are voluntarily wire tapping ourselves. Your Alexa or Google Assistant doesn't have to be a magical box that represents a possible security vulnerability in your home (or car). Come join me as we explore how voice assistants work. I'm a firm believer that you need to know how something works before you can truly secure it. Next, we'll explore the threat vectors that all voice assistants share. And finally, and how they can be protected. One of the first dates with my wife was spent overclocking a Gateway 2000 66Mhz to 175Mhz. I knew she was the one as the under-cooled chip glowed red hot and set off the smoke alarm. These days, I’m an Engineering Manager at CrowdStrike, a leading cybersecurity company. Hit me up if you’re interested in joining our team!

Show transcript [en]

okay welcome to this sound over the whole West right here he's an engineering manager at CrowdStrike you know you'll be going over the sound of evoking various aspects of audience degree this one take time to thank our sponsors here at V Sox from the Kennesaw State Department of Information Systems like to thank the NSA and for spirit of help systems so without further due so the sound to be talking about audio security but before again I just want to take like a whole what attracted you to come to this talk several others that are going on some really good so what do you think of when you think of all your security I have two surveillance

surveillance yeah privacy privacy all right so basically I made it up there's no such thing as audio security but there should be now the reason why there should be is because audio systems are coming in in all kinds of places right they show up everybody and I would argue that this is something that's really just in it's like very beginning phases they're not going away if you go back in like in computing history the ability to just talk to your computer is something that's like baked in like all the way through our our size how did people interact with like kidfox you know the the car the nightlife talk to it how did we interact with data awesome Star Trek

talk talk like this is baked in and then all the way through to like there's a really old ad for Apple where they started off pretty the name of the system but the whole ad is talking to the Macintosh and having it do things now the ad was was not completely far-fetched they had a little bit of things even the old new device from Apple II many years ago you could at least tell it to set a date so the ability to interact with a computer it's like really old promise and why did it all of a sudden like so only recently have all these other pieces come together for it to become like something

that dude today so in with that as a good security researcher you should think about like what are the ways that this this device can be misused and used and also the privacy issues everything else so most people will they think of all your security or audio devices voice assistants think about the physical special-purpose voice assistive device the Amazon echoes the Google dies and then not really special purpose but a Siri speaker or actually the avatar the Apple speaker which had Siri pregnant but then there's a whole bunch of others to the they're really just they really just constitute like microphones some thin processing and then a speaker of some sort that's it and then IOT side of it they're actually

the best case of IOT they have end-to-end encryption the firmwares locked down there was one version of the Amazon echo that had the very first version you could tag onto the board and then get root access that's how we they're they're based off of an Android operating system and they have apks that are installed you don't install any of the case Amazon but other than that the device is really locked down there's not a whole lot else going on there so what else is there that we can talk about I want to introduce the idea of voice assistance is being disembodied think about Star Trek where you're walking along the bridge you talk to the

computer do this thing right that is a voices system and there they're not just limited to the physical device that the device is really just a pic report for you to say you know here's the device or here's the capabilities but the capabilities are built into a whole bunch of other things like carbs phones remotes all over the place they're showing on public areas are starting to have like a voice assistant approach to so um they're even showing up in like channel systems for those of you who aren't aware chat ups are like taking a chat type system like slack and being able to interact with your corporate town or Jenkins or some other servers and then adding a voice

overlay to it just tell me why it's go build this in other words they're used more and more as shortcuts for helping us do things that we want to do there's all kinds of apps that are built on the phone where you can like your voice like triggers to do things Siri used to be known as happy interesting side until the marketing team thankfully came along and said don't really want to sell a product and how to people but the idea that like I think it's kind of a series anybody familiar with how you might not from when you're sorry I should've done it neighbor not familiar with that 2001 Space Odyssey either recommend that

because that is one of a original like sounds people it's where the I'm sorry comes from yeah he ends up boxing the guy out his face and tries to kill him right yeah which is why he's saying into the house it's not a good idea to name your right voice so one thing that we can save like generalized over voice assistance is that the Platonic ideal and we're talking about like ideal forms for a voice assistant is really a combination of three different machine learning models and I know machine learning might give you some heartburn it's used in all kinds of context whatever but this is really the reason why my voice assistants are popular now and they work popular when

they were first conceived that can like even the 50s the bells Bell systems had something called the Audrey system that was trying to detect numbers that was all it was really designed to do but it was a sort of a primitive voice to speak voice to text other bit of trivia there the the person who created Tetris was he used to work for the KGB before he came to the US and he said one of his projects there was to be able to speak to the big fighter plane and like the pilot would be able to tell it to do things because at high G maneuvers it's harder to move your arms but the sound

you know sound waves travel at like 700 miles an hour so you can tell the plane to do something you never do it I don't know that they ever got that implemented but just the notion that this was something that was way back in the sixties so today we have all all the processing power to make this come together and what we really had is in the nascent phases of having a useable voice assistant it's not there's still some huge gaps but it's still at least useful to where my kids can have it so anyway three different machine learning systems voice-to-text the natural language processing part of it and then the text-to-speech each one has its own

threat model and I only want to talk about the first two because the meet different times uh presented this is that trying to to light cover three different machine learning models and one top is way too much so just the first two to begin with we have to pick apart the acoustic landscape if you think about it we've actually experienced here in this classroom like in the sense this talk has started the the fact that there are multiple different sound events and sound like occurrences in this landscape going on at the same time we actually two different hallways with their own acoustic space that's bleeding in because the doors are slightly cracked but then each one of you guys is your

own source of sound and we're sitting next to my kids in church you know and then like so there's there's all this stuff going on in the Honorable space so the first thing we have to do is pick a part that's space that audio space and zero in on the sound events that we're interested in and throw the rest of the way that sounds very simplistic for me to just say it but to actually do it in processing is really hard for instance they've been doesn't even have like bows or the headphones that have the - built into them that try to like zero in on your voice so the way that that works if

it picks apart the space and then tries to zero in on your voice one of the most hostile places that you can the headphones out isn't an airport so was talking to a colleague of mine once and when he was talking zeroed in on his voice when he stopped talking it ranges and finds the announcers voice the person standing next to mostly they announce so point is lots of stuff going on it's really hard to process so the voice in the whole sound landscape occupies a fairly narrow band a male voice hundred to nine hundred Hertz female voice like an octave higher so 350 to 3 kilohertz the human ear has a really wide range Oh bit of trivia

wallows researchers miss for the male voice is typically around 100 Hertz James Earl white he had 50 years as his natural voice range the lower your range of voice that's the radio voice the human ear 125 to 12 kilohertz so quiz time who cares the most about the range of voice frequencies and all that what industry what device manufacturers yes all centers all senators beakers or somebody else's hearing aids period definitely and phones in fact phone manufacturers cell phone manufacturers in particular have some of the best white papers on this they try to to really narrow in on voice does anybody know why any guesses safe man exactly yes now you're tempted as someone who's

operating like in this space to cut out everything above 3 kilohertz because the average female voice can't you know it goes up to 3 kilohertz maybe you'll snap it out at 4 kilohertz or something like that but the nature of sound waves is that it presses up against the higher frequencies so it's if you don't include the higher frequencies then you lose things like consonants and it's harder to understand so we need to capture like generally the entire range of semen here it may be a little bit higher - which comes into play a little bit later our brains are wired though because the audio landscape is something that we naturally like process and we've been

doing it for forever we feel that we at least a little tricks but fill in spaces of frequent a range that we don't actually hear oh and the last part of that is pull it apart sound events is inherently lost that means that when we're when we have the acoustic space how many of y'all have seen a waterfall diagram of sound or anything like that okay that's it takes the entire spectrum plots and on the graph actually even plotting it on the graph is losing information by itself but the point is the more you hunt for a signal the more that signal trades so going back to the the trick of the brain this is the Alexa dauber

commercial which was over the first commercials where Alexa was not actually triggered because there was like this whole rash of you know Alexa commercials and news reports triggering Alexa's it was this funny case of a news story about a kid that was able to buy things on their parents Alexa and the story actually contained Alexa by this and then the news report actually triggered like all the people listening to it Alexis so there's this like infinitely of Alexis buying things people reporting about so Apple came up with a solution they would let commercials and news reports know if you cut out this range in the frequencies people will still hear the word Alexa but the device will not pick up on it

because the model is not trained for it okay it's an easy way to just bypass the triggering there's an implicit in teresting security implication here - you could be tricked into hearing something the device is not processing because our ears are tuned to filling in this information in other words it would be a way to have a covert channel so the way that you pick apart signals in an audio space is basically a what's called a fast Fourier transformation so you take the time domain so the lucky signal over here and you want to pick apart all the different signals up here that make up the signal over here so if you combine all these signals that come to

this if you pick them apart it gives you this okay that is the heart of audio process and it's actually the heart of video process in their image processing or just about any other signal processing okay but especially for audio processing so we've got the frequency graph here that most people are used to looking at but what we're looking for is specific audio events yes is that so microphone is just going to give you changes in voltage when it's best vibrating and so that's going to give you that picture on the time that you have shown it at a time domain and then say that the fast Fourier transform is to pick out the individual frequencies

that have to be costly in that yes so to go back it's good I probably need to go back so a microphone is capturing samples of the audio space and it's capturing like across the frequency domain that the microphone is - - not all microphones pick up the same range but whatever ranges of frequencies it's tuned to it will pick up a snapshot in time of what it hears for us those frequencies and it picks up like that won't be like it's basically just a frame of data that has this is the intensity of the signal at each one of these frequencies at that time and so a microphone digital signal processing is something like I guess the average would

be like 16,000 samples per second okay and so what we're doing is we're taking that we're all audio 16,000 samples a second and we're trying to take out of those like several seconds or probably even milliseconds we want to pick out what signal we hear in that so that's another piece of all your processing is we're taking the audio space we're taking a sample of the sound like every second every usually every 200 milliseconds or so we're taking a snapshot and then we're running a fast Fourier transformation on it and then seeing what what signals are in this space at this time that makes sense okay if I lost you I have cars message me later but that is basically the

processing part of just figuring out what signals are in the space at that time so what we're doing is we're looking for sound events specifically we're looking for human voice sound event support if we're making a audio assistant however not also Alexa has this interesting guard mode that you can enable and the reason why you don't have it enabled all the time is because it is a like some sound events sound very similar like glass breaking or dropping a glass on the floor or something like that so I liked so detecting sound events this is like the very basic car alarm is won by continual sound glass breaking it's very similar sound or a very definite sound in fact here's a

collection of different sounds and what their plots look like on what's called the male spectrum a spectrogram is what I was talking about earlier you take the sample of sound and you run the Fourier transformation over it and you're picking out here's what here's the signals that I'm detecting in the space in this given time so if you zoom in on this it looks like there's a lot of ribbons down and that's because there it's a like a chunked collection of that sound face at that time if you look at it you can either fool yourself into seeing it or just take my word for it or like reason into it that these the graph of

the sound there fits what it's describing especially the hydraulic hammer you can see the rhythmic and the fact that it occupies all frequencies and then it comes off then it occupies all frequencies that comes off wind milling machine and the sound of generator you can see that the generator is more intense than the milling machine and that the wind noise just generally flooding all jacks the one that's the most interesting here are the two that are most interesting of music and talking talking is very unspecified in other words unless you're singing a note in a song at a constant rate it's not going to fill the audio spectrum so what that means is there's only a very short

amount of time that your voice is making a an impact on the sound waves in an acoustical space right so the sampling of that in time and all that matters music even any of you guys use Shazam or anything to like determine what song is playing at a certain moment didn't I have a pixel lightning chance a pixel phone okay the pixel phone actually has this built into it you can download like a chunk of it's actually 500 Meg's of audio fingerprints and it's not for every song of the world it's for the top ten thousand songs and so what these fingerprints are doing is they're taking the audio space here for this and there they're determining if you

have this frequency and then 200 milliseconds later you have this frequency it's a constellation that's that is effectively fingerprinting a song and it takes somehow 350 milliseconds of audio data to figure out what song is playing not a whole lot and that comes into play later so another thing before I keep on going is that the Alexa card doesn't work on every device it doesn't work on every device because the MOR triggers you add to the device the more it has to actually do classification which is machine learning operation and the more processing power has to have so the Amazon echoes can't do this I also have in there gunshots because one interesting feature or interesting

aspect of like audio detection or whatever is - you can triangulate where where something comes from or where it where sound is coming from so voice the text is all about what's called phonemes phonemes are pieces of speech ok does anybody remember learning how to read phonetically like these two letters make this phonetic sound like or whatever else these are the things that we hear and so phonetics was a style of teaching that was about taking what we hear and transcribing it back to the written language okay same thing needs to be done here for voice of text so what we hear in the audio space are these these costs are these vowels and consonants these

these combinations and one interesting thing about this is that phonemes change over time they change by region did anybody have Grand Central Station before it was Google Voice yeah did you actually it was like the first person I've never run into they had Grand Central Station basically Google started this project or no Grand Central Station started the project and it was like you could replace your voice mail with this really cool new technology that will just capture it and story and what they were doing was they were trying to build up their voice to text across a broad number of people in different regions and it turns out that you're like how you say words impacts how machine

learning models were made to a really great degree so Google started there basically started their Google dot project or their voice assistant project like over a decade ago because that's why they bought Grand Central Station is to help build on this time I had a particularly southern friend of mine I've always called Lee West voice mails and they would always have some hilarious stuff wrong did anybody have Dragon NaturallySpeaking like yeah you remember how long you had just had to read to it before it before the model was correct you had to read to the for like an hour or more before the model would get built and but the thing was it was built to your voice no one

else's it was a model train to you the trick with modern voice assistance is that they're trained across a lot of voices not all but a lot yeah

differences to do with how we start out more or less the site places which diverge it was really just about the bowel sighs he meant how the length of the consonants really determine how strong accent was and they may be different but the problem I see when you're if you were trying to recognize work if your machine is consistently you know long or short entire but within within that region but in another region there may be a word this is the same word yes it has the opposite right say well they pronounce it totally different in different regions so does anybody have a guess how that's handled by the model I can see how being handled if you

you are bid we write from but it doesn't know that right then you would have to say a lot of words before it would guess ok there everywhere on earth where I expect this right sound we're substituting this now therefore that must be one of these regional accents so one of the features of the Grand Central Station later Google Voice was that when you saw your what it when it spit out for the transcription of your voicemail you could go back in and edit it and what that gave them was basically reinforcement mark so they can and it was free free labor reinforcement learning across thousands and thousands of people and what they also got was

regional information right so those two pieces together we're able to say somebody calling me from New York for instance there's a certain set of models that probably apply to that person and so what you do is you layer the different models when somebody speaking you layer the different models and say first you need to do like regional protection right and then you need to do like and there's certain markers like my Canadian friends in the abudance and stuff like that like that's dead giveaway but not all of them have that there's actually you're right it's a little bit deceptive to say that voice-to-text as the it's like an entire machine learning system it's actually a

collection of them one of them is where you're from and based off of how you say certain words and then applying the right model to the voice construct what I wanted to bring up here with phonemes is models that are trained off of us speakers and I guess getting down into it even regional speakers they aren't transferable there's not a generalized model to say and now we know Korean or Chinese or what across has to be a whole different model a whole different set of trainees what that means is well this is the collection part it goes back to the privacy Amazon and Google so this this case came out late last years and I

remember this Amazon like all of a sudden this guy so what happened was he asked them I want to know what you're collecting about me can you send me my date now out of this put on our investigators have here there are certain things that we can conclude out of this one it is broadly possible to tie someone back to their Alexa device otherwise they wouldn't have been able to give them any of us out of the course second Amazon has no interest in like separating your audio from everybody else's audio it's basically put into one giant s3 bucket okay and warframes I'm just guessing in the architecture based off of this event so effectively

an engineer and Amazon just like here's basically your audios in this range in this bucket here you go okay this was not a security breach necessary because it was one guy who was asked a question and he filled out the ticket incorrectly it should not alert alarm us that that the engineer made a mistake what it should inform us of though is that the data is stored and continually like models are continually run over that data to figure out different properties so yes it is recording it is storing it I don't know how long the retention of the storages no idea I wouldn't be surprised if it wasn't very very long because you need a lot of data to build

these models of correctly and also there was a related case I don't have a picture of this one - in the news article but Amazon employees recently like in their internal chat talking about like different things than they've heard from customers well if they have access to them like the buckets of all the information then of course they have access to the individual information - and apparently it's enough to where you can actually cure yourself ok related to those two articles throw out another article of there are two active court cases that are pending in the discovery phase they're there some subpoenaed Amazon give us the recordings of this Alexa the reason the belief elite has

been triggered and has data about some crime that happened as far as I know Amazon still fighting but yes it is that it's recording Amazon employees have access to it to the degree that you trust him is the grade that you are comfortable with having one of your house car now another thing on this before I go into the water falling the device is always listening but it's not always streaming what's happening in the room at that time notice I said the court case they had a reasonable belief that the device had been triggered the device doesn't start recording and like chunking that information until it's been triggered until the wake board has been set going

back to the sound events so the the guard duty that's pushing down a lot more triggers basically to the device if you hear this instance then it's a trigger so those those events become triggers the trigger for almost every device now is the wait word okay Google a Siri Alexa Oliver those are the triggers the reason is because the device can't be doing that much process it's a small device I mean the heart of it is not that smart device but when it is triggered then that's when it starts and it doesn't capture the entire audio space there's actually some really cool research articles from of the Alexa T the Alexa side of Alexis Sciences they had a

division of assignment and one of them is anchoring the weight word to the person that's speaking you may have analytics in their house I usually ask anybody I'd watch and I give this this also because I love the promise of audience period but it basically in our house when we when we go to play music when my kids will say they'll trigger the device Alexa and they're still thinking about what they want it to do or like play and another cannabis would bring it play this play another song okay the point is we'll get to it a little bit more in a second but the device doesn't know who's speaking it has no concept and flesh that out a

little bit more in a second but now all it's listening for is the wait for it and then it sends the rest of it or it tries to pick out pick out who speaker is they can't take that send the fast Fourier transformation detect which is actually a very small amount data but it's still enough data that you can reconstruct voice up dammuz on systems so how can we protect the device from listening to us all the time so I propose that we should think about so there's the network concept of a firewall that protects the ports from coming in I propose for audio security we have a concept of a waterfall waterfalls flood out sound behind us an

extra make your device work better to noise helps us isolate sounds if you have a headset noise-canceling headset I did this recently installed an AC unit right behind me and so it's like flooding the space behind me and after listening to all of this and all the research I was like I bet it's gonna make your better two little conference calls she was right because it floods out all the noises past that like the street kids and everything else another note here silence and noise as far as a digital processing is roughly the same what we experienced the silence is really lowered intensity of the audio space so all those combined we can do something

like put a parasite onto a digital assistant device and what that looks like is this funny we'll have here 3d printed when it's got inside of it is a Raspberry Pi a this is part of the project alias if you guys are interested in that it's all the plans for the software everything's over the source what it does is we teach it your own weight word so you can rename your device and it's listening for your weight work while flooding the device under of the sound and then once you trigger it with your weight word it stops flooding the device under the sound plays the lake word for device and then you can interact with it as normal

most people are going to spend the time or whatever to build this so I'm not advocating this is like a generalized solution mostly it was a fun hobby and an excuse that I gave my wife to spend a lot of time so but I bring it up because flooding out devices with audio and sound like you remember the old spy movies I'm sitting in the bathtub running the watery stuff there's still some usefulness to that mentality so if you're around the device think about am I the only one speaking or am i a part of a crowd is how easy or hard would it be for that device over there to pick up my voice and stuff like that so this

brings me back to the speaker identification the most that a device can do is detect that there are multiple people speak in an in a space what it can also do is perform like very simple calculations on the frequency range of your of the voices that are speaking to say is this voice high and squeaky like my kids friends probably a child okay or is it is it higher or is it lower than 350 Hertz probably a man's voice otherwise if it's higher than like 900 Hertz maybe a woman's voice somewhere around there these are probabilistic calculations right the amount of human data the reason I spent so much time on the frequency ranges the amount of data

that's in a voice is not enough to give you like confidence cryptographic certainly not anywhere close between this person's voice and another person's voice even if you trained a model using dragon even that sort of a model there's training your voice it's not enough for somebody that further for the model to say this is absolutely this person basically the collision the the degree of collisions there is very problem so while researching this I as if I remembered that there so this is from the movie sneakers very big movie person in case but movie sneakers has a scene in it where they're trying to get access to somebody's accountant so they're trying to find like recordings of this

person's voice saying hi my name is blah my voice is my passport verified okay and then the whole like scenes the movie or them capturing clips and phrases of the person is saying pieces of them the funny thing here is that the technology to clone a person's voice already existed when that movie was made so it was completely superfluous nobody would have done it but after researching this I decided who in their right minds would implement unlocking something with your voice well it turns out a bank would do that and specifically my back and so this is their like thing about you can access your account on your phone you just speak to it and hey my

voice is my passport because we remember we remember the movie but we don't remember the lesson from the movie in that same page is my voice ID secure sure we have a proprietary algorithm but you can't bend the laws of physics to make the voice have more bits of information in it to make a cryptographically security this is a dumb idea but lest you think that my bangs the only one that did this there's multiple banks that business multiple banks including HSBC the fun thing about HSBC is somebody decided this is a bad idea they went and they proved like here's how we can do it not only did they do this but they did it on

live BBC Leighton they brought in the twin I love this picture because it kind of captures the whole light this was a dumb idea to protect my bank my bank account with my voice I'm not going to go into all the other things but I will layout view anytime you see voice being used as a access control run it can't be it's like I'm not gonna I don't want to be too strong in this denunciation but no pin drop is a great local company they're researching the space of how to do it but it's also in conjunction with other factors it's the second thing I love pin drop but even they won't claim like we can totally secure a count by

one factor force yeah catch me afterwards how do we do it like when I'm listening to you like I've heard you speak before like I know your voice even without seeing you pick up cues yes I mean I wouldn't assert that that's graphic certainty but like you know when I hear voice of recognized label yes I know it yeah we know the characteristics of the voice what I'm saying is like it's you broadly you know that it's me and not my wife and not some other completely other people that you met you have a kind of fuzzy matchstick but my point is something else can come along and duplicate that voice like it's not there's a whole nother section of this

talk about duplicating voices that's what the cryptographic certain stuff comes in okay like you know that you think you know that it's me but the other part of the talk cloning voices we don't really know right and so moving on there's a lot to go into here and really what I found that also recently as a pixel allows for you to voice unlock your phone another bad security design but at least Google decides never mind so they have this feature like you could I'm not gonna follow that place when you go to turn it on they at least tell you this makes your device a lot less secure and yes last month they completely remove the feature

because it's completely insecure it's funny because like these large companies are putting out poise security and then walking it back alright NLP I don't know if I'm able to do this in three months but we'll try this is how you're Alexa uses in a feed other stuff your weight word your launch which is an application invocation wait word action application and variables that's how it works so what you're doing is you're telling this and then the NLP in your device Alexa cheese it fills in this other information so if you say Alexa play law it cheats and says so what you want is this music application whatever else so to quickly go over that companies tightly control the taxonomy

and to craft a better user experience they have these cheats in it basically you don't know what happier talking to this is a real security model you don't know what app you've triggered when you tip when you ask Alexa to tell you a joke you don't know which app is actually providing the joke which matters when you want to Joe this from the broad pool of jokes you don't want a continual stream of Jimmy Fallon jokes which is what Alexis started doing a while back no way to tell there's no there's also no sense of context no sono session does anybody can anybody guess what the security application is of that no session information your device has

authorized once for your account and anybody was able to speak to that device has that authorization level there's no elevating lowering authorization reading this stuff so some of alexinnz features have allowed you to like tie it to broadly if if the adult is speaking whatever what they're trying to do is filter out children okay but it's not it's not super secure so there was part of the NLP part is the family had their conversation recorded and sent to a contact that was not a security issue baked into the elects what happened was the NLP picked up on the weight word then they then it matched something that sounded my call and then it matched something that

sounded like a contact and then the rest of it was just an open channel okay but it was that the model was very very like a truck it was trying to be super helpful and pick up all these key words from a very far distance called a far field problem and so novel so we were talking about like regional dialects whatever that's one issue another issue is how that sounds across the span of a room up to like 30 feet away so Google's Amazon's response to this was to limit the field that it would listen to so the effect of that is now I have to yell at the Alexa in my talk with me rather than being able to

just talk to it and it deals with all the deep data so to sum up and there's there's ties to say about all of these pieces our is really not enough to do it justice so if you want to catch me later feel free I've got business cards everything else going it's we whatever later three areas I want you to consider voice identity and authorization is not something that audio systems do or I would argue can do by the physics of how much voice or how much data can invoice the apps that are on the Alexa camp part they don't have a way to identify themselves to you so you don't know what happended to you it's highly

it's very easy there's research papers on injecting malicious apps into the skill store although Amazon is very they they say they're very diligent about police and have conveying sensitive information we never even got in that part but it's across a medium that we are designed to like here and process and then the last one privilege separation when it comes to basically up one session token' for the device and it is a forever session time so with that thank you guys for coming been asking questions all along but is there any other questions that we didn't get to real quick I don't want to be too unkind to the next speaker yes okay so for authentication I've seen

another company and do this where they would they would give you the string of numbers and say repeat these numbers instead of saying like a phrase that you can kind of piece out record its data better security solution or say gonna read this phrase with that help you know basically any any audio that you use if this is a single source of audio then it could be spooked it can be played but they have a program that would say okay I'm gonna it would have to have all at mine you mean like a capture of a voice yeah you could replay it because it's random right they're gonna say repeat these random numbers yeah and if you

don't you you'd have to piece it in together and then play it right not really because second half of this talk was all about voice money and so nutshell there as I conclude your voice in like ten minutes or so of sample data and then I could play whatever I wanted oh so you can just you can make it up right okay yes of all the things you listed that need to happen feel like think they are that's one area of intense research

yes absolutely sir or absolutely yes thank you guys and

BSidesAtlanta 2019 - Wes Widner - The Sound of Evil

Related talks