← All talks

Using Deep Learning to Undermine Tor

BSidesROC · 201918:04149 viewsPublished 2019-03Watch on YouTube ↗
Speakers
Tags
About this talk
Website fingerprinting attacks can reveal which websites a user visits even over encrypted connections like Tor. This talk presents Deep Fingerprinting, a convolutional neural network attack that achieves over 98% accuracy against Tor and defeats recently proposed defenses including WTF-pad and Walkie Talkie. The work demonstrates both the power and risk of applying state-of-the-art deep learning techniques to traffic analysis.
Show original YouTube description
Talk Description: Website fingerprinting enables a local eavesdropper to determine which websites a user is visiting over an encrypted connection and can even reveal information sent over the Tor anonymity system. In this work, we present Deep Fingerprinting (DF), a new website fingerprinting attack against Tor that leverages a type of deep learning called Convolutional Neural Networks (CNN). The DF attack attains over 98% accuracy on Tor traffic and can even defeat some recently proposed defenses against website fingerprinting. The success of this attack shows the value of deep learning techniques in security applications. Bio: Matt Wright is the Director of the Center for Cybersecurity at RIT and a Professor of Computing Security. He graduated with his PhD from the Department of Computer Science at the University of Massachusetts in May, 2005, where he earned his MS in 2002. His dissertation work examined attacks and defenses for systems that provide anonymity online. His other interests include adversarial machine learning and understanding the human element of security.
Show transcript [en]

better get this show on the road my name is Matt Wright I am the director of the center for cybersecurity at RIT and I'm gonna talk about how attackers could read your encrypted traffic and what maybe we could do about it now what I just said might give you a little pause wait a second you said encrypted traffic and attackers reading it that's not supposed to happen but consider the case of Shelley Shelley you see Shelley has an embarrassing problem Shelley hat has athlete's shell and as many of us do when we have a medical issue we go online to check out her symptoms and see what she should do about it too so of course she goes to

turtle health calm as we all would right and of course she's using an encrypted connection because she wants to protect her privacy and it's a good thing too because her Snoopy neighbor Sheldon is trying to sniff on her wireless connection and read the traffic but hey it's encrypted so all good right unfortunately Turtles have more dangerous attackers to deal with and what is it that shredder is going to do so shredder is going to create his own connections to turtle health.com and he's going to be doing this to perform an attack called website fingerprinting the attack takes two steps so first is this preparatory step where he makes these connections himself and he goes to say the shell page

athlete shell page in one - like curled tail disease or whatever it is and when he goes to the athlete's shell page ready sends HTTP GET request get some responses send some more requests get some more responses but when he goes through the curled tail disease page he gets a different pattern of requests and responses different numbers of packets going back and forth these create two patterns p1 and p2 and he can save some information about these patterns into a database so that's step one step two of the attack is that he can actually perform the attack on Shelley so Shelley is connecting to turtle he'll calm Shredder's again sniffing the wireless traffic and he's got his database p-one

and p-two and he sees that this is a match this traffic going back and forth that Shelley is sending to the website is a match for p1 and he says oh okay well that means that Shelley probably has a fleets shelf and it turns out that this attack is pretty effective it's got in sort of the prior literature some 90 plus percent accuracy to convince you that this is an important attack let's consider the different possible threats to Shelley's privacy here so one is that okay it could be someone sniffing her wireless traffic like I said it could be someone who's compromised Shelley's DSL or cable modem router and of course we know how vulnerable these are with their

default passwords and not being updated very frequently it could be her ISP right or her workplace and of course the US Congress a couple years ago said hey ISPs it is okay to spy on your customers oh and go ahead and sell that data to other companies if you like it could be any of the networks between her and the website including again the ISP on this time over on the website side so all of these are potential threats to use the website fingerprinting attack so I would recommend to have a defense the defense that I would suggest would be tor how many of you know what Turia's rates so we won't go through the whole thing but

the client is making these connections through a guard middle and exit node to get to the web server that doesn't mean we completely prevent the attack so the attacker in this case could be anybody sitting between the client and the guard so almost all of the same attackers except for the ones that are close to the web server over there those are the ones who can see the identity of the client right the IP address of the client but you know otherwise should not be able to see what website it is that Shelly's going to or what it is that she's looking out online the way the website fingerprinting attack works on tour is there's still

the same two-step process shredders gonna sit there and sniff his own wire his own connections through tor to these different web sites and generate a series of traffic patterns he's going to then use those feed them into a machine learning algorithm that's going to be used to train up a classifier so that's the step one of the attack these preparing to do now notice one difference in the attack before we were looking at a specific web site and pages in the site tour definitely makes the attack harder one of the things it does is it breaks all the packets up into 512 byte chunks so all the packets are basically the same size that's definitely going to make the attack

harder and what the attacker is interested here is not which page within the web site are you going to but which web site are you going to write which is definitely going to be an easier attack except that we're trying to do it over tor step two of the attack again here's your user going over tor to these web sites shredder is going to sniff on that connection and see this traffic pattern use the classifier to determine what web site it is that Shelly is going to in this case Web MD so it turns out that this attack has been around for a while and it's been getting better and better as people have used different

machine learning classifiers and different techniques to isolate the information and in consequence recently a couple of folks have proposed different defenses that could be used in Tor to improve tors defense against these attacks one of these is walkie talkie the idea of walkie talkie is pretty clever you've got two web sites site a and site B you're going to take their traffic patterns you're going to merge them to create what single traffic pattern and whenever the user goes to either site a or site B you get that same traffic pattern so in theory you shouldn't be able to distinguish between the two sites right so the maximum possible accuracy should be about 50% this adds some overhead to tor which is

already it has some high overhead and it adds a 34% additional delay the tour which tour is already pretty slow if you've ever used it you already know that so so it does add some additional delay to tor in order to provide this defense but it works at least it has worked in the past another defense is called WTF pad and this is something we actually worked on with the Tor project the idea is fairly simple you can have real bursts of packets so if you look at web traffic you've got a really bursty behavior a bunch of traffic and then less traffic in the more traffic and if you look at the bursts and then you look

at another burst there can be a gap in the middle WTF pad detects those gaps and fills them in with a dummy burst like a fake burst of packets this adds a little bit more overhead but it doesn't specifically add delay and it is considered the main candidate to be deployed in tor so it is something that is important to consider and study so this is where we come in with a point of today's talk is to talk about deep fingerprinting this is work with my student and for a student PI up my former student Molson and our colleague mark from kayuu Leuven in Belgium so you've all probably heard about deep learning it's the new hotness and

buzzwords bla bla bla and let me just give you kind of a quick refresh on why deep learning is such a hot new thing these days so there's this dataset called image net and there are 1.2 million images 1,200 categories - 1000 images per category and they're things like ok there's a horse a table a car a ship a banana right can you did you know classifier machine learning classifier can you detect the difference between these things can you guess which one it is in addition there are 120 breeds of dog this is one breed of dog that's another breed of dog that's another breed of dog if you can tell the difference you're probably better than me

the machine learning classifiers probably better than all of us these days the reason that machine learning or deep learning has has gotten so hot so in 2011 that's the green bar on the far left the error rate for giving the machine learning classifier five guesses you've got five guesses if you get the right class out of the 1,200 classes if you get the right class then you get it right and otherwise you get it wrong in 2011 the machine learning classifiers were getting it wrong 26 percent of the time the next year the first effective deep learning classifier came in and dropped that rate by almost 10 percent it kept on dropping we get to 2014 in

2014 it got to 6.7% and one of the researchers in the field said you know that's really impressive but I think that humans can still do better and he trained himself we sat there spending hours and hours looking at image after image to see if he could get really good and he got it down to five percent alright humans are still better than computers you know where that story goes next year of 3.6% and it keeps getting better right so we have these algorithms that are really good at this specific task they're superhuman in fact if these very precise well-defined tasks so can we give the website fingerprinting attacker a task a well-defined task that will also do the

same type of thing and to do that we need to have a good data representation the data representation we have is very simple remember that I said tour all the packets of the same size essentially so what we can do is just look at the we have these outgoing bursts of of packets from the client to the server and we have the incoming burst from the web server back to the client and we say okay if it's going outgoing we say plus one and if it's incoming we say minus one and that's it that's the entire data representation and that it turns out is all we need now there has been some prior work in using deep learning to do

website fingerprinting some of them were using the the early so CNN is a type of deep learning classifier some of the folks we're using the one on the Left design based on the Alex net from 2012 but a lot of stuff has happened since then and the accuracy of these things has gone up and so by 2016 you have this inception v4 network that was getting I say 80% accuracy on the image net set where you only get one guess at a time so given that maybe we can take advantage of some of those advances and put them into our classifiers for website fingerprinting as well in addition we are interested to know well can we you know there are

these new defenses that I talked about can we can we break them right so essentially you can think of it if you have an image and now with the defense's you've distorted the image can you still figure out oh that's a dog in the upper left hand corner and that's unknown so the deep fingerprinting model I'm not going to go into any of the details because I don't really have time but basically we're just using some of the latest state-of-the-art machine learning techniques that are used in image classification and bringing them over to this traffic analysis world one of the things that we use is we use deeper networks with more layers and each layer

more stuff is happening and what a good idea of what's going on is in images you see on the far left you have the low-level features that are in the shallow layers of the network and that's looking for like the edges between light and dark or one color and another and as you get deeper into the network you get more and more things by the third panel there you're starting to to see stripes or swirls or colors or curves right and by the far right you can actually see it in the bottom right hand corner those are wheels and you actually have visual features that you recognize as a person and that those are the things the same

features that the image classifier is using to pick out which class these things belong to the same thing can be done for work traffic on the far left our low-level features are just the individual packets back and forth being sent but as we get deeper into the network we can look at the whole pattern of traffic the ups and down and say you know this is something recognizable for this website versus another website so how does it do well I wouldn't be talking to you now if it wasn't really successful right so we do get ninety eight point three percent performance against just vanilla tour like the existing tour as it is today and of

course that in itself is is dangerous now you can see that the prior state of the art was getting some 97 percent and I'm just gonna argue well you know on the one hand yeah that's not that doesn't look like a big leap on the other hand in the real world when you're dealing with many potential more sites than we have in these lab conditions that extra accuracy can actually make a difference but perhaps more importantly we also studied these things against defenses and we have this two defenses one is WTF pad and this in our lab experiments we're getting 64 percent bandwidth overhead so really high bandwidth overheads even with all of that extra dummy traffic these fake

bursts that are being added we're able to get ninety point seven percent accuracy so the defense the one that is most intended to be used in tor does not look very safe right now what about walkie talkie well walkie talkie does better remember it merges these two sites the best theoretical maximum accuracy is 50% we get 49 point seven percent I think what's maybe worse is that we can if you give the classifier two guesses it basically nails it it's got a ninety-eight point four percent accuracy in that case so the walkie talkie defense I also wouldn't really recommend all right so in conclusion we've looked at tor and we've looked at applying these deep

learning models to do the website fingerprinting attack on four it turns out it works pretty well the deep learning models a deeper deep learning model is more effective using all the state-of-the-art techniques and it also can be effective against distort traffic essentially like these distorted images over here still being able to recognize which image which website is which class and with that shredder is back and a quick shout-out to the National Science foundation-funded all this I'll take any questions yes sir

right so we actually have a couple of techniques oh yeah so the question is do you have any other defenses since these two are broken we are working on a couple of things it turns out there are ways to trick dirt deep learning mechanisms so there's this really cool stuff like you can take a stop sign you can put some weird stickers on it and you can make it so that the deep learning classifier looks at that let's say it's in a self-driving car it looks that goes oh that's a 45 mile an hour sign so that should probably also give you pause but yeah so there are potential defenses against these things any other questions

so the interesting thing about this so if you have a GPU although usually you know your laptop GPU I am just sort of right now the way things are set up it's a little hard to get the deep learning models to actually use the GPU but you don't need a very fancy machine so if you've got let's say twenty five hundred dollars to put together a machine you can get an you know reasonable machine with a thousand or $1500 GPU in it it'll work the deep learning models are getting really like it's the GPUs are really well designed to deal with this and NVIDIA just their main product these days is not well I mean I'm sure there's

a lot of gaming but there's a lot of deep learning that's a huge product all right well thank you very much [Applause]