Reverse engineering how WAFs (fail to) identify bots - BSides Portland 2022

Name: Reverse engineering how WAFs (fail to) identify bots - BSides Portland 2022
Uploaded: 2022-10-19
Duration: 20 min 47 s
Description: Devin Gaffney (@intl_persuasion) As online systems have exploded in complexity, so too have hacks that seek to manipulate platforms and websites with automated bots mimicking human behavior. In response, a six billion dollar web application firewall industry has risen to attempt to thwart malicious

BSides PDX · 202220:47151 viewsPublished 2022-10Watch on YouTube ↗

Speakers

Devin Gaffney

Tags

StyleTalk

About this talk

Devin Gaffney (@intl_persuasion) As online systems have exploded in complexity, so too have hacks that seek to manipulate platforms and websites with automated bots mimicking human behavior. In response, a six billion dollar web application firewall industry has risen to attempt to thwart malicious bots using a variety of methods. In this talk, we use a survey of 193 large retail sites, 44 distinct bot implementations, machine learning, and statistical feature extraction to identify what bot design features have the highest correlations with firewall circumvention. We find clear evidence of a hierarchy of bot features that are most indicative of circumvention - used jointly, we observe extraordinary success rates in firewall circumvention. We conclude that while these are useful tools for circumvention, to the trained practitioner they merely present a security theater defense. BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. Twitter - @BSidesPDX

Show transcript [en]

hi my name is Devin Gaffney I run a company called International persuasion machines uh here in Portland Oregon and we work at the sort of intersection between security trust and safety and influence operations misinformation abuse those sorts of things manipulation by Bots so I'm here to talk to you about reverse engineering how laughs like cloudflare identify Bots uh the basic wager of IPM is that there is this turn towards a sort of socio-technical security that's happening in the next few years uh that is to say as systems have become more participatory where the importance of people interacting with one another is the most important thing that's happening on a system there are these new incentives for exploits that start to emerge that aren't purely technical in nature but instead rely on hacking how we perceive these systems and hacking how people interact with those systems in a meaningful capacity that's sort of this uh umbrella way of thinking about you know the Twitter election misinformation stuff right this is a socio-technical security problem it's not just a security problem uh if you want to talk about different types of frauds there's a huge ad fraud business out there Uber spent I think 150 million dollars buying ads and they stopped spending 100 million of it and nothing happened so no there was no drop in like their ad uh whatever their effective reach uh with the ads because the rest of that was just being soaked up by Bots clicking on uh links to make it look like lots of people were looking at things and make everybody on the team feel good about themselves so this is sort of this weird nebulous space where our expectations about how machines work and our expectations about what we are going to get in these different online social spaces can be sort of hijacked uh also there's this great paper that is I think the only paper that sort of like talks about this in specifics and here's a like heavily uh edited little quote from a very long paragraph about the sort of introduction to this stuff uh so there's a lot of different examples that you see out in the world you know fake Yelp reviews fake Twitter profiles uh Cambridge analytica [ __ ] stealing data and then exploiting it uh people using uh different Bots to put comments on Amazon reviews people doing anything related to nfts uh anything related to crypto you see lots of different pump and up schemes where the idea is you have masses of real genuine people bidding on real genuine products and then all of a sudden once you've bought the thing that whole crowd disappears so one of the basic tools that people use to stop Bots out in the world is web application firewalls there's companies like cloudflare perimeter X which I think got by bought by human last week or something imperva a bunch of these different companies that basically do the same thing they produce this little page that says hey slow down wait a second I think Uh something's a little sketchy about you you know maybe it's something about the browser settings or the fact that it's coming from some data center IP or whatever and there's this whole industry of folks that are Building Systems to circumvent these tools and try to get around these walls at the minimal cost minimal effort to them because they are also Engineers that also have access to you know stack Overflow and still have salaries and you know budget constraints and jira tickets and all of that sort of stuff like anybody else right and so they're just trying to find the cheapest way to get over the wall and so if you go look at packages that are used to circumvent this stuff you can actually learn a lot about what they're doing what they're building you'll see you know popular URLs show up you'll see a lot of nft references people packaging packaging in metamask to open up wallets stuff like that and so the work uh the the way that people use waffs in practice at least from my perspective has been to sort of like be a security theater and throw up some sort of wall that demonstrates some amount of capacity for preventing this sort of bad behavior but because these are really complicated tools that look at a lot of different signals with enough effort you should you can usually get over the wall and get into the system and so usually this sort of like one size fits all approach doesn't actually solve anything if someone is dedicated enough to get over this this wall it's mostly a question of do they want to spend the time doing it so I did a statistical test a while back where I wanted to know what makes these wafts tick um so the basic idea is uh at IPM we've got 44 different types of Bot implementations that use like a spectrum of sophistication so is it you know just sending off like a really low effort HTTP request in Python right and it doesn't do anything it doesn't even change its user agent it's just let me go to the URL all right and let's see what we get right and that's all the way out to much more complicated things that are you know trying to obfuscate the fact that it's a fake browser that's running an actual browser rather than an HTTP request it's able to execute JavaScript it's able to detect if it's getting hit by a captcha and then solve the captcha all of that sort of stuff so with all of these different design implementations we can sort of label the 44 different types of bots as to whether or not it has that particular configuration of design implementations present so is it rerouting its traffic through not a data center IP is it running an actual browser or not is it trying to hide the fact that it's running a browser or not or is it trying to hide its user agent is it actually mounting the browser on a display or a they're called virtual displays it's headed mode I don't know how much everybody is read into like playing with selenium and different types of browser automation tools but there is one way that you can run it in headed mode to make it look like it's actually on a screen rather than just on some Ubuntu machine out in the uh the use some vast data center so we can take all of these 44 different bot implementations and encode them as the different types that they are then there we go so in order to test this uh Waf project what I needed now was a fairly representative data set of different websites that are a high profile enough that they are actually using you know waffs in a considered way there's probably a person or five people thinking about how it's set up at these companies and I wanted it to be something that you know people are actually making an effort to protect and so what I started with was just a set of retailers there was this uh and also a set of URLs that was like very easy to get because it's a pain in the ass to go look up you know URLs for a thousand different companies so there's this great data set that Newsweek put out of 200 different companies that had all the URLs in it so I grabbed all the URLs and then we I threw them into our tool and then took uh screenshots and a bunch of telemetric data off of each of the websites and then now I had this data set of 44 different types of bots hitting these 200 or so different websites once each then what I did was I encoded a machine learner that would then look at the responses from these servers and say is this response from a web application firewall or not so does this look like I got hit with a wall and the way I did with that was uh going through and voting on a lot of cases manually doing a maybe about 1500 of them manually and then projecting the rest out uh cross-validating and making sure that I didn't screw it up too bad and it was you know pretty accurate in pulling this out there's only a few of these sorts of you know large web application firewall companies that do this stuff so the machine learner was you know specifically looking for you know what's on the page is cloudflare DNS stuff on the page and that's pretty good signal to get most of these uh and so uh then I took that data set and encoded it into whether or not I think it got hit by a web application firewall according to this machine learner and then whether or not different design patterns were present in that particular test so for this test using a standard selenium instance that went over to patagonia's web page and didn't try to hide its IP that's a particular configuration of design patterns and did we get hit with a web application firewall wall or not did the machine learner think that we got hit with one or not and so uh we run through that we have this really nice data set now that we can throw into this other process I apologize this is a very big Rube Goldberg machine to try to get a reasonable answer out but we can then take that and throw it into a thing called dominance analysis which is basically saying all right so the thing you want to know is whether or not you got hit with this web application firewall and the things I know that could that could potentially predict that is are you using a browser are you trying to hide your IP all of that sort of stuff so I'm going to test each of those things individually and then test them with the rest of them and see how much lift a statistical test would get by having that thing included in the test so is this thing actually the thing that's driving it is it mostly about the browser or is it mostly about the IP or is it mostly about these other factors right what is driving any sort of meaningful description of whether or not a web application firewall is actually blocking some of this stuff and so you can then get a sort of weighted ranking of what things matter and in what order and that gives you you know really interesting information about you know how people usually are building these things what when they're playing with thresholds what type of thresholds are they tweaking the most which ones matter which ones don't all of that sort of stuff so if you're on the defensive side this is useful to think about in terms of what is someone going to build next what's their immediate Next Step the lowest hanging fruit thing is going to be the thing that causes most of these browsers to get through so I should focus my energy on these things in this rank order and if you're on the offensive side you want to use this as your sort of hit list of what things I need to include in order to make this browser work if it has this then this then this it's probably going to work and anything else on this list is kind of diminishing returns so maybe I'll go look for someone else to hit so there is the entire Rube Goldberg machine in one screenshot go from some Newsweek post to a bunch of different uh recordings of web pages to a machine learner to a data set to a dominance analysis to a bar chart and the only thing that we care about really is the bar chart and so there's the bar chart the idea here is that just using selenium something that can run with JavaScript something that's really basic is going to drive most of your variation most of the success you're going to get and then there's this huge drop off to are you using headedness or not is it is it actually mounted to a virtual display or not and then to another degree less are you trying to hide the fact that this is a selenium browser or a python browser or whatever a python HTTP request so are you screwing with the state of the selenium instance to make it say oh yeah no no this is not this is not robots this is a very legit guy that lives in a data center in Santa Clara or something um and then almost to no effect does IP IP obfuscation matter uh that was sort of the interesting thing that I found here was it doesn't really matter if it's coming from low reputation IPS or high reputation IPS uh when you account for these other things which really drive the uh the story uh the other note here is that the the r squared is 0.31 so in like very basic terms that means that it's picking something up but it's not picking up the whole story so I count that to you know there's lots of different ways to set up thresholds for these types of tools uh and design you know your particular implementation and people are screwing with them all the time to make it exactly you know Miss the cases that they want to let in and not miss the cases that you know all of that sort of stuff so there's lots of variation in this stuff and even you know maybe the same test would work once and not work the next time because there is some element of stochasticity with this stuff uh so it's definitely picking up something meaningful here but it's not picking up everything uh and then just as like a sort of top level this is what uh it looks like as a more basic data set we see this huge drop off we're only you know 15 of these requests are successful when we're just sending them through uh HTTP requests and uh it doesn't really matter what we're using whether or not uh it's you know any type of IP obfuscation it's it's all pretty crappy if you're not running JavaScript they're not going to play uh and then it's kind of a watch with the rest of them once we get into this regime of are we using an actual browser or not there's some that are slightly better than others and some that are worse than others residential IP proxy IPS we see drops on because there's lots of different problems with connecting sometimes and just having weak connections that get dropped and so some of that's attributable just to that uh but yeah what was interesting one one more interesting thing was in every case for the 200 retailers that I looked at at least one browser was able to get through uh so that sort of goes to this theory that like yeah you can you can block this stuff but with enough time and with enough specific you know specific implementations somebody will find the hole that you haven't accounted for or that you specifically left open because you know there's some other vendor that needs to take screenshots of your site to do some sort of thing that you know somebody in marketing signed up for and so now you just have to have this little back door and you just hope that that's not a huge back door that people care about uh that are building these sorts of things and that's it uh I'd love to answer any questions about this stuff or manipulation in general or any any of that great question so the question is you know what was the sort of methodology of uh data capture for the original data set um so there's about 10 000 captures in the data set it's not just screenshots it's also you know what was the rendered Dom what was the you know uh what were the headers the cookies all of that sort of stuff what what was actually given to me in response to this request right um uh one part of it was what was the temporal nature was there did I do it at a time when people are usually looking for Patagonia jackets or Burlington Coat Factory things or whatever right and and the answer is no I ran it when I ran it because I wanted to get this post out and so uh and so you know there's there's there's some of you know randomized controlled trial element there because it was you know when I did it so uh I didn't try to control for that too much and it was just one One-Shot attempt to uh get a get a resolution from the server uh I have internal systems that are able to determine if there was some sort of internal failure on my side that's not their problem and those are rejected and removed from the data set and then I also removed any retailer that had fewer than 20 percent of the responses come back as being blocked by wafts because then I'm saying these are not these companies are not meaningfully trying to block this traffic in the first place so they're not doing enough for this to even be worth looking at whether or not they have a useful strategy because their strategy basically is not do much so sure so the question is how deep did this uh data set capture go did you just hit the website once or did you actually you know go in and type a search for Patagonia the sweaters or something right the answer is it was just a ping on their root site their top level domain that's a pretty non-controversial thing to do with a bot if you have problems with that uh I sort of have problems with it but you know we are in a an environment where that's totally fine and like that's actually like a preferred way of doing things so whatever um but yeah it was just hitting the top level and taking up and like a very cursory scan um but yeah the the the the wager being that if you're not protecting the top level you're probably not doing much beyond that that's a loose assumption there's you know deeper you go into systems more in you know interrogation of the browsers but yeah what [Music] sure my data center is redacted and uh uh I'm sure that it would change if you used a different data center I think that that data center in turn probably runs on a lot of AWS [ __ ] as far as I can tell because when you use some of their other products you're just using AWS so I don't know if that helps you narrow down which provider it was um but uh but yeah so I think obviously it's going to change if you're using different data providers obviously it's going to change if you're using different you know data like even you know uh whatever it regions on your data provider right that that might trigger certain events right there's maybe more you know interrogation if you're visiting uh very U.S website like WinCo is on this list of companies and WinCo you know if you hit WinCo from Bulgaria it's like no um so that you know there's certain things like that there's quirks like that that come up for sure go for it real quick and then yeah I was wondering if you set any request headers like a referrer or a cookie and that nope if that might have actually helped you punch through the the laugh nope this is Joe browser coming from anywhere USA I think it might do you think it might help yeah potentially sure right I mean I think yeah this is this is uh you know there's sort of you know tweaks at the margins of all of this stuff that you can do to increase yield as it were right uh this is sort of the initial test of let's take like a very naive approach and assume nothing and also not do a lot of work of looking up what headers would work for each of these 200 different websites right um but yeah totally there would be there's plenty of there's plenty of circumstances where that where that would work and you know in fact some uh web application firewalls the way that they work under the hood is if you pass the first test they give you a token and you can use that token in curl or whatever else you want to use for the rest of the day and they're like that looks totally fine to me so you know there's definitely implementations that are very um you know prone to that sort of attack go for it ooh okay so there are good Bots out on the internet that people think are great like Google Bots and different you know other non-googlebot googlebot competitors Bing Bots maybe um and uh and and so the question is what if you just mimic them um sounds sounds more illegal um but uh but uh yeah I think that you could probably get away with that sort of thing right if you were to say yeah this is definitely the Google crawler and maybe make it an IP that is in the same region somehow through [ __ ] with proxies and seeing like oh yeah this is also in Mountain View or something I don't know maybe you can do something there uh I haven't tried it that'd be interesting you probably get some yield on it way in the back there and so the Google publishes their IPs is that what you said oh gotcha gotcha so Google publishes the uh addresses of the bots so if um if you are a sophisticated person you'd be able to say this this doesn't look super legit um because it's coming from a totally weird place that's not that address so great uh thank you so much uh visit

Reverse engineering how WAFs (fail to) identify bots - BSides Portland 2022

Related talks