"Context Aware Content Discovery: The Natural Evolution"

Name: "Context Aware Content Discovery: The Natural Evolution"
Uploaded: 2021-04-18
Duration: 50 min 1 s
Description: Presented by Sean Yeoh, Patrick Mortensen, Michael Gianarakis, Shubham Shah BSides Canberra 2021, 9-10 April

BSides Canberra · 202150:011.5K viewsPublished 2021-04Watch on YouTube ↗

Speakers

Sean Yeoh Patrick Mortensen Michael Gianarakis Shubham Shah

About this talk

Presented by Sean Yeoh, Patrick Mortensen, Michael Gianarakis, Shubham Shah BSides Canberra 2021, 9-10 April

Show transcript [en]

all right looks like we're good to go all right so just by way of quick introduction um you know we we just got introduced there but the folks on the stage i'm michael um ceo at asset note i have subs uh cto at asset note uh sean on the end engineering lead and pat senior engineer so we'll be presenting today but there are actually a few other contributors as well so we've got james down the audience and uh and jordan there uh also worked with us at asset note um had a bunch of interesting contributions especially the killer ascii art that was james's contribution so um plus a few other things so we're going to be talking about content discovery

today so when it comes to web application security and particularly offensive web application security uh content discovery is a really key element so when you think about an organization and their attack surface the application attack surface is often very large and varied and understanding as much information about that attack surface is critical if you're looking to find security bugs so to do this you typically do some sort of content discovery and typically that involves you know file and directory brute forcing across those web assets and the more effective you are with this uh the more you can map out the application attack surface and the more likely you are to find various avenues for exploitation pretty straightforward

so if we look at the current state of content discovery and where it's at so as i mentioned the typical approach to content discovery is just file and directory brute forcing using a word list there are a number of great tools out there that are commonly used so you know it's w fuzz der search more recently you know things like go buster ff rustbuster feroxbuster and uh usually these tools uh are paired with some sort of hand curated word list maybe from a source like cyclists or maybe something that you've built yourself and added to over the years and so these tools have evolved over time but ultimately the underlying approach has remained the same

you have the the tool you feed it a word list and you feed it a list of assets that you're pointing it to and and you know it goes there so um and there've been a number of improvements though over the years so things like flexibility of where you can fuzz uh you know better ways to to filter out the content and and the output but really at the end of the day there hasn't been too much innovation in terms of the content discovery techniques it's still ultimately doing the same thing generally each iteration and each new tool is really primarily focused in a lot of ways i'm going faster so that's why you see sort of the

evolution from sort of python to go to rust it's really designed uh they're really uh optimized for for going fast and this leads to kind of a cycle that reinforces itself so more speed you can run a bigger word list against your target you know the faster the speed the more web apps uh you can cover in a shorter amount of time and then if you add in sort of recent asset discovery tooling you know it kind of completes the cycle and it just reinforces itself and that's been really most of the the innovation in that sort of space in terms of content discovery uh about more about speed and given a large enough attack surface

you know this approach can still be pretty effective however application development has also evolved if you think about modern application development it's primarily framework driven it's typically api focused with complex routing and modern applications have generally moved beyond this paradigm of files on a server uh to to being with endpoints that are more strictly defined in code and so while there has been some movement um in the tooling towards more context awareness so things like common speak and even our wordless site which sort of auto-generates these word lists most of the tooling lacks context so what do we mean by when we talk about you know context awareness um context awareness uh allows us to

discover more api endpoints than typical content discovery tools uh that simply guess for files and folders without any of the context and as we mentioned earlier the more endpoints you can be the you discover the larger the attack surface and the higher the chance that you're going to find some sort of security vulnerability without the context awareness um you know certain apis might not even respond at all and with that you're probably missing a lot of the attack surface um and so the only real way to to really uh cover that attack surface with modern applications and modern apis is through context awareness ultimately you can't hack what you don't know is there so let's dive into this concept of

context a little more with some some concrete examples so we think about frameworks right um they want really well-defined apis so you know legacy systems you know they're defined by you know files on a server ultimately um and but you know as frameworks such as flask and rails and express started to become more commonplace and they they took over um it's not really the way that things are structured now so with when modern frameworks become became more prevalent we noticed that endpoints were no longer just files on the web server but rather there were strictly defined apis in the code of the application and so this is fundamentally shifted the way that we see content discovery

um as we're no longer just searching for files and folders we need to specifically look for these api endpoints so to run through some uh some simple examples so we'll start with flask it's a very basic sort of a bit of code here but there's a number of different elements of context so the first one is there's a parameter in the path there which is this key parameter which takes an integer value so that's the first bit of context that you need if you don't if your word list doesn't have you know slash one two three or something like that on the end you're not going to be able to hit this end point if you look at

uh the rest of the the uh code there's also a number of other elements of context specifically it's responding differently and and handling things differently based on the http method that has been supplied to it so for example we've got different different output for put delete and for the get if it's not found if the key is not found so it's not in your word list you're going to get an exception that's not found so let's imagine say that that get uh the the sort of the default there for the get uh is not there and it's just put and delete if you've got a word list that's just running through even if you have like an integer in that

parameter value if you're not configuring your brute forcing to hit or to use a put request or a delete request you're going to miss this entirely in a similar vein we've got an example with rails again multiple parameters in the in the path so in this case we've got this to do id and also an item id and again different combinations of these parameters as well as uh request types uh doing different things so again in this case uh it's responding to post put and delete with various um with various context in the path as well and so again if you're brute forcing with just get requests um you're likely well you will miss all

of this right so unless you're configuring specifically to look for this stuff you're going to miss this attack surface and then finally again uh with express another simple example but here we've got three um three parameters of different types in the in the path so again unless you're you're brute forcing with um with that context uh in there you're going to miss this endpoint entirely so even these simple examples demonstrate how contextually understanding an api is important for effectively mapping out the application's attack surface from a black box perspective subtle changes in the request that you send can produce dramatically varying results and can be the difference between discovering a valid endpoint or not and current approaches to content

discovery tooling do don't really account for this context so we decided to uh work through this idea of context aware asset discovery so context aware content discovery is the process of finding endpoints in applications by sending requests with context so we talk about context i'm talking about things like the http method parameters values headers things like that and it aims to tackle the problem of modern apis that only respond when the correct context is being sent to them and so like we went through in the previous examples some apis expect certain methods certain parameters values or headers when they're being requested by making brute force attempts with the correct context we can discover modern api endpoints so what did we do

what was our approach um so to solve this problem we decided to build a tool and a data set that is capable of sending these api requests with the correct context so api specs such as swagger define how apis operate and exactly to a high degree of specificity what methods are required what the parameters are what what types of values should be in those parameters and any headers that are necessary to interact with those apis so we then made it our mission to collect as many swagger spec specification files from across the internet broad internet wide scanning and then we built a data set uh from all of the api specs collected and then we

built a tool that was capable of using this data set to send fully formed contextual requests to web app servers for for content discovering so the next part was collecting all this data um getting all these swagger files from the internet so i'll go through a little bit how we did that and all the different sources we ended up using for this so first off we use uh bigquery i've talked about bigquery quite a lot it is quite magical it lets you query terabytes of data in seconds it's uh also really amazing because it has github's public data set github's public data set isn't exactly the entirety of github on bit on bigquery but it is a decent portion

of github and through that data set we were able to query it um and obtain all of the swagger files from it so this is what our query looked like from bigquery we first queried from the files table to find anything that had that ended with swagger.json api docs.json or openapi.json and then finally we obtained the contents of these files by running the second query shown on this screen so after we obtained all these files we had to pause the csv file that was outputted and then we obtained all of the swagger files individually through this process we now have eleven 000 swagger files these are de-duplicated and unique swagger files so we've already got it got 11 000

swagger files through this method next we took a look at apis.guru now this project is pretty interesting its goal is to create a machine readable wikipedia for for rest apis it's entirely open source and it documents publicly accessible apis with swagger definitions all the data from the open api directory can be accessed through a rest api itself so we downloaded the rest apis via a rest api that's inception i know but uh we extracted all of the swagger api endpoints from it and we were able to use something like wget to download all these swagger files so just look something like this we just downloaded all of these swagger files from apis.guru we ended up with 14 000 unique swagger

files cumulatively at this point then we also found another data source which was swagger hub and we could list uh we could list all the specs on swagger hub and unfortunately it was only limited to the first 10 000 results that was the limit that they were enforcing on the server side not something you could bypass with a list of ips with a list of proxies or something it's just a hard limit of 10 000 but supposedly they have something like 400 000 swagger specs in their database if we could grab that that would have been amazing but we were only able to grab the first 10k we weren't able to bypass these limitations in any obvious way

this was written in async python we we downloaded around 10k swagger files through this method we ended up with roughly 23 000 swagger files by the end of this and then lastly i guess the most important part is we scan the internet so james on our team spent some time scanning the internet and we did some calculations there's 3.7 billion ips on the internet that are accessible and uh you know we didn't scan ipv6 but that is what it would look like if we scan ipv6 the number at least and uh we scanned port 80 and 443 http and https um and we scanned 22 different paths uh where we thought that swagger files could be uh available

um and as james mentions in these slides um if you want to frustrate hackers return a crisp 200 on all routes instead of a 404 uh why not make the why not make all pages have the content type of application json so these are the paths that we scanned and the number of hits that we had you can see that some routes are much more popular than others like for example v2 api docs had 27 000 hits and api v2 swagger.json had 91 000 hits this is uh these statistics are before we have done our deduplication so these are just the raw statistics from the internet wide scans and our approach was something like this

we skipped ipv6 because we're not going to be scanning that many ip ipv6 addresses and technology is not there yet to be able to scan that many addresses um we preferred passive sources to supplement our data so we used the rapid seven http and https data sets combined with an internet-wide mass scan to fill in the deltas as i said earlier we only stuck to port 80 and 443 as we knew that production servers predominantly use these ports future scans if we want to develop this out further we could look at other ports like 8080 8090 and 99e things like that that development servers could be running on but at the moment we only focused on 80

and 443 we focused on known api documentation parts and we scoped it to swagger 2.0 compliant files um so the process uh in the infrastructure we we only had one 8gb line node box and linode is a great provider to do internet scans from they're very friendly to security researchers so i highly recommend them if you're looking to do something like that we set up an rdns record and that's something that we suggest you do as well where you just explain the scan traffic this way it reduces the number of abuse requests significantly and you can continue going on with your business without any problems the software we use we use mass scan to fill in the deltas

we use zgrab2 to obtain the swagger files and communicate on the http and https protocols and we use some custom validation tooling that we built in-house so it looks something like this we patched zgrab2 to add an add an api specification detection module so every time it would detect a swagger json file or swag yml file it would flag it um but it would be making these 22 requests that i showed earlier across every single host that we had identified being online on port 80 and 443 on the internet um so we fed these ips to zed grab two and at the end we ended up with roughly 20 gigs of output that we had to deserialize

and deduplicate so here are some numbers um so there's quite a lot of hosts that we scanned at the end um quite a lot of valid uh data returned but the post-processed unique routes were sitting at just under a million so quite a lot of routes that actually gave us data so we ended up with after deduplicating we ended up with roughly 67.5 k swagger files so that's a pretty decent uh data set and we will be releasing that later today as a part of this project

so next uh we had to pass all this data why because we have a ton of swagger files and we want to basically meet a number of conditions we don't want to include any invalid specs we want to exclude all our duplicates a lot of these specs were like the swagger pet store example so obviously useless to us anything that's blacklisted so we can include in in our passing tool a list of blacklisted hosts that maybe they offer a bunch of generic specs that aren't very useful to us so we want to be able to offer a way to exclude these from be included we want to add any missing parameters or replace any invalid parameters

so some of these specs that we have they'll have path parameters in the path but not defined in the actual parameters inside the spec for whatever reason so we want to be able to detect these and include them so that we can use them later on in our other tooling we want to try to detect any uuid or regex parameters and map them as such so we can accurately guess those parameters when testing the apis and we want to add ksu id to each spec so that when we're running our tooling we can trace back to the original spec where this problem is coming from we want to be able to pass this into a

single file so that we can use kite runner to compile the word list uh so we want to conceptualize the data we had some basic rules like the ones i mentioned a moment ago as well as we wanted to replace any json ref values because it becomes very difficult to try and do this at a large scale unless you just do it each time you process an individual spec we want to include the security definitions excluding the elworth 2 because they're kind of irrelevant as well as the host or the url that the spec came from and if a string appears to be either a uuid or a regex we want to mark that format as such

and do our best to duplicate any results that have come through uh this is example of what i mean by a path that doesn't quite include uh the token in this case so we want to add that to the path so our tooling is able to fill those values and guess what it can and so what does this output look like it basically looks like a bunch of these where you've got the ksu id url if it's included any security definitions if they're included as well as a bunch of paths and parameters so writing the tool we want to write it in python because easy to work with json data and manipulate the data and we originally started off using a

swagger and open api pass the library but we ended up giving up on it because it didn't scale very well especially with that amount of files and it would just sometimes hang giving you no kind of insight into why without really diving into it it actually originally appeared like it was a disk writing issue which i was trying to debug for a while only to realize that this parcel was just breaking on us and some specs are very weird and the parser is too picky for it so we may have like a mostly valid spec with uh a field that doesn't quite conform to the swagger spec but we still want to include like that data might still be irrelevant

but we want to include the rest of it so there are kind of limitations when using the swagger library to pass that that we will never get past so we just end up writing our own version of i mean it's basically just reading a bunch of dictionaries and yeah the case occasionally the pass will just die when reading a large bunch of files so we're generating our final output this is just a screenshot of it running through it would look something like this in our logs where it passed a bunch of them exclude any blacklisted ones and also exclude any invalid json and on our data set of about 67.5 000 swagger files we ended up with an output.json file

collating all of those into about 2.8 gigabytes uh this excludes a large chunk of the scrape data that we had such as google apis azure swagger pet stores example and amazon as these are all usually the same specs and these are ready to be compiled with kite runner and so kite build up which is the tool that we use for passing all these specs was able to this is essentially how we'd use it just to kite run our pass and pass in any of the parameters you want it also adds a convenience tool for passing the csv data we saw our shops mentioned with the bigquery and so if you wanted to do that you

would just pass something like this with the csv file in the format that it would come back from bigquery and it would save a bunch of json files into our scraping directory which we could then use this passing tool uh to go through and pass the whole scraping directory into an output or whatever you want to call your json file so here we call it kitesub json so as pat said we took a whole bunch of these type arms lager files and we really wanted to formalize the spec into something that was usable to create web requests out of uh so we took a bunch of jsons and then rewrote our schema into girling and usually that's pretty good except

for the part where json's a very dynamic language and it doesn't statically type very well that and developers are very liberal with how they put things into json so for example um swagger allows you to specify the content type that a request expects and will respond with so for example however here we have a developer that decided kindly to include the content type separately completely ignoring the consumes endpoint that's provided for them here we have a developer that decided to be very explicit with how they wanted to do query parameters now instead of like specifying parameters inside the swagger body they decided to use their own custom format for the query parameters resulting in something we can't really

parse or use in any shape or form in this case we have a developer that probably forgot halfway what they were doing and they kindly told us that there is a body to this request they just don't tell us what this body contains um and i guess it's left as an exercise to the radar to figure out how to submit data to this request finally we have a developer that thought they were trying to be very helpful and here they specify the format as something returned by charts that'd be great if they also included the chart's endpoint in this api spec unfortunately they didn't and then we have what i think was my favorite from

all of the data we analyzed this developer wasn't quite sure how to format their data so they left a placeholder thankfully for us we know what that placeholder is so just a few takeaways from like learning to pass all of this into a strictly type format first of all you can't really trust developers with anything because they're not going to do the right thing um the second thing is that statically typing json is incredibly difficult and the way we got around that was using an intermediate data type for how we store our format which effectively becomes the kite format that you will be using so the second major challenge we had in engineering this tool

was being able to handle wild cards and if anybody's done some casual brute forcing before the worst thing you can have is your output being flooded with a whole bunch of garbage requests so as mike previously explained api brute forcing in this contextual aware like environment is very fundamentally different from normal brute forcing you often have directories and web servers that don't respond the same way anymore previously they would 302 if a directory gave you the full path but not the pending slash or you would have apis that just straight up don't respond unless you supply the right endpoints or the right headers or even the right body type and even more so the current like brute

forcing landscape attempts to address like wild cards and invalid results with using status codes but in this landscape you can't really trust status codes at all you'll have web servers that give you 500 errors where it's a completely correct path but you're just missing some header or some validation element you have web servers that will 404 you even though the most of your path is correct but you've supplied the wrong user id so technically you're nearly there but still as a 404 and then as we had before you have web servers just straight up lie to you and just always return status 200 which is incredibly unhelpful so the other cases you have to think of

is that we now have applications that can virtually route multiple apis together here we have an nginx config that routes three different applications to three different routes to three separate applications and you have to keep in mind this means that each application may potentially handle invalid input differently so for example the admin application might 500 if you supply a wrong path whereas the mail application might straight up 403 or 401u for invalid permissions and the way we solved this was not through machine learning but through a whole bunch of heuristic analysis and what do we mean by heuristic analysis what we did was we decided to take a whole bunch of baselines so in the

examples before we have admin and mail for each of these directories we will take a whole bunch of requests that would attempt to address figure out what the expected response code content length how many words are coming back how many lines are coming back and uses to establish a heuristic for how the admin directory should respond from this we can then deduce anything that deviates from this baseline for the admin directory will then actually be considered probably an interesting result that isn't probably one of the default responses now we've also implemented a whole bunch of other heuristics for cases we saw that you can't really detect through algorithms or like heuristics and these were things like annoying

cases where the aws elastic load balancer would always respond the same way if you supplied a certain header so we have additional heuristics to help pick those things up now obviously all of the security experts that use tools want to be in control and some of them really hate technology so we let you turn these features off if you really don't like our algorithms now we've also built this tool in mind with trying to handle the obvious edge cases where you're running this kind of content discovery brute forcing tool we've all had cases where you're sending a lot of requests and the web server might be a cert behind a certain big ip based load

balancer or like web application firewall and then suddenly after sending 50 something requests in a second all your requests start getting sinkholed like they don't respond anymore in these cases we intelligently detect that a host has started abnormally responding the same way and we quarantine that host to one save web requests um but also to make the user aware that this is probably a host that requires special care either through different rate limits different delays or maybe a different ip address now obviously we've done a whole bunch of talk about how we can do these detections but that doesn't really help us if the tool runs really really slowly i think over the past few years as mike

has mentioned we've gotten tours to be gradually faster and faster and i think it's pretty easy at this point to write a tool in go or rust that will hit one host very very quickly but what we've done is we've figured out how to go quickly on lots and lots of hosts because at asset note scanning one host really doesn't suit this really big attack surface what we have here is across the same number of threads scaled on the same size box scaled across multiple hosts in this case we're doing five concurrent hosts at a second we're averaging 30 000 requests every second compared to existing tooling which will probably cap out around five or six thousand requests on a

single host admittedly this is a bit of a contrived example realistically no web server is going to respond at thousands of requests per second to your input but in a more realistic workload we're looking at around splitting say a thousand threads across maybe 200 hosts and we're averaging approximately 26 000 requests a second um instead of having like one server being slammed we prefer to distribute our load across multiple and this way we increase our scanning throughput and give us a whole lot more results in a much shorter time period so as part of trying to go really really fast we encountered a whole bunch of i guess challenges along the way the first challenge we hit was

obviously we started hitting connection limits on servers when you try to open a hundred tcp sockets some servers just don't like that so the way we resolved that was we decided to use long-lived connections and we structured our concurrency method to strictly limit how many connections can be made to a server at any given time then after you get over that scaling problem you realize that now i'm trying to open up requests to hundreds of hosts at once and my laptop with its measly 512 file descriptors can't handle it unfortunately for us there is no good solution for this aside from telling the user that they're doing something stupid and asking to fix it themselves

admittedly the only way to stop this is to stop the user from shooting themselves in the first place after that you realize that now we can open lots of connections and we can send lots of requests where is all my time going it's spent on context switching now there's two aspects to this one is internally since we use a lot of go routines we spend a lot of time waiting on each other to like give us information so we structured our concurrency to minimize these weights entirely by batching up work and handling it with quite a unique parallelism method the second part of this is that you start getting stuck on kernel context switches because now you're just waiting

on sys calls for reads and writes from tcp sockets admittedly there is no good solution to this at the moment like i'd love to implement the tool with an i o urine based http client but it just doesn't exist yet and then finally we realized that once we've optimized out our connections our context switching we realize we're stuck on garbage collection since go is a memory managed language this is quite unique to i guess writing tooling and the way we solved that was simply to remove all the allocations we needed and this means that while we're scanning we will have minimal interruptions from the golang garbage collector to try to fix up what we've made

so that being said with the tool that we're releasing today what will you get you'll get some mad ascii art from our very own james hempton beyond that we'll also provide conversion between standard word lists json files and our custom kite files so this means you can interchange between these three formats if you want to use it with your own tooling or you want to modify what the kype files look like we've added built-in loading and caching of wordlist.assetnote.org word list into the tool so this means that where you're running the tool on a box where you for example don't have the word list yet or you're trying to find like a different content discovery type but you

don't necessarily have like xml output or you don't have aspx word lists you can automatically load the asset note word lists with the tool by itself just by specifying a parameter we'll pull the word list down cache it on disk and in future scans you can also use this at the same time we also add a whole bunch of filtering and control options for your scanning so you can add delays you can skip our heuristic checks you can filter on specific apis that you want to scan you can force specific methods if you know apis do certain things you can add your own headers and you can sort of um append as many wordless as you wish

as you require another neat feature as we found throughout testing is some targets and some attack surfaces decide to use octa or some other type of access gateway across their entire tax surface now instead of showing you like millions of redirects you can opt to blacklist specific hosts where these requests will go to and will simply save you the time and effort of filtering out that stuff finally we also supply a way of reconstructing requests from the output now what this means is that often with displaying the output we can't be super verbose with what headers you have what the body is like what specific things went into this request to cause an error so we offer a way to replay a request

and also proxy this request through his favorite uh internet proxy like owasp's app in addition to this uh in addition to api scanning we also provide good old vanilla apr uh normal brute forcing so you can take it back to your door search roots and even use the same der search word list because we have compatibility for those kind of things so uh let's have a little chant towards the demo gods i'm just kidding we pre-recorded this so here we have kite runner running across an attack surface that's a couple thousand hosts we're running it with 200 max parallel hosts with five connections each and what's happening here is that through every baseline directory that we're scanning we perform a whole

bunch of preflight requests and then we start trying to figure out what requests are important and what's not we're sending about 20 22 000 requests a second which is going at a decent pace for what's effectively in a medium instance on aws now you may be thinking if we're sending so many requests what's happening there's our first one and what we see here is we might not know why this request 500. so what you can do is you can take it to the replay and then this will reconstruct the entire request including headers and show you why exactly this request resulted in a certain response let's go back and see what else came through we can see that these specific

api endpoints deviated from our baselines and looking at this product endpoint we can see that after we replay this request this gives us a whole bunch of api information so so now that sean has built the tool we've collected the data set um what's left is to apply the concepts and really prove to everyone that our ideologies and our theories on content discovery have value when it comes to security impact so the first thing is how do you find all the apis with brute forcing now there's this huge list of apis i've included on this slide and for each one of these apis we've created an api signature such that you can identify these apis by

the response on the index page or a 404 page so you can find that in this repo in this directory and in the signatures we've included a census link as well as the signature itself for what you should be looking for in order to identify these apis typically you want to be able to identify these apis before you put put them through something like kite runner you don't want to just blindly use kite runner across every host as the results may not be as amazing so here's meme you know there's cannot get slash which is the default response from express when it can't find the route that you're trying to request now i love that

response because it tells me that it's an express javascript node.js server which means that it's most likely running api endpoints which means it's most likely a good candidate to put through something like kite runner here's some more examples as i said earlier here's express express also has a header x expound by express golang looks like this where it just says 404 page not found spring framework has a white label error page and flask has this specific not found error page as well so it's it's actually quite trivial to identify these api endpoints and on an attack surface you can use these sort of indicators to see whether or not you should be running them through kite runner with the

context aware brute forcing so here's an example we you know a normal content discovery tool would have made a get request to slash download and that would have been a 404 not found like we see in this response here most content discovery tools would have probably stopped here unless the word list was super extensive and had everything necessary after the download path but with kite runner we can see that we get a 200 response when we provide an integer after the download field now the great thing about kite runner is that it pre-fills all of the placeholders so if the swagger specification had a placeholder for an integer or a string or a uuid we automatically fill that out and

request that in the requests so we can see it's a 200 here what happens when we visit the 200 we see that it's actually trying to read from the local it's reading a local file so it's trying to read one seven five zero six four nine nine dot pdf from one of these directories locally um we would have never really found this if it wasn't for the fact that we were requesting this specific path with the integer filled out this is a really good example of why using something like kite runner is more effective than most content discovery tools out there today another example is where you know you hit slash api user and that gives you a 404 not found

because that's all you had in your word list you didn't have anything else beyond that but you notice how on this path you can't rely on the response to tell you whether or not there is a further directory down you don't know that there's something after api user there's no 301 or 302 response when hitting this path but with kite runner we're requesting fully formed paths so we request api user me as that's something that we found in our swagger specifications and then later we were able to find that it's actually pulling users based off a hash and then we were able to return data for a user by hitting that specific end point so with

kite runner you've got that extra context and you've got that extra ability to find endpoints like these another example is this one's great actually because this really proves why kite runner is a tool to be using for content discovery especially for apis this is an express js server and you can see a get request to slash user create is giving you a 404 so all of the content discovery tools you use now if you just use a huge word list and you just use the default configuration of get requests you would have missed that anything existed here but with kite runner we can see that a post request to slash users create user create is a

500 internal server error and that's really interesting because if a get is a 404 and a post is a 500 that kind of indicates that the endpoint actually does exist and all other current tooling would have missed this endpoint so we can replay this request by using uh kite runner's replay feature and we can see that it actually responds with a json error with some more analysis we use paraminer to guess the rest of the parameters and we were able to register users using this endpoint another example is you know slash get slash images is a 404 not found slash get images slash is also a 404 not found so traditional content discovery tools would miss that

you would not see a 200 response for any of these two paths but with kite runner what we found is anywhere in the path if we had slash images slash something after that so images and a suffix we found that it led to a 301 so we investigated this a little further and we went to that path and it said info blah blah does not exist so then we tried etc password and it actually just returned the contents of the etc password file so here's a another example of where having this context aware brute forcing is really valuable because it was expecting something after the image's path and only because our context aware brute forcing found that there was something

after that leading to a three or two response we were able to then investigate it and find the etc password file so here are some links before we end off this presentation so that you guys can get the tools the kite runner tool and the kite builder tool available there we've also written a blog post that goes into depth about all of the work we've done and some of these examples as well the api signatures as well as the two data sets that we're releasing today so routes large is probably a little bit will take a little bit longer to run but it will have much more coverage and routes small if you're just in a

hurry trying to brute force quickly the other thing that i want to quickly note is all of the data that we've collected in this process we've also made it available as a regular word list that you can use with your favorite content discovery tool there's nothing locking you into using kite runner it's just just a tool that i'm sure sean would love people would use because he spent the last two months on it but for the most part you can use any tool you want with the data set we've created so swaggerwordless.txt is something you can use with f5 or whatever else you want to use um the other thing i want to just quickly mention

is the effectiveness of kite runner across a large number of hosts so something that you might struggle with as a pen tester or as someone who's testing web applications is you've got a large list of hosts that you want to do content discovery on most current content discovery tools are not so great at this however kite runner is actually probably very well suited to this and you should really try it if you are doing something like that so that's all thanks guys

um so i guess i'll answer this in two parts there's the first part where it's establishing our corpus of data and the second part which is doing our research uh with establishing the corpus of data admittedly i don't think we hit that many black lists some of them people did complain but in our first part we only did one pass over over the whole set and using z grab it sort of gives you a bit of reliability with retries so yes we would have missed some things but with the current corpus size i think it's relatively comprehensive and sufficient to cover it in terms of your content discovery and like running kite runner itself against bounty targets we

have had some hosts that just straight up blacklist you after a certain number of requests like they're having capsular or big-ip in front of it in most of those cases we just hop onto a different ec2 instance and then resume um so hopefully that answers your question about blacklists i think that's a really good question so currently our concurrency model spins up a supervisor thread for every host and then that has its own pool of threads and scaling out that many requests is actually more efficient with having a supervisor for every specific host because that means now all these targets you're scanning at once can proceed in parallel whereas traditional content discovery tools if you try to scale it out

might do all of it into the same pool and one incredibly slow hose would just block the rest um our current concurrency model does try to handle your like one hose very slowly and then many hosts at that one time that's also how a delay option works if you have like a very sensitive set of hosts and you want one request per second per host okay

um um so the data set for the most part should be pretty solid to use for a long period of time because we do cover a lot of different uh different diverse data sources however we do plan to eventually scan more of the internet for more parts like more ports that we wanted to scan and additionally we want to see if we can scrape more of swagger hub because i mentioned that swaggerhub has maybe four hundred thousand swagger specifications if we could get access to the full 400 000 that would be incredibly valuable for the security research that's something that we're still in the process of trying to get access to

um yeah so it's it's not a reference to khaled hassini but uh kyron is actually a reference to uh this kite running festival uh that happens in my hometown in gujarat which is called utaraian where you normally fly kites and you cut other people's kites down so that's basically why it's named kite runner and yeah that's the origin story so so just to add to that chubbs came up with the name um obviously with his history um the names that uh pat and sean came up with were terrible um so if you buy us a drink later we'll tell you what they are but they weren't really good

um uh you don't we ran all of our tests and requests from us east one so it goes faster now

um yeah i mean it's something that you could probably use with axiom pretty effectively but i'm pretty sure pyroc the developer of axiom will beat us to it

so we've considered using things like http 2 and 3 for our clients the main difficulty there is that the current go libraries supporting them aren't very efficient when you scale it up so currently it only supports http1 i think pipelining and hdb2 might make it like substantially faster but most of the bottlenecks occur on the server side since now you're sending api requests this might also do be doing api processing um so even if you can send like 500 requests in a second um you're going to be waiting quite a while for the server to respond to those so it's it's doable but probably not advisable all right we're almost there let's just say just a couple more questions

um was the name kite runner inspired by kiteworks no but we've found a lot of vulnerabilities and kite works in the past maybe a talk for another time um does kite runner self-advertise user agent other did you feel the need to do anything like any spoofing um to prompt responses on an arms race with yeah um so generally a pro tip is that we just use the chrome user agent so by default we just use the chrome user agent sorry cs admins an oldie but a goodie if you're targeting just one web api and it's generally a normally provisioned prod box during your testing how long has it taken to scan with the short and with the long list

uh that's a good question i mean like the answer depends on two things one is the latency to that server and two is how quickly that server can respond to things uh my experience has been with the short word list if you run it in the same like region for example we guessed it's in us east and we spin up an ec2 instance in us east with relatively low latency you're looking at maybe a minute or two minutes at tops maybe in like the realm of 30 seconds if it's a very fast server and this is the last question are you running on one host at a time have you considered multiple hosts with cnc or other orchestration

um that's an exercise left for the reader otherwise give a round of applause for these guys

you

"Context Aware Content Discovery: The Natural Evolution"

Related talks