BG - BOLABuster: Harnessing LLMs for Automating BOLA Detection

Name: BG - BOLABuster: Harnessing LLMs for Automating BOLA Detection
Uploaded: 2024-09-04
Duration: 37 min 29 s
Description: Breaking Ground, Wed, Aug 7, 12:30 - Wed, Aug 7, 13:15 CDT BOLA poses severe threats to modern APIs and web applications. It's considered the top risk by OWASP API and a regularly reported vulnerability on HackerOne Top10. However, automatically identifying BOLAs is challenging due to application c

BSides Las Vegas37:29246 viewsPublished 2024-09Watch on YouTube ↗

About this talk

Breaking Ground, Wed, Aug 7, 12:30 - Wed, Aug 7, 13:15 CDT BOLA poses severe threats to modern APIs and web applications. It's considered the top risk by OWASP API and a regularly reported vulnerability on HackerOne Top10. However, automatically identifying BOLAs is challenging due to application complexity, wide range of input parameters, and the stateful nature of modern web applications. To overcome these issues, we leverage LLM's reasoning and generative capabilities to automate tasks, such as understanding application logic, revealing endpoint dependencies, generating test cases, and interpreting results. This AI-backed method, coupled with heuristics, enables full-scale automated BOLA detection. We dub this research BOLABuster. Despite being in its early stages, BOLABuster has exposed multiple vulnerabilities in open-source projects. Notably, we submitted 15 CVEs for a single project, leading to critical privilege escalation. Our latest disclosed vulnerability, CVE-2024-1313, was a BOLA vulnerability in Grafana, an open-source platform with over 20 million users. When benchmarked against other state-of-the-art fuzzing tools, BOLABuster sends less than 1% of the API requests to detect a BOLA. In this talk, we'll share the methodology and lessons from our research. Join us to learn about our AI journey and explore a novel approach to vulnerability research. People Jay Chen Ravid Mazon

Show transcript [en]

so thanks you all for joining uh we will introduce today bolab Buster which is our methodology of um automating Bola detection vulnerabilities uh with using llms uh but first thing first let's introduce ourself so my name is Ravid I'm a senior security researcher ATO Alto networks um I'm part of the was team which is web ation security and API security and in my free time I like watching football games traveling the world and take care of my dog uh Maple sorry for delay apparently AI doesn't solve everything we still need human in the loop a lot of human thanks for the technical team and my name is Jay I am a security researcher with Palo

networks my research has been focusing on identifying the risk and threat in Cloud environment recently I have moved I have switched my research Focus toward generative AI in particular studying the potential malicious use of generative AI when I'm not working I spend most of time with my hyperactive twin boys who behave just like minion and I also have two cats when my kids go to bed I become cat slave feeding them cuddling with them and cleaning their little box okay um so let's quickly go over the agenda for today so we will introduce the concept of Bola we will see our methodology of automating Bola detection with llms we will see an actual test um that detects a real Bola and

eventually we uh we will show you how we hunted down 17 new ball of vulnerabilities in the wild and what lessons that we learned uh during the process um so Bola or broken object level authorization um first of all our motivation for this re for This research um I'm not sure if you are familiar with Bola but it is the top risk at the the a fi top 10 it's the number one and it's also the fourth most reported vulnerability in Aker one so it's very popular it's very severe and there is no automation tool that actually detects uh Bola in scale today um for all of these reasons we decided that we need to develop our own

methodology uh to be able to automate the B detection and uh solve an unsolved question if you take a look uh in this screenshot you can see a patient uh application um which a patient can query an API call uh and use his visit ID and he will get his doctor uh nodes but what would happen if the same patient will try to query another visit ID that belong to another user um if you will be able to fetch this the sensitive data of another user we have a authorization issue or B so is a basically vulnerability that arise when the application fail to validate the if the user is authorized to access modify or delete object that

does not belong to him so imagine that I'm able to delete uh J's comment from Instagram or Twitter which obviously I shouldn't be able to uh this is the authorization issue and we will have a ball of vulnerability and the consequences can be data leaks data manipulation or even a full account takeover these are the challenges that we faced during the process of the development first of all uh today's application has multiple users typically multiple roles and resources and it's really difficult to to understand which user is allowed to do what secondly most application today are stateful which means that every every uh action you make every endpoint you call affects H and change the the state of

the application so imagine that you try to delete a comment from an article first you will need to create a comment sorry to create an article then to create a comment and only then you will be able to delete it so there are some dependencies between endpoint that it's really hard at first to recognize also um there is a problem of lack of vulnerability indicators so imagine xss SQL injection they all have a pattern we all know how to recognize them pretty quickly baa is a logical error um it's really it doesn't have a clear pattern so it's really um difficult to understand whether it is a ball or not and lastly uh the context of the

application it's it was difficult to understand exactly which endpoint or parameters return sensitive data and what is the actual impact of each action that we make so all of this um this was a real challenge to automate the the detection of baa in in scale um and yeah thanks Rie for covering the background I hope everyone Now understand why automating B detection is not easy so Bola is not a new problem it has been around for so long ever since we had internet however it was only two years ago that we realize that AI might give this problem a g of Hope in particular the rapid advancement of AI give provide us with new tool to solve

problems that were not possible previously in particular it also happened that the challenge our challenge of extract ing context and logic information from Tex show data is what large language models are extremely good at so that's the beginning of our journey in using AI to solve this problem here's an high level overview of our methodology the only required input is open API an open API spec or a swager spec in the first stage we identify the end point that can be vulnerable to Bola not every endpoint can be vulnerable to Bola we use AI to help analyze every endpoint and its parameters to select a subset of end points this step help us focus on a

smaller relevant end points only and avoid wasting time on the endpoint that are not at risk so it's important to know that we call this we call our Target endo is potentially vulnerable endpoint short for PVE I will switch between Target endpoint or PVE or uh potentially vulnerable endpoint in this in this talk the next stage uncover the dependency relationship between each endpoint modern web application are complex with one endpoint depending on many others for example if we if I want to test uh an endpoint that update an invoice I first need to call the endpoint that create the invoice and before I can call the endpoint that create the endpoint I also first need to

get call the endpoint to create some transactions that can be included in the invoice so it is crucial to identify the dependent endpoint of a Target endpoint before we can accurately test it with the endpoint dependency relationship identify we can then calculate the execution path to each Target and point to each potentially vulnerable endpoint we then create a test plan for each Target endpoint the next stage then turn the test plan for each PVE for each Target Endo into a set of executable bash script using large language models and there may be one or multiple t uh execution path to each Target endpoint we aim to cover as many path as possible in the last stage we set up an actual

API server and run all the executable B script to send the actual API request to the Target endpoint the process of user registration user login and token refresh have all been automated and we also use AI to help analyze the logs and response during the test to determine if a end if an endo is vulnerable to Bola now let's dive into more detail in each stage the first stage identify the end point with input parameter that reference private sensitive or confidential information these are the end point that we primarily focus on these are the target endpoint potentially vulnerable endpoint let's use the first endpoint as an example the parameter username here in the that this

endpoint May reference to some data associated with one with a particular user as a result if this endpoint is vulnerable to Bola and attacker could reset another user's password similarly the second end point here if it is vulnerable to Bola and attacker may be able to change another users's input uh email traditionally pentester manually look through every end point and it's parameters to identify identify their target to identify their target end point this process has been slow and combersome especially for large application with hundreds of end points we leverage ai's capabilities of reasoning and understanding task to automate this step here is a snippit of the prompt that we use to communicate with AI basically this TR this prompt instruct

AI a set of rules and example to identify parameters with uh that may reference to sensitive information the AI then return us the m points and parameter that meet any of the conditions here the next stage uncover the dependency relationship between endpoints as I mention mod modern web applications are complex and stateful meaning that the exec the execution of any end point can change the state of the entire application and affect the outcomes of other endpoints that's why it is crucial to identify the dependency relationship before we can accurately correctly test any Target end points in this diagram the end points on the right are we call them consumer endpoints and and the end point on the

left are producer end points this is one of the most important Concept in our research to in order to identify the dependency relationship producer endpoints on the left output the values that consumer endpoint need as input again the producer endpoint on the left produce output the value that the consumer endpoint need as input let's use uh this an example the consumer here is delete username and in order to correctly test this endpoint we need to fit the endpoint and existing and correct username if we fit in a random username the test case will always fail and give us meaningless results in this case this consumer end point has four producers and each of these producer can out output the

existing valid username that the consumer endpoint can use for testing each endpoint can be a producer a consumer or both let's look at another example here the consumer endpoint here is delete comment it has two required input Slug and common ID and they can all conf from its producers get comment and post comment in turn these two producers also have the dependent the required input slug and they have conf fromont their producers and here is the snipp of the prompt that we use to teach AI to recognize dependence dependency relationship between endpoints it is important to to to to know that although uh this process can be down to humanistic asking uh using heris to match the output parameter with

of a producer with the input parameter of a consumer this humanistic matching algorithm is not reliable for several reasons first developers may use different parameter names to reference the same data object within the system and developers may also use the same par parameter name to reference different object within the system that's why we need to use AI to study to analyze the description in the spec and match the end point by their functionality rather than just the par par par

name the next stage turns the pairwise dependency relationship we identify in the previous state into a dependency tree for each Target end point in this in this dependency tree the r note represent our Target endpoint PVE so we create one uh dependency tree for each Target endpoint and within this tree any two directly connected end points represent a consumer and producer relationship in this diagram the PVE is the parano and it is the consumer and .1 is the uh trial node of PVE and it is producer again producer output the value that consumer need as input and let's use this one as an example n.1 is pv's producer and n.3 and four are n1's producers in the next

stage with with all the dependency re relationship figure out in the next stage we calculate the execution path to each Target m point in the dependency tree a path from a li node to the root node represent a depend uh an execution path and there may there can be one or multiple execution path to each Target end point let's plug in some real end point to show how we calculate the execution PA in this case the PVE is delete commment and it has two producers get commment and post common and these two producers in turn also each has two producer so in this case our Target and point DD common has two has four execution

path finally in the last stage we turn each execution path into an executable B script and run the best grip to actually send API request to the Target end point the process is more complicated than just generating and running the script as ra will expand in the next few slides thank you Jay um so before we going to see a real balla test and a demo let's see the the rest of the stages before we actually can generate test so first we will need to register the users and collect the login data these steps also is done by AI with a human verification first we will generate uh the users's credentials that meets the the

criteria AI will uh analyze the open API spec and and understand what are the complexity of the username and password that uh a user can have in the application next we will create and execute the user registration phase and lastly we will fetch the login data and save it in a dedicated uh file for further usage and before we can actually create the test we need to isolate the test data open API spec can be really really huge and we don't want to deliver to H to feed AI with a huge API spec we want to cover only the specific data that is relevant for each test so Jay mentioned we have a consumer and producers for

each test we will isolate only the data for this consumer and producers and create a new trim spec file which we will feed AI we do we do it for efficiency we want to have only accurate data to prevent AI from having mistakes and for cost we want to have a the least amount of token that we can send the AI so when we try to generate a test script it is saved as an executable bash script you can see an example of a really simple H ball test we have a consumer of put a user password and the producer will be get users which will get the username this test will be saved in a

put get directory put will be the parent get will be the child and the test generation script actually runs asynchronously in order to save time and right now we have an average time of 1 uh 5 Seconds to generate each execution uh B screen lastly uh we do we perform a when when we run the test um on the application we do it in a certain execution order and the main goal is to avoid avoid any test failure due to technical reasons we want a test to be failed only if the PVE is not vulnerable to Bola which should be so we first run the test that populates data and resources to the application and only then try to fetch

them to eliminate the to eliminate the fact that uh to fetch a non-existing resources and we try to push the delete and update operations to the end we don't want to delete a user and then then try to fetch it because the test will fail due to a technical issue and after we finished creating our methodology we uh actually made the evaluation process so we evaluated babster against wrestler wrestler is a open source AP fer created by Microsoft um it's one of the best API fers that you can find at least open source so we wanted to test our our methodology against the best and its goal is also to automate the testing of services through rest

apis um and find security vulnerabilities basically and what we did is you can see this table we took free application this is open source application uh vampy capital and crappy all of them are deliberately vulnerable to the a SPI top 10 so they have existing Bas um you can see the number of Bas in in each one of them so again we tested wrestler um against our tool and this is the results of the wrestler run wrestler couldn't discover any bow of vulnerabilities in in either one of the application um I would just say that we use the default configuration and they claim that they do uh find baa with the default configuration but they

didn't and the number of API calls you can see was pretty big thousands and even uh 100 of thousand scores which making a lot of load on the application obviously and this is our results which were amazing we found all of the Bowers in all of the applications and we did it with less than 1% of the amount of API calls in comparison to wrestler so this is really huge um in terms of the load we didn't even make any load on the application and we were able to find all of the boras also we focused in on the true positive rate so our goal was to uh if an application have aola we we will want

to find it uh we we didn't really care at this stage about Force positive because it's really basically impossible to avoid them but again we had a 100% true positive rate so we found all the Bas in the application and now let's see an actual B test so this is an high level example of a test uh we have here two producers and one consumer which is the PVE so in this scenario Alice will try to create an article she will create a comment for this article and eventually Bob will try to delete Ali's comment which is the potential Ebola so first of all um Alice um and Bob will log into the system and they

will get a unique uh token afterwards the sequence of the test will begin uh Alice will create a new article in the system we will save the article title later tic later Alice will create a comment for this article we will save the comment ID and lastly Bob will attempt to delete all his comment and this has two options uh of results so if we will be able to do it we will get the 200 okay and this is a potential Ebola vulnerability for this endpoint um and if everything is correct and we have defenses it will be of course forbidden let's see how the code looks like so we have here Alys uh which

create an article I hope you can all see it uh first of all we will create a unique random string so we will try to use it as the article title to have a unique title every time and then we will create uh create a post request to/ articles in in this uh in this case to create a new article you can see that the authorization Adder is user a token this is the ident identify that Alice is making the request and we will use the random string as the article title so this is the the API call we will save uh the article title as a slag and then Alice will try to create a

comment for for artical and as you can see here the the if you can see the slug which is dynamically being used in this API call um this is the slag that we saved from the the previous uh API call so we are using the article she already created she create a comment and we save the comment ID and lastly Bob will try to delete uh air comment so you can see the API call is being we are using slag and comment ID the one that we saved before now you can see that the authorization Adder is user B so this is the identifier of Bob instead of Alice and basically in the end we will

check if the test passed so just one sentence about the the the check in in this case in our case it's enough to to Market test as as Ebola potential Ebola at least if you if the PVE returns 200 okay if you think about it every test we make is is malicious by definition we the last request is a user trying to perform an action that he should shouldn't be allowed to so we expect not to get 200 okay if we do we can mark it as a high potential for B and later on WE perform a human Analytics and checks to verify if it is a b okay so for the fun part let's see the

cves that we found uh and before that okay I'm I'm not sure the demo will work no it's not working I'm sorry we have a technical issues but let's see the Bola that we actually found um in the open source application so first we have Harbor which is a cloud native container registry um it's basically equivalent to dockerhub I'm sure that all of you know dockerhub um and it's a cncf graduated project so it's very popular it's being used by it's being downloaded by 2 million uh 2 million times and we found a in 2024 we found the Bola there we have also grafana that I'm sure that most of you are familiar with it's

a very very popular data visualization and monitoring tool uh it has about 20 million users around the world and we were also able to find a Bola there and lastly we have easy appointments which is the appointment scheduling application um it is less popular but it has almost 200,000 download um and we were man managed to find 15 new Ballas there um seven of these vulnerabilities are targed as critical so they have a CVSs score of 9.9 which is the highest so imagine that seven vulnerabilities out of the 15 allow uh a full uh compromise of the application so you can do whatever you can be an admin basically you can do whatever you want uh so this is this was a pretty big

uh a big achievement for us and let's deep dive to talk about the arbor vulnerability so Harbor actually has a projects and the feature that we found vulnerable is the project configuration metadata um every project have users so you can be an admin you can be a maintainer a developer a guest or a limited guest and I hope you can see in the screenshot that this is the configuration of a project in Harbor um you can do a bunch of stuff you can change it the project to be private public you can create um Autos scan for image vulnerabilities and much more so Arbor claimed that only the project admin C can create and modify or delete this

configuration which make a lot of sense this is crucial part of the of of the project but we actually found out that when we are logged in as a maintainer uh first we try to modify these attributes via the UI we we were not able to do it as should but we found that we can do it via API so there was a discrepancy between the UI and the API which allow us to to an unauthorized uh user basically to create edit or delete the project configurations so the issue is the maintainer actually extend his Privileges and now we can make a private project public deploy unverified images and bypass vulnerability scanning and and more um the consequences can be really

bad so we you you as a malicious malicious maintainer you can compromise the entire project Integrity um and the security posture basically so Harbor recognized this vulnerability and issue a cve uh it was not long ago basically like two weeks ago and um they published the details in their uh security advisory so if you want to take a look it's open source so you can go and do it and also if you want to read more about the technical details about this vulnerability and our uh methodology uh go ahead and scan this code this leads you to our blog log that describe this vulnerability yeah thanks are I hope everyone understand by now why we need Ai and how

we use AI we started the project by dropping an entire open API back into chat gbt and ask it to find all the baa end points as you may imagine the result were not great and we not only exceed the token limit but also confuse AI a lot and so there after many many child and errors we gradually learn how to collaborate with AI to optimize its performance and these are a few most important lessons that we have learned throughout the research first AI isn't always the best solution AI should not be used to solve simple problem with humanistic solution like sorting finding p solving equations these problems have existing efficient and Optimal Solutions although we may use AI to solve these

problems but usually it does so in the much at the much higher cost and longer time remember don't shoot a mosquito with a Shotgun If and if a problem with existing heuristic solution always choose heuristic over AI second don't trust always validated blindly trusting the output of AI can be very very dangerous especially in application in critical application where mistakes can cost millions or even human life in our research we often generative AI often give us non-existence parameters end points or sh commands in our case these mistakes are not lifethreatening but they result in failed test case false positive and false negative as a result is so crucial it is so crucial to always validate the

output of AI before using them or passing them to the next stage lastly makes ai's job easier treat AI like a very capable Junior colleague who can do simple task extremely well but it can start making mistake if the TX get more complex it is thus the human supervisor responsibility to simplify AI task in our case it was a bad idea to just give the entire API spec of thousand of Lights into AI it confused AI a lot as a result we break each API spec into many many smaller pieces and only fit AI the relevant piece the relevant pieces of the current task when working with AI divide and concur is always is always a good

strategy although we have seen some promising and successful successful result of This research we still have some remaining Challenge and room for improvement first our methodology is very sensitive to the quality of the input API spec currently we treat the API spec like the absolute truth and build our entire test plan based on the API spec however throughout our research we found that many open API specs many API spec of the open source project are outdated or inconsistent with the actual API functionalities this inconsistency result in a lot of issue fail test fail execution path and and false positive false positive and false negatives next not every API application out there has an uh API spec available some applic

some application don't have maintenance or even document and there are application deployed in more restrict environments such as industrial control system in which we don't have direct access as a result in the next phase of the research we want to explore more data sources such as uh pcap flow log or even source code to help AI understand the application lastly using generative AI model can be quite expensive especially for the more advanced models the me the cost of our methodology is proportional to the size of the API spec and the complexity complexity of the application luckily the rapid advancement of AI and the intense Market competition between AI service providers the cost of AI has

reduced a lot in the speed that faster than we could imagine compared to just 6 month ago the cost of testing and application has been cut by half while the performance and the speed of the model we use have all significantly improved one amazing side effect of working working with AI is that our the performance of our application always get a free boost whenever there's a new generation of model available and let's conclude our talk I don't know if you have time for for a question but we can talk after [Applause]

after thank you very much thank you than here

BG - BOLABuster: Harnessing LLMs for Automating BOLA Detection

Related talks