Using Common Attack Database and Intent Clustering to protect websites, mobile apps and APIs

Name: Using Common Attack Database and Intent Clustering to protect websites, mobile apps and APIs
Uploaded: 2023-05-12
Duration: 25 min 59 s
Description: Using Common Attack Database and Intent Clustering to protect websites, mobile apps and APIs - Paulina Cakalli Cyber-attacks are becoming more and more sophisticated every day. Thousands of attacks are happening every day, targeting individuals and corporates. They attack your website, mobile apps,

BSides Prishtina25:59602 viewsPublished 2023-05Watch on YouTube ↗

Speakers

Paulina Cakalli

Tags

CategoryTechnical

TopicWeb AppSec

StyleTalk

Mentioned in this talk

Frameworks

scikit-learn

About this talk

Using Common Attack Database and Intent Clustering to protect websites, mobile apps and APIs - Paulina Cakalli Cyber-attacks are becoming more and more sophisticated every day. Thousands of attacks are happening every day, targeting individuals and corporates. They attack your website, mobile apps, and APIs in order to damage you financially or reputationally. In this presentation, I will talk about how we build a threat model to study the behaviours of traffic coming from malicious sources. How we use historical data to help us understand and estimate the different bot attacks. An attack where competitors may attempt to scrape your site in order to obtain pricing and product information (which can be used as business intelligence to gain a competitive advantage) or to take content such as blog posts and relist this elsewhere to drive traffic to another website. Account Takeover (credential stuffing) by running lists of usernames and passwords against the login portal and seeing which permits access. Fake Account Creation: attackers can create fake accounts to take advantage of trial periods and other user benefits. We have used all our experience to build a Common Attack Database which represents a combination of heavy malicious Visitors, IPs, User Agents, and Data Centres. We will describe where is the malicious traffic coming from and what different levels of sophistication it represents. Which are the different behaviours we see day-to-day and how we apply Intent Clustering to identify malicious visitors and serve them a captcha or apply hard blocking if they become abusive or bypass the captcha. The Common Attack Database and Intent Clustering help to protect against cyber-attacks straight away to the first request coming from a malicious visitor.

Show transcript [en]

good luck thank you thank you everyone for being here it's great to be here I want to thank dardan and the team for organizing besides we now we know how much effort is needed so well done it's looking great so today I'm going to speak about using common attack database and intent clustering to protect websites mobile apps and apis

okay apparently this doesn't like me yeah I'll just use this thank you so first I want to give an into drug introduction of who I am so um my current job is a lead data analyst at netacia a manchester-based company where I work closely with threat research team and data science team to identify and protect protect web services and API from the recent threat attacks basically I'm also a co-founder and organizer at besides srirana we have called for papers open it's the second edition that we are running this year if you want you can have a look and submit your talk if you are looking for any instructions either myself or Rio we are happy to support and give our feedback

I'm also an organizer and team at illyrian brains so illyrian brains is a community-based organization the focus is to bring together all Albanian professionals worldwide do meetups webinars and have fun together so I'm interested in threat hunting both cyber security threat analytics and threat protections my name is Paulina chakali you can also find me in social media in LinkedIn Twitter or GitHub

so sorry [Music] so the agenda for today will be what automated threads are we facing today kill chain investigation mapping mapping attacks to Blade framework uh threat attack sophistication level what is cut common attack database clustering method used to identify malicious traffic analyzing bad bot behaviors and also using machine learning to identify Bots by passing captures so first let's start with the automated dress like the most recent automated threads that we see in my day-to-day job are like credential Staffing account takeover fake account creation scalper and sneaker Bots web scraping gift card abuse and also there are so many others automated threads that we are facing nowadays that are causing like either Financial damage or reputational damage

to the businesses so I wanted to give us an introduction about blade framework so blade framework is similar to maitre framework so I I'm pretty sure that everyone here knows about the faces tactics and techniques of the maitre framework so creative knowledge base of adversary tactics and techniques based on real-world observations of business logic attacks there are six phases disting stages that a business logic attack May progress through there are 24 tactics the Strategies employed by adversaries during specific phases and also 80 techniques the specific actions or methods performed to achieve tactical goals and then we will explain um and map attacks to Blade framework we'll start with the credential Staffing kill chain overview so what is the credential Staffing bot a

credential Staffing bot is used to test previously leaked credentials typically usernames and password pairs that you can either Buy in in dark web to determine if they are valid on a Target web service or API these spots validate credentials payers against their target web service or apis by automating login attempts allowing adversaries to test and validate credentials at Mass scales so when we talk about credential Staffing Bots the end points will be login endpoints that adversaries basically will try to attack in the credential Staffing bot the blade framework has like six phases the first phase is resource development where basically the attacker will um produce the tools that they need to attack and then we have the

reconnaissance stage where basically um the attacker will know what is the specific Target which in this case as I said before it's login endpoints and then defense bypass attack execution actions on the objectives and post attack so we I mean this kind of attack is the most seen attack recently also the scalper button sneaker boss I mean I'm pretty sure you might have heard about the PS5 launch when we had this you know crazy stuff going with the ps5s but also recently in the Eurovision song contests like the tickets went at around one uh sorry 11 000 pounds because of bots doing this you know scalping all the tickets and like reselling them also I wanted to give some information

about the fake account creation bot fake account creation Bots abuse the sign up process of a web service to create user accounts in bulk using stolen or fake identities so yeah mostly on the fake account creation attacks they target register endpoints these Bots automate multiple sign up requests which can be spread out over long periods of time of using IP addresses from different geolocations to hide the fact that they are controlled by by one person I mean the resident the attacks coming from residential proxy networks might be a good example of this kind of attacks coming from fake account creation Bots many Advanced fake account creation Bots can also bypass email phone and captcha verifications uh mapping to the blade

framework here we see five phases they are pretty much the same phases so I'm not going to mention the same one so I'm moving to the next attack which is creeper board and scalping this is the same group of um no sorry this is scraperbot webscaping is the use of both together content of data from websites they can scrape product endpoints the prices of the products it might be a competitor who is trying basically to scrape all your website and like to clone it and create the same website on or they can play with the prices like if you have if you are selling the same products they can manipulate the prices and make more

business than you again this goes over six phases resource development reconnaissance defense bypass attack execution actions and objectives and post attacks and the last one I wanted to talk today is calpurn sneakerbot group breaking the kill chain with blade so here we have it's interesting because some of the use cases that we mentioned before are part of this uh attack basically so the first phase of the scalp robot is monitoring targeted websites creating accounts and scraping products so fake account creation credential Staffing account takeover and product scraping up are part of the phase one of this attack and then like if I'm an attacker I'll scrape your products I'll create either fake accounts or compromise like

existing accounts to do some product scraping and then what I'll do I'll let all the stuff into the basket or add to cartoons where Bots hit the basket massively they simulate many users and we have Speedy scripted Bots from my experience we have seen that these attacks last for seconds or minutes in web services or API so they are very fast that's why they use speed descript Bots and the last phase is check out abuse Bots hit checkout endpoints in this the the last phase that they do credit card and gift card fraud as well so the reason why they do this is because they want to gain discounts basically and um spend less money and then they do

multiple purchases exceeding limits out of stock and this is mostly used for limited edition products like Adidas gz or Nike shoes or the PS5 or any other product that is limited edition basically so one important step on analyzing automatic threads is a sophistication level there are four levels this is how I see in my day-to-day job the first one is the easy level where we have an attack coming from high volume of requests and they it's like a combination of all categories like one single user one single IP one single user agent one single Data Center and Country so if we have a high volume of requests coming from this combination of these categories is very easy to spot us

basically we can see a peak in the high number of requests and like mathematically we can analyze that quickly I like identify that attack and then the second phase is moderate level high volume of requests coming from multiple users and clients but we have unique user agents data centers and countries um so in this case it's it's still manageable I would say like you can build a system with rules and like you can still identify these taxes it's easy and then we move in the more sophisticated levels which is sophisticated level high volume of requests coming from multiple users clients user agents data centers But Country still in this case I would say that it's manageable I mean we can

easily mitigate these attacks as we can do geo blocking for example I mean you can do in the mitigation strategy you can block in a user ID IP user agent data center country or ASN or geographically so these are still manageable and then we go in the very sophisticated level of the tax uh we we see high number of requests dating sources we see multiple categories doing this so these are mostly residential proxy Network sources so what Bots do they compromise residential proxy networks and they behave let's say very similar with a human and that's the most challenging level of the attack that we see and we protect web services or apis I'll explain later how we do and how we

ensure the protection so one of the stuff that we use which basically mitigates in the first requires and it doesn't need a big science behind that is cut so cut is common attack database it involves some lists one of the lists that we have created is called capture recommended thread lists and this basically is using all the experience like we serve let's say a part of the traffic to capture services and then we can see if something is showing a human behavior or about behavior and then what we do we aggregate the data and we collect all users that haven't been able to pass captcha at 100 percent and then this is called the capture recommended threat

list and it's part of the card the next one is residential proxy IPS so we do this geographically like we collect all the residential proxy networks let's say if we are in kosova we will we will collect all the lists for Kosovo if we do business with United Kingdom we'll collect the list with United Kingdom so basically we we collect geographically and then we have worst offenders and these are basically values or entities that have been collected through a system of rows like we can use a system of rules let's say like okay I can build the regex and I can spot like spooked user agents for example or um or I can build another rule that can see

if an IP is coming through a VPN or like a private service so there these are like you know a system of rules that we build and that we classify if something is malicious or spoofed and not representing uh human behavior and also there are some other lists so these lists are coming from the models output basically so why we use cards as I said before like you can mitigate in the first request you need zero second of preparation really like if you want protection in a web service or API we say Okay a model needs time to tune to validate to do the performance measure and all this so here is the list we have

and we can protect your website straight away and then uh I'll move in the Plastering method which is the most challenging part in the detection I would say there is a lot of mathematics behind the let's say machine learning models that are used to identify bots so first of all this is a classification problem so basically we need to identify if a user is a human or a bot being a classification problem one of the most common methods that we use is a clustering method called dbiscan which is a density based clustering method there are some hyper parameters of on the methods so first we do data set loading aggregation we aggregate the data so each row represents a

categorical value and has a Time series array for that category and then we have the filtering filter out categories with a smaller presentation in the data and reduce the volume of data to be considered and here for example when we speak about outliers this method is really good because it removes the outliers so they are not included in the Clusters and this makes it very efficient for the problem that we have a noise adjustment all time series container roughly roughly fixed amount of noise which interferes with the distance metric due to scaling so for the distance metric we use we usually like by default it's euclidean distance but you can explore the method itself and the clustering algorithm is DB scan

a density-based clustering method basically and then we move in the reporting phase where where we have like a list of cluster labeled and then we do some uh we set some rules and then we select we do a cluster selection and then and then those clusters are used basically as recommendations for web services or apis so if you want to explore with um DB scan clustering method you can go in the scikit-learn library and you can play around with the hyper parameters you can play with the Epsilon minimum sample of the data and you can see what's the impact basically so the next one as I said is cluster overview so what the time series

clustering method will do is it will identify entities that have the same behaviors over time so let's say if I'm a human and I want to buy I don't know a dress today I'll go home dress red and then check child payment and that's it if you are a bot you will do a different Behavior the probability of myself doing the same behavior with someone else here at the same time it's I'll say very close to zero as it's very rare that all the people do the same behavior let's say within the same time period um so what the method does it it correlates so basically it correlates together based on the traffic pattern so

if basically you have automated some Bots either coming from residential proxy networks this will be identified from the behavior basically over time so we have cluster labeled in this case we have a lot of clusters like 26 it's a lot the minus one cluster usually is unclassified data so usually we're expecting this to be uh entities of the data that haven't been able to classify or correlate together from the pattern in the trend so this is interesting I mean usually it's recommended not to use this cluster but it's interesting because sometimes you can find out that there are some unique Bots within this cluster and then the other clusters so then you build like a system of rules

in terms of selecting clusters as it might be let's say traffic coming from semi-trusted data centers that you don't want to block this kind of traffic as it's risky it might cause false positives this is a good method for uh captcha as a mitigation strategies but if we are in hard blocking for instance for us mobile traffic it might be challenging and we need to set up some extra rules to make sure that this will be valid and then we will see um analyzing behaviors so here are some of the Clusters what is good about this method is that you can run that intent let's say in a user ID level but you can also run that in a client level user

agent so basically all the categories you have either in an ASL level you can still run it and find like commonalities through asms for instance so here in the top left graph you will see the -1 cluster so hopefully you can see the automated behavior on the blue line and then the orange and red one so usually this is unclassified but still it's good evidence of finding out some weird behavior and then you can see the other clusters that you know like the values correlate really good over time so let's say in the top right one you will see that there are how many lines like five lines or more but they have all the same

basically Behavior over time so when the traffic goes down these sources are doing the same behavior over time and then we have um the bottom right graph this is either more volumetric in terms of the unique entities that are sending the attack so here for instance it might be a residential proxy Network attacks basically that is using a very distributed list of users IPS they use the recent versions of browsers or and also they use semi-trusted data centers and this makes it challenging but still we can find out on the behavior like which are the eyepiece or users that are doing this and to recommend them straight away so the last topic I wanted to talk today

is bypassing captchas as this is a topic that it's very challenging nowadays so all the companies that have bot management as a strategy as a product this is one of the parts that we see we have a lot of discussions with customers and want to find the goods basically solving solution so what happens so low cost human labor to solve capture images they share the same cookies to stay active you can pay them for like very cheap maybe 0.50 to solve 1000 captcha frames and then what they do they keep the cookie active and they can really use the cookie and do the activity basically and then we have ai chatbot GPT Etc that can solve captcha with some

human help again but then we also have sophisticated Bots that can bypass capture by using AI to and let's say if we have a captcha like yesterday when um dardan was sharing the awards for the raffle we saw that you know you need to fill the captcha images you can build an AI model that will be able to fill that and bypass the captcha it's very easy really what we can do is that stop badbot's user from sharing cookies for more than x hours like you can add the rule let's say and say like okay if a user is using this cookie for more than 12 hours 24 hours then stop it like expire this cookie and they need to use

a new cookie and this is like a quick solution it doesn't want a lot of exploratory data analysis and stuff uh the other feature that we investigate is analyze the time that bedbot needs to bypass capture and compare this to time needed by humans so this is time to pass captcha estimation we visualize the distribution of time needed to bypass captcha and then we build the captcha abuse model so basically the features that I mentioned before like sharing cookies and also estimating the time to pass captcha we are able to build like a captcha abuse model so what we do is uh we visualize the distribution of humans and Bots bypassing captures and then we

can see what cluster basically so usually we'll have two clusters one will will represent Bots and one will represent humans and then it's very easy to spot with some rules which of the Clusters basically it's uh automated traffic and then what we do is we we solve the problem with hard blocking in these cases I've also visualized um so yeah this is bypassing capture so here I've listed some captcha Farms um I'm pretty sure there might be others as well supported captures versions but most of them basically can be manipulated and like bypass um the other one I wanted to share some information is basically how we visualize the distribution of bots bypassing captchas so here are three

different Bots that we're bypassing uh the captcha and hopefully you can see how different and unique they are so basically basically you can see the purple line and that's a bot that is more stationary and like um they need up to maybe for 50 seconds um to bypass and then we have a noise here of 90 seconds so this Bots it's you know manipulating us like if we say okay we can't set the threshold and say like okay humans need like 40 seconds and Bots need like I don't know 20 seconds before they used to need to take like 12 seconds to bypass but now it's completely the opposite they are behaving as humans and this makes it

even more challenging and then we have the red bars so we can see that this bot is um is bypassing before 20 seconds and in a much more number of requests like up to 300 requests and the third um behavior of both I'll say is bypassing in around 10 seconds um sending around 67 requests so yeah as I said I mean we see different behaviors really and it's challenging it's a new industry I'd say but it's definitely interesting this was my talk for today thank you very much I hope you enjoyed it [Applause] I don't know why it wasn't working no worries no worries do we have any question yes we do let me go out there

thank you very much for the presentation it was great Insight regarding data security my only question was uh since you mentioned the AIS like Chad GPT or Microsoft Sydney or Google's Bard uh can we use those like do you see uh companies or different solutions using AI as a protection measure like to counterfeit these types of bots or uh engines that want to pass captures or whatever stuff like that like do you see uh these AI helping us or helping companies in mitigating data security and web apps applications and what um yeah thank you for your question I mean from my experience I don't think there is like um like a chat GTP version or something

else that it's protecting from this kind of friends it might be in the future who knows like you know AI is evolving very quickly so at the end of the day we're still building machine learning models to protect this so if we automate it in an AI level like child gpdl like we say Okay identify if this is a bottle user it will be really good but until now in my knowledge no no worries thank you [Music] do we have any other question Paulina thank you so much that was very insightful thank you

Using Common Attack Database and Intent Clustering to protect websites, mobile apps and APIs

Related talks