
OK, let's get started with the second lecture. Hope you had some coffee and you're ready for what's coming. So right now we're going to enter the dark web. And our next speaker will tell us how big of a mountain of incoherent data lurks in the dark web. And how can you actually extract workable intelligence from it, which is useful for you. so you can actually know what is happening potentially and what might hit you down the road. So from Group IB, please welcome Igor Kaplan, machine learning expert.
Hi, folks. My name is Igor, and I am a machine learning engineer at Group IB. This one here. I worked on development of an AI agentic system at our fraud protection. And today we're going to talk about darknet intelligence. If you have ever worked with darknet data, you're all very well aware that this data is really complex to work and it is very demanding to extract some actionable insights from it. And today we're going to explore how you, can employ AI to help in this.
Okay, it works. So our agenda for today is pretty straightforward. We will start by exploring a particular use case in which we can see that darknet intelligence can be a real game changer. And then we can
formulate a problem statement, so what we would like to achieve and our obstacles on our way towards achieving this goal. Then we will explore the solution options that we have on hands, dive deep into the particular one that I'm going to present to you, and then I will tell you how AI is going to substitute us all over and how you can make AI work for you. Okay. I know that most of you have very rigid answers to this particular question, but I would answer it by asking another one. So how can you discover an emerging threat well in advance and take the right actions? And one of the answers is to look behind the
scenes. And in this sense, Darknet is the right place to look. So as you're all aware, dark web constitutes only a tiny fraction of the global web, but it's a very important one. It is the very right section of the dark web where the cyber criminals communicate with each other, where they sell their toolkits, where they share their insights and experience. And this is the very right spot where you can take the intelligence you need to be one step ahead. So let's look at a particular example and let's start with the theory first our example is a phishing campaign and typically it can be split into four distinct stages so stage one is verpenization is when the threat actor releases a new phishing
kit and they start to distribute it then comes the testing and start of exploitation. So this is the phase where the first buyers of the phishing kit, they start testing, they start sending the first phishing links to their victims. First victims got affected. This is usually also called the golden hour of the campaign, this is the time when the campaign can still be quite easily contaminated. So the number of victims is quite humble, the amount of losses is not that large and it is the right time to take the right actions. But this is also the time when early warning signs start to arise so you can actually see what's coming next. But then comes
industrialization phase. During this phase, the adversaries ramp up the distribution really quickly. So they use all the leaked databases that they have on hands. So the fishing kits, the buyers, they start exploiting those fishing kits really extensively. And at this phase, the number of victims, it grows at exponential rate. And usually, it is quite hard for the defenders to contaminate this attack without heavy losses at this stage. Well, and then determination phase. This is the full-scale operation phase when the fishing kit is fully operating at full scale, it is industrialized, and at this stage, you can only respond to the attack, so you can only basically clear the mess and try to reimburse the losses.
So our target is to detect the attack during the golden hour. And now once we are familiar with the theory, let's jump to the real case. and it's worst case scenario. So again, weaponization, this is a real message that was detected in the dark web. So a threat actor releases a new Fish and Kill tool and starts promoting it on one of the darknet forums. This fish kit targets retail banking customers. All right. During the golden hour intelligence blind spots, the defenders blinked, nobody saw anything. We couldn't anticipate the attack in advance. So what happens?
actually detects the attack too late only when the number of complaints from the customers start to explode and they start to see very suspicious links mentioned in the in the domains that they see so At this point they can do nothing and at this realization phase they can only response so they can finally
do the hunting. So they start searching for the right tools in the dark net just to understand how to fight with it and the issue the domain take down and they start to warn their customers but it's too late and the amount of losses is already immense. And now let's look at the
ideal case, ideal scenario. So in this scenario happens the same thing but during the golden hour we can see that we could discover
the signs of attack on early stages. And we could attribute the threats such that we could then start preemptive orchestration of the mitigation of the attack on the early stages. So we update the hunting rules that can be done automatically. The takedown is initiated. The awareness campaign among the bank customers is started. So this is the proper scenario what should have been done and this is what we strive to achieve. Okay, how can we achieve that? Let's formulate our goal. So what we would like to achieve is we would like to automatically identify the emergent threats in the dark nets with very little, little to no latency so that we could get to we would get to fit into that golden hour and then we need
to accurately attributes those threats so that we could initialize the orchestration of the prevention man actions but we have a number of obstacles one of them being the volume the amount of messages in the darknet, if you can source them properly, can be really large. If you know where to dig, you can dig a lot of darknet forums, telegram channels, discord channels. And although some of them are really credible and people don't really fluke there, most of such channels are the same channels as everywhere. People talk a lot and it is very noisy data. And as always in cybersecurity domain, it's very hard to find the needle in the haystack. The signal to noise ratio is
really low, a lot of spam, not useful for anything. Then comes variety. So not talking about different types of threats and different types of adversities that communicate in dark web. People actually chat a lot there. They can discuss politics, they can discuss online shopping along with exploiting online shoppers. So it can be a real mess. Then also comes the language aspect. In the dark net people communicate also in many different languages. Of course, English is predominant, but still all the languages
there exist so you need to take them into account somehow. Also there are additional artifacts such as links or pictures that also need to be somehow processed so the variety is huge. Then quality of course folks on darknet they don't really try to communicate very politely in a very structured way it's not a news feed from BBC. And also, apart from that, they're interested in obfuscating their thoughts. They pretend to be chatting, but actually they are selling something. So it's hard to understand even for a human what's going on there. Apart from that, of course, there are a lot of special symbols. There are typos and a mixture of languages. So many times when people mix their own language with English,
and so on. Of course there are a lot of duplicates which needs to be taken into account somehow when adversaries just copy paste with some minor changes the same message over the whole darknet so the quality is also not pretty nice. Then
we also need to take into account context because different darknet forums they have different credibility, they have different specializations and also different threats can be also should also be taken into account because in one thread they can discuss politics in another thread they can discuss how they can exploit something. Then of course inside
single message context also changes everything because there are a lot of for instance company names that are really ambiguous a great example of that is booking how you can just hit distinguish that someone is talking about booking air flight tickets or ripping off for the booking customers so all of those problems are need somehow to be addressed in order to achieve our goal. And if we approach this problem manually, it becomes really demanding. So let's somehow try to approach it with the modern technologies. So let's look at our target workflow. So direct forums, they produce a number of messages of different quality, of different different topics so all right then comes processing magic something has to be done
so that we could then produce high quality insights and labels messages to target orchestration some kind of sour system that can initiate the prevention measures Or we can initiate notifications of the customer or analysts so that they can be alarmed in a timely manner. And then there's also storage for search and reporting, which is also very important because if you store all those messages as they are not processed, you cannot really search them on them. You cannot observe the threat landscape in the dark nets. You cannot make any insights because The amount of data is immense and the quality of data and all those problems, they are still in place.
Okay. So what our solutions, solution options are?
First of all, of course, the traditional rule based system, hunting rules. I'm sure that you're all very well familiar with them. They are written manually by analysts and they can detect the threats pretty well And of course you can Extend the system really quickly. You can add up new rule. Of course, it's not easy to come up with the correct Hunting rule, but then adding it up to the system is really straightforward but I it takes a lot of manual work to come up with a hunting rule. It has low flexibility, which means that if hunting rules, they aim to be really, really precise. And if the threats found in the darknet, if the message doesn't match the partner exactly, then you miss it.
You miss this threat. Of course, it's the, With the number of hunting rules written, it very easily gets overwhelming to keep track of them, to navigate among them, to keep them up to date. It's really hard. And some hunting rules, they can be really hard and really complex so that the person who didn't write them cannot even understand what's written there. So also, they struggle to take into account context. I mean, the context inside the message or the threat itself. It's not easy to fit into the rule. Of course, you can use regex, but again, if we're talking about a lot of context-heavy situations, for instance, about booking or stuff like that, regex is not a great help there. So what we have as a
bottom line, so velocity of the system when it works, it's really high. So when you have the right hunting rules in place, you can get the predictions really quickly. The accuracy of them varies from medium to very low. By medium, I mean hunting rules, if they are well written, they can be pretty precise, although they can also attract, they can have patterns that match some more data that wasn't intentionally matched. And by very low, I mean that if you don't have handy rules in place, then your system just doesn't work. The cost at creation is really high, so you need to manually come through, all those messages that you find online and try to come up with a pattern.
So it's a really demanding job as we have already discussed. Cost at runtime is really low. You don't really pay for anything. It's just rules. So good, but not enough. And then you might all think, all right, so we're in AI world, so why don't we just take the best LLM there is and just let it process all those messages. It's very straightforward. You can just prompt it and that's it. Easy peasy.
But there are some problems with this approach. First of them being, well, it is very costly. I mean, if we're talking about tens of thousands of messages, some of them, yes, they're very short, it's not a big deal, but some of them are really long, intentionally long. They can also include some pieces of data, they can include artifacts. So the costs mount really quickly, especially if we talk about the cutting edge LLMs. And if we use some other LLMs that can be for instance self hosted or some just some more outdated versions of the modern LLMs then they're not that good. I mean their accuracy is not that high. And also about how easy it is to extend this system. So of course you can just
type in another line or two lines into the prompt but it's not actually enough because once you have a lot of patterns you cannot simply type in all those patterns in a single prompt hoping for the LLM to make the work for you. It just doesn't work like that. The larger the context the easier for the LLM to get confused. And also it is
hard to persist accuracy. By that I mean LLMs are not supposed to be good at persistent predictions. They're good at generation. They're good at solving something that's at solving tasks. But they're not good at
constant predictions. And in this sense it is quite hard to make them work persistently. And OK, what about the metrics? The velocity is lower than the rule based system. I mean, it still takes some time for the LM to make the prediction, but it's still a matter of a couple of seconds. Worst case scenario minutes. So it's not a big deal in our case. Accuracy is quite high if you prompt engineer it correctly, but it's not It doesn't come out of box. The cost of creation is pretty low if you know how to prompt engineer, which I assume you all do. And cost at runtime is relatively high. Again, it depends on the model, but it can be really costly.
So I suppose we're missing something in between. So we need some kind of a mixture, a rule based system with some elements of AI, maybe some machine learning. So my proposal here is to use a processing pipeline, which would consist of a series of sequential steps that could process our data and augment it and filter it out and make predictions so at the end we get a more accurate
view. So we start with a text normalization, then we augment context, we do some scoring of the messages based on what we have done on the first two steps, then we do the filtering and finally we do the threat attribution with the most important messages that we have collected during this pipeline. Let's have a closer look. So again, we have a bunch of messages and we start with text normalization. It's a really easy step. It's normally rule-based. Maybe you can use regex, maybe even
less demanding rules. So the goal here is just to standardize our text such that they don't seem so messy, so they don't be so different between one another. Then we add some context. So we use OCR models to process the attached pictures to extract information from pictures, extract text from them. It can be really helpful. Sometimes pictures are only there to abuse people, to attract their attention or even to obfuscate what their real intention is. But in some cases it can be really helpful. Then, We have a separate system of threat actor profiling. That means that we keep track of threat actors, what they write, what their interests, how credible they are, and so on and so on. Again,
it can be really helpful to navigate among the sea of messages when you know who is the author. And finally, threat summarization. As I've mentioned earlier, some threads, especially if they constitute of messages where people discuss their favorite series, of course, they make no sense to summarize. But in some cases, it is really important to understand what was written in the threads, what the adversaries referred to in the particular message, what was the initial
message that started the discussion and so on.
Then using the normalized and augmented messages we can finally do the scoring. We can use some lighter model, some not an LLM, not a large language model, but a language model such as Beard or some more advanced versions of them. Yes, they're not cutting edge models, but they can understand languages pretty good.
They work really fast and you can actually train them. LLM, you cannot train an LLM. I mean, you can, but it's a different story.
But when it comes to language models, you just add up some new layers that could do, for instance, classification for you. So the model still understands the text pretty good. But as a prediction, it gives you just a binary output, whether the message is spam or ham. It works pretty good, and you don't need to have an abundance of data to train the model on.
a handful of really high quality samples. Okay, so using scoring and I mean at this stage since it is a binary classification you just arrive with the with the probability whether the message is spam or ham is spam or something useful. You can do the filtration. You can also use some other rules but apart from that you base your predictions or your filtration on severity scoring and this is a crucial step where you can get rid of two-thirds of messages of course you don't want to lose anything important so you set the thresholds pretty low so it's not that problematic not to filter some spam at this point. It's much more problematic to filter out something useful. But still, this
works really good. And later we can finally get to the thread attribution. At this stage, once we filtered out all the
spam and we can finally concentrate on the real deal on the real meat of the of the darknet messages and we can apply another fine-tuned language model that could classify the threats into high-level categories by that I mean not a specific attack vector or if a part of that tag but some higher level categories such as phishing or initial access broker messages or data data selling or stuff like that and
What this gives us this gives us additional context so once we know a broader category of a threat we can later uploads all the related materials that we know about this threat category all the different attack vectors all everything what we can extract from meter attacks matrix or fraud matrix and in some cases we don't even need even any further threat attribution for instance if the message is about selling the data we already know what the kind of attack it is. We're only interested in whose data the adversary is selling. And in this sense, we can use named entity recognition models. Again, this is another kind of model. When we talked before about two classifiers,
Here we're talking about classification not of a whole message, but classification of words inside the message. This model is called NER models or named entity recognition models. These models are really good at extraction brands and countries based on context. Again, it is very important to distinguish between Amazon as a river and Amazon as a company. So if we don't need further attribution we can just use this model. It is very light and it works really fast. And if we do need further attribution we can finally use an LLM. At this point we deal only with a fraction of incoming messages so the cost is not important for us anymore. And not only That we can feed only the
most important messages to the LLM now we can all also Feed it with all the contacts that we gathered and also we can add the all we know about the threat category so the LLM now Doesn't need to choose between a whole bunch of different options you just need to concentrate on a single domain of threats and needs to extract all
some entities, so who is gonna be attacked, and to attribute to an exact attack vector, so what is actually going on. And here, in this, and then,
Hackers don't want to be for us to talk about AI.
Let's continue. So, where were we? I was gonna tell you that finally, oh, okay. Yep. Finally we have our outputs. And this output is far more than just a message, a cleared one. We have a severity score of the message. The affected brands and countries we have the threat category and even if we're lucky enough attack vector and we Know the threat actor profile and we have all also processed all the related artifacts So with this final output we can feed it into a solar system or we can feed it into our analysts and this is already useful enough to make actionable insights. So let's get back to our solution overview. So processing pipeline consists of multiple sequential processing steps and each step
can actually be run in parallel. It can be scaled really quickly. So this system works really fast and can be scaled. It is now persistent in the sense that we have several steps where we can control the predictions that we make using the language models. This costs up to three times less than the pure LLM system. Yes, it's not very fast to extend, but we'll talk about that in a sec.
So let's see the metrics first. So the velocity is still medium. I mean, we have a number of processing steps, so it cannot be instant, but still it doesn't take that long. So it's a matter of minutes. We can still process thousands of messages within this golden hour as they arrive. The cost of creation is relatively high but only in the beginning once you need to set the system up then
It doesn't cost you that much and cost at runtime is relatively low so it seems like a feasible solution let's Yeah, okay
Yeah, let's talk about the problem that I outlined here. So it's not really fast to extend in a sense that we need to collect and label data so that we could train our models. It's not like writing a prompt. But actually we cannot skip data labeling and collection even if we use pure LLM because in production systems you need to make sure constant on a constant basis you need to make sure that your system works as it's supposed to be as it's supposed to work you need to control accuracy and for that you need to collect and label data so you cannot really avoid this step so on this slide I'm gonna explore the
overall approach to using data centric predictions and to collect and label data when we talk about AI powered systems So you always start with a golden data set No matter if you have already a running system and you just want to add a new type of threat into the system or you just you're just setting up the whole thing. You need to start with the cherry picked and properly labeled pieces of data in which you can make sure that they are correct. You can start with the rules, with the hunting rules to collect data based on those hunting rules. And then you just aggregate more samples using the semantic search. You just vectorize your target messages and
you use vector similarity to aggregate more messages and then you end up with a handful of really high quality samples that you cherry picked by yourself. Then you can augment your collected data. When we talk about AI powered services, very widespread solutions to use synthetic data but in when we you when we work with such complex data as darknets LMS cannot imagine how the adversaries communicates even though LMS were trained on reddit and Twitter where people don't really communicate in a very nice manner still darknet messages is a whole different dimension of complexity. So what you can do instead, you can play with the messages that you have already collected. You can rephrase them, you can translate
them into different languages. So you just play around. As I have already mentioned, you don't need thousands of high quality samples. You basically need a few hundreds. So this is a very feasible solution.
Then during the setting up stage, you can use instead of one LLM, two LLMs and prompt engineer them. Why do you need two instead of one? Well, because judging is easier than predicting. One of the LLM will act as an annotator. You can use this LLM Later on even in your real-time processing and the other one will work as a judge and the only task of an LLM as a judge is to Judge whether the prediction is correct or not Finally you need to set up monitoring some part of the data production that is being collected needs to be regularly checked by the LLM as a judge and the examples on which your system fails
need to be added back into the training
you cannot really skip human in the loop so humans still are needed to control this whole thing because You cannot train your model or even prompt engineer your LLM once and hope that it will run smoothly ever after. You need to regularly update your prompts. You need to regularly make sure that the system is working as it's supposed to work. So you need to source high quality data samples, add them back to training and check that your LLM also judges them correctly.
So our final workflow now consists not of a magical processing box, but now it consists of a multiple processing steps out of which we get the high quality data insights that can be then feeds into the system. And as I promised you in the beginning, We can also talk a little bit about AI, here AI agents, and explore how we can help analysts in search and reporting. Because it's also, even though we have filtered and prepared our data and labeled it correctly and we have processed it, but even still, the amount of data is quite large and it is quite hard to navigate it and the analysts still need to have a broad overview. So
we have implemented a power direct nets analyst. It is an AI agent powered with this dark net intelligence system. So what happens under the hood is it is that it
queries the LLM with with the user request. And the LLM uses its access to a vector database with a hybrid search to collect the messages that the user asks about. And then it can use analytical tools and search tools to refine the answer to analyze it and stuff like that. Why we need a vector database? Well, because messages are texts and we need vector search. Why we need hybrid search is because we also extracted a lot of attributes that can be also stored along with the messages. So we can also filter them by attributes. And that's it for now. Thanks a lot for listening. Thanks besides for hosting. If you have any questions, please.
I like how you presented that you also use AI as a judge during the learning. Because I heard recently that in the times of Deep Blue, the chess computer, they were very deterministic. They were taught on many different chess games. But that was it. There was a finite set. Well, nowadays, all the stock fish and the engines, massively produce new games we've never played and then learned how to play those. So basically they increase the volume of the experience of the AI by using AI as part of the trainer. But what's puzzling me, so you said you could train this, there's a small data set that is needed. I'm curious about one thing. A colleague of mine just
told me recently that he tried to create a GIF image with transparent background. And LLM the model, all of them produced a GIF image. We don't have a transparent background. It has that chess background. That's the fake. That's the representation of the transparent. Probably because the models were learned on massive amounts of internet data where you can see the previews of a GIF image with transparent background and has this fake background. But the metadata says it's transparent. So he
the neural net changed its co-efficiencies and the ponders to make sure that this image is classified as transparent, even though it's not. So in your case, do you have to wipe the memory all that is pre-trained so that if the model is pre-trained on general knowledge from the internet where people are talking nonsense. These are not actual threat actors. Could this pollute your model? Do you need to wipe them before you learn them? Or can you basically just adopt any new model from any vendor and just augment it with your learning? So it's generally not a really good idea to change anything in the core levels of the language model simply because if you start
changing any coefficients deep below, then your model can start what is called catastrophic forgetting, start forgetting things. It can start missing even connections, relations between the words and in the languages it already knows.
When we talk about fine tuning a language model, it is basically about freezing all the model as it is and building atop of that just a deciding hat such that it could understand the message with the knowledge it has and then just to be trained to make the correct predictions based on your data. And for that you don't need a lot of data. It can be sharpened into the right direction really easily. I understand. Any questions for Igor on this interesting topic? Yeah, two I see here.
Hi. Your pipeline seems to depend a lot on the accuracy and the speed of the classifiers. So can you tell a little bit about them? How large are your classifiers? Do you use a pre-trained classifier that you then additionally train or do you start from scratch? What is your process there? So yeah, as I mentioned earlier, so we take some open source language pre-trained models. Can you specify which, please? Yeah, so XLM Roberta was our model for that one. So it's basically one of the first language models there were was birds, and then there were a couple of modifications of birds, and Roberta was one of them. And XLM is just a multilingual modification of that language model. It's not really large.
It's a matter of hundreds of millions of parameters, not billions. And that model is already good enough at understanding the language and only what you only is are left with is just training it on what is called a downstream task. For us a downstream task is a binary classifier. It's just a couple of layers that you put on top of a freeze network so you don't touch the internal layers of the network and you just add some new layers that are, that specifically are trained to do the binary classification for us. And for that, using the modern training approaches such as set feet, for instance, you don't need a lot of data. For instance, we used maybe a few hundred of
samples of high quality samples. And of course those samples need to be pretty varied.
representative of the whole data set, but still it's more than enough to strike very high accuracy on, for instance, binary classification. With the multi-class, multi-label classifier, it's a bit more tricky. You can also play with one VS also type of classification when you basically don't need to train a classifier to classify into a number of classes, but you train it to distinguish between the target one and all the rest. So you can play with those, but still it is feasible. Thank you. Can you pass it back? Thank you. I would like to ask about efficiency. So the first part is about which parts of your workflow, of your pipeline, you see that can be most optimized in the
next steps. Which ones you can see the most benefit from optimizing like where is the most room for efficiency? In your steps and the second thing is if you had the chance or the luck to see This work load is performance only on the wilds. What are your first insights that you got from it? detection mm-hmm Well regarding the first question well, I would say that's a One of the rooms for optimization is always which models you have and how good you train them. So once you have this system, you always find the edge cases, corner cases for which your system doesn't really work well. And one of the features is that each step can be scaled.
Because it's a separate docker container that can be just running in parallel So in terms of efficiency of the system it is either about Velocity and that can be solved with the parallelization or about accuracy so that can be solved with the just pinpointing the problems that your system is bad at and just fine-tuning the models even further. As for the insights, I would say that's Yeah, one of the insights is that no matter how hard you you try to cherry-pick your data points You still end up at a lot of cases on which your system still fails so this constant data collection and monitoring and fine-tuning is essential for the system to work because
And the adversaries they they don't use a standard language. It's like they don't use like a Python to communicate with that with each other so they use different wordings the Communicate about different things they talk about different things so you need to constantly adapt so this is one of the I mean, insights that you could infer from the beginning, but still. Thank you. For other questions, grab Igor by the coffee, which is well-deserved now. And before you go, Igor, here you go. Before you go, for all of you that love smart homes and smart houses, love hacking, love CTFs, go hack a house, but don't get... jailed, there is a small house model in the lobby
and you can hack it. There is a link there, you can see all about it there. So grab some coffee, investigate, you might win something. So see you back in seven minutes.