Hunting Malicious Domains at Scale with AI-Augmented OSINT

Name: Hunting Malicious Domains at Scale with AI-Augmented OSINT
Uploaded: 2026-06-02
Duration: 37 min 28 s
Description: Zohar Buber presents a pipeline for shifting malicious domain hunting from a reactive, feed-driven approach to a proactive, AI-augmented one. Using LLM analysis of DOM content, HTML structure, and page artifacts, analysts can quickly classify suspicious domains, surface concise verdict explanations,

BSides Prague 202637:2825 viewsPublished 2026-06Watch on YouTube ↗

Speakers

Zohar Buber

Tags

CategoryTechnical

TopicAI/ML Security OSINT Threat Hunting Threat Intel

StyleTalk

About this talk

Zohar Buber presents a pipeline for shifting malicious domain hunting from a reactive, feed-driven approach to a proactive, AI-augmented one. Using LLM analysis of DOM content, HTML structure, and page artifacts, analysts can quickly classify suspicious domains, surface concise verdict explanations, and pivot from single-domain investigation to full-campaign infrastructure hunting. The talk covers architecture, real cases (fake CAPTCHA lures, Netflix phishing), code snippets, and limitations including evasion, false positives, and short-lived domains.

Show transcript [en]

Hello everyone. So, thank you for joining this presentation today. Um, I hope next time it will be a face-to-face presentation. Um, so today I'm going to speak about anting malicious domain at scale with AI. um and basically transform um the hunting from reactive approach to proactive approach um and not just using a block list or feed threat intelligence feed but understand the real intent of malicious domain of course using AI. Um so I will cover the problem that we have today when we try to add uh malicious domain or try to investigate them the architecture that I've built in order to implement such a project uh and to fix the this issue um real use cases

I've investigated using uh the AI project uh code uh overview that I will show you some snippets from my code so you can implement it in your project if you wish. Nothing special don't worry. Um limitations of course this project has limitations. Nothing is perfect as you know. Um so we will review some limitations. Um and something to expect from this presentation is that it's not uh a presentation about replacing analysts or researchers with AI. It's about reducing noise accelerating using automation and AI.

Now uh some information about myself. So my name is Zor. I'm working at Kato Networks at the control team. Um I'm 30 years old. I have uh two boys, two lovely boys, and I'm married. Um I have more than 10 years of experience. My core expertise are threat intelligence, reverse engineering, and malware analysis. and more. Um, so nice to meet you all. Now, let's deep dive into the challenges and the issues that we have today. So, first challenge that we have is scale and speed. So, new domains are created and weaponized within minutes these days, often before they ever appear in threat intelligence feed. That means that if you are using threat intelligence feeds in your system, most likely that they

are way back um before the threat actors cuz if threat actors will register a new domain, then it means that this new malicious domain or this threat will not be uh recognized by the threat intelligence feed and most likely that it will slip through by default. Another issue that we have these days is the reactive approach against the proactive approach. So um these days we are using the reactive approach the reactive data meaning that we're using threat intelligence feeds that there are deterministics uh meaning if a domain is found in uh threat intelligence feeds then it's malicious and if not then we assume it is safe. However, unfortunately, attackers move much faster than this threat intelligence feeds. So, in order

to fill the gap of this issue, um analyst and researchers have to investigate manually a lot of unknown domains. when they try uh to investigate an alert in their systems, they try to dig into sandboxes, reports, HTML data, DOM data, um JavaScript code, etc. And basically that that is not something that scale um and the result of this issue is I analyst fatigue, misreads and poor pri prioritization. Now in order to fix this issue um we need to shift our approach from asking is this domain is on a block list to a new approach which uh asking what is this domain trying to do to to try to understand using AI um what is the real intent of the

malicious domain. So um AI can assist in that instead of relying on historical data. AI analyzes the domain structure, the naming patterns, the page content and everything is done in real time. Um it can identify fishing, malware etc. And it can detect BR brand impersonation, credential arresting and suspicious scripts such as offuscated scripts or any other suspicious scripts that are used inside um the code of the domain. Now um a key advantage of the AI is that it can provide us uh explanation of what the the conclusion that he got to. So for example, if I give the AI to analyze a domain and he give me a oneline reason to why he got to the conclusion

that a domain is malicious then it will make me trust the result. It can reduce the investigation time. It can reduce the noise and even I can uh validate as analyst or as a researcher I can validate the result of the AI and basically validate whether the results are false positive or true positive and what it can provide me is that I'm acting much faster um and not consuming time. So basically it gives me the uh the the detection of suspicious domain. Basically it enables me to detect suspicious domain within minutes uh with automatic enrichment classification and automatic response. So I'm shifting the approach from reactive approach to a proactive approach. Now let's uh deep dive into the pipeline

of uh how we can implement it and how we can do it. Let's start with the first uh phase which is ingestions. We need to ingest the domains to our AI layer. So we can take the domains from the DNS logs from the URLs that we receive in our emails if we collect them. Of course, we can use threat intelligence feed that we ingest to our system. If you are also ingesting newly registered domain feeds, you can use them also um in order to ingest the AI. Now the next uh phase is the enrichment phase. So once I have all the list of the domains, I need to send them to a sandbox such as URL scan io.

So I will um get the HTML code, the the screenshots, the images, the redirects, the scripts, etc. Now I can uh relate this data with who is data um with hosting data about the hosting provider the ASN and all kind of other information that I have um in order to enrich the the data that I have about the domain. Now the next phase is the AI layer. The basically the analysis of all uh the data that I gave the AI. So what the AI does it evaluates the context and the page behavior such as HTML structure form scripts redirects images resources and of course uh if I gave him who is data or ASN or hosting provider data

then it can correlate all these data and uh combining all those signals it detects indicators like fishing pages, impersonation pages, obfuscation scripts and it can assign to the domain or to the analysis that I gave him a clear risk with a clear explanation that will explain me why the AI concluded that this is a malicious domain. Now the final uh step of this approach is of course react upon this analysis. So if I um making decision that uh the AI will analyze me a domains and then all the malicious domain will be sent to a slack alert to a through a slack alert to a slack channel. um I can review or react upon these alerts or even I can send it

directly to my seam or directly to my firewall rules or even uh review it manually if I don't trust um the analysis but basically instead of manual investigation I just need to figure out what u priority priority this alerts or this analysis can give me and act upon it. So basically it enables me faster response, better prioritization, reduce noise cuz I don't need to investigate every domain. The AI will do it to me. I just need to react upon this. And I'm not replacing the analyst or the researchers. I'm just augmenting them with much more data and which they can react to this data and take actions upon um the decision uh I am taking

using of course AI and automation. So again we are moving from reactive approach of investigation and analysis and um to proactive approach where we uh react to our automatic investigation. Now let me uh quickly walk through how this works and to end by examining the code. Let's see. So first um we have the URL scan io enrichment. So as you can see here I've implemented some script in Python that will implement a URL scan io um API. So basically I'm getting the the the domain or the data. In that case I just need I just needed the domain. So I'm sending this domain to the URL scan.io. URL scan.io enables me um to get the HTML data, the the

screenshot, the redirects, etc. Afterwards, the next step is that this data, the HTML, the screenshots, all the data that I got um moved and into the AI layer where I created very basic prompt which will uh try to analyze the data that I gave and uh to give me a clear verdict whether this is malicious, benign or clear. Um, and just if the the verdict will be malicious, then it will send it to me to a Slack alert to a special Slack channel that I've created. Of course, you can um change it easily. So, it will be sent to your SIM or your firewalls or whatever um you wish. So for example um this is uh uh select alert that I got

uh from my analysis from the AI analysis. So as you can see we can see the domain we can see the verdict which is malicious and we can see one line of explanation by the AI why it thinks that this is a malicious domain. Now importantly this is not uh something theor theor the theoretical this is something that are already implemented and uh this is the result of it as you can see um and it's very easy for implementation. So once uh we figured out how we doing it and uh what we need to do, let's deep dive to some real cases uh that I've investigating using this uh approach, this AI uh approach. So first of all um as you

all know fake capture is a known threat these days. um basically very very popular thread these days in order to infect machines with malware. So as you can see in the left side this is the prompt that I have inserted to it. You can see the malicious domain inside the prompt and then a snippet of uh the dome. However, this is just a snippet. But uh in the real case, I've inserted the old dome. So the AI will analyze the old dome. Now let's uh see how it looks on URL scan.io. So you can see that on the screenshot of URL scan.io um it presenting me a capture. Now before I analyze this capture or this domain, I

cannot know whether this is malicious or not. Of course um URL scan did uh mentioned to me that this is a malicious domain. However, I don't know um why this verdict is malicious. I don't know why this domain is considered as malicious. Um and I need to investigate it. But if I will send it to my AI prompt and using the AI in order to investigate it, I will very very fast will get um the reason that this domain is malicious. So as you can see first of all I can see the verdict malicious. Now that the AI with a oneliner not more than that can explain me what does he see in this domain that got him to the

conclusion that this is malicious domain. For example, he can mention that there is a fake Cloudflare page and some resources here that are suspicious. And combining all those um uh all those uh artifacts, he he got to the conclusion that this is a malicious domain and of course is correct and this is a true positive and this is a real fake capture. So in that case the the the AI investigation the AI um analysis was totally correct. Now let's move and break down all the malicious artifact that the AI analyze. Um so first of all um you can see on the left side um fake capture Cloudflare page we can see some HTML referencing to Cloudflare.

However, this is not a real capture. Now not everything in this artifact must be true or suspicious. However, when I conclude all the artifacts together, I need um to get to the conclusion that this is a malicious domain. So, the first uh I would say artifact could be suspicious, but I'm not sure. Now, let's move to the second artifact in the left side. So no indexing um directive meaning somebody um using HTML tags that uh will be uh enabling to avoid search engines to ignore this page. Basically um it directs the search engines to avoid indexing this page. This is something that could be suspicious, but again not something that um very very suspicious that will I I would say um

um be a red flag for me. Now in the third artifact I can see that there is some suspicious external assets. the page load images from unrelated domain and as you can see the AI uh flagged it as a major red flag um because services like Cloudflare don't put assets from random domains. This suggest this is a malicious um infrastructure. Now I I would say that uh in the force uh um artifact the the force artifact in the uh right side is I would say the most uh malicious artifact that we have. So what we have there the fake capture implementation the I frame that loads um a local file the capture HTML file. So basically in that um local file

there is the real fake capture or the real um I would say implementation of the capture and this is what got me and the AI to the conclusion that this is a fake capture and this alone however uh basically this alone artifact uh is enough um to flag this domain as malicious. However, when I conclude all the other artifacts, of course, I got to the conclusion that this is a suspicious malicious even malicious not suspicious domain. So that's how I investigated and this is how the AI investigated and he got to the conclusion that this is effect capture which is totally true. Let's move on uh to another case. So now we have a Netflix fishing domain. Let's

see the prompt. The prompt is basically the same. I give him the domain. I give him the dome. Here we have just some snippet of the dome. However, in my project, I gave him the all dome. Let's see how it looks like on URL scan.io. You can see the screenshot of um the page. Just by seeing the screenshot, I can identify this is a Netflix fishing. However, if I don't know to say it or don't know to figure it out, then of course URLs can mention this is a malicious domain, but I still don't know why. Let's see what the AI investigation will lead us to. So of course again we are seeing that the verdict that the AI

gave us is malicious. Now it explain us why. So we can see that the AI explained this is a deceptive Netflix uh domain with fake capture typical for uh fishing and uh uh filtering pages. Now let's uh of course this is true positive and again this is a real Netflix fishing domain and again the AI um analysis was true. Now let's break down again all the malicious artifacts. So first of all we have some uh in the left side some bot filtering evidence some costume mass challenge basically some uh a capture that um the the fishing uh infrastructure um implemented in order um to avoid um automatic analysis, automatic scan such as URL scan or any other uh

scanners that we have. Uh so it implemented a capture. Now that could be malicious, but uh I'm still not sure. I need to verify more artifacts and continue with my investigation. So I have uh the second artifact which is fake recapture uh claim. Basically the page uh claims that it's protected by rec by recapture but there is no actual capture on the script. It's just some HTML tags. Um so that um is done in order to build trust on security analyst or researcher or even automatic scans that uh scanned this um domain. Now let's move on. I have the hidden input field. Basically there is hidden input field again designed to trap bots and um

evade bots and evade some automatic scanning. Basically this is a basic bot filtering to protect the page from scanning. Again, this is could be something um malicious, but it could be also some benign uh activity. So, I need to move on and investigate much more uh artifacts in order to uh got to the conclusion whether this is malicious or not. Now, let's move to the right side. Um so again I have search engine blocking with the no index u meta tag which prevents search engine from indexing the page. Um again antibbot uh code. Now I think that um the most I would say malicious uh artifacts are coming now. They are the last artifacts. So first of all there is a known brand

suspicious asset. What it means? That means that the page loads local assets instead of official Netflix infrastructure. Now on a real Netflix page, it would be very suspicious cuz real Netflix page um use trusted CDNs uses is Netflix resources um and not some local unknown resources. Now this is very suspicious and there is missing branding indicator. What does it mean? It means that the first of all an empty title on a page of Netflix very very suspicious. This is something that is not regular cuz in a real uh Netflix page I would um I I would see a title of Netflix. I would see um something that will tell me that there this is a legitimate Netflix

page. And in this case we don't see it. So this makes this artifact very very suspicious. And if I can conclude all the artifact that we have seen. So none of them by uh itself is very very malicious. However, when I conclude all the artifacts that I saw, I can definitely say that this is malicious domain and I can definitely um flagged it as a malicious Netflix fishing domain. Um, and of course the AI explain me every artifact with a one line or if I don't need the deep dive analysis then I can uh go back to the one line which explains me why this is a Netflix fishing domain. All right. So um let's move on to a more

uh I would say uh advanced threat hunting um uh approach. So now um we have investigated single domains but AI can assist us um in order to analyze not just one single domain but the whole campaign. Now let me explain you why. Because attackers rarely rely on just one domain. when they um upload a campaign, they reuse the same infrastructure, meaning the same templates, the same script, the same hosting providers, the same IP ranges in order to register or um to build a an entire campaign with a lot of domains. Now if I I have identified using the AI and my automation single domain and all the artifacts that it has, I can identify basically the old

campaign and AI can assist me in that investigation very easily because it can compare all the assets and all the artifacts that he found very quickly to all other domains in the campaign. So for example, if there is the same HTML structure, um for example, identifying similar code patterns or DOM trees, um if there are the same scripts on different domains, um shared JavaScript, uh shared images, even same offuscation, same ASN or IP agent or TLS certificates, etc. All these artifacts when they are used um again and again on different domains and they we already concluded that they are malicious. I can use the AI to hunt the whole campaign and the whole infrastructure and not

just hunt um a single not just investigate a single domain. And basically this gives me the ability to uh shift my approach from reactive meaning blocking one domain or investigating one single domain and blocking only this domain to blocking the all um domains of this infrastructure or of this campaign and moving to a more proactive approach. Now um after we have concluded all uh I would say the artifacts and the investigation uh approach let's see what challenges there are in this projects. Of course nothing is perfect and there are challenges in this approach. So first of all evasion techniques. So threat actor can implement evasion techniques. In order to uh evade this kind of analysis for example gear restriction

they can block some g uh geographical um restriction in order to evade such analysis. They can block IP rangers um in order to evade automatic scanners. They can implement um a capture in order that uh our analysis our automatic analysis will not be able um to analyze their domains. For example, if they will implement a capture, then we will not be able to provide the data from URL scan IO and also they can implement a prompt injection meaning manipulating AI analysis and evading the AI using prompt injection. Of course they can implement sandbox detection and automatic scan detection and then evade these detections. So they have few techniques in order to evade our analysis. Now let's talk about

operational artless for example false positive. Of course this analysis is not perfect and there are false positive. So we might mark legitimate uh domains mostly login uh uh forms as fishing or as malicious domains. If there are unknown uh domains and there are not uh domains or infrastructure of a known vendor, then our AI uh analysis can mistakenly um flag them as malicious fishing domains or unknown domains and it can cause us false positive and therefore we need uh the human uh reviewer uh in order to review the results and still uh watch the results and verify the verdicts that the AI imple um basically gives us. Now another things um that is very common on this uh

infrastructure or um campaigns is the shortlived domains. meaning um when threat actors register or upload a new domain or a new infrastructure, it will not uh live forever. Most likely that after 24 hours, they will replace the domain, they will replace the infrastructure, they will reuse the same infrastructure to another domain. And then if we didn't analyze it um very quick quickly sorry then we will lose the data and we will not be able to determine if this domain is malicious or not and basically we will not be able to determine um that we are seeing now uh a threat actor campaign or a malicious artifact. As I said, we always need human in the

loop in order to verify the results of the AI because it's not perfect and it has false positive and we would like to verify it with humans and not replace them. So let's conclude the all presentation that we have um seen and learned today. Sorry. So first of all we need to transform our approach from pro sorry from reactive approach to proactive approach to automatic approach using AI. So we can reduce noise make it faster and be um one step ahead of the threat actors visibility. Um, we need to stop trying to understand whether a domain is blocked or not or known to a threat intelligence feed. But we need to investigate the real in the real purpose

of any domain in real time and we can do it only using AI as of course we cannot analyze every domain that we see in our network or in our uh logs. efficiency. Again with AI or automatic approach using AI we can reduce time of false of noise of false positive alerts. we can reduce the time of analysis. Um and we can we can basically reduce the time that we need to spend on this um I would say investigation and just focus on the highrisk alerts and on the malic the real malicious alerts and real malicious domain or infrastructure whatever we are investigating scale as we said we can scale this approach not just uh to investigate

single domain but to investigate a lot of domains or even if uh we want to add the all infrastructure the all campaign in a very very fast um approach and do it faster than the threat actors. So that's all from my end. Thank you uh for listening to this presentation today. I hope next time we will meet face to face. Uh but if you have any question feel free uh to drop me a question over the LinkedIn and I'll be much more than uh happier to answer you and assist you um in such projects or any question that you have about this approach or the presentation. So thank you very much. It was pleasure.

Hunting Malicious Domains at Scale with AI-Augmented OSINT

Related talks