BSidesSF 2026 - Security for AI Agents Using an Ensemble of Fine-tuned... (Lidan Hazout, Bar Kaduri)

Name: BSidesSF 2026 - Security for AI Agents Using an Ensemble of Fine-tuned... (Lidan Hazout, Bar Kaduri)
Uploaded: 2026-05-12
Duration: 38 min 27 s
Description: Security for AI Agents Using an Ensemble of Fine-tuned Small Language Models Lidan Hazout, Bar Kaduri As AI agents gain access to privileged tools and sensitive data sources, new risks are emerging at scale. This session introduces a runtime guardrail based on fine-tuned SLMs pre-tool invocation.

BSidesSF38:2775 viewsPublished 2026-05Watch on YouTube ↗

Mentioned in this talk

Protocols

LoRa

Standard

OWASP Top 10 for LLM Applications

Vendors

Mistral

About this talk

Security for AI Agents Using an Ensemble of Fine-tuned Small Language Models Lidan Hazout, Bar Kaduri As AI agents gain access to privileged tools and sensitive data sources, new risks are emerging at scale. This session introduces a runtime guardrail based on fine-tuned SLMs pre-tool invocation. It also features benchmarks on leading open-source models and live attack-detection demonstrations. https://bsidessf2026.sched.com/event/e6269b562767d2f00d5e4d2ddf224ae2

Show transcript [en]

For our next talk, we're going to have Barka Duri. She's going to talk about security for AI agents using using an example of fine-tuned small language models. Just a quick reminder, use Slido to send questions. Slido is accessible through BSidesSF.org/Q&A. Bar, the stage is is yours. Hi, good afternoon. How are you, BSides San Francisco? Yes, I'm super excited. This is my first time in BSides San San Francisco and uh um it's amazing. Uh I'm very honored and grateful to be here today. Um with me today uh Ali Dan was uh supposed to uh to join me, but he unfortunately couldn't make it. Uh but Lidan was uh the brain and the wind and the spirit behind a lot of the

innovation that I'm going to show you today. So, um you can contact him as well about questions and about cool SML tricks um as well. Before we get into it, I want to do a quick show of hands. Raise your hand if you used an AI agent in the past week. Could be any kind of agent, coding agent, presentation agent, personal agent. I don't believe the people that didn't raise their hand. Um now keep your hands up if you know exactly what your agent did or if it actually did all the things that you told it to do. Yeah, that's what I was thinking. I'm going to get my head my hand down too. Um and this is what what we're going to

talk about today. And just before I begin, I will say a few words about myself. Uh my name is Bar. Yes, this is my full name, three letters. Uh and I'm in cybersecurity for the past 14 years. I started as a threat intelligence analyst and um then I joined Check Point as a malware analyst for 5 years. After that, I joined Orca Security as a cloud security analyst and I uh finished as the head of research and detection. Over my time in Orca Security, we found over 25 vulnerabilities in cloud and companion products. And we also wrote detections for every kind of product that um Orca um is doing. Um a few months ago, after dipping my

toe in the AI ocean, I realized security products or everything I knew about application is not applicable anymore in the new era. I I found out that the best thing that we can do um to deal with uh uh the way that AI is unexpected and not deterministic is tackle it uh during runtime with AI. So, that's why I joined Capsule Security and this is what we're going to talk about today. So, what are we actually dealing with here? Um this diagram is taken from the OWASP Top 10 AI Security AI project. Um this is from my opinion is uh it's that one of the most important projects um most important project ever been in

in OWASP. Um and the main reason is that the pace of innovation and the uh uh the all the crazy things that we see every day and the adoption of AI um actually requires a lot of people working together, sharing their knowledge, and doing great stuff to actually map risks and help other organizations to to operate securely. And when I joined this project a few months ago, I saw so many people working together even from competitive organizations and contributing a lot um for the entire community to operate securely. Um this specific diagram was in the um the original document of the OWASP Top 10 AI Security AI. Um it was released uh this December and

it maps all the risks that as a community we define today for AI Security. It starts with inputs. Um and inputs could be anything starting from um starting from the user inputs, APIs, other agents, going through the integration or the processing layer, which are the thinking, the uh um the memory of the agent, and uh also the human in the loop side, and up to the outputs, which are again APIs, other agents, and so on and so forth. And I want to take you through two of these risks because I think they tell the entire story. The first one would be the agent goal hijack, which is ASI number one. And it's about how um the agent could be tricked into doing

things that it wasn't designated or planned to do. Um it could be uh done with prompt injection, which I think which I think many of you are familiar with. And it could be it could be anything else um that could allow the agent to follow an instruction set that it's not ours. It's not a legitimate one. It's a uh it's a malicious one. The second one that I want to talk about is the last uh ASI, ASI 10, rogue agent. And this one, as a researcher and as an attacker, is the most interesting one because it is talking about um a point where the agent is not compromised. There is not a malicious instructions that uh were sent to it.

Um its memory is okay and everything is fine. And yet, the agent decides to do something that it wasn't planned to do. A great story uh about that, a great incident, is uh Replit's agent. They reported that one of their agent actually deleted um a production DB in a whoops moment. It was like Oh, I will execute what you wanted and I will delete your database. Whoops, I did it. And that's it. That's a rogue agent. Um there there isn't an attacker in this loop. And if we look at these two uh um risks, we understand that many of those risks that we talk about in AI Security Security are related to how the agent operates

uh um um tools and um actions. This is why uh we decided to to tackle this problem from a very unique uh um point from the point where uh the agent interacts with actions or tools. >> [snorts] >> So, we actually thought about um creating some kind of uh a security gate between the brain and the hands of the agent. And we had three principles to guide us uh for that. First of all, we wanted it to be in real time. Um and this is crucial because we don't want to have these whoops moments of hey, I deleted your database. This is not what we wanted to do. Um so, runtime and it means that it also

has to be very very very fast. Uh we wanted it to be contextual because um a developer viewing um the entire uh um customer um database is weird, but a customer success engineer viewing this data is okay. So, we wanted we wanted uh to have the exact context that will allow us um to understand uh the full picture and to understand if something went wrong or not. The last one was uh we wanted to create some kind of um a an engine that gets decisions based on a particular knowledge. And this is where we decided to create an ensemble of um SLMs >> [snorts] >> that are uh specialized in specific areas and uh then they are able to

to be more accurate, more specific, and to deliver exactly what we wanted them to deliver. Here is what it looks like from the architectural standpoint. So, we have the agent trigger, which is everything that we talked about from the input side. We start with a user triggering the the agent or anything else. And then the model starts to to think. It starts to process the request. At this point, we have the the interception. We intercept the agent intent and we check it with security judges. We call the small SLM security judges and they check the intent, the reasoning process, what led the agent to get to the point where it wants to use the specific tool and

the specific call. And then the security judges um do a classification based on certainty. We will let it decide its own certainty and explain that. And then to classify it to allowed action, blocked action, or an action that I need to escalate. So, let me zoom in to the interception layer, which we said it's in the stage between the LLM reasoning and the security judges. So, I want you to think about it as a firewall for agents. Like you need a firewall in your organization for every inbound and outbound requests in the network, it's the same here. If you want to do some kind of action, call a tool, you have to pass through the security gate.

You have to pass through this firewall. So, it's mandatory. It's baked in the infrastructure. The agent has to use it every time to do the actual execution. I'm asked a lot how it happens technically. There are many There are many AI vendors today that bake these specific guardrails into their own infrastructure. Claude and Cursor call it hooks, and many other vendors are using a very similar concept so that every time that the agent pass this specific junction, this specific operation in time, this hook is being called and you can do whatever you want. You can do an API call. You can You can call an agent. You can call a small language model. Whatever you want.

The second thing, it's contextual. As I said, the context is key because there are actions that are allowed for some people but are not allowed for other people. There are actions that are relevant for one type of agent but not for another one. And the most important thing, it has to be fast. If you add a latency layer, another latency layer on top of people's work, they're going just to shut this solution down and it will not work. And that's why we decided to work with small language models and I will explain a little bit deeper the infrastructure. But before that, why did we choose to use small language models and not one big I don't know.

Claude Opus 4.6 to to check everything. So, there are three main reasons. First of all, the speed. Big models, they think a lot. Yes, they are very smart, but they are not fast enough to be put inside a user infrastructure. The second thing is that we wanted these small language models to be specialized. Just like you wouldn't wouldn't go to a general doctor for a heart surgery, you probably don't want to use a general model to identify both PII leakage and a rogue a rogue agent. You want to be very specific. Each one of these action looks differently, behaves differently, and could have different examples from the internet. Last thing is the cost. Running small language models is much more cheap

than than running one big language model. So, which judges we decided to to have? We wanted to have the biggest coverage of OS top 10 and we decided to have these three judges. First of all, the instruction violation. With that, we identify everything that tells us that the agent went off the script. So, we have ASI 10 that we talked about going under this definition. And for example, if an agent was told to summarize documents and it sends email to a third party, it would be blocked. The second one is the data security judge. We identify leakage of PII um sensitive information, credentials, everything around that. And these are very specific patterns. So, you can tell that this is a very

specific type of training to this model. And the last one is the threat detection judge. It covers everything that is related to adversarial footprints in the session, like prompt injection, jailbreak, all all the cool stuff that we that we see. All these agents run in parallel and it was very important for us to let them run in parallel to save time. And an attacker can't just fool one. We have a few agents looking at the same session. And even if the attacker, I don't know, managed to do something for one, something else will run. We will find something. Each one is trained on different data, different set of decisions, different prompts, and they run um

definitely independently. And the third piece of this infrastructure, and I think it's the coolest because it allows us reducing significantly false positives. Um As you all know, security products usually are not one-size-fits-all. Some organizations can find some products very beneficial, but other organizations don't. Um And the reason is every organization is different. Every organization behaves different. And that's why we decided to build a memory system. Every tool invocation decision, if it's approved or rejected, it goes into a repository. Over time, it became the system's understanding of the of the organization that we're running in. When a new request comes in, we do a semantic search. We find the most similar past invocations. And we try to see

if it looks similar and we saw 50 times of this similar behavior being approved in the past, this is a really good sign to approve it also in the 50 first time. Um And we also have kind of a feedback loop here. So, in cases where we need a human in the loop, the system can ask for a human in the loop and we learn from this behavior as well. Um so, that's a really cool way to reduce false positives and to actually fit ourselves into the organization. Um Now, let's talk about the small language models in in a a little bit of more focus. How do we actually train them? How do we create

them? The first stage would be the the data. Um like like all of you know, when you have a model, you want to train it on data. We use publicly available data of research, patterns, OS, MITRE, everything. And then we also synthetic data. This synthetic data is made by publicly available libraries that are available um online. And um we go to the next stage where we start labeling these examples. We use three big high-thinking um to do the labeling. And if we have um an agreement of all LLMs, um then we say uh then we label it in high certainty. If not, we lower the certainty. Um the next stage stage uh would be to

fine-tune the model. Uh we use small models as I said, uh 4 to 12 uh uh um billion parameters um tops. And then we fine-tune them with LoRA low um um low retrieval uh augmentation and calibrate uh confidence scores from all the areas. The last stage would be uh the quality validation. We do all the calculations uh that uh we always do uh when we train models, uh which is measure accuracy, precision, recall. And we also let a human expert going over this data. Um this is a very important stage. Um we believe that we need to have a sanity check of a human in the loop, and this is exactly where we doing it um to make

sure that we uh that everything worked as expected. Um now let me show you um a real-life example. Um this is the instruction violation judge uh that we wanted to check. Uh we took the top most common um small language models, and you can see that for the these scores, they're not so bad, right? Um also the the running duration is fair, and we can we can think of a improving these models instead of bringing something from our own like from scratch. Um so um it's very important to um to see that we need to train these models because having a 10% chance of blocking the wrong action in production um is pretty bad for users.

Um so how it looks after we do the uh the LoRA. It looks like that. Uh we took we took the uh Mistral, the Mistral that you've seen here, the the first one. It had the best F1 score. Uh we did the fine-tuning with LoRA, and we got almost a 100% um um in the score, in the accuracy. Um comparing to other models, this is a very very good improvement. Think about this improvement and also using uh um the RAG system that I just showed you, um I think we can get to really good uh um work in production and in runtime for users. Um so let's put it all together. How it actually looks like.

So we have the uh the call that we intercepted, that's the lowest point of this diamond. We capture the tool, um the intent. We have the agent, um the context, the context from the organization, every data uh that is possible. Then we run these SLMs um in parallel. And then we apply thresholds. We have auto deny and auto approve or human in the loop thresholds. Um we usually go with 95% um auto deny. It's 95% certainty that this is a malicious or unintended uh um call. We have auto approve for below 50%. And we have the RAG for the in-between cases. If we have a strong RAG for this organization, um we'll probably finish this entire

process pretty fast. If it's a fresh organization, it will take some time until we'll learn the entire um um behavior. And there's always option for human in the loop escalation if needed, but we try to keep it as low as possible. We don't want to uh to let people do too much uh around this. Um and that's it. We we anyway update the RAG after after the uh after the the entire loop. Um so we always have more information on how typical um allow or deny looks in the same organization. Now I always get the question, how does this scale? So this is how it scale. Let's think about it as um um a microservices

system rather than uh um a monolith. And this is exactly how it works. When you have small components that can uh uh scale or be updated, pulled out or in um independently, um then first of all, the work on each component is much faster, is much uh uh lower maintenance for each component and we also can add things uh for example, new risks, new patterns to each one of these SLMs pretty fast and pretty efficiently um all the time. Um so that's the main idea. >> [gasps] >> Um so another point of view here to what is actually innovative here. So we have uh the pre-interception in uh pre-invocation interception. Um most AI security tools today are not

doing the interception exactly this way, so this is a pretty new way to do that um to to look for the uh uh intent and to block it from there and not from um other places. Uh the second one is the specialized um SLM ensemble, which is um a new uh concept, and the self-improving RAG. So I know that each one of these uh um ideas on their own doesn't seem very very new or very innovative, but together they they make up a very good uh runtime security um application. Um so that's it. Um this is my summary. Um so let's bring it all together. Um we're in a very interesting moment in in the tech history uh in my opinion. Um

people are adopting AI agents very very uh uh widely and broadly. They're giving them tools, uh memory, autonomy. And as I showed you from OS, the attack surface is massive. Um what I presented you today is an architecture that does three things. First of all, intercepts, uh intercepts before execution, uses specialized SLM judges, and learns your patterns through RAG. Uh the result, you can actually deploy agents in production and sleep at night. Uh not because you just disregard your problems, but because you know that there's someone, something that takes care for it for you. So if you remember the show of hands that we did at the beginning, I hope that next time that we do that, all of

you can keep your hands up because yeah, there's I know there's someone who's taking care for that. Um and I hope it will help you uh ignite some ideas in this area and to contribute back to the community and make the entire uh AI adoption more secure the thing. Um so thank you. >> [applause]

>> So Bar, we have three questions. I'm going to start. Uh you say take small language model and train them. Can you give me examples? Sorry uh they like that and change. Can you give me examples of what models you took and trained, and how did you train? Yes. So this is what I showed you here. Um these are the small um the small s- the small language models that we uh checked for training. We took just the most popular. Uh we kept them in the range of 8 to 12 billion parameters. Um this is the list and how we trained them. Um this is this was this slide. Um we collect data, created some synthetic

data, labeled the data with um LLMs, uh with high-thinking big LLMs, and then um applied LoRA, um on those uh um on those uh uh SLMs, and did the quality validation with uh a human expert. Thank you, ma'am. The next question on Slido, how can you prevent a prompt injection specifically designed to bypass prompt injection judge? This is a great question. Um our our idea um it looks on the conversation from the perspective of I'm a judge needs to uh classify this behavior. So the um the small language model don't get the uh the instructions from the session like the original agent gets it. So, it's not affected from the prompt injection um in the same way.

Um also, it's a classification model. So, uh it it's not really it couldn't be actually affected from it and do bad stuff. Thank you. Next one. >> raising their hand. Excuse me. Can I just real quick? Um when you build your rag that basically has your custom rules that you're trying to create kind of a cap? Yeah. How do you do cache validation? Oh wow. >> [laughter] >> How do we do the cache invalidation? Um so, we keep this information. Um we have we have uh This is a very long explanation. >> [laughter] >> I expect a blog post. Thanks. Okay, so next one. How does stacking the SOMs cost add up, especially with this being added to the

existing token costs of the user prompts? Okay. So, um they're very very fast. As I showed you, their running time is super fast. So, um it adds some kind of uh token uh uh um cost for that. Um but, they're fast. We keep very specific areas from the context that we find that benefit uh the final conclusion. And uh this really helps us keep this the the cost pretty low. Um I hope it answers. Yeah. Next one. In in simple labeling, you mentioned top three LLMs quorum label data. How are these LLMs selected? Cost estimates for data set labeling? Okay. I don't have the cost uh the exact cost um like from the top of my head, but um we

were we were looking for uh we did small batches. We checked on small batches. We took uh the strongest and most known um models for these tasks. And we checked them. And after a small batch that we checked, we choose the top three. I think we checked nine at the beginning, and then we continued with the top three. On a small on a small batch, and then on the entire batch. Okay. And a follow-up question for that one. I guess you already answered the second part, but uh could you give examples of the base model you took? The base model uh that presentation. The base model that we took for the It's already answered. So, next one. What is the typical uh

workflow for coding agents? And is it a replacement for hooks or complementary? Oh, so coding agents uh they use hooks. Um Before Okay, they have actually a really really long list of all the uh junctions in the decision of the AI agent uh that could be intercepted by a hook. Um and the typical the typical flow is uh um for example, I'm a developer and I want to uh to read some kind of uh um a document document. Let's say I want to update our authentication method. Um I tell I tell it to go and read this file where we have our uh authentication method and upgrade it to I don't know, a new version.

So, um the first thing that it will do, it will ask you for uh read uh permissions. This is the first invocation of tool. And this would be intercepted and be allowed. And the second stage would be to update the code, to actually write on this uh file. And sometimes uh we will see another tool called, for example, sending an email or deleting a database. This is another tool that is being called, and it will be intercepted with the hook. >> [snorts] >> And also, the um um the legitimate right would be um interested intercepted by a hook. This is the usual flow. Next one was fine-tuning uh a small model and developing synthetic data, including running

infrastructure, better / cheaper than using cheap GPT-5 mini, etc. with instructions? Oh definitely. >> [laughter] >> Um because many be many of the reasons conclude to the fact that uh GPT is uh hallucinating. So, we needed to do um really small batches and to use um these models um um to avoid really weird um scenarios. We We tried that like for a brief moment, and it was very unrealistic. [laughter] Okay, so last one. It's a big one. For your HITL, are you worried primarily or only about prompt injection / rogue agent risks? Or is some of these about malicious human risks? How can you can he help there? This is a great question. Um so,

we try to also run another validation on the data for the organization to find possible risks of uh I'll call it risk risk from the inside. And um this happens on the uh post evaluation of the data. This is another layer that we add after that that is not uh in in this talk. Okay, thank you so much, Bar. She'll be at the front of the theater to answer more questions. Uh I think we have a last one there. Uh Okay.

I think it's GPU. Cuz the numbers are very large, though it's taking seconds. Mhm. I'm I'm not sure, but I can check. Yeah. Comment on the the data we used. Are you using the open-source data like Hammer bench? Qualifiers uh prompt injection data set? Yes, we used uh publicly available prompt injection data sets as well. Yeah. Also, other known um um benchmarks. Okay, I think we have the last one. We have time. Please. What is the quality evaluation data set for your fine-tuning these small language models? The evaluation Do Do you mean uh When you fine-tuning it? Yeah, yeah. That one. What is the evaluation data set you used? So, that way uh that's the first stage.

Um some of it is from publicly available information, research patterns, OWASP, and so on. And some of it was synthetic data set that we created from known uh um um data sets and collections. And uh known techniques to generate synthetic data sets. Okay, I think that's it. Thank you so much, Bar. Thank you all for attending. I have some announcements. I'm just going to go go through them. Please join us for the closing ceremony in theater 13 at 5:15. Uh the coat check will stay open until 7:00 p.m., but the rest of the city view is now closed. Uh finally, we need to be out of this theater by some minutes from now. So, please

quickly and safely make your your way to the exits. And thank you so much for being at BSides 2026.

BSidesSF 2026 - Security for AI Agents Using an Ensemble of Fine-tuned... (Lidan Hazout, Bar Kaduri)

Related talks