United in Defense: Architecting Safe and Trustworthy AI Agents

Name: United in Defense: Architecting Safe and Trustworthy AI Agents
Uploaded: 2026-03-30
Duration: 25 min 47 s
Description: Bsides Seattle February 27-27, 2026 lecture Presenter(s): Rabimba Karanjai

BSides Seattle · 202625:479 viewsPublished 2026-03Watch on YouTube ↗

Speakers

Rabimba Karanjai

Tags

StyleTalk

About this talk

Bsides Seattle February 27-27, 2026 lecture Presenter(s): Rabimba Karanjai

Show transcript [en]

Hi everyone. Uh thank you for attending the talk. Today I'm going to talk a little bit different from some of the things we have uh been hearing today. It's going to be less of what our AI agents can do and more of what I'm scared of the agents what it can do and how I deal with my insecurity of the agents. So hi uh that's me that's prior to the agents became sentient uh it did not talk to me at that time uh my friend uh so hi I'm Rabimbo uh I am a Google developer expert I'm also a scientist at PayPal uh but today I'm going to represent more of uh Google things so

yeah let's start get into it how many of you are building something that has an LLM component in it. Awesome. Actually when I asked this question a lot of people more people raise their hand and which makes me even more scared. So like any systems uh AI applications has vulnerabilities we all know that but did we actually comprehend what are the problems we can have? So this is kind of like a whole life cycle of the things I have faced uh throughout working in couple of uh end to- end scenarios using LLMs. So we are going to talk some of these and we are going to categorize them in certain specific parts and then map them how our AI agents which is a

troublesome term because I still don't have a very good definition of what an agent is but we are going to look at some of the things we think are agents and how they can be uh affected by them. So first uh a lot so this is a secure framework Google came out with safe it's called secure AI framework and uh when it came out uh a lot of us thought oh this is awesome we just follow this and everything is going to be awesome it did not happen that way so what's happening and what's a lot of us are seeing is that AI became a very fastpace dreamy scenario every day you have a new model

coming up every day you have new framework box coming up, new AI agents, Open Claw, before you can even download Open Claw, you realize it was called something else. Then it became something else. By the time Cloudflare came with a molt runner, it was not even molt anymore. So it's very fast-paced and the solution explored uh in this talk appropriately and so what's happening and what's a lot of us are seeing is that AI became a very fastpace dreamy scenario. Every day you have a new model coming up. Every day you have new frameworks coming up, new AI agents. Open claw before you can even download OpenClaw you realize it was called something else. Then it became something

else. By the time Cloudflare came with a molt runner, it was not even molt anymore. So, it's very fast-paced and the solution explored uh in this talk appropriately un agnostic to this uh fast-paced new models. So, let's look at our whole AI system tool map. So, you have your models uh and then of course those models has to be run somewhere. So you have your infrastructure team, you have your data where the model is that's model spill ground and your application which your user is supposed to be exposed of and then on top you have the governance and assurance which a lot of time comes as an afterthought even though that's on the top and that

brings us a lot of grief and problem down the line. So let's compose like decompose this a little bit more. So when we look at application component you probably have some of these so you have a application you have a model plug-in in your model component you actually have your model be that be open or in-house model or you are using somebody's API and you of course have a way of interacting with it which is the input and output handling you are going to serve that model from somewhere in some capacity that's going to be where your model is get getting served if you are doing in-house model uh in-house AI Okay. And uh if you have a team which

proposed to you maybe six or seven months back that okay we can do this we have this kind of data we will come up with our own model we will fine-tune this and this is this is this and they did a very shiny demo you were convinced you gave them go ahead and after that once that goes to production level you are now b you now have a lot of problems. So you also have this infrastructure component where you are dealing with data storage. This is how the planes look like. So you have the control plane, you have the AI risk and risk. Why they are different? We're going to go to that. So we are

going to talk each of these parts very briefly. I'm going to not go too much into the application part today. So first data. If you have your AI models being fine-tuned or maybe you are big enough that you are retraining on it, uh you will have a way of handling that uh data some way. Why does it matter? Where is your data coming from? Uh if I may ask any of you have maybe just for fun fine-tuned a model or just ran a collab and did a fine tuning. Awesome. Most of the time we just use some of the example data set or do something on top of that. And if you have a big team, they probably are using

your either in-house data or some of these data from internet maybe to fine-tune the model or do something like that. We do not think about data poisoning or model poisoning. Why that matters? There are two parts of it. A lot of times uh imagine you are building a specific uh AI model which is supposed to look at all of your log files and supposed to say if there is something weird going on in that log or not. How do you fine-tune it? You might have your own huge log data set or you might be tempted to use something which somebody did in hugging face use that do something on top of it. How are you sure that that specific

model has not been fine-tuned on a data which has not been poisoned before? How do you know about data provenence? And how do you know the model itself is not uh poisoned? So many products use a kind of like user reporting flow that can be abused. So for example, Gmail manual reporting false lags. So if you specifically if you have a team of cloud bots and you spec create like a 100 of Gmails some kind of email let's call call it Zmail and you have a very specific way to send a lot of specific emails and have another cloudboard constantly marking them as false marking them as uh spam. You can train the classifier to learn a

very specific pattern. It should not learn. That was a nice example which probably not going to happen. But in your in-house modeling, you should have a data sanitization part very specifically for the training data. Why? So when you do any kind of data validation and anomaly detection, this will matter of or like how you have trained. So when you have this whole cycle, you have a user data management, you have a training data management and you have a data sanitization part. Now also how do you prevent unauthorized access using strict access control? This has been very normal architecture for a long time. But uh but in scenarios where you have fine-tuned a very specific model, it's open to your developers and

they can uh access it to maybe get a documentation out. How do you ensure that model is not going to answer an intern about something which they're not supposed to so uh supposed to know. So this access control we are talking about in more detail not only on the training side but also on things like retrieval augmented generation. How do you ensure some of the things which are coming from your data set is not exposed to uh agents or humans uh who should not have access to that? That brings us to infrastructure because you need to have certain things to ensure your training and fine-tuning and serving is done properly. So this is how it looks like for most of

the in-house uh organizations at least the smaller ones uh like mine. So you have model and data storage, you have a serving, you have a training evaluation and you have uh frameworks and one thing I'm missing here is benchmark. Now let's start with model source tampering. This is already old. Uh so HuggingFest ran a study to see that how many of the models it was hosting has intentionally been tempered. And you will see they found quite a lot. So one of the examples like this you can recreate uh is uh model poisoning through Python's pickle. So pickle is a Python's uh default serialization model. It converts the objects to byte stream and vice versa. How it can be exploited? Um it

can be exploited through the reduce method. How? Let's see. Uh we're not going to look into this slide more because we're actually going to see a demo uh if it runs. So here uh we have two VMs running side by side and on one we just look into the code and on the other side that VM whenever somebody so the how the thread model works is that the thread model is you have already poisoned the pick and file you have uploaded to it to hovering face or distributed through some other channel and the user has no idea it has been poisoned. So what they're going to do, they're just going to run the uh uh

like they just just going to load the model and uh they're going to interact with the chatbot how they're supposed to and on the back end uh we should be able to get access to their machine.

Did I go to the second? Oh yeah.

Okay. So, uh this is uh this is actually available in GitHub and hugging face. So, you can actually play with this uh you do not run it in on your own computer. uh because this will open up a port uh and it as pleased uh this when I tried this was six seven months back uh Windows does not uh mark this as a uh malicious file. So let's go forward and look how it actually should look once you access it.

So this is the ONX loader. uh so we can uh distribute it. It's 30 MB of your whole model loaded and when they run it. Oh, this we already saw. Um I do not like PowerPoint. So funny thing which I was just explaining today morning I realized suddenly my this is my office laptop and we uh some policy change happened. I cannot access my personal Google slides anymore. So I had to download and depend on it Excel sorry PowerPoint. So go forward. Where is Wait a minute. I want to show. Okay. So this is the slide I wanted to show you. So,

so this is uh like they're accessing the application fairly simple gradio application on the right. If you see that it immediately opened up a uh reverse shell and the attacker can see whatever files is uh available in your file system. So this is very uh straightforward attack through Python pickle. This has been documented well enough. If you want to try it, you can scan it. This will you will be able to uh use it in your own machine on VM. So going back to our talk, so architecture back doors in neural networks are a thing. So you can have a system where uh if you have a uh model, you can train the model very specifically to learn

certain patterns. it will work perfectly fine for every other responses but if you have very certain keywords in the responses it will go in a very different path. Why this is important? We have seen uh models being used directly from hugging face uh to do certain classification tasks to do certain other tasks where those models are being loaded by a agent and that agent was not aware of. Of course, the developer was also not aware of uh how it should uh work for very specific scenario and that loaded a very specific command which told the agent to forget some of its guards and do certain stuff and since these are multi- aentic system everything after that fell in a chaos.

So back model code uh to get remote access is what we just saw. Now this generally happens when you already have unauthorized training data which you can still mitigate to some extent if you completely control your whole end to end data provenence which becomes a problem if you're only a shop which is fine-tuning on existing models. So AI specific risk there are already models like this which are available which is fine-tuned on backd dooror which are fine- tuned on CVS and uh other uh problematic codes. This opens up a huge uh threat model for us. But this also is very interesting. You can use some of these to internal to do internal red teaming to have an agent which constantly tries

to break your system uh for for a very specific task. Now that idea even though is enticing you still need a complete end to-end secure ML tooling for your team to work and build these systems which again brings us to model and uh data storage serving and the whole control. How do we do that? So implement a verifiable provenence using cryptography. Implement very specific systems access controls inside your agent. How does your agent gets the scoping? Does it get the scoping from the user who is running it or is it the system level scope? What exactly it's supposed to uh work on? If it is what kind of data is uh it's supposed to get

is it expecting a JSON data and got something inside that data which doesn't look like JSON. All these uh should be uh part of your system. So we talked about model expiltration and uh that uh brings us to like model data storage. Now then there are other attacks like beer token exposure and loss. If your model has been trained on a lot of your internal data set maybe code have you ensured those codes does not contain specific uh patterns of specific uh things which your model should not expose if it has not been if you don't have ability to do that kind of data sanitization. Does a model have a very specific guardrail and classifier not

only in the model stage on the agentic side as well to prevent that uh remote model with reconstruction is also a thing where uh people try to reconstruct the weights of your probability model and one of the easy ways to uh not exactly reconstruction one of the easy ways uh is uh of course distillation of your model's capability which took a decent stage through uh one the tweets anthropic did. Now throughout all of this, one of the huge things which is coming up for us is uh data access control. I'm going to go a little bit faster here. So how do we ensure our models and uh agents which are accessing those models require authentication?

The talk just previous to my talk was excellent on giving you a lot of ideas about that. How we are doing it is very specifically is scoping each of the agents. Each of the agent for us agents are nothing but uh very specific codes which are given certain respons responsibilities to do certain stuff using some uh LLM that can be a very specific models for that agent or something uh generic but we ensure very specifically they all are seo and their authentications are separate from the other agent. and it's going to call. So this is how if you look at the whole scenario these are all of the things we have to uh take care of the exfiltration

deployment tampering poisoning source tampering training data tampering and if you look at how we kind of deal with this is of course red trimming on the left side data and access control secure ML tooling is your infrastructure partial testing and all those. So this brings us back to models or agents. So how do you safely process a user's input and a model output? So model input handling, output handling, this is what it leads back to. So what is unsafe model output for us? This is mostly a very specific pattern spec specific information which should not come to the user. That can be private information that can be sensitive information that can also be uh harmful or very stylistic

kind of harmful information. So unsanitized output does lead to arbitrary code exe execution. This is very prevalent specifically on agentic side where we have multi-agent and some of the agents are deal dealing with uh code executions and then uh you also have our adversary training which works on that how much time do I okay so since I have five more minutes so some of the things we are doing is that organizing red team exercises to model safety and scrutiny and this is not as easy as it sounds uh prompt injection is very small part of the equation which is for most of the our systems is mostly solvable by having multiple uh layers. What's more

problematic is uh creative parts for example having invisible image content to hijack uh result accuracy for having logs which are have been tempered with which when passes through your log processing system specifically suppresses some result. So input validation sanitization is still a huge thing and we are starting to have agents or I would say classification systems which looks into the input even before it goes to any kind of LLM. So what we realized is that you should have a lot of uh dedicated input and output security classifier code sanitizer and uh just sandboxing them in our uh experiments were not enough a lot of times. So sensitive data disclosure that uh is one of the huge things if you

are working for a healthcare or a financial company and uh that brings us to some of the privacy enhancing technologies which we are doing is specifically on the training and tuning part. Your evaluations should capture not models accuracy but very specifically what you want the model to output what and also what you want your model not to output. That has been uh a huge push on our side that uh you should have very specific evaluation which captures the things we should not let the model to happen. Uh some of the things uh this is more into the model training side uh differential privacy training so that model does not learn and recall uh private information. Uh this has been a

very huge push very specifically because a lot of data we deal with uh becomes little bit less useful if you take everything out from the um training data just before training the model. But if you train the model we cannot guarantee ever that it will not be exposed. We can empirically show but we cannot guarantee that. So differential privacy has been one of our tools to uh employ to give certain guarantees that yes this is not going to come out. So we started with all this stack uh of like prompt injection sensitive data disclosure inferred sensitive data. This is very interesting. This inferred sensitive data is not essentially a data which is part of your model should be outputting.

For example, we have seen attacks where your model is supposed to only produce code and it does produce code and then somebody asked it to produce very specific code that produces a very specific kind of text and image which the model should have not have known about that. But since we don't control the input model, we don't control the pre-trained model. So only fine-tuning uh was done on those models. It came out with certain information which was not part of uh its original design. It was not part of its original output. And remember this model just outputs code. So it should not be able to output any kind of text like nobody cared about private uh information getting output

from that model. But people were still able to ask it to write certain type of code which when you chain that code was able to give information which the model was not supposed to output. So input validation, adversarial training, output validation all of these are part of uh the toolkit which enables us to uh safeguard it to some extent. None of that we deployed all of them but none of that still uh stopped that agent to uh provide certain kind of code output which uh disclosed our sensitive data. Hence at the end of the talk I will always say this is a running goalpost for us. So the takeaway from the talk uh or take away from my

experience here is that what how we can look at the kind of like the AI risk or the agentic risk uh and spec uh specifically in this multi- aent architecture is a combination of classical uh issues classical ways we were dealing with it even before uh AI came into the place but also having very specific uh safeguards so that your output your input they're all sanitized and also your training whole life cycle is sanitized. So if you want to apply some of the things uh we talked about here. So review your AI workflow, review your workflow from end to end from your modeling team, the science team to the implementation team to the team who is

providing the data and in the next 6 months maybe improve your uh security uh of auditing controls even more because a lot of the stuff we realized our agents we could not have was doing we could not have uh done that without very specific auditing trails and um we have realized the more you have multi- aent system the more complex makes uh the data flow more complex it becomes the audit trail. So top five practical recommendations if you want to kind of like uh do that in your system if you scan that code it has a very small lightweight complete end toend implementation of certain things I talked about here. So how you can have a

uh very specific access control in model how you can sandbox how you can filter sanitize all those outputs. With that I am almost at the end of my talk. Thank you. If you have any questions I'll be happy to take it and if you scan this it will allow you to give uh feedback on the talk. Thank you.

United in Defense: Architecting Safe and Trustworthy AI Agents

Related talks