Your AI Agent Just Got Pwned: A Security Engineer's Guide to Building Trustworthy Autonomous Systems

Name: Your AI Agent Just Got Pwned: A Security Engineer's Guide to Building Trustworthy Autonomous Systems
Uploaded: 2026-02-11
Duration: 26 min 36 s
Description: Presenter: Matt Maisel A recent AI red teaming study by the UK AI Security Institute achieved a 100% attack success rate against all tested agents, with successful exploits in as few as 10 queries. If you're a security engineer building or securing agents, this should motivate you to act. As we giv

BSides Philly26:3688 viewsPublished 2026-02Watch on YouTube ↗

Speakers

Matt Maisel

Tags

CategoryTechnical

StyleTalk

About this talk

Presenter: Matt Maisel A recent AI red teaming study by the UK AI Security Institute achieved a 100% attack success rate against all tested agents, with successful exploits in as few as 10 queries. If you're a security engineer building or securing agents, this should motivate you to act. As we give agents more autonomy, we expand the attack surface. Every autonomy level increases utility but also increases risk. Yet most teams ship agents with only basic prompt filtering. This talk delivers practical patterns for building secure agents. Attendees will learn what agentic systems are and why to use them, how each autonomy level creates new attack vectors, design patterns for agents, and guardrails that add security hooks to agent frameworks without breaking functionality. Attendees will leave with a security evaluation framework, code examples, and a pre-deployment checklist.

Show transcript [en]

Uh, thank you all. Good morning everyone. Thank you to the besides organizers and volunteers. Um, and thank you everyone here today in this room. U, my name is Matt Masel. I'm the CTO and co-founder of Cindera. We're a Philly based AI security startup. Uh, and since 2022, uh, I've been building, breaking, and securing generative AI systems. Um, and in my 15 years of of in cyber security, uh, I've really been at the intersection of of ML software, uh, and security engineering. Um, 25 is the year of of so-called agents. So, but maybe maybe some agents. Uh, you're right. We have deep research agents that digest information for us, coding agents that help us build software uh in if we're

brave enough, computer use agents that control our oss and our browsers. Um, but the reality is is that we still need more domain adaptation for different uh domain tasks and unlock reliability. Um, AI agents are systems that have the ability to perform increasingly complex and impactful goal-oriented actions across different domains with limited external control. Uh so agents are doing reading and writing. They use tools. They change the state of the world. They send emails. They query databases. They execute code. And this is the shift really from querying writing rag systems or or chat bots to mutating and writing uh is really breaking our traditional security models. So to secure an agent uh let's dissect it further. Um so the scaffold is what

is called uh really it's the the body uh that it's the code that wraps the LM and it gives it agency in the world it operates in. It's also our attack surface as security engineers. Um, and the harness, it's the control layer. This is where we're detecting and controlling those attacks. Vulnerabilities live in the scaffold and safety and security and reliability evaluations all live in the harness. Uh, so that scaffold provides that loop. It lets the the LLM, if it's not a reasoning model, um, do thinking, planning, and acting. It manages things like memory. And it's really effectively connecting the brain to the tools and then really those tools to the external world that it operates in. Um so there's

already scaffolds built in frameworks like Lion Graph or Google ADK which is what we'll be looking at today for some case studies. Um but you can also write your own. Um and then in testing uh in runtime for the evaluation har or sorry for testing at the relevation harness uh it's used for like running benchmarks or doing evalu harness is then used to enforce like guard rails do observability telemetry and apply policies. Um so we'll be looking at um specifically kind of things in the harness today. Um, so you can take any backbone LLM, right? Any one of the Frontier Labs uh reasoning models, load them into your scaffold of choice and get good results.

Um, this chart from Meter shows that the duration of tasks that an AI agent can autonomously complete at the 50% success rate. It's doubling every seven months. Um, at the same time, this other chart from OpenAI uh is showing that their performance is approaching par on industry expert tasks on economically viable tasks. So this is definitely something to be taken seriously but at the same time there's a catch right all these benchmark scores you know look great and on the surface um but there's fragility right so this is an one of the healthcare benchmarks it's just showing that you know the LLM and the agents that they're embody embodied in they might get the wrong answer they might

get the right answer for the wrong reason they might confabulate the reasoning altogether or they might completely fail when the input is just perturbed or changed slightly so this is just showing that there's this gap between the sort of increasing saturated benchmark scores that you see out there uh and real world sort of robustness um when that this is also precisely the issue we're going to look at today you know where achieving trustworthy uh AI arises and that brings us to our central question uh as we as we build these agents to be more autonomous and more capable how do we as security engineers uh make them trustworthy especially in highstakes environments where actions have irreversible consequences and

really the question becomes what does it take to trust an agent to do a task in one of these real world environments and it moves us uh from asking things beyond just you know isn't the L is the LLM and really is the agent accurate to things like is it safe is it secure is it reliable and I would say like that engineering trustworthy agentic systems is is a wicked problem it's a it's a really hard problem and uh right now we're going to look at some mitigating controls that we can that we can do and put in place so to understand this today we're going to look at a a sketch of like a workspace agent so I implemented

this and uh I'll release this is already released on GitHub today. This is a sketch of a Google ADK agent that's really a personal productivity uh assistant. It uses uh Gemini and one of the LM reasoning models and it has access to some native tools that I've built and it really has two core functions. It does email management, calendar management and uh it can create new meetings. It can send emails, delete emails and uh really to make all this work, we're we're giving it tools, right? and we're giving access to to being to those data sources and effectively it's it's a read write on my on my digital life and it allows uh it potentially you know strangers that

email you it potentially could trick that this agent into following those instructions. So if we deployed this today you know what could possibly go wrong. Well if we just like sample the last six months of uh sort of headlines in this space you know we can see that there's there's quite a lot these aren't just like theoretical attacks on these systems. there's uh really this whole wave of really what what we'll get into prompt injection and indirect prompt injection specifically against all major platforms uh like superbase uh they had a GitHub issue that ended up tricking an agent to publish a database uh to a public issue of course like Microsoft copilot there's been several like zero

click one was called echolak zeroclick vulnerabilities where simply retrieving a poison document uh ended up to ended up causing or leading to the path of of data excfiltration um now coding agents are the latest to this this dumpster fire. Um, so things like GitHub Copilot, Cursor or Windsurf, now anti-gravity, you know, these are highv value targets. They they sit on developers machines or in the cloud, you know, sort of running on behalf of developers. They have access to your source code, to your PRDS, to your specs, um, and potentially if you're not using sandboxes, uh, uh, can can use that and and get access to everything in your environment. Um, so why does this keep happening? Um, this

really brings us into prompt injection. Um so earlier this year, Grace Swan AI and the UK AI security institute they achieved a 100% attack success rate in a public uh red teaming competition against every agent that they tested. Uh it so some of these agents for some of these these models that only took 10 probes. Um since then this data set is has been assembled from this competition and it's now used by frontier labs like uh Anthropic to benchmark for pro prompt injection when they release their latest models. uh in traditional computing like SQL we have separate you know we separate code right the instruction from from the data the input but in large language models really transformer-based

architectures that boundary does not exist and to the to the transformer the the system prompt a user query and a retrieved email they're all at the same single stream of tokens it can't be reliably distinguished uh it can't reliably distinguish tokens uh really instructions from a data instruction uh from data instructions when processing so in a chatbot uh prompt injection right can be annoying or depending on the company, you know, cause brand reputation sort of loss, right, with producing hateful, harmful speech or images. But in an agent, prompt injection could be potentially catastrophic depending on what it's hooked up to because you gave it tools. Uh it's not just producing text, right? It's actually executing code potentially

moving money uh doing transactions or in in the case of windsurf and anti-gravity being used to like xfill data and source code. Um so this this vulnerability uh exists really when as as a vulnerability versus like a a threat uh vector. Prompt injection vulnerabilities exist when these three conditions are met, right? The agent takes a dangerous action or attempts to take a dangerous action. It does so without any human confirmation or or oversight and it does so acting on attacker controlled input. Um and then also too as owners like we haven't accepted that that vulnerability or that risk. So even more so it's it's getting even you know more precarious. researchers have been analyzing and just

in recent months and there's work being presented at Nurups uh this week uh in San Diego on this this area uh analyzing adaptive prompt injection. Uh these are techniques that automatically optimize the the prompt for injection and they're finding that really the security of this the refusal and the safety training of these models is no longer about finding you know really handcrafted sort of tokens. It's could be modeled as a math problem and it actually has somewhat predictable scaling properties or scaling walls. So this this attack success uh with injection is now becoming a predictable function of the compute resources. Um as an attacker if you apply uh compute like using RL or doing using genetic algorithms to build

out search like alpha evolve um you can mo you can evolve your your prompt and utilize models as well too that have high persuasion capabilities. The probability of of injection of finding a prompt injection or uh doing a jailbreak which is like bypassing the safety training of of the foundation model. Uh this approach is 100%. And these advanced optimization techniques are effectively shifting that difficulty curve and making highly highly capable target models vulnerable to automated attacks. It's also showing that at some point here, humans are not going to be the best AI red teamers. Um there'll be models that are doing it. Uh and so all of if anyone is has was an English major

or still is an English uh English major today out there, yes, you can even get single single turn universal injection with adversarial poetry. Uh so starting from this template uh you can go and splice like whatever sort of injection you're trying to do and evolve it and attempt to bypass that safety like again get uh get a refusal and get something accepted by the target model. So just to make this really concrete let's watch uh indirect prompt injection happen uh in in a demo. Um this is [clears throat] uh again the ADK web browser. I'm acting as a user here but this could be like an event triggering uh to an agent to say hey go read this email. So the agent

reads the email and then all of a sudden you see it's starting to send uh send emails in this trace. Uh so buried in the that first email there is a there is a command that tells it to you know read the last five emails and four of them to to another to another um another recipient. So like this is what that looks like. Um it really isn't that complicated. It's just a a command that says hey this is important. Go retrieve these emails and then forward them to this recipient. So if we were again to deploy this today without any controls um this is what is called like the lethal trifecta um this is coined by uh

Simone Willis and then more recently there's been was a great blog post on meta around like the mitigation strategies for this called the agents rule of two. Um so this whole this whole vulnerability here and the impact it's all around trying to break the simultaneous presence of three critical capabilities in the AI agent. um a uh pro processing untrusted inputs, b uh accessing the private data and then c having the ability to communicate externally or perform consequential uh con sorry consequential actions. So when an agent uh possesses all three of these properties that severity of the risk is is dramatically increased. Again it can lead as we showed in in the in the uh in

the ADK example data exfiltration or unauthorized access. And again hopefully as I motivated uh prompt ejection is a really hard thing. uh even the frontier labs haven't figured it out yet. Um we have to look at other types of strategies uh to mitigate these risks and uh and in this case for lethal trifecta try to break one of the boundaries and ensure that there's only two of these uh potential uh capabilities at any given time. Um so really right in security we've at this point all assumed breach right so the new phrase that you know was kind of talked about at black hat earlier this year uh was is assume prompt injection. Um so we have to design develop and

deploy our agents then accordingly and we can engineer these uh these agents uh to be trustworthy and to be resilient uh throughout the agent development life cycle. So sort of at design time at develop time and at uh at deploy time we can build in various controls that mitigate these risks um and ultimately uh you get utility while balancing the risks that they pose. So we'll start with some uh secure architecture foundations that address uh these prompt injection risks uh from the ground up. So first uh just a quick uh view of some of the emerging uh security and threat modeling frameworks in this space. Uh the Google now coalition for safe AI secure AI framework is a great uh the

NVIDIA AI killchain is also a helpful way to to do threat modeling. And then OASP has also done a lot of really great work. Uh the top 10 LLM's uh LLM top 10 is out. The Agentic top 10 is also coming out I think next week. Um, so these are emerging standards that are defining the the risks, the threats, the vulnerabilities and the risks for us to to work on uh to look at as we assess these systems. So here's like a simplified um you know sort of threat model that I did for this this uh email this really this workspace agent. So we talked about you know instruction manipulation um leading to indirect prompt ejection. Uh this is also

highlighting tool abuse. So our agent has excess agency. So it can take chained read and write permissions to these tools to read email to write email to create calendar events and that is the direct path for sensitive data disclosure. There's also destructive actions. So if we allow the agent to have uh the ability to like delete email or to delete calendar events without human loop, you know, this is potentially where it could go rogue. And then we also have persistence. So, uh, sort of a newer a newer concept in agents, you know, they can have memory and they could potentially be poisoned and those there's injections from the email that could be loaded every time,

you know, the agent is about to go do do some task. Um, there's also, uh, more governance layer models that we can look at. So, agentic profiles building them for these systems can help us characterize the properties and then inform the type of controls we want to put in place. And really, this gets to the root of like what agents are. uh agency is the capacity to act intentionally. So it exists across the spectrum of dimensions. It's not just one or two sort of levels. I think the first two though uh these levels autonomy and efficacy which I'll define are this is our attack surface as as security engineers right so autonomy that's the degree of independence you

know does the agent ask for permission or ask for forgiveness as autonomy increases it does does so without or less human oversight or scalable oversight and then efficacy is really the power to change the world to mutate the world so this is a combination of blending uh capability what it can do with permission what it's allowed to So if your agent can execute code or transfer funds, we could say that's you know high has high efficacy. It has the potentially the ability to do high consequence sort of actions. And then the third one to call out um because we haven't achieved AGI yet is I'd say goal complexity. So this is the planning horizon uh right so chat bots you know

they have a single turn or a couple turns typically so that's like low goal complexity. uh an autonomous coding agent though might have high complexity might have thousands of steps and this complexity makes it one difficult to monitor and understand like what what it's really doing but then also difficult to predict like what what things could go wrong and like when bad things are happening. So drilling into the the sort of spectrum of autonomy I would say it's really the heart of the agent design choice and considerations right now. So you can think of it as a slider. As we turn the slider from left to right, you know, from level one to level five, we're increasing the agents

utility and power, but we're also dramatically reducing the oversight. Um, as agents increase, uh, they can take more consequential actions without oversight, and we can monitor the utility. It makes it harder to monitor the utility and risk. So that level three, level four, that's where, uh, we have sort of potential like consent fatigue setting in where we're prompting or or escalating decisions to our user, to our human to look at all the time. And this is an area that you know is under active research right now to kind of figure out that balance and what we can also identify at uh at design time. So to help automate this process uh I'm releasing today a AI governance profiler

that I built with open hands that takes like a structured rubric output and helps produce some of this analysis that you can you can use to to build your own agent profiles for any systems that you're building. Um so in security right like we have the pre of course we live by or we try to the principle of lease privilege. we're only granting assets that's required to do the job. But in agents that privilege is not not enough. Um we need this new concept of of really choosing they they have this new concept where they can choose how and when to use those privileges. Um so this is why we have to really think about also

having the principle of lease autonomy where we're not giving that agent uh the power to decide uh something that it doesn't need to. We have to really effectively try to constrain the decision loop and uh and that is again could be done sort of at architecture time. So just to go through a couple of patterns that are useful. Um, so this is like the blunt instrument. This is an architecture pattern that can really help eliminate or try to prevent these patterns are help mitigating different types of prompt injection. First, like there's the action selector. This is like the router pattern. Uh, so the model of the LLM takes an input and then it like picks a predefined tool. It's

pretty dumb though because you can only you can't let it like sort of plan. You can't let it go retrieve an email and then act on it. Uh there's plan then execute or code then execute. So this is similar to uh these are both kind of related but it's when the agent the LLM generates a fixed static plan and then either calls tools uh with certain arguments that it retrieves or uh generates and then executes code. So even if if the email uh or the email in our example saying delete everything or send this email uh if you didn't ask it to do it and it didn't it wasn't in an existing static plan uh the agent won't

do it. Uh but this is unfortunately at times like a trade-off where we're losing some of the adaptability um sort of at execution time if you know the plan can't change. Uh there's map reduce which is typically you'll see this in like deep research agents where you know they might be processing a lot of untrusted emails or web pages. Uh and this is where you have like a parallel map uh function that's looking at the the sort of data and then you have a robust sort of reduced function that you know is a deterministic function that gives you like structured outputs. And again, the whole goal here is that even if you have uh injection in one document

or one email or whatever you're looking at, it's minimized to that particular document. Um, and then it's it's a way effectively acting as like a sanitization uh sort of pipeline. Uh there's context minimization. Uh so this is where the user's prompt is removed entirely from the LLM. Sorry, from the final response of the LLM. So this is a way to prevent uh direct prompt injection from the user. Uh but again, as we see with a lot of the indirect prompt injection happening right now, it's not as common in agentic workflows. And then the one that we'll look at today in more detail is dual LLM. Uh this again is an idea that was initially proposed by Simone Willis and has now

been sort of implemented in a couple different uh architectures. But the idea here is that we have a privileged LLM handling our trusted instructions and tool calls while a separate quarantine LLM processes that untrusted data in a sandbox environment. And specifically we'll look at the capabilities for machine learning uh camel uh implementation and and proposal by Google. Um so this is a good example of a dual LM pattern and it really splits u you know things into sort of three pieces here right first we have that privileged LM so this is what's driving really the control flow and pull comes from like control flow integrity uh sort of theory and it this is what actually

has access to the tools it creates the plan but it actually never sees any of the data it never reads that email body for uh for our workspace agent it's only looking at pointers or variables representing representing that email then we also have the quarantined LLM or the I which is handling all the data flow. This is what's you know doing a lot of the information flow control for us. It it's uh processing all these documents all this data. It might have injection but it can't execute code. It can't call tools. Its only purpose is really to output sanitized data that then loaded back into the third piece which is the interpreter. So this is

what's enforces the capabilities. Um right again from control flow integrity research and theory. The idea is that we can have unforgeable keys that could be derived from like your permissions identity system. Uh and as data flows into these tools, uh the interpreter checks for different tags, different tokens, right, that represent, you know, what data can flow to what places. So things like um internal sensitive data might might have a a token that might not allow it to be sent to, you know, untrusted u low integrity sort of recipients sort of in that email uh email tool. So this is great like this does a lot for us, but it's it's expensive. It's complex. um and it

requires more inference um and really it requires a combination of design and development which we'll get into but right now it's really the the best known way for kind of building robust uh agents for for really mitigating as much as we can against prompt injection. So uh you know we I've gone through a bunch of design patterns and uh sort of threat modeling. Now it gets into development. Uh so this is where we can focus on uh eval benchmarks. So the first thing is that uh for red teaming and for adversarial testing you know I would advise not to try to do this manually as much. Um the attackers as we said they move second. So let's try to

preempt them. So there's tools like Petri which I'll show and then uh the UK AI security institutes uh inspect AI tool. There's a handful of other tools out there now too, but they can help you uh do testing, do adversarial testing on your on your agents. There's a more recent shift to to do a lot of behavioral testing uh in uh on agents. So, a really good example of this uh in the in the last in the like last week's Opus 45 model card paper. Uh there was a uh a good example of a policy on flight evaluation that was violated. So, the policy was stated like no flight modifications, but the agent uh didn't

refuse. It actually found a loophole. it upgraded the cabin class which was allowed and then modified the flight. So this is a good example of uh following the letter of the law um but not uh not not it violated the spirit in this case and then finally metrics uh finding metrics that matter um you know for the sort of security controls we put into place like what's the actual success rate you know uh benign success rate and what's the actual uh success rate while under attack and then what's the what's the benign rate after all these controls are added uh sorry uh the utility after attack once the controls are added in this is what helps us really determine

like if we secure these agents, you know, can they do their job but, you know, not be like a brick and not actually, you know, be useless. Um, so for ADK, uh, I put this example together up on the repo where you can use user simulation. So this is a nice feature that ADK has that you you give it a uh a starting prompt and then a plan for a conversation and an L LM will like pretend to be the user and like generate a conversation and then uh you can use that to score that agent against like hallucination benchmarks um or safety benchmarks that uh that ADK specifically provides. But there's other tools like I

mentioned like Petri where [clears throat] you can do this in a more general sense. So this is a tool from anthropic and the idea here is that you can uh you can use this for auditing uh any sort of scenario that you create and from seed uh scenarios you can go and have the the agent generate a trace and then have like a judge that's scored against you know whatever criteria you're looking for. In the case of anthropic they're using this for like deception and psychopensy um and other sorts of safety related uh scenarios. Um and then finally um we get you know we have done all our eval uh we get to the deployment uh phase and this is where

really again real-time monitoring prevention come into play. Um, so there there's a ton out here for this. Um, this is where there's all kinds of guardrails, right, from the kind of traditional like say a traditional like as in like last few years of input output filtering sort of on the prompt. Um, so you have tools like Nemo guardrails or Llama firewall or you know other other products that do this. Um, and they're really either detecting or trying to do rewriting you of of your prompts. Um, there's uh then the second piece which is a good example of camel where we're doing like policy and privilege enforcement. There's the third piece which is more of the behavioral

monitoring um and where we're actually using tools like Petri on production systems to monitor uh or have other like guard agents monitor agents sort of looking uh at the trajectories um and really all of this like requires a balance. So I think just in interest of time uh I won't cover like everything here but the the goal the goal my takeaway here is that really again look at your attack success rate you know set up the evaluate uh set up the your evals and look that you know as you're adding in these controls like what uh what sort of impact does it have on your utility while also preventing uh some of the the common um sort of injection techniques

or other sorts of safety or alignment um and sort of reliability things that you're testing for. Um and just again to make this concrete with ADK they have a great plug-in system where you can hook into different stages of the agent life cycle. Others have this too langraph uh strands etc. And and this is what um you know we can look at specifically for one of the uh prompt filtering and prompt input and output filtering uh implementation. So this is a recent paper uh on this this idea of soft instruction control. Uh so this is helpful when you know you can't afford the latency or the complexity of the full uh the camel dual the dual LM

architecture and you want to just have a runtime guardrail. Um so this this concept is pretty simple but it's been found to be very effective uh by these researchers who were also involved in in the camel uh paper as well too. But the concept is that you're just going to defang the input. So you can uh right as attackers when you're doing injection you're relying on imperative instructions uh like ignore previous rules or send this email in the case of that ADK uh example that I showed. Um so what this does is it sits in front of the agent and it acts as a sanitizer. Um as untrusted email content comes in or web pages or prompts even from the user

in this case. uh it it rewrites it uh in this this sort of loop where it looks for um really uh checks to see if there's any instructions with another LM checks to see if any of those instructions are imperative and it rewrites it and then it uses like dummy instructions that it puts in there as canaries and checks to see if like they're still they still exist uh in like the rewritten prompt um afterwards and uh and ultimately uh if it if it can't do it it basically halts and raises an exception. So again, this is an example where like uh right, the complexities of camel uh they're great, but it's not as as as important as a uh

sorry, it's not um it's more effective and you can like again easily do this at at monitor sort of at deployment time. Um and yeah, so that's uh that's that's the example. So uh really what can you do tomorrow uh from all this again uh I think the responsibility of security engineers that are in the in the development life cycle for these agents. uh it's up to us uh it's up to the AI engineers that are building and the researchers that are also designing and coming up with new approaches here but at least uh right now you can go and take home you know the tools hopefully here map your agents autonomy level um understand where it sits on that L1 to

L5 spectrum uh go build an agent profile for it and then look at like what sort of threat model you can do on that second uh do a red team or do a behavioral eval with some of the tools out there a few that I showed on uh during the talk and then third uh implement at least one garbrail pattern um and this could just be as simple is monitoring you know the trajectory of the trace of the agent and just so you have observability I think uh that's where a lot of teams right now are struggling and you know from there you can get into prompt rewriting or into tool input output sanitization or into

other you know more complicated setups like camel or policy related guardrails so um with that uh thank you for listening to the talk today I hope you found it useful and uh yeah happy happy happy [applause] to take questions an hour after after

Your AI Agent Just Got Pwned: A Security Engineer's Guide to Building Trustworthy Autonomous Systems

Related talks