Beyond Vibe Coding: Building Reliable AI AppSec Tools

Name: Beyond Vibe Coding: Building Reliable AI AppSec Tools
Uploaded: 2026-04-21
Duration: 28 min 9 s
Description: Emily Choi-Greene, CEO of Clearly AI and former Amazon Alexa security engineer, breaks down how to build production-ready AI systems for application security. The talk moves beyond hype to address practical challenges: preventing hallucinations through chain-of-thought reasoning and source citation,

BSides Vancouver Island · 202528:0959 viewsPublished 2026-04Watch on YouTube ↗

Speakers

Emily Choi-Greene

Tags

CategoryTechnical

TopicAI Security DevSecOps Web AppSec

TeamBlue

StyleTalk

About this talk

Emily Choi-Greene, CEO of Clearly AI and former Amazon Alexa security engineer, breaks down how to build production-ready AI systems for application security. The talk moves beyond hype to address practical challenges: preventing hallucinations through chain-of-thought reasoning and source citation, normalizing LLM outputs for reliability, and designing human-in-the-loop workflows. She covers RAG failures in security contexts, threat modeling from unstructured documents, and engineering patterns that achieve "multiple nines" of accuracy—the reliability bar that security systems actually require.

Show original YouTube description

Security teams are overwhelmed by AI‑generated code, but how do you actually use LLMs for real application security? In this BSides Vancouver Island talk, Emily Choi‑Greene (CEO & Co‑Founder, Clearly AI) explains how to turn large language models (LLMs) into reliable, production‑ready AppSec tools. Drawing on her experience securing Amazon Alexa and building AI security platforms, she breaks down practical strategies for using AI in cybersecurity, secure coding, and software security workflows. This session is ideal for AppSec engineers, security architects, blue teams, DevSecOps practitioners, and engineering leaders exploring AI security tools. Key topics include: - Using LLMs for application security and secure code review - Why RAG (retrieval‑augmented generation) often fails in security use cases - Reducing hallucinations with structured outputs, source citation, and LLM‑as‑judge - Designing AI systems for high reliability (“multiple nines”) - Combining agents, tools, and context engineering for better results - Threat modeling from unstructured design documents using AI - Human‑in‑the‑loop patterns for safer AI adoption in security teams If you're working with AI in cybersecurity, DevSecOps, or secure software development, this talk provides practical, real‑world guidance. This session was recorded live at BSides Vancouver Island 2025 in Victoria, BC at the Victoria Conference Centre. 📣 BSides Vancouver Island 2026 Conference Join us on Friday, September 25, 2026 Victoria Conference Centre, Victoria, BC 🎤 Call for Presenters (CFP) - - Deadline August 14, 2026 https://www.bsidesvi.com/cfp 🤝 Sponsorship Opportunities - Deadline August 14, 2026 https://www.bsidesvi.com/cfs 💬 Join the Community Slack https://communityinviter.com/apps/visrs/visrs Subscribe for more cybersecurity talks, AppSec insights, and AI security content from BSides Vancouver Island. #CyberSecurity #AppSec #AIsecurity #LLM #DevSecOps #SecureCoding #InfoSec #BSides #BSidesVI #VancouverIsland #SecurityConference #ArtificialIntelligence #SoftwareSecurity

Show transcript [en]

It's my genuine pleasure uh to welcome our next speaker, Emily Choy Green, as the CEO and co-founder of Clearly AI. Emily is at the forefront of automating security and privacy reviews for highly regulated industries uh delivering advanced AI solutions exactly where they're needed most. Before founding Clearly AI, Emily honed her expertise as a security engineer working on AI systems like Amazon Alexa, gaining firsthand experience in building secure and trustworthy technology from the ground up. Today, she'll be sharing her knowledge in her talk, Beyond Vibe Coding, building reliable AI apps tools, diving into real world challenges of evolving AI from prototype to robust production ready security applications. Let's give uh Emily a rockus bs vi welcome.

>> Hey everyone. Um as you said I'm Emily and I know we've spent a lot of time today talking about attacking systems. So I'm going to talk a little bit more on the other side of how to um pro think about and build technology to protect systems. So my background is as an application security product security engineer spent five years working on Amazon Alexa um where I secured a lot of the Alexa AI systems and have been in the AI space for about a decade now. Um, so this talk is really around, you know, why AI is a great tool to use to solve application security problems, why AI can solve more problems than maybe you

think, but also it's difficult to build a production AI system that is reliable 100% of the time. So talking about a lot of tools and techniques that you can use in order to build AI systems that you can actually trust in production. So, it's an AI talk a little bit more than it is a security talk, but very much all around applying this to um your teams and something that hopefully you can take back and use and build. So, I promise this is not an AI hype talk. Um AI is a buzzword that is literally everywhere these days. If you walk the floor at most conferences, every single vendor has like AI agents, AI agents everywhere. Um, and as

security folks, we're naturally skeptical. Um, technology, we've seen how it fails more often maybe than we've seen how it succeeds, but we also need to embrace every tool at our disposal because attackers are also embracing every tool at their disposal, right? So, you have to kind of fight fire with fire sometimes, especially when this is the attitude from leadership. Um, so I wanted to start by pulling the room. Who here is using AI in some form at work today? Cool. Almost everyone. Um, who is using AI to write code? Who is using AI to review code for security reasons? Okay, fewer than those using it to write code, which is problematic. Um, but hopefully after

this talk, you can start thinking about ways you can use it to review code as well. Um, so enterprise have really bought into using AI software this year. Tragic came out in November 2022. So it's now been almost three full years, but the identic software adoption within the enterprise has really started picking up in like 20 24. Um, a lot of it's that top down pressure. Um, a lot of it is also getting more comfortable with systems and getting more comfortable with the idea of, hey, if I don't innovate, then I might fall behind as an executive. Um, and so what that means is software releases are up 30% industrywide. And vibe coding is pretty magical. You

end up having, you know, this is a pretty viral tweet of someone who built a SAS just with cursor. Um, he like is making tons of money. It's been great. And two days later, API usage maxed out, subscriptions bypassed, his database is hacked. I'm not trying to attack this particular guy. It's awesome and incredible that software is more accessible now than it has been in the past. But it's pretty clear that with more software producers, there's also going to be much more security vulnerabilities. So it's really like a Jevans paradox in action. Um for those who don't know the Jevans paradox is the idea that supply paradoxically leads to more demand. So more AI code

leads to more AI code which leads to more need to secure AI code. Um good for all of us. The need for security is only going to increase. Um and so the needs for security experts to focus on the most critical problems is only going to increase further still. So why large language models? Why? When applying AI to any process, we should stop and ask ourselves, is this an appropriate application or am I just really excited to use this new hammer? Security tasks have characteristically aligned really well with AI strengths. They rely on understanding diverse context types like code bases, documents, images, um, potentially from many different sources. So the squishiness of LLM inputs makes them

pretty good at that. It requires taking that context and transforming it into some normalized output and it's just a computer so it's very good at running well-defined consistent workflows like stride. LMS can handle scale repetition. They don't get tired. They don't get hungry. Um they have infinite patience. They can operate in parallel in response on a schedule. All the good things that software can do. And so applying AI to tasks that are typically humanrun can be really powerful, especially when it's about the natural language side and what we haven't been able to automate before because it's been in this unstructured information in emails and Slack messages and some in the codebase and some of it's in diagrams. when you should not

use AI. Um, anything where it's missing very key information, AI will try to make it up. It AI models are trained to be helpful assistance, they are consistently reinforced with reinforcement learning to try to answer problems for you. So, if you don't have clear information, if there's no clear right answer, AI is going to be very bad at solving that type of problem. And then finally any like at the end of the day large language models are a probabilistic system and different people have different takes on this but my take is it's generally a feature not a bug. It's you want to use probabilistic systems for things that probabilistic systems are good at. You should use deterministic systems when

you want 100% the same answer every time. Like that is the purpose of a deterministic system. These are not deterministic systems. And so you have to think about that when designing your system that has an LM component in it. So I've organized this talk into kind of two different components, building context and then um building resilience and making it actually reliable over time. So if you've decided to use AI to let's say we're going to be performing an initial design review maybe you have a highle design document or a project proposal maybe you have an architecture diagram you actually need to build that context in order to use the information that you have. So this is basically like

how one uses context like you have a query there's some concept of like retrieving like search then you take all the stuff that you've retrieved and you generate a response back to the user. So the most basic version of this is called retrieval augmented generation. It uses something called um a vector store and it chunks um source materials into embedding vectors based on semantic similarity. You then uh when you search over this you your uh search is also encoded into embendate and embedding and they're basically trying to match the semantic similarity of what you're searching with what's in the vector store and you come back with a bunch of results uh results that are semantically

similar to each other. But what does that really mean? It means I have a lot of the same keywords has a lot of the same intent but it doesn't actually necessarily mean that it's the answer to your question. And so the problem with um traditional rag models is it's kind of like getting a Google search result. It's saying, "Hey, these are a bunch of things that are generally similar to each other." And it's definitely difficult when you're searching something that's super niche. Because if you have a generic embedding model, then all of most of your embedding space will be mapped to a very small area. If the whole corpus of information is niche, then it's all going to be mapped to that

same niche corner. So there's a lot of issues with like basic rag systems. So we've kind of all graduated past that to what we call tool use, function calling or even kind of agents. Everyone has a different definition of agents. We got a definition earlier today. Um my most basic definition of an agent is something that can program programmatically perform actions. Um so a tool use is like the most common um form of those actions. You basically register a set of tools that the LM can choose to call and those tools then call kind of end APIs in backend systems. Once you have that information, it's the LM is reprompted with that additional information and uses it to provide a

tool augmented output. So it's much better than rag. It's really like kind of rag 2.0 because you can get a lot more context and not just results from one specific type of like data store. Um this is also where M MCP ends up coming in, right? Those are tools that are basically um created by other people that can be called but at the end of the day MCP is like just another protocol. So it's another way to call something to get more information. It's just intended to be this shared model protocol instead of having to know the exact API to call you know Atlassian's Jira read form data like API. So, a lot of folks ask me if we

fine-tune models at clearly. Um, a lot of people are like, "Oh, I want to build something with AI. I need to fine-tune something." It's like where a lot of people's heads first went. And it turns out you can go really, really far without fine-tuning by just using the state-of-the-art foundational models and doing a lot of prompt engineering and context engineering. So, this is a graph from OpenAI. Um, and really fine-tuning should be a last step and not a first step. You should try to do as much as you can do to help what the model needs to know and give it all the context that it needs before you start trying to change how the model needs to act, which

is how you do fine-tuning. Like fine-tuning is basically reinforcing based on examples and saying, "Hey, when you got this input data, your output should be this thing instead." you're reinforcing certain patterns of behavior versus just changing the amount of information that you're you're giving the model in the first place. And most foundational models are now generally competent at following instructions. Um, the one that is generally the most competent at this point is probably GPT41. Um, GPT5 can also do that really well with reasoning. Um, but they don't necessarily need to be told how to act by us. they're generally know how to do that at this point. So you shouldn't need to be trying to dive into

fine-tuning on day one, but instead think about am I really giving my LM the right context. So you've built this really cool tool. Um maybe you built a cool AI review tool where it pulls in information from Confluence and Google Drive and you're able to review design docs and you have nice tool use so you can call your knowledge base. But now it's time to actually ship it to prod. Um the requirements for prod as we all know are not the same requirements as for a proof of concept. Um it doesn't just work no matter how good the models are. It needs an engineered system around it because security has correctness requirements. If you run

chavpt a 100 times and ask it to output JSON, it still fails five times. Um, if you want all of your developers to get the exact same advice on the way to, you know, do encryption at rest at your company, you don't want only, you know, a 90% success rate. You want everyone to get the same information. And this is really contextualized when you think about like reliability and protection systems in general. Like it's it's silly that we've suddenly changed the the lines. Like it it doesn't make any sense. Like we're security people. We need to have multiple nines of reliability, accuracy, consistency. Like if our security systems fail 5% of the time, like we're

all screwed. So we can't be thinking with that bar. Like we need to be thinking about the bar that prod deserves. And I think that's like require like it requires a lot of this complexity. Like you're like, "Oh, great. I've got my like rag and my tool use and I'm good to go." And you don't think about everything below the surface of, you know, they can hallucinate. They can take your instructions too literally. LMS are very sick of fantic. They have retrograde amnesia. like they can do a lot of great things, but the best and the best way to do this is actually engineer a system around your model. And at the end of the day, it

just is good engineering practices that are going to win the day here. And an your LM ends up being a very small component of the overall system that you've built. So, let's kind of dig into that piece. The first number one thing I think about when building a reliable AI system is preventing hallucinations. Um there's a bunch of ways to minimize hallucinations today. So I'm going to go through each of them. First one is chain of thought reasoning. This is basically asking the model to explain its thought process as it's doing it. And because the way large language models work, they're generative transformers. Every subsequent word is based on the previous words that were written. So the more you ask it to

explain its thinking, it's actually continuing to think out loud. So it ends up arriving at a more logical conclusion if you add that chain of thought reasoning as explanability. It's similar to asking a kid on a math test to show their work. um they are more likely to arrive at the right answer if they show their work beforehand. Um the second big piece is source citation. Um I think I think of a lot of LMS like students like every thing that we've been taught as kids of why you should do this thing on a test is very similar like source citation is like an open book test. You actually have access to your source material. you can go and find it and you

will be more likely to be right if you're being asked to directly quote it than if you are trying to paraphrase it. You're trying to remember it from your memory. So forcing source citation is another great way to prevent hallucinations because if it can't find a source then you know that you're going down the wrong path. Another big one is types of prompt engineering. So because um agents were generally reinforced over time to be as helpful as possible, that means they want to be able to give you an answer. It's the same way as if you're in a lecture class and it's a Socratic method class and someone your professor points at you and asks you a question. You try

to answer and you might be wrong and you might be partially right, but you feel like you have to answer because you're in front of your class and you want to be helpful and you want to give the right answer. And so a big thing is letting your LM off the hook and just telling them, hey, it's not only okay to say I don't know, it's I want you to say I don't know. If you don't know, tell me. And adding that like one or two sentences in your prompt can drop your hallucination rates by like 10 to 20%. So just like giving the LM permission to be like, hey, instead of making something up, say you don't know, um,

goes a really long way. And then the final one um that I'm going to go through is LM as judge. And I actually have a demo. So I'm going to try to run some code to show you it. Um but the really the goal of LM as judge is that u models perform much better when they're scoped to very narrow tasks. So when you have a second LLM judging the first one, that second LM is scoped to a very narrow task of does the answer do this yes or no? And it's much better at being accurate at that very scoped thing than answering the broad initial question. So I will attempt to You can't see that. H Okay, one second.

You can just see my Slack messages. Oh, they're gone. Okay, there. Better. Cool. Okay. Um so what I have here is um a few different pieces of code and we use something called BAML um which is a um prompting language for LLMs that um basically enforce structured output and what we're doing here is the first step of L or this LM is judge system is to extract all of the claims out of the initial answer from the LM. So basically if you have the LM answering you know for what security controls protect this API the API uses JWT tokens with RSA 256 signing and enforces rate limiting. There's actually multiple claims in there. So you want to

extract them out into each of their own specific claims that you're checking for accuracy. So that's part one. And then part two is when you have um a bunch of contents and you want to actually evaluate if those claims are true based on that context. We'll run that particular test.

So here you have the LMS judge basically verifying that each of these claims is actually proven as true with explanation as to like where I got like that correct answer. So by having the second LM system running as LM as judge over your primary system, you can start to break out where the hallucinations are occurring. And that can also help you pinpoint where you maybe have missing knowledge in your context or unclear instructions in your prompting in order to reduce hallucinations further down the road. I will now do my second code demo because I don't trust that I'm going to be able to go back and forth between these two things. So, um, basically after hallucination prevention is output

normalization. That's the second big piece to making a reliable, resilient production grade AI system. Um, and this is really around the idea that in order for LM to talk really nicely to engineered systems, you need to actually be able to extract variables from the LM, right? You can't just rely on this free text form field. You need a concept of output sanitization. This is very similar to what we know in general as security professionals. You don't just return untrusted content to a user. Um similarly LM content should be considered semi untrusted. So you don't want to return it all directly to the user. Instead you want to pass it through a layer of engineering first. So

this is again why we use something like BAML because it allows you to actually take what is something very unstructured. So you know pretty unstructured design document for example and use the LLM to actually extract structured information from it. So here we're extracting specifically um trust boundaries. So you'll see that I created a bunch of different classes. So like trust boundaries, they'll have like specific boundary, what the actors are, what sensitive data is actually passing over that boundary and you can break down into like further enums and different types of structure. And by using something like BAML, this is an open source library. um you can verify and guarantee that the LM will always in

the case of like this enum return one of these five options like 100% of the time. And so that allows you to then build more engineering systems around that where you can pass these variables back into your Python or TypeScript code and know that you're having it's like similar to parameterizing a query like you know that this is all that you're getting back and you're not getting all the other free text form pieces. So run this one as well. So from that design doc, we extracted different boundary names of like the sensitive data that was handled and you're able to then use this to make create a threat model out of this design doc.

Okay, now I'm going to attempt to reshare my slides. Okay, we did LM is judged. Great. We did output normalization. Awesome. Last piece that I didn't mention is just now you can actually normalize this and put this into somewhere else. Right? So if you have a threat DB, you have findings DBs, you want to open up a ticket in Jira, being able to take this unstructured output and formalize it into structured information um reliably will allow you to then engineer more reliable systems downstream of your large English models. Oh, this is how you can actually get the uh code that I just demoed. Um they have like a editor online um called prompt fiddle. It's not final but it's

something that you can use as a starting point if you want to play around with um this particular LM programming language. Awesome. I should do that too. Cool. So now that I've sprinted through a bunch of AI engineering techniques that you can use for security tasks and you can use to make these AI applications more reliable. Let's talk about some challenges. Um, there's still a key challenge of the like garbage in, garbage out. Um, LMS are really only as good as the context you give them. The same way that humans are only as good as the context you give them. If you think of your LM as an unreliable intern, it goes a long way in

how you treat them. You can't assume that if all the context in your organization is in people's heads that and you onboard a new employee that they're just going to magically know how this works, how the system works. So this is still a major challenge and I I personally think that the answer is for all of us as organizations to try to get better at documenting in some form like getting the ideas out of our heads. And the cool thing is now that we have so much AI, you can literally just like talk at your computer to get it out of your head and it'll transcribe pretty well it into a transcription. And now suddenly you have

something that you can use with large language models as an input vessel. So you don't have to think as much about how do I actually structure this information. The important thing is that you just get the information into the system somehow. Um, the second is really like misinformation about AI. Um, I've been in the in AI for, you know, a decade now from when things were initially, you know, LSTM models and when BERT came out and when we moved to DNN's and RNN's and there's a broad spectrum of machine learning systems. Uh, transformers are just the most recent one. Um, I think that there's definitely some bad things about AI, but in reality, it's very similar fundamentals

to other machine learning systems that we have been using for many years. And the more you can understand the fundamentals and you don't see it as this mysterious black box, the more you can realize that a lot of the security threats that AI systems have are actually security thoughts we've always had like bad access control and um too much trust in people that are now using these um agents on their computers. Um, so I think things like that to me is still something that's a big um, issue. Um, AI security threats. We obviously had a whole talk about this, so I won't dive too much into it. Um, I still think that there's a lot of roles for human in

the loop here. In order to build a reliable AI system, human in the loop needs to be a piece of how you're considering it. You can't just set up automation for something immediately with absolutely no human intervention. you have to think about, you know, how will how will we make sure the system isn't going off the rails? How do we ensure that there's a human loop at all times and that will actually help you get to the production state faster than if you initially PC something that's entirely autonomous and just assume it'll work in prod. Um, and then like where kind of I think we're going from, you know, using this on the defensive side is being able

to tie it all together. Like truly having apps that be endto end from initial design through threat modeling out to secure code analysis and connecting back to pentesting instead of having apps be focused on, you know, vulnerability management and alerts or having kind of the separation between what's happening at build time, what's happening at runtime. I think the great thing about AI is that it can bring those two different data sources together and give you a more holistic view. All right, so my whole goal was to equip you with the tools that you need to build AI powered um security automation inhouse. So this is a gigantic list of tools. Um mostly open source um lots of

different types of like imprints providers, structured output providers, ways that you can fine-tune if you want to. Um tried to to mark where there's better privacy and security guarantees in some of these companies. Um so hopefully this is the push that you need to start building and reviewing more things with AI and not just letting your software engineers run wild. Awesome. Thank you.

Beyond Vibe Coding: Building Reliable AI AppSec Tools

Related talks