Beyond Vibe Coding: Building Reliable AI AppSec Tools

Name: Beyond Vibe Coding: Building Reliable AI AppSec Tools
Uploaded: 2025-12-26
Duration: 14 min 50 s
Description: Emily Choi-Greene explores how to apply large language models to application security tasks like threat modeling, moving beyond demo-stage chatbot usage to production-ready systems. The talk emphasizes the gap between proof-of-concept AI and reliable automation, covering retrieval-augmented generati

BSides NYC · 202514:5024 viewsPublished 2025-12Watch on YouTube ↗

Speakers

Emily Choi-Greene

Tags

CategoryTechnical

TopicAI Security Threat Modeling

StyleTalk

About this talk

Emily Choi-Greene explores how to apply large language models to application security tasks like threat modeling, moving beyond demo-stage chatbot usage to production-ready systems. The talk emphasizes the gap between proof-of-concept AI and reliable automation, covering retrieval-augmented generation, tool use, and techniques like chain-of-thought reasoning and source citation to minimize hallucinations and ensure consistent, accurate security outcomes.

Show transcript [en]

Okay, perfect. Do you you want me to start? Okay, perfect. Awesome. Um and then Okay, cool. Okay. Um cool. So, this is an AI talk a bit more than it is a security talk. How do I have both this up and the other up at the same time? >> Uh that is like uh it's not possible. >> It's not possible. >> Yeah. So you can keep the hit here. >> See you can explain. >> Okay. Awesome. Um sorry for the technical uh difficulties but this is a AI talk a bit more than it is a security talk. Um, as engineers, we should really be applying the best tools to the problems. And so, um, that really means

taking AI and applying it to solve security problems. Um, I promise this is not an AI hype talk. Um, it's a buzz word that's really everywhere these days, but it is a really good tool for the right problems. And as security folks, we can be naturally skeptics, but we also need to embrace every tool that we have at our disposal, especially when this is the like the direction that's coming down from leadership. Um, so I wanted to start by pulling the room. Who here is using AI in some form at work today? Awesome. Who's using it to write code? Most of you. Great. And who is using it to perform code reviews or security reviews?

fewer but still a good a good percentage. Um so it's easier than ever to build software and vibe coding is pretty magical. Um but it also gives tools to a lot of people that are not prepared and unfortunately here's the same person 2 days later API usage maxed out, subscriptions bypassed, databases hacked. I'm not trying to attack this particular guy. It's awesome and incredible that software is more accessible now than ever. But it's pretty clear that with more software producers um we're also going to see much more security vulnerabilities. Um I think it's kind of a Jevans paradox in act in action for those who don't know the Jevans paradox is the idea that more

supply paradoxically leads to more demand. So more AI code leads to more AI code which leads to more need to secure AI code. Um that's where all of us come in. the need for security is only going to increase and the need for security experts to actually focus on the most critical areas is only going to increase further still. Um so when we talk about you know why use LLMs to solve problems like apps um you know when applying to AI to any problem we should think about is this an appropriate application or are we just really excited to use a new hammer. Um security tasks have a lot of characteristics that align really well

with what AI is great at. um requires understanding of diverse context types like code bases, um documents, images, rich content from potentially many different sources. Um the squishiness of LM inputs also makes them pretty good at comparing across initial design time decisions with final implementation decisions. Um it requires taking taking context and transforming that into a normalized output. So you can actually have um threat findings in a threat database, put it into your vulnerability management system and actually see end to end um what the development looks like and it's just a computer. So LMS are very good at consistent well-defined workflows and frameworks like Stride. Um they can handle scale and repetition. They don't get tired. They don't get

hungry. They don't ask you for a raise. Um they have infinite patience at developers that aren't doing the right things. Um, so generally AI is great for apps like problems. When you should not use AI is when there's unclear goals and outcomes. If you don't know what done looks like, you can't tell a model this is what done looks like. So humans need to define a process before you can automate it. Um, if there's trade-offs with no clear right answer, when there's a major business decision or business judgment that's needed, um, you should not be using AI for that. Humans are the right people to make those decisions. And if you don't have any tolerance for

probabilistic outcomes, if it needs to be 100% right all of the time and not 99.5%, um you should be doing this using deterministic things like code. Um that is why we all still have engineering components to our systems and it's not entirely AI. So today we're talking about applying AI to threat modeling specifically. So I um grabbed a specific um architecture diagram that we'll threat model and the same technique can be used for a lot of other security processes um but I thought threat modeling was a really good example to use today. So this is a example um service that connects to a bunch of Bluetooth devices. There's an admin portal. There's a kind of server side

and a client side. We're going to threat model this. Step zero is the naive approach. dump it into chatbt. Um, and it took a stab at it. Um, we got some highle information about what the system does and it even offered to draw me a diagram where it explained where the threat actors were and you know visually showing different system components and that's what it produced. Not not great. Not a good start. So can't really answer point it can great at answering very specific point questions using something like chatubt is insufficient when you need structure like it needs to be systematized you need methodology you need to know context about my enterprise environment my policies um

it's more to this than just a simple diagram so the first big part is building context that service does not exist in a vacuum even though it looks like it when you look at an architecture diagram. It's a bunch of white space around the outside, but there's a ton of other context about the enterprise that's missing there. So, in order to build context, there's basically kind of a few different components to a traditional concept of retrieval augmented generation. You basically have a concept of a query. What are you trying to retrieve? What is the information that you need? You have a retrieval component that's going to actually retrieve that information or perform a search. a generative component

that's taking in that newly searched information and generating new content. Um, and that feeds into the response. Traditional retrieval augmented generation uses a vector store for the retrieval component and chunks the source material. Basic rag has a time and a place, but there are a lot of limitations. First, the results are entirely dependent on how you chunk it. So you are relying on an embeddings model to determine um kind of what those chunks and pieces that are broken down and put into your vector store are and look like. And when you perform a search over these vector stores, you get a set of results that are a bunch of similar things. So, similar to looking at a

Google search results page, it tells you snippets of contained text that might be similar to what you're looking for, but it's a bunch of sentences that may or may not be actually what you're looking for. Secondly, if you're using a generic embedding model and you're doing something very specific and specialized like security, um all of those embeddings might map very very close together in the vector space and so your results might not actually be as relevant as you want them to be. Um, finally, you actually need to know what you're searching for in order to do this pre-processing and put it into a vector knowledge base. And so, you have to have this concept of pre-processing and be

like, what do I actually want to search to use this type of basic rag? So, industry as a whole has moved over to more tool use and function calling. Um, everyone has different definitions of what an agent is, but at the most basic level, it's something that can programmatically perform an access actions based on an LM decision. So, at a high level, using tool use allows you to use APIs to programmatically search internal systems, the web, call other agents, etc. And it's really rag 2.0 know because you get a lot more context that are not just results from an embeddings model um but are results from different underlying systems. So we when we use something like tool use to

actually analyze our threat model we get context like this. So we have that initial um architecture diagram that was created. We've also pulled in related tickets from the dev team of different things that they're implementing from Jira. We have context about the organization, who this person is that's submitting this do diagram, where do they sit in the broader context of our org, all of our company's internal policies and procedures, and maybe the requirements that are specific to this type of device that's being launched in the EU and therefore needs to be compliant to the radio equipment directive, for example. So once we have something like this, you're able to create a threat model that is much more rich in context. You

have contextual diagrams. You're able to break out information via data flows. You're rating threats um severities based on your own internal organization policies and risk rating methodologies. Documenting what mitigations are present. So great next step. But now what? Cool. We have a threat model. But what's the point of a threat model? It's to find unmitigated issues and tell developers that they should mitigate it and ideally give them very clear and pointed suggestions to actually implement it. So you want to take all this information out of this document and put it somewhere probably into you know open findings in Jira or a threat DB. So the next step is actually creating functionable structured output and doing this in a production

environment. So the requirements for prod are not the same as the requirements for a proof of concept. It doesn't just work like no matter how good the models get, they're always you always need an engineered system because security has very high correctness requirements. If you run chat GBT a 100 times and ask it to output JSON, it might fail five times. So you're suddenly not actually opening those findings or tickets because it's you're just getting these not properly structured errors or other issues, which is why you need to build this engineered system around your AI to actually get the end results that you're looking for. And just to kind of illustrate the point, you know, when we think about it,

like if every API call you made failed only 5% of the time, that would be a disaster from availability perspective. But yet somehow we seem to accept every LLM call failing 5% of the time. And we're security people. Like our AI apps need to have multiple nines of reliability, accuracy, and consistency. the concepts are just a start. If our security systems fail 5% of the time, like we're screwed. So when you're thinking about building AI systems in production, you have to think about all of your production requirements and not just have a nice shiny PC that you can demo on chatbt, but actually figure out does this work repeatably and consistently. And that requires a lot of complexity.

Agents are hard to build. Ton of tons of things go wrong. There's a huge gap between deploying an LLM and automating your workflows. They hallucinate. They can take your instructions too literally. They have poor taste. Um they have retrograde in Asia. It feels like 50 first dates. Um but there's still a ton of benefit to doing it, right? So how do you think about um kind of balancing these things? That would be part two. Building reliability. When I think about a reliable AI system, the first thing I think about is preventing hallucinations. Um, there's a bunch of ways to minimize hallucinations today. Um, and we'll go over a bunch of them. So, first, chain of thought

reasoning. Similar to the idea of being a student asking someone to show their work on their math test, they're more likely to end at the proper answer by explaining their reasoning. And a lot of the reason for that is generative AI is generating token by token. So as it's explaining its work, it's actually thinking through the problem to arrive at the correct solution. So chain of thought reasoning is super important. Not just so you have insight into what your AI system is doing from an observability perspective, but actually helps you get more right answers. The next piece is source citation and prompt engineering. You have to actually engineer your prompting to give the LM permission to say when information is

not present and that it doesn't know. Most of the baseline um reinforcement learning models that we're using today have been reinforced to be a helpful assistant and helpful people like to give you answers. But sometimes the answer is that I don't know. And so you actually have to coax and prompt engineer it into saying, you know, things that aren't said that should be said, things that are said that are incorrect. Like all of those pieces, we need to have prompt engineering in order to say, hey, encryption rest isn't present. It's not specified. That's a gap. And highlighting where things might be missing. Um, the second piece is source citation. Whenever an LLM posits something, you need to

have it link directly back to its end source in order to guarantee that that initial information was not um, hallucinated. And by asking it to cited sources, similar to a student with an open book test, it is more likely to be right than if you're trying to think about something off the top of your head. The final piece is LM as judge. I actually have a demo for this and that was what took a long time to try to set up properly. So hopefully this works. Um but I'm going to basically show you how we do LM as Judge at Clearly um to give you some ideas of how you can use LMS to actually help make your LM processes

better.

Beyond Vibe Coding: Building Reliable AI AppSec Tools

Related talks