← All talks

To AI or not AI? Dos & Don'ts for applying AI to Threat Modeling

BSides Seattle · 202625:4211 viewsPublished 2026-04Watch on YouTube ↗
Speakers
Tags
About this talk
Emily Choi-Greene explores when and how to apply large language models to threat modeling workflows. The talk breaks down tasks where AI excels—understanding architecture and trust boundaries—versus where human judgment remains essential, with practical techniques for minimizing hallucinations and structuring AI outputs for downstream security tooling.
Show original YouTube description
Bsides Seattle February 27-27, 2026 lecture Presenter(s): Emily Choi-Greene
Show transcript [en]

consumer side and the enterprise side. Um so I'm somewhat qualified to give this talk from you know 10 years in the AI space when back when it was called machine learning. um started my first job in natural language understanding at Alexa so LSTM models uh pre-transformers and then for the DNN roll out and then worked at a company called move works which is an enterprise AI for IT support um those were on BER based models but nothing really holds a flame to when GPT 3.5 came out and we all remember that in November 2022 and that's when this hype train began um so I'm going to really be from an AI model perspective um what is

great from a threat modeling perspective and where we still have a long way to go and let's get started. So I promise this is not an AI hype talk. I know that I am an AI person. I am an AI optimist. I'm an AI vendor. It's a buzzword that's everywhere these days. But it's also a really great tool for the right problems. As security folks, we're naturally skeptics. And I think it not surprised that a lot of folks are excited to say, "Oo, well, maybe there's some don'ts for applying AI." Um, but we also want to use all the tools that we have at our disposal. So, the structure that I broke this talk down into, and

we'll wait a second. No worries. >> Yeah, you can do it. >> We're ready to go. Okay. The privilege of being the first talk in track 4. Um, awesome. So, I broke down this talk in three categories today. um the tasks where AI really thrives. You have to do minimal work as a human at this point to have AI do a great job for you. And then I called the second one the task where AI survives. And this is the meat of my talk. And basically these are how to get the most out of AI for threat modeling. Um but you also have to do a bunch of techniques on your side. So things to

minimize hallucin minimize hallucinations, normalize outputs, things like that. And then finally, I'm going to touch on the tasks where AI dies. And specifically, this is the most philosophical part of my talk. Um, not just covering tasks where AI doesn't work, but where I personally believe you should not use AI um, in this space. And I think it's a mistake. So, we're going to use this architecture diagram as the basis for the threat model that we're building together today. Okay. Now, we can see this really ugly architecture diagram. It is a simple Bluetooth device um that can connect to a cloud service. It's got an admin portal and can connect to um kind of a variety of smaller Bluetooth devices. So

centralized gateway, this is going to be what we'll be threat modeling together. I also wanted to emphasize what the goals are of threat modeling. I actually think threat modeling can get to an almost religious debate with appsac engineers about you know what does it mean to threat model. But for this talk, the terminology I'll be using is um the following activities. Understanding the architecture including the threat actors, trust boundaries and data flows. Determining attack vectors and threats and then identifying any compensating controls or remediations to identify threats and then surfacing any residual risk that may remain. So where is AI great in 2026? Um just kind of pulling back, you know, why should we use LMS at all? Um when

applying AI to any process, we need to think about if we're just excited to new use the new fanciest tool or if it actually is the right tool for the job. So security tasks have a lot of characteristics that align really well with AI strengths. It requires understanding diverse context types like code bases, documents, images, rich context from potentially many different sources. The squishiness of LM inputs makes it pretty good at that. It requires taking that context and translating it into some sort of structured output. And it's just a computer. So if your workflow is well defined, it can do consistency very very well. So frameworks like stride, it can follow at scale with repetition. LMS

don't get tired. They don't get hungry. They don't ask for a bathroom break. They have infinite patience and process in parallel in response on a schedule. So there's a lot of things that LMS can do very well for us here. If you remember the goal of understanding the architecture, things like the thread actors, trust boundaries, and data flows, this is where an LM really thrives. You can ask a very quick question about that highle architecture diagram and get a ton of meat very quickly that can prepare you for going into the meeting with the engineering team. So they can really minimize your time and getting up to speed with a new system and especially as a security engineer that may be

responsible for hundreds or thousands of different software systems. LMS are a huge tool in helping to arm you going into those types of conversations. Similarly, LMS are really good at answering very focused, very pointed questions. This is a powerful for a security design review um and in some ways a threat model. um your threat model changes significantly based on things like the authentication of a system or the network exposure for example. So these are very very specific questions and answers that an LM is great at. But there's a lot of other things you want to do in a threat model besides ask a ton of very pointed yes no questions or enumeration questions. So this

section is really about how to optimize your AI systems for best results for a threat model. If you don't optimize your systems, you end up with something kind of like this. So if you just ask chatbt to threat model the system that I uploaded using stride, um, this is what we get. It doesn't really help me understand the system that much better. It's a start. It can answer those initial questions, but we need structure, systematic methodology. Um, what about my enterprise environment? What about my policies? What about the other areas and systems that this particular diagram may live within? Um, it's more than just that simple diagram that I put in initially. Also, chatb decided to generate a brand

new diagram for me and I'm not really sure why >> you didn't like your phones. >> Apparently, um, so in order to analyze the diagram, it doesn't exist in the vacuum. We need to add a lot of other context about the enterprise. So the best way to add other context is um using tool use or function calling and a lot of folks use the term agents. I think agent is also a very hyped and overloaded term but the most simple definition of an agent it is something that can call something programmatically on your behalf. So it can make other API calls. It can call MCP servers which is really just API calls wrapped in another LLM and um go

and proactively fetch information to come back to build context. So tool use is really the best way to programmatically search and gather other context in order to help you build a more complete and thorough threat model. And this is really retrieval augmented generation 2.0 Now because if you think about the principles of retrieval augmented generation, you do a retrieval step. You go and find a bunch of other information and then you are augmenting all of that with the generation on top to perform the final output. So um what types of things do we actually want to call to gather it? So we want to collect a bunch of enterprise context. So calling things like tickets, understanding what

the organizational structure is, understanding what potential specific enterprise policies may exist that we want to align with. Are there regulatory requirements that this device falls under? And so when we're designing that threat modeling tool, there's actually a lot of engineering that you build around it in building those agents to go out and find each of these pieces, building those integrations together in order to augment that threat model. So now we have a much stronger start. We have very specific threats. We're clear on the impact. We're able to actually site where mitigations may already exist in the enterprise and provide specific remediations that are tied to our enterprise policies and the way that we do things. Um, so being able to break

down those information via data flows, rating threat severities based on our organizational like requirements and what how we rate threats, all of those pieces are now possible once you add the agentic component in. But when I really think about building a reliable AI system, the most important thing to me is hallucination prevention. Um, and especially for security where correctness is so important, um, we need to use every technique at our disposal to minimize hallucinations. So, I'm going to go quickly through each of these pieces. Um, the first one is chain of thought reasoning. Similar to asking a student to show their work on a test, asking an LLM to explain its chain of thought as it's re reasoning through

makes you more likely to arrive at the correct answer. And that's partially because LMS are actually like thinking as they're typing. And so the thought process of going through that chain of thought reasoning, allows you to actually arrive at that correct answer. The second piece is source citation, similar to having a student do an open book test. They're more likely to be correct when they can actually site their sources and when they're required to site those sources. Um, and that is a great way to add kind of that additional um hallucination to uh point. Another big piece um that should not be underrated is just very good prompt engineering. And most big labs have published guides on the best way to

prompt engineer each of their specific models. But something that's important to think about is all of these models have had historical reinforcement learning on being a helpful assistant. And if you think about if I kept telling you, hey, be helpful. You are going to try to answer the question as best you can, even if you don't know the answer. And so when we think about prompt engineering for security systems, we need to give a lot of reward to saying I don't know and saying that that is a very good response and that we want them to say I don't know because that's the best way to actually minimize the hallucination threat is to reward the

low confidence and say it's great to not be confident and you don't need to have every answer to every question. Um, this also helps a lot with some of the more sycopantic you're absolutely right piece that we definitely do not want in our security systems. Finally, um, using a separate LM as judge is a great way to prevent hallucinations. So, this is actually a separate LM call that does not have context of the first one. It is just judging based on the initial sources and the LM answer. um if that LLM answer is grounded in that sources and it turns out that models most models perform much better when scoped to a narrow task of saying does this answer

do this thing yes or no like is this answer correct and so the LM that's acting as the judge has a much easier job than the LM that's initially asked to do the work and so this is a great way to actually do a lot of hallucination minimization is to flag anything where an LMS judge will catch it as being inconsistent. And finally, most importantly, I know I'm speaking to the choir as security people, don't rely on LM outputs alone as the output to a final user. If you are normally going to do input sanitization and output sanitization, you should also do that with an LLM. Um, as much as possible, LMS are best when

built within a highly engineered system. And in that engineered system, you can do things like output normalization, adding the source citations so that if a human is going in, they can actually see those pieces. So here's some of that particular threat model in practice. So having prompt engineering to say where information is not specified, having source citations that actually say exactly where in the source document we get those information, and really giving that LM the confidence to say I don't know. So when we think about a threat model, normally it is not a means to an end in and of itself. You're doing a threat model for some reason and normally that is to find a set of very specific um

findings against the system and you want to actually have very actionable remediations out of that. So often what that means is that you want to export this all into some threat DB. You want to export it to an analytics tool. you want to open up tickets against your dev team and in order to connect the LM output to another system, you need to actually be able to trust that the LM will output things in a structured way. And so what we use um at Clearly AI is an open source library called BAML. Um it is actually built here in Seattle by a company called Boundary ML. And it basically adds enforced output normalization to all LM prompts no

matter what LM you're using. So guaranteed structured outputs so that you know and you can rely on the specific variables in downstream code and you no longer have the it broke because JSON syntax broke. Um so this is um this QR code goes to um something called prompt fiddle. It's a way for you to actually author BAML code in real time and run it and test it on actual LM examples. So this is a very high level one that I built for threat modeling, but it allows you to enumerate things like a trust boundary, um what is considered sensitive data and a lot of the pieces in the prompt um and actually break those out into finite variables

that you can fill and then use elsewhere. So, I've sprinted through a bunch of AI engineering techniques, and we can chat in more depth afterwards if you have follow-up questions on how to build strong, reliable AI systems for threat modeling. But now is the time in the talk where we talk about some of the challenges. Where should we not use AI? As I mentioned, people threat model for a lot of different reasons. And without a clear idea of the reason that you as an organization is going to threat model, it's not a good way of knowing whether applying AI is actually a good idea. If you are using threat modeling as a way to think very structured

through security risks, AI is great. Like with that, if you're using threat modeling as a way to imbue your software engineers with a security mindset, then having AI do the work for them may not actually help. Um, so really thinking about, you know, are you doing this because you want this to be a very collaborative exercise with your engineering team? Are you doing this because you have compliance requirements to do a threat model? Um, each of these different purposes have different places where AI should or should not be part of the picture. Another place where LM really fall short is the escalations and judgments that may be specific to your enterprise context. Um, LMS can surface different

types of risk, but they don't always think about how each of those risks come together to become an organizational risk for your specific company. Um, so a lot of what we think about in our role as security engineers is the trade-off and judgment between security priorities and engineering priorities. And one of the principal engineers at Amazon used the terminology risk reduction, return on investment. LMS are not very good at a lot of those think those thinking processes and so that's really where you should still have your security team um and you all doing that work models might be getting better at this but generally models are still like pattern machines like LMS are at the end

of the day like very very big pattern machines of things that they have seen before and sometimes you can turn up the temperature and string together words that are less likely to come together but by and large. Um, the land of full true creativity is still best harnessed by humans. And I think as security engineers, we have been kind of honed and trained to think in this very interesting and creative way that is often not captured in large language models today. Now, I've been giving I've used this slide a few times over the last year, and I will say like LMS are getting better and better, but I personally still don't think they're there. Um, and I do think that this is

where truly human approach is extremely important. Um, and finally, your model is only ever going to be as good as the context that you give it. Um, if your context is entirely in wetwware, then they're never going to be able to figure out exactly what is happening in a system. Can an LM build a threat model entirely on a codebase with no documentation? Yes, very well. Can an LM build a threat model on a meeting transcript? Yes. But can it build it on nothing but what is in your brain? Like no. They don't read minds. So it's not just about missing data. It's bad data, stale data. Like all those things will still affect your

threat model um whether or not you're using AI. So, my general rules of when not to use AI for a threat modeling process are if you have unclear goals and outcomes, you need to define what your process should look like um as a team before trying to add automation to it. That's kind of a good rule for all automation. Um secondly, you know, if there's trade-offs with no clear right answer, um that is something that human judgment is still best for. And if there's probabilistic outcomes where any or outcomes where any probabilistic nature is not tolerated, you should just use code something more deterministic than a model. Um because even as you get closer and closer to the end of the

probabilistic range, um you're going to end up with potential failure cases, especially in bulk. So, if you do want to embark on using AI, um this is a cheat sheet that we've put together of like open source libraries, free tools, things with um good security and privacy commitments. I've tried to annotate them as such. Um adds a bunch of different ways that you can start doing AI engineering um with the tools that are available. Awesome. So in conclusion, you know, know your goals before adding AI. I personally think AI is best to raise the floor. So getting all of those baseline security considerations, baseline security controls, validating for very specific presence or absence of controls, is it this encrypted at rest?

Is this vulnerable to cross-ite scripting? Um, but humans should still be used to raise the ceiling. Um, automation should be a support and not a replacement for people. Um, and then focus on that unstructured to structured boundary. That is where language models and NLU and NLP has always thrived is being able to take something that is an unstructured format and turn it into something that is in a structured format that you can use downstream elsewhere. Awesome. This is the feedback QR code. Same one that you see on the um screen. Would appreciate any feedback. And I think I have a few minutes for questions.

Yeah. So, the question was, you know, what are some techniques to use LMS to elicit a deeper conversation um with your engineers because sometimes what they've written down is not actually what's happening behind the scenes. I think it is interesting sometimes you can get like multiple engineers on the same team in a room and they disagree on how their system works. Um, that one's always a fun one. I would say the the best thing about LMS is that it can read a whole codebase and 15 docs all in the span of a minute, right? So the more that you can put context that does exist that is like ground truth. So codebase, infrastructure is code, things like

that, the better you have then a starting off point to go back to your dev team and say, "Hey, well, like this is what I saw." Especially if there's good source citations, you can be like, "Hey, in this chunk of the codebase, it says it's doing this. where am I what am I missing? Where is there a gap? So using it as a way to surface things that may disagree with um what what they told you. And then I would say another big piece is like using your you can kind of create or craft an agent that's more socratic in nature. So having it ask questions back at you um versus you asking it the questions. And that can be

a really good way to kind of prompt conversation. Be like, "Hey, did you think about this? What about that?" Besides stride, are there more modern approaches to threat modeling? >> There are a lot of approaches to threat modeling. You've got Adam Showstack's approach. You've got other things like Pasta, Dread, there's Maestro AI, which is like an AI specific threat modeling framework. Um, Stride is a very much a triedand-true, but I think um, generally the framework is just a means to the end, right, of like trying to elicit what the potential dangers or risks are to a given system. Um, and so I mean I there's a whole it's basically a religious debate. I feel like I'm not

qualified to remark on it. Yeah. Yeah. So the question was if you increase temperature, how does that affect hallucinations? How does that affect the number of um threats that are raised? So, increasing the temperature basically means that a LM is more likely to for every subsequent word pick a word that is maybe less likely than it may have otherwise picked in a lower temperature case. Um so it does increase hallucination significantly because it is looking at much less frequent patterns than otherwise maybe um surfaced. But it also increases creativity. What I personally would recommend is you actually run the threat model twice. one on a very very low temperature and then you can run again

or a rerun or kind of secondary conversation with very high temperature. And so you can actually isolate what are low temperature threats that are very likely and non-h hallucinated and then you can have high temperature threats as the beginning of a conversation um for kind of more that creative side of the house. Yeah. Yeah, definitely. Um so would I recommend using AI to learn threat modeling? I think AI is great at learning most things that are present on the internet. Um, so it's a great initial starting point. You should make sure that you're using a model that has some source citations and also read some of the raw sources like Adam Showack has a lot of great, you know, blogs. We have

a lot of the classic apps vendors will have blogs around threat modeling. So I would say, you know, if you want to know kind of what 500 average people would say about something and AI is a great thing to ask. So if you wanted to average everyone in this room, that is the answer you'll get from AI. Um, but if you want something super specific and specialized, it's great to go back to the like raw source material. >> So what are techniques to encourage I don't know and to have more deterministic outcomes. Um, to encourage the I don't know, it is mostly prompt engineering. Um and in that prompt engineering, every model family will actually have um specific uh prompting

advice. So every time OpenAI launches a new model, they'll have specific prompting advice for that. Um but often it's saying like very explicitly that it is a not just acceptable outcome, but a desired outcome for it to say I don't know. Um as far as the deterministic more deterministic outcomes, BAML is like our personal tried and true. um because you can basically turn anything into you know a binary outcome an enum and so that allows you to by mapping down the outcome space make it much more likely that you can get towards a deterministic output um so just to make sure I understand your question you're asking if I build a model or a threat model during the um

kind of initial ideation stage and then I've now kind of moved towards implementation how do I make sure that the threat model evolves properly >> yeah so a lot of that is going to be in the engineering ing of your threat modeling system. Um so whether you want to take your initial threat model as some of the inspiration and you want to do more of a diff whether you want to do kind of a brand new threat model every time. Um there's a lot of different ways you can do things like drift detection, diff detection, evolution of threat modeling. So happy to kind of dig into that off afterwards too. >> Yeah. No, you can definitely do that

today. Like you can build a system like this using a self-hosted model like a locally hosted model and it will depending on the type of tasks you're requesting you can still get you know halfway there I would say. >> Awesome. I think we're at time. Thank you all so much. I'll be outside to answer any other questions.