AI Agents: Augmenting Vulnerability Analysis and Remediation

Name: AI Agents: Augmenting Vulnerability Analysis and Remediation
Uploaded: 2025-06-02
Duration: 40 min 32 s
Description: "Are AI agents worth the hype? In this talk, we’ll explore the tangible impact of AI agents in cybersecurity, focusing on how they can be used to automate proactive security workflows at scale. AI agents can be used to augment traditional human-driven processes to identify, assess, and remediate v

BSides KC40:32374 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Peyton Smith

Tags

CategoryTechnical

StyleTalk

About this talk

"Are AI agents worth the hype? In this talk, we’ll explore the tangible impact of AI agents in cybersecurity, focusing on how they can be used to automate proactive security workflows at scale. AI agents can be used to augment traditional human-driven processes to identify, assess, and remediate vulnerabilities. We’ll highlight real world case studies to show where AI agents excel, where they fall short, and lessons I've learned along the way. We'll also discuss the technical challenges of implementing agentic security solutions, from managing hallucinations, building human-in-the-loop workflows, to integrating agents with existing security datasets for improved performance. We’ll also discuss the broader implications for security teams -- how AI-driven automation is shifting the role of human analysts and changing the way organizations approach cyber resilience." "The AI hype cycle is in full swing, making it increasingly difficult to separate reality from marketing buzz. Will AI revolutionize cybersecurity or is it another over-promised technology? In this talk, I’ll share my personal journey using Large Language Models (LLMs) to automate my daily security workflows with a focus on proactive security security. We’ll explore using AI agents to assist human analysis related to identifying, assessing, and remediating vulnerabilities, CVEs, and misconfigurations. We’ll review real world examples where AI agents genuinely shine and where human expertise remains crucial, challenges LLMs face when performing critical cyber security operations that leave no room for error, and discuss what the future likely holds. Topics We'll Cover: 1. Real-World Use Cases: AI as a Force Multiplier AI isn't replacing cybersecurity professionals anytime soon, but it can make you more efficient. We’ll break down examples of automating specific security workflows that AI agents can perform, including: Triage: Given a result from a vulnerability scanner or tool, how can an AI agent help you determine if it’s a False Positive or a True Positive? Analysis: How bad is this vulnerability? Is there public proof of concept code related to this CVE? We’ll review how a LLM can be used as a super-powered Google to identify and analyze data you need to quickly make a security decision. Remediation: Tired of searching for vendor advisories? We’ll discuss how you can build an AI agent to automate generating a customized remediation plan for a given vulnerability and organization. Lastly, we’ll explore how security teams can leverage AI to perform repetitive tasks so you can prioritize deliverables that push your organization forward. Challenges in AI-Driven Security Operations While AI has already transformed industries like customer support and sales, cybersecurity presents unique challenges that make teams hesitant to fully trust AI-driven analysis. Unlike other fields, where minor AI mistakes may be tolerable, cybersecurity has limited room for error—a single misstep can lead to unnecessary remediation efforts, while a missed threat could result in a major security breach. Strategically speaking, how do we ensure that AI-driven security decisions remain explainable, verifiable, and reliable? Technically speaking, how can we limit LLM hallucinations to build trust in scenarios where even a small error could have serious consequences? We’ll discuss strategies to minimize hallucinations using retrieval-augmented generation (RAG) to ground LLM outputs in authoritative sources and implementing strict validation layers before acting on AI-generated recommendations. We’ll also discuss human-in-the-loop systems that allow security analysts to validate AI outputs before execution, leveraging confidence scoring and explainability techniques to clarify why AI reached a certain conclusion, and building audit trails to ensure AI-driven decisions can be reviewed and justified. The Future of AI in Cybersecurity: Where Are We Headed? As AI continues to evolve, what might the next five years look like for security teams? We’ll explore the possibility of AI-powered autonomous security agents that identify, prioritize, and patch vulnerabilities in real time. Will we become subservient to our AI overloads or will human oversight remain a requirement?"

Show transcript [en]

Uh thanks for having me everybody. Um so this talk is going to be about AI agents which uh is sort of a new space and so uh it's something that I'm excited about and hopefully I'll get you excited about it as well. Uh so a bit about me. So my background I was actually initially a software engineer and then moved into cyber. I've been in cyber for about 10 years maybe actually before. So can everybody hear me? Okay. louder. Oh, okay. All right. Sounds a lot louder. All right. Uh, and then I was in cyber for about 10 years. Uh, I started my career in threat intel, uh, building malware analysis pipelines and tracking threat actors. And then I moved

to Crossre Services. Uh, I was on their cyber intrusion team. So, I've seen a lot of cyber breaches, uh, nation state and e- crime, Russia, China, North Korea, Vietnam, you know, pick whoever targets Western entities. And also split time between the instant response team and the red team. So, uh, hacked into a good chunk of the Fortune a thousand as well. I left in 2022 to start a business. Uh, we're backed by Google and we're actually building in the space. Uh, I promise this is not a vendor talk. Uh, I hate those. And so hopefully, uh, we'll be educating folks about language models today. Um, great. And so that's that's sort of me. I just I just think

background is sort of important to understand where I'm coming from. Um, so why you should care about the space. Um, I think large language models will revolutionize most industries. Um, you know, it's it's a derivative of neural nets and neural nets are what power self-driving cars. I don't think uh you guys have Whimos out here yet, but uh it's pretty fun seeing seeing them drive around California. Uh there's just no one in the car and it drives pretty well. And so essentially, I think you can sort of apply that same idea where you're going to have sort of a a machine taking a lot of actions on behalf of something that a human used to do. And

um I I think that'll apply to different verticals. Uh but with that said, I think cyber security due to its low risk threshold, right? If we make a mistake, if we typo command, if we accidentally or accidentally delete system 32, uh catastrophic problems can happen. And so uh I think cyber security will will be one of the last uh industries to fully adopt them, which means that your jobs are all safe. Great. Uh but uh I think what that also means is that uh there will be human in the loop workflows meaning that you're pairing with a AI agent to do certain things or to automate or augment what you do today. And so we'll walk through a lot of that

uh today. But point being is that I I I really think it's a technology that you all should try to adopt sort of like email it's going to be the new standard like everybody's going to be using it and so I I think it's important to understand uh how and why to use it. So with that said goals for this talk. So uh basically build the building blocks to understand AI AI agent terminology and then agentic workflow patterns. Uh and then similarly sort of make sure that you understand the the basic building blocks of um AI agents and so that if you wanted to start writing your own you could do that. And then last

thing is that there are some inherent risks as I just talked about and so hopefully we'll talk about some strategies to mitigate those risks as well. Great. Um I so I I I know there's probably a different level of understanding with language models. So I want to do a few quick slides on basics and then we'll kind of go into I think some of the more fun technical uh things. Uh so what is a large language model oversimplified? Uh if there's actually a real AI person in here, they'll probably not agree with this, but that's okay. Uh so really really what it is is it's it's an advanced autocomplete system. Uh I think the simplest version of this is if you guys

use like Gmail today for your email, if you start typing a word, it'll it'll know what word you're typing. Uh and essentially it's just that you know times 10 where it can start to predict paragraphs and uh you know and and it's it's it's predicting the next token in a in a sequence where where a token is just a word a character or a number. Um and it's based on its training set. So uh large language models large language models have huge training sets and it's pretty much everything. It could be books, articles, New York Times, YouTube data, and and they're sort of cramming it into a training set and then having a large language model uh learn from that

training set. And I specifically put New York Times there because there's a ongoing lawsuit between New York Times and OpenAI. Uh where New York Times sued them for using their data without their permission. Uh I think they that is still ongoing. So, but in terms of just like they pretty much used anything that was public and they're like, "Hey, we're going to build our language model off that. There's probably going to be a lot of lawsuits down the road, but for now, who cares? Uh, but I I'll I'll caveat that it's not necessarily security data. There's some security data in books, articles, right? But is it like Defcon talks? Is it Black Hat talks? Is it

private research papers? No. Right. It's it's pretty much just all public data that uh these vendors could could get their hands on to train the model. Uh, and I'll and I'll say something that's really important is that it's really expensive to train these models. So GPT4 which is uh opened generic model cost about 100 million to train. They trained that about last year. They recently released 4.5 and that cost about 500 million to train. Um and so these are very expensive which means you can't train them that often, right? Uh so uh just that that'll be a reoccurring theme here as we go through these slides. in in terms of thinking about different models. So there's open source and then

there's proprietary uh proprietary open AI cloud Google uh and then there's open source mistrol llama llama deepseek and there's this is like you know six of I don't know 20 uh but I I I think the general idea that I that I want to share here is is that I I I'm not sure if that this will be like search where Google pretty much just won that space as a um as a monopoly. Um I think we're starting to see these models have specializations. So as an example, uh Mistral is gonna is is recently specializing in Arabic and then uh Claude's trying to focus on the developer space a little bit and I think OpenAI is more like consumer and so I

think you know just broadly speaking about markets here uh I I think we'll sort of see start to or start to see some of these u models sort of specialize in certain areas. Uh so maybe a little bit different than your traditional search. Okay. Okay. So, and just uh a few additional um differences between open source and uh proprietary. So, open source, you know, you can run it on your laptop. So, the the the cost is pretty much low. It's just a infrastructure cost. Uh performance for open source is going to be slightly behind proprietary. Uh but you can customize a lot more. Um and then, you know, I'd say it's a little bit less reliable because you're

probably going to be running it on your own uh or on your own infrastructure. Uh but you but with that said because it's there you also have more privacy and control in terms of where your data is going. Okay. Uh so maybe one or two more slides on uh foundation. So um OpenAI even each vendor has a ton of different right. So this is a release pattern of OpenAI's models over the last year. And I'm actually missing one. Uh there's another one that they released in September. But as you can tell, what they're doing here is they're they're like they're only training models because it's so expensive. They only do it once every few months or even longer,

right? And then and they're sort of releasing different types of models uh for different types of specialties, right? So they have generic audiovisisual, they have a reasoning model. And so uh you you can imagine this like you know quickly increases if you have each vendor releasing all these different models. And so uh you know I think from our perspective as we're building in the space it's like how do you know which model to use? I think it's a really important question that we'll revisit here as we move along as well. Okay. Uh and so how folks try to decide what model to use is there's actually a lot of LM benchmarks. Uh these are pretty much websites that you

can go to and they'll I won't go into how they're grading these language models, but they're they're essentially trying to grade these language models based on various characteristics. It could be reasoning, it could be code, and there and it's basically just a scoreboard. Uh, and this scoreboard changes all the time because there's this constant game of leaprog where a vendor will release a new model and all of a sudden it's way better than the previous model. And so this is a this is a screenshot from March 21st and this is right after Gemini 2.5 came out and you can see it's pretty much has the best ratings across the board. And then I took one this morning and this looks

totally different because um uh um because OpenAI just released O or 03 and 04 uh I think last week. So uh in cloud will probably release something or and then this is going to change again. And so I I think it's interesting building in the space. It's like how how do you consistently try to use the right model for your use cases especially when models are changing so fast and you don't know which one to use, right? and and that's a a topic that we'll talk about. Um so uh just some high level notes I I think if if you are don't care about the output as much as as in like you don't need it to be the best all the

time I would heavily recommend using a open source model. Um and there you know and if it's a simpler task uh but if you really care about output which is often cyber security sadly right like you want to be you want to make sure that you're using the best model because you want the best output um I'd highly recommend using proprietary um but I I think the main point I'm trying to make here is at the very bottom here which due to all these changes even if you say that you're going to use proprietary um models uh it's it's important to build in a way that you can easily plug and play with different language models Meaning if

Gemini comes out with a better model and for whatever reason Claude or OpenAI starts to be terrible like you want to be able to easily switch out pretty much your model that you're using to use whatever model that you want to be using, right? And so we'll talk about strategies to do that. Okay, so that was all foundations. Now moving to I think some more uh cyber security focused workflows. Uh so CV analysis I I feel like this is a pretty standard workflow that hopefully everybody's at least somewhat familiar with. Um, and so really when I'm looking at a CV, I'm my my first question is what is a CV? Uh, is there threat actor activity? Um, is

there threat actor activity affiliated with the CV? And then is there public proof of concept exploit code? And if yes, generate it. And then more importantly, how do I fix this? That's that's sort of my workflow usually. Um, and then so my in terms of how I would typically figure this out, I would look it up in the national vulnerability database and I would read the reference links. I'd maybe Google for vendor advisories, blogs, thread intel, whatever it might be. And you're basically trying to answer these questions. And maybe if you have some vendors that you use, you probably do uh maybe add a few sources depending on who your vendors are uh for your

org. Okay. So, just to kind of level set and then we're actually going to use a language model to try to automate all this. Uh so, what is a CV? It just to kind of level set again. Uh it this is a um or what is CV 2024577? So this is a PHP rce uh vulnerability in the CGI argument and or uh that affected the CGI component of PHP and it allowed for uh remote code execution on Windows systems with specific language configurations. It's a mouthful but um and then it was it was used by a variety of threat actor groups including ransomware groups etc etc. uh there is public proof of concept code and then really the simple fix is just

to upgrade it but uh we'll hopefully get some more information with the language model. Okay. So um I'm going to try to be a smart analyst and I say hey I want to use a reasoning model because you know the we like want to pick a right model that's going to be correct for this complex workflow. And so I'm going to use GPT01 which came out December 2024. Uh just to go back really quickly. So this CV was released, it was publicly reported in NVD in June and public exploitation started in July of 2024. And so we're we're going to use a model that was released by OpenAI in December 2024. Um, and I wanted to pull

everybody. So uh, does everybody think that 01 will be able to answer these questions, uh, completely? And maybe just a raise of hands. Just kind of curious. Nobody. Okay. Partially. All right. Not at all. Oh, not at all. One. Okay. Uh, so it is not at all. Um, and it's because as we talked about the language model is trained basically a long time ago and it's super expensive to train and so even though they released this model in in December, it has no knowledge of a CV that was published in June. Um, so just just to kind of give you a context of like why uh there's still a lot of problems with language models. Um, but thankfully

we're not the only ones with this problem. Uh, I think most most folks, you know, regardless if you're in the cyber security industry, if you're in biology, like everybody has this problem where they're like, hey, we can pretty much only use uh or only ask it questions on I don't want to say stale data, but older data, right? So essentially, how do we solve this problem so that the language models have current up-to-date context is is sort of the big question. Okay, so there's something called retrieval augmented generation, which is which is called rag. Um I'm going to oversimplify this again, but um essentially language models have a context window. Um and this is in

addition to the training set that it was trained on. Uh it's it's essentially you can picture it as if you pull up chat uh GPT and you start typing in that chat window that that is essentially you're typing in the context window. Um but so the the context window was actually initially super small. uh but over time I think uh proprietary vendors have realized that this is a really important feature and so they've started to extend this context window so that you can sort of solve that problem that we just had right where you can sort of shove data into into the context window and then in addition to the training set the language model can analyze the context

window for additional information um so initially I think some of their smallest context windows initially were like 8,000 tokens which which again are pretty much just words and I think some of the larger context windows today are I think are close to two million um So I've se seen a lot of growth here and I think everybody recognizes that this is probably going to be the way of solving this this this problem moving forward. Okay. Uh and just a diagram unfortunately these slides are a little bit small so sorry. Uh but um just a diagram of a architecture of a rag application is is again you you have your user query. Uh so rag is

essentially the the application recognizes that the data that it that it needs is not in its training set and so it it it's going to go query data somewhere else. So it's going to retrieve that data and it's going to throw it in the context window. So in addition to its training set, it's going to have this additional data that it just retrieved and then it's going to use that in combination to generate a text response. Um cool. Is this going to work? We'll see. Um, so let's see. All right. So this is a demonstration. So we're going to use 04 here. So or 40, excuse me. Uh, so opening 40 came out a little bit

afterwards. And again, they also recognized that this is a problem. So I'm going to run this a little bit. And so all we did is we I know you guys can't see this because this is way too small. Uh, but uh it's basically we just ask the same question. So we say, "Hey, can you describe this CV?" And what's happening on the screen here is that there's something in gray and it says searching the web. Um so what what opening I realized is like hey uh we're not going to train our models all the time. So we need a way to find relevant current data. And so this is this is this came out in the last probably three

four months is they've integrated this sort of pipeline where if if they recognize that their model does not have the right data they'll query Google they'll get the Google links and they'll shove that data into or into the context window to actually generate a valid response. And so I'm just going to let this run out. Uh so this is the exact same question just a different model uh that has this rag capability built in. And uh what it's showing here is that there's links on the right hand side. Um and it basically used these links on the right hand side. It's actually doing a really cool I don't know if you guys see this, but uh if you hover over a link,

it actually shows like what information in the response uh was generated from that link. So again, it's sort of using these links to dynamically generate a accurate response. Oh, good. Okay. Okay. So, uh my argument that I'm that I'm trying to make here is that this rag especially for cyber security where we're things are always changing really really quickly. Uh especially for thread intel, right? We want to know if a thread actor is actively exploiting something. We want to know about brand new CVES. We don't have to wait six months for a language model to train on that data. Uh this rag pipeline in my opinion is the most important part for cyber security workloads. Uh it pretty much

doesn't matter what you're doing. uh in my opinion this this is the most important part and so we're going to talk a little bit about strategies uh to build those as well. Okay. Uh I'm I'm calling this Google rag. That is not the definition. Uh that is a madeup thing. Uh but uh I there's uh Google has called this Google grounding. I' I've seen that. Uh but there there's this term of grounding where essentially you're using external data sources to ensure that the language model is outputting correct data. Uh, but I just want to be clear on that that that's not a real technical term. Uh, so this is hopefully a slightly bigger screenshot of some of

the responses that were just in the in the video. So it looks pretty good compared to the first response uh where it's actually giving information about the CVE. It's saying what versions are affected and we can actually use this data as a security practitioner. Uh, is there threat actor activity affiliated with the CVE? Looks pretty good. It notes some ransomware groups. U says there's been mass exploitation campaigns. It's a little bit vague because it's using Google. doesn't have the best thread intel, but decent. Um, and then is there a public proof of concept code? Uh, it uh it says yes, but it doesn't generate it for us. We'll actually talk about generating code here in a little bit, but that's generally a

pretty good response with this rag workflow. And then how do I fix this? It sort of gives us some versions to upgrade to. So, pretty good using Google, right? Um, so drawbacks of this is that I don't want to have to manually go into OpenAI and type a question every time. That sounds terrible. I'd rather I'd rather u somehow automate this. Uh so essentially what we just showed is sort of like a point in time analysis using Google. Uh we also don't get to control what sources the language model is using, right? It's just it's using Google. It's just looking at search engine optimization articles really. Uh and so we don't have control, right? If

you're an organization and you probably have your own thread intel or maybe you want to use whatever sources you want to use, uh you don't really have a choice in terms of what the language model is using. And then uh again, this is more for your enterprise or organizational perspective, but uh the answer also isn't customized to your network. This is pretty much just a generic answer. Uh it doesn't have any context about your network. If you're looking up a CV, it's probably because you're affected by it maybe or you're trying to figure that out. And uh a generic answer is only is just okay. Okay. So how do we improve this? Uh automate all the things. So

we're going to try to automate this. Um then also as I mentioned in terms of we also want to integrate with internal data so that's relevant to our network. Uh so not necessarily just public Google data but maybe if we have like a CMDB right we want to pull in that or pull in that information in into the context window uh so that the LM has that custom thread intel. If you guys are Mandy Crowdstrike uh Palin Networks pick your vendor uh their their thread intel is usually a lot better than what's on Google. Uh and then last thing, you know, maybe if you had a like a SIM or an ER Splunk, whatever it might be, uh kind of

pulling in that relevant data into the conference room as well. So we'll walk through some strategies here. Okay. So to do that, to automate this, we're going to build aic systems. Uh so it's sort of just automating exactly what we just did. You're just going to write some code to do it. So I understand this is small again, but uh essentially each proprietary vendor has their own API, right? So you can write a Python library and you can integrate and pull a API call uh and automate sort of that question and response answer, right? Uh so this is feasible but going back to a topic that we talked about before uh because how quickly language

models are changing, we if if if all of a sudden open starts to be terrible, right? We want to be able to quickly switch to Gemini or to to a different model. And so I don't want to have to write this six times for each model vendor that I want to use, right? And so how like can we basically make this a little bit easier so that we can easily change language model vendors. Uh and so I'm going to propose that we're going to use agentic frameworks. Uh so this is a relatively still newer space. It's probably like two years old, which I guess is about how old most of the since language models have blown up. Uh but essentially

there are abstractions on top of the language models that allow you to easily be LM agnostic right so so that you can plug and play and I'll show an example of this um but uh and they also come pre-built with a lot of with a lot of popular agentic patterns we'll walk through a few of those um and then you can also provide the language model some tools we'll walk through that uh the one downside is is that because this space is so new even these libraries can be a little bit unstable as an example so there's lang chain lang chain and lang graph are actually the same company they just released different tools curi is a

separate startup and then uh autogen on the right is uh Microsoft's framework um but uh because of how new this space is like if if you were to use lang chain.1 and then you would try to upgrade to lang.2 too, most of your code would break. Uh, so thankfully, you know, it's it's getting a little bit better and more and more stable where that's not really a problem anymore. Um, but I still highly recommend using these frameworks and so we'll walk through some examples. Uh, so just one example here. So in pretty much the the only you just need to change one line of code to change your model and everything else stays the same, right? So this is sort

of the the benefit here is again if you wanted to use Gemini, Claude, OpenAI, you just have three lines and you can comment one out and and everything else stays the same. So it really allows you to kind of plug and play and and figure out what works for you or for your use case which which model works for you. Okay. Uh to go into a bit of depth in terms of some terminology here. So Anthropic came out with a blog in December uh where they're trying to sort of uh define some not nomenclature for agents and I'm going to try to use their uh blog. So there's workflows which are uh agentic patterns that have predefined

code paths. You can think about uh this is actually lang chain. That's what the name is suggesting. You're sort of chaining together things and you can't go backwards. You're just going from one place to the other. Uh and then there's agents which give the language model a little bit more flexibility in terms of how it wants to accomplish tasks. uh and so this is more a traditional what's called a DAG which is a directly asyclic graph or directed asyclic graph where uh essentially the the language model can sort of decide how it solves the problem and we'll kind of walk through both of these as well. Okay. So, building workflows. Uh, so we're going to try to

automate this. Uh, so um you know just starting with the with the top question. So in terms of if if we if we wanted to describe what the CV is really what we're doing as humans is we're reading through all of those uh national vulnerability links in terms of like the references that are included in the national vulnerability database and we're we're trying to summarize that text. Right? That's pretty much what we're doing. Uh and then same thing for thread actor activity is essentially we're going to read a variety of blogs and we're going to try to summarize uh that text as well. And so really it's a text summarization workflow. And so I pretty much just uh described this but

uh essentially again we're this is the um MBD table and all all you're going to do is read through all those links and try to answer these questions. Um, okay. So, I'm not going to go over this, but this is just a good example of, I think, why or the power of lang chain. Um, so cyber security isn't the only vertical that's trying to summarize text, right? And so, what langchain has done is they've, uh, basically built these pre-built workflows that try to, uh, automate common patterns. And one is a text summarization pattern. They do a bunch of fancy map produce stuff. I haven't looked at map produce since college. I don't want to look at it. And

so I just use their their basically function to summarize all this data, right, that we're going to throw at the model. And so again, it's just an example of sort of one use case where lane chain sort of has this complex text summarization uh workflow that is pre-built into that framework. Uh there's also different types of workflows just to kind of show one more example. There's also a refine. Uh so instead of using map produce essentially for each article in that NVD table if you go back uh essentially they also have a pre-built workflow which is called a refine. So basically it's going to ask a question it's going to read the article and it's going to generate an

answer and then there's this workflow where it's going to iterate through every single article and if the answer changes or if if there's information in the article that contradicts its first answer it will update its answer. Right? So that's called a refine workflow. Um, and so there's there there's different use cases for all these different workflows. I think you all would have to figure out what works for you or or for your specific uh use case. Uh, as an example, I think for the refine workflow, a little bit different from the summarize is that maybe if you wanted to say aggregate evidence of threat actor activity, right, it would read through each article and it would

update its answer based on activity that in each article, right? Versus if it was a summarize, maybe it'd miss some of that, right? Uh so it's sort of like you have to figure out which workflows work best for your questions that you're using the language models to analyze. Okay. Uh so agents uh so there's lang chain and there's lang graph. And if we were to go back to a few slides there's crew ai and then there's autogen. Uh I'll say that I mostly use lang chain and lang graph here uh because uh I like their marketing uh just in terms of like I think it's easier to understand if you're doing a lang chain. It's a

predefined code path. You're just going one way. uh lang graph is more of a you're providing an agent the ability to go back or to choose how it wants to do something. Uh so this is a basic example that I pulled from their documentation. uh but uh basically is that at at certain points there will be a uh a a option for the language model to basically say hey something's wrong like I want to go back to a previous state right or or it it it uh could also be like maybe I want to uh choose which data source I want to query for a certain data set. So there's you're sort of giving the language model the the

ability to again kind of choose how it solves a problem. Okay. So uh these more the the uh agentic component of this is usually more used for more complex workflows. So generating code if you guys have ever used cursor copilot these are all more agentic uh just because of how complex a lot of these tasks are. Um, and then uh also I would argue for like a how do I fix this? I think that should also be agent especially if you're trying to customize it with whatever internal applications you guys use. And so we'll walk through this example. So just to give an example again a secured example of a aic framework. So let's say that we want to

remediate CV 2024577. Um there is a we can sort of have an orchestrator agent in your network and the orchestrator agent gets to make a decision about the the the data that it queries for. Right? So in in one instance it could maybe maybe we want it to understand uh additional information about the affected asset meaning like which web application or host has this PHP installed and so it would it would know to query the CMDB right if you guys are service now shop freshworks um sort of picking various CMDBs u or uh maybe this isn't a good example for PHP but maybe the vulnerability we're analyzing is Cisco Cisco does not have public documentation I'm pretty sure and

so basically we would need to build a different data or way for that the for the language model to query Cisco's uh documentation and so we can do that through a through a vector store but that's a different topic but the the the idea I'm trying to make here is that uh we're giving the language model the ability to query for things that it thinks it needs to solve a problem. Right? So so it has a little bit more flexibility. We're not doing a predefined hey you need to query this thing uh to get this answer. We're actually letting the language model figure it out. Um and probably what you can even have here as well is that once

you get that data there's a there's a synthesizer agent which sort of takes that data and and combines it. And I think what is very common in um more agentic patterns is maybe you you would you'd have a arrow from the synthesizer back to the front where the synthesizer can also make a decision like hey do do I even have the right data like do I need to go back and query for more data. Okay so um I'll say that rag uh in terms of retrieval augmented generation uh was initially designed actually for vector DBs there at the bottom. So basically searching through a huge amount of text and getting uh contextual information relevant to the question. Um

in terms of integrating with CMDBs at the top or integrating with any application right it could be uh Splunk it could be you know pick any security product uh it's really hard to do that today. Um so building language model integrations is generally pretty tough. Uh it's like can you have the language model query a API to get the data that it needs? It's not a clear answer. Um so uh there is something on the horizon that I want to briefly touch on which is the model context protocol. Uh that's something that the uh anthropic team released I think also in December and it's starting to gather adoption. Um and oh yeah there we go. Um, and then but

really it's uh it's trying to assist developers in writing these more complex agentic workflows where the LM all of a sudden is able to easily query different APIs and sort of it's almost like a browser how how a browser is allowing you to interact with Gmail. Uh, you know, it's it's it's sort of like allowing the language model to interact with different products. Um, and so there there's also it's there's a pre-built list of integrations that uh are being built. So um, Anthropic has a GitHub repository where there's I think probably a few hundred model context uh, protocol servers and essentially again you're these are pre-built integrations with your favorite apps. So maybe you want the language model to be able to to

query Google calendar, your Gmail, uh, Splunk, Service Now, and it's sort of allowing that interaction to be a lot easier. And again, I'm oversimplifying this a little bit, but hopefully that makes sense. Uh, just in terms of a graphic, it's almost like you you can picture as if you're plugging in a USB and all of a sudden your computer now can do, you know, for those of you that like hacking, like maybe you want to uh put in like a Wi-Fi adapter or whatever it might be. Essentially, you're just plugging in a USB to your language model and all of a sudden it can query Slack, it can query Gmail, it can query calendar and uh and sort of build those

integrations uh a little bit more seamlessly. Okay. Uh and just just a quick slide on the differences between MCP and API calls. Again, I think to use the browser uh um metaphor, I think it's it's sort of an easy way to think of it is is in terms of like you're not doing a single call. It's more of like a state and a session that the MCP is allowing the language model to have with that application. Um so it manages context and memory. And I think this is kind of get going down the weeds a little bit, but uh it it definitely allows for more complex uh implementations with different data sources. Okay. Um, let's skip this. Okay. So, uh, out

of the technical weeds for a second, uh, I think as we talked about at the very beginning, um, even if we have all these fantastic integrations, uh, language models and especially with cyber security, our risk threshold is very low. We don't want a language model to do certain things that, uh, you know, might take down our network. uh how do we build this trust with language models to augment our work, right? Um so there's this idea of trust but verify where essentially we're going to try to have humans in the loop for a lot of these interactions and a human gets to say, "Hey, this looks good or this does not look good." Um but I'll say what one

of the core problems here is um preventing against hallucinations. I think um especially folks that have been using language models for a while, hallucinations used to be really bad. Uh I I remember when I was initially looking at this and I would do the same question this just describe the CVE instead of saying hey I I don't know anything about the CV the language model would just reply was complete nonsense like it would talk about a different CV from you know that that was not related at all u some of the newer models have these sort of built-in protections that where I think actually a lot of this is just going to naturally go away um but

it is still a problem for the foreseeable future Uh, so and and I I think if you're going to use open source models, it'll be a for a while, whereas I think if you use a proprietary vendor, they're they're going to have a lot of built-in protections that sort of prevent these uh pretty bad um problems. Regardless though, we want to manually verify the language models response. And so there's a few options to do that. Uh one is chain of thought, which we'll talk about briefly. And then the other one is uh to include uh or like have the LM site its sources. It's it's also another um interesting avenue to go as well. Okay. So, to talk about

chain of thought really quickly, uh it's it's actually pretty simple. You're just going to prompt the model to provide or generate a step-by-step explanation of how it got to its answer. Uh and it's actually interesting. There's been studies that show that the language model's output actually just improves on its own if you ask it to do chain of thought. So it's sort of uh you're almost like forcing like a third grader to like show their like math process or something and then all of a sudden it improves. Um so it's it's it's sort of interesting. But in addition to that, right, you you can go through and look at the steps that that it used and

manually verify if if all of a sudden it makes makes a logical jump that that doesn't make sense to you, you can sort of flag that and say, "Hey, something's wrong here. I probably shouldn't trust this response." Um I'll say so one other option is this is more getting back to the agentic frameworks. So there's um evaluators when I I think I briefly talked about this. This is a very common pattern where you'll have a LM call. Uh the LM will generate something and you'll have a separate language model that assesses the output of the first one. Right? Right? So it's almost like you're you're doing paired coding or you know you can think about you're working in a team

with somebody and this this second language model is an independent assessor and can say hey this doesn't look right regenerate your response. Uh so a very simple example of this is we we uh build a lot of workflows that um have the language model generate JSON or markdown. For whatever reason the language model will just generate garbage sometimes. uh garbage is in like the markdown format will be wrong or the JSON format will be wrong. And so if you have a secondary language model that says, "Hey, this is incorrect JSON, like the syntax isn't right, like redo it." It actually works pretty well, right? Um you got to be a little careful here that that you don't get into some infinite

recursion and rack up your AI bill, but uh generally it's a pretty good strategy that most um companies building the space use today. Uh so this is a bit of a more complex example of the same thing. Uh this is an example of generating code in terms of what that actually looks like. There's a lot of back and forth where you have multiple agents that are uh sort of looking at the code like hey is the syntax right? Uh is this part of the problem correct? And if at certain points of the of the um uh problem the you know it'll it'll pretty much just go back and regenerate a response. So again, if you use cursor, uh if you

write code, if you use pretty much any modern LM application, there's a lot of this type of workflow where again where each step in this workflow is pretty much an AI agent that is doing tests. Uh you know, you can automate QA, you can pretty much do whatever you want, right? So again, I think more towards security. You know, you can have it validate threat detection syntax if you're generating rules in Splunk, uh etc., etc. Uh so you guys should be creative. Okay, where are we going? Um I think just to kind of talk about some newer things in this in this in the space. Um I think folks are starting to realize that rag is really important and

that there's a big problem especially in cyber security again with getting up-to-date information. And so uh just a new a new thing in the space is Google released what's called SEC Gemini which is a specialized model. Uh they released it earlier this month on or earlier this month on April 4th. uh and they claim I'm taking this from their blog. So combines Gemini with near realtime cyber security knowledge and tooling. Uh they're purposely a little bit vague on how they're doing that. Uh and they also say they have an integration with Google threat intelligence which is manant because they acquired manant uh OSV and then other key data sources. Unclear if it's fine-tuning which we didn't really talk

about today or if it's rag. Um I hope it's rag because otherwise it's not going to be great. But um again, I think there's these sort of specialized language models that have these pre-builtin integrations with different data sources is going to be the future and more on like what I think we're building here is is really trying to build these integrations with different data sources. Uh so again if you're looking at a CVE right and you want a language model to augment a lot of what a vulnerability analyst is doing you're going to want the language model to be able to query you know if you guys are using tenable qualit uh whatever the vulnerability source is

and then combine that data with your CMDB again service now freshwork solarwinds doesn't really matter uh and then firewall right so you want to know maybe hey uh is this does this host have an internetf facing connection right and and you're sort of integrating all these data sources and stuff the language model can make the right decision. Um, and so again, that's probably going to be mostly with MCP. Uh, but this is also new that, uh, we'll find out. Okay, in summary, uh, so I'll say that this space is still changing incredibly fast. Uh, I started writing this talk, I don't know, months ago, and I've had to update a ton of slides. Uh, a portion of this talk will probably be

out of date in three months. That's just sort of the nature of where we're at in terms of language models. Um, this this space is evolving really, really quickly. Um, I think that, you know, for folks that are looking to integrate this into their internal workflows, uh, I I really think they're an improvement on traditional scripts, right? If you're using Bash, if you're using Python, if you're using PowerShell, um, AI agents just give you more flexibility and and more uh, capability in terms of augmenting a lot of what you'd like to um, do, right? And it's, you know, I think we've all written scripts before and if you can build on top of a language model, I think you'll

be able to automate a lot more uh, faster. and I think more accurately as well. And so I hope I encourage folks to start building. And that's it. Appreciate everybody's time.

AI Agents: Augmenting Vulnerability Analysis and Remediation

Related talks