Enhancing Cybersecurity Intelligence with Retrieval Augmented Generation - Shaikh Ahmad

BSides Albuquerque31:028 viewsPublished 2025-08Watch on YouTube ↗

Speakers

Shaikh Ahmad

Tags

StyleTalk

Show transcript [en]

Jesus.

All

right, let's get going here with our next our next talk. Right on time. I just met you. We were expecting Lee Johnson, but he didn't make it. Um, your name is Felix. >> Yes. >> Uh, Felix Schroer. >> Felix Schroeder. So, welcome Fel Felix Schroeder and Shahik Aman. Aman, sorry. >> All right. Take it away, please.

>> Good afternoon, everyone. How's everyone doing? Good. Had a good lunch. >> I saw those drumsticks and steaks were gone pretty quick. >> So yeah, >> we can't be close. So today we'll be talking about retrieval augmented generation and how how can it help us enhance cyber security and we will be actually going uh at it at a very high level because uh with the time we've been allowed which is about 30 minutes. It's very hard to get into the core of it which is the vector database system but we'll we'll be giving you a a high level overview and the cool part is we actually have a a prototype demo that we're working on and we get to play with

it towards the end of the presentation. So please uh you know stay with us and and we'll take you through it. And my name is uh Shy Kaman. I'm an IT project manager and u I have my um colleague uh Felix Sher doing the presentation with me today. our CTO Lee Jensen, he couldn't be here, so uh we just uh we're poxing on his behalf. So bear with us. So why this matters? Well, in order to begin first, you know, we all care about cyber security and that's basically why you guys are here. We handle situations every day if not every hour of our work life and we do care about this. And another thing is, you know, we've been

talking about this over used word artificial intelligence in every community. And of course, there are a lot of benefits and it it's kind of magical to see it work and how it, you know, it gets you a result within a matter of seconds, but it has its own limitations and it it doesn't really change it, you know, over time or it doesn't update itself. You know, you just get to use a version that was from like last year or the year before and then, you know, you get a basically a text response. So we'll be talking about some of those limitations throughout the presentation and and and therefore we also identified that uh you know for

cyber security since it's ever changing we need a more situational and aware system to to address the problem where we're you know dealing with incidents and you know security u response and the last part of it is how hallucinations and we'll be talking about it more uh where like AI is more probabilistic and it it gives you a likely answer but it doesn't care about whether that answer is accurate or not. So that's the biggest limitation that you know we try to resolve with the the rank system. Next slide please. All right. Um so let's begin. Uh in 2023 we saw the release of large language models like chat GBT uh and others. And these were

fantastic. you know, for the first time ever, you had a model that you could send it a question or text and it would respond back to you in text and it seemed to respond uh you know based off of what you said fairly accurately enough. Um and the way that it worked behind the scenes is that chat u and other models LLM learned to uh study all of a huge corpus of English text and it learned the relationship between uh different words and u what different words mean in different context and it's able to sort of use this to uh you uh make an auto uh correct or an auto predicting feature where it can look at

the history and usage of these words in English and sort of intelligently pick what should come next based off of uh the context and and uh information in the sentence. But it's important to realize that it picks what is the most likely answer um in a given list and what is the most likely answer isn't always what is correct. Uh this is what's referred to as AI hallucinations where uh the uh thinks that it's completely correct. It it gives off syntax. However, it is completely nonsense and it's uh not based off of anything. And you know, this might be why you've uh maybe tried to use chat GPT. You ask it a question one time and

it responds one way and then you ask it uh the exact same question in the exact same way and it could respond a completely different way. Uh this is because of that probabilistic model and it really means that these systems are not completely reliable. Uh if they are able to produce different answers and produce different uh answers that obviously are not correct, then how could this system be correct? And so uh this has been a huge problem. Uh these are two famous examples that uh I at least just knew about just from knowing about it where a lawyer used Chat GPT in a uh briefing and all of the cases or at least a lot of them it just

completely made them up. Uh it wasn't based off of anything. Um, and another example of hallucinations, uh, Google used, uh, tried to do a demo and one of their bots hallucinated and it ended up losing over a hundred billion dollars worth their stock prices, which is kind of crazy that a flaw like this can have such a devastating impact. But this is a real issue that we need to solve and it needs to be solved in order to make LLM be able to actually be used and reliable. So, how do we do this? That's why we have which is what I'm here to talk to you today about. Uh, RAG stands for retrieval augmented generation. It

is sort of the idea that before you have chat or before you have any LLM answer question, you run some sort of query on this data that will go out and find uh the data that answers the question based on the user's question and it finds it finds a data and then it that data directly to chat. So then chat is able to answer the user's question based off of the real-time data that it found. It's not just, you know, uh answering the question uh shooting from the hip, possibly making stuff up. It has its uh site, its source pulled up. It has its source and it's literally reading the source and a lot of times regurgitating

that information back to the user. Um so we'll go we'll run through an example. First the user asks a question. In this example it will be a zoology bot. What animal sleeps the most? The system uh can then take the user's question and it needs to decide how it's going to you know search for this information. It only has the text which animal sleeps the most. So it can do this in a variety of ways. There are agentic rad programs which uh prompt the free uh LLM and say, "Hey, here's a list of actions. Here's a list of sources you can search. Which one makes the most sense for this context?" And uh depending on how the AI

responds, it will go out and get that source and get that text and then feed it to the actual bot, which will then read the text and uh answer it based on the text. Another uh important uh way of searching is semantic searching. This is particularly useful when you have to search a huge amount of documents. So AI scientists have uh using the same kind of transformer model they have found a way to uh map map out all of human language um as these vectors multi-dimensional vectors you know you think of a vector in 2D space it's a point a 3D space it's a point in 3D space these are thousand dimensional vectors that exist in thousand

dimensional space but uh the AI is able to build relationships and uh make these vectors so that uh words that are semantically similar to each other and words that mean similar things show up to each other in this uh vector space. And so we can then use that to uh numerically compute how uh how similar to uh like a user's question is to say a data source. And so we can then use that the exact data source that the user's question mentions. So for example uh in in our question, what animal sleeps the most? It would compare it to all the different data in the in the data set. So maybe there there's information about dogs, information

about cats. Um but it would be able to uh semantically compare it. And because the vector for which animal sleeps the most most uh corresponds with the information about while sleeping the most. It is able to then match that. And I think that this is also pretty cool to see because you can see that the text doesn't necessarily have to match spot on. It doesn't necessarily have to be a direct keyword matching. It's just that they are semantically similar. The AI has found how to make that connection and how to uh connect those text and return back the correct results. Um and after the is able to read back the result. It's able to give a uh

response in natural text that sounds like a human. It behaves like a human and uh is able to answer the question and it can site sources uh based off of the exact source that we got. So it's extremely accurate, reliable, uh fast and uh cost effective. And so this ended up being what we used. All right. So now we'll talk about like the augmented generation a little bit and I think Felix covered it uh really well. So the way it uh works is it injects the data in the vector model first and then the vector database anytime the user ask a question it basically tells this is the document you need to answer the question and then it

goes to open AI and the chat model to refine that answer. So what's happening here is like uh how many of you are working with like a state agency or like city county? Raise a hand. Great. So we have a few folks. So for example, you're looking for like city codes or or something that's very specific or or or maybe like a law that just came out. So you would basically have that information within your datab bank and the vector database. So what happens is when it's handing it off to chat GPT, it will tell like this is the question my user need to be answered and here's the document you're going to use. So that

way you actually get a more grounded answer to to your question. And then coming back to you know the the big word like cyber security like how do we relate it to that? So basically when you're doing like threat response you have to come up with reports uh you know pretty like frequently and and pretty quick and when you have those definitions and like say for example the standards you use already within the datab bank of the vector database which you can update pretty easily then you can get a more refined answer rather than you know it trying to get a u you know definitions from 1925 and then give you like a response report

on that. So um um can you go back to the next previous one? Sorry. So basically the the rack outputs how is it different? The question is like why do you need rack because you know you have chat GPD it's free like or or it's not actually we just don't know it yet. We will end up paying for it one way or the other. So basically uh you know it is more grounded and verifiable. So when you're asking a question it's going to actually give you the exact source. is like this is my answer you know this is what happened and here's the source so feel free to click on that source and then you'll see that more on the demo we'll

have at the end and um so is it coherent and logical or or safe so all these things actually the the three and the bottom are taken care of by openi or the chatbot you're using because it it usually if you give it the right context it comes up with the with accurate content so what I basically wanted to emphasize on is when you're interacting with open AI it's it can help you a lot with the content but it just can't do the context for you. So what we're doing you know here basically through rag and the vector databases providing that additional context so that it comes up with a more relevant answer not just

like you know internet search. So how does you know it really impacts the industry you know so here are some things that you know we think that you know it should be able to help with uh you know off the gate is like threat intelligence summarization. So if you already have your definitions for example if you use like you know the the the NIST 27 ISO27001 or NIST CSF those definitions can be uh pre-uploaded and it will actually use you can code it to to actually say use this code definition or you know these uh different responses to come up with the response and then it will use exactly those definitions or advisories to come up with the response and then

also it will have those like you know the the NIST ISO and compliance checks built in place. So uh the the demo we have basically have a bunch of uh you know public documents that we can ask and query uh through but then these are some of the cyber security implications of react that that could be very beneficial and then uh security operations chat bots like how many of you like use those um you know if you're not sure what what you need ask a question on websites so yeah I'm sure you guys use it a lot so what it does is you know in ways they use a widget. So the widget just goes

like it sends a question to open AI or like you know CGP whatever model they're using and they they just come up with an answer. But with rag you can actually say these are the 20 documents that are relevant for my website or my state agency or my company. So anytime you answer a question make sure you go through this. So it basically automatically does that every time so you have a more uh refined and accurate result. So some of the benefits of using uh the rag instead of you know the large LLM where you can still do this is it reduces the token usage. So basically anytime uh you know you're using like a

paid version of GPT it has a you know like token limits and you have to pay for them although they cost probably like onetenth of a cent per query but you know over time it it accumulates. For example, if you're running like an archive agency where people search for stuff like thousands or like 10 thousands per day, then this cost may become significant over time and you know like our model actually helps kind of mitigating that a little bit. And then the snippet level filtering. So instead of say looking at you know a billions of documents it's basically giving you the definition like this is the document here's a question you know go to open AAI refine an answer and give

it to me. So and we'll we'll see that in action in a little bit and then also sometimes when you have like uh constraint hardares which is often true for like state and you know city and county governments is it can be you know a little bit faster compared to you know the GPT model. And some of the benefits is like as you see over here is like the real time info. So what rag does is like you know you can every morning like upload a few documents to your one drive or your share drive and say this is the new definition for today. So it it's pretty relevant and up to date. It's not really

using old information. And then the hallucination risk is pretty low in the rag model because it's dealing with real time data. And then um compliance help is automated. So once you have your definitions and you upload them and you actually code it to use it every time, you don't really have to do anything. You don't have to say like use NIST270001 or sorry ISO27001 every time you come up with an answer. So the next one so here we will uh get into the demo and we have a few model uh few modules within the demo here and then uh Felix will help us with that. So we have a file and image search and we have

document summarization. So what we did is the system has some uh pre-prompted questions already but we can also ask some questions from the audience see what it comes up with and also you know it's important to keep in mind this it is a work in progress you know it's not a finished product yet we are trying to uh get more feedback and refine it as we go. >> So I'm driving the this demo over here. I'm going to start off with uh the New Mexico tax and rev public documents. So, these are just public documents that are available um that you can learn about that you can use to find uh information on taxes and stuff. So, I have a few

sample questions, but we can also ask whatever questions we want. How do I file my taxes electronically? You can see it's going in, it's searching, and it very fast. It found its sources and it uh the LLM was able to And so you can see it gives a a response. Um, but not only does it give a response, but it comes back with sources that it used. And these are actual PDFs. If I was logged in, if I was logged in, I could view the file but uh you can see here it cites it from the real documents and it includes scores and thresholds. And the chat is able to use this to uh more correctly give its

responses. And you can see that chat's able to site its sources within the within the prompt itself. And it's even able to do uh uh spelling mistakes. For for example, brackets is not spelled correctly in this question. That was not on purpose or that was on purpose. I swear I can spell. But uh you can see even though brackets was spelled incorrectly, it still was able to pick out what it needs to and respond correctly. So that's pretty cool. >> So in the back, are you guys able to see the screen or you want me to kind of read you some of the questions? >> It's good. Yeah, I see some thumbs up. So yeah, basically and then let's u take

a question from the audience related to New Mexico tax. See what it does with the documents. Any questions? I know you have a lot when you go to your tax advisors. uh tax exemption take effect. >> Uh healthcare tax exemption take effect. >> Did you hear that? When does the healthcare tax VA healthcare tax exemption take >> property tax? >> Oh, sorry. Property tax. >> When does the property tax exemption take effect? I should I should also say that this is not uh been trained on most recent documents. I think this is from documents back in November. So it might not include >> so here so one of the main things about this you know model is that if it

doesn't have the right document it will actually tell you like I don't have it and and then you know the way we're actually uh designing it is then you can say like okay go to chat GPT like open AI like find like whatever is there like publicly available and and then these ones are actually pretty like grounded specific documents and as Philips mentioned since we haven't uploaded in the database. What it's saying is this information is not within the system yet. But there's a way like you can override this and still say like I want the chat GPD answer. Does that help? Any other questions? Yes.

Uh I can't hear that. Well, sorry. >> Sorry.

>> Okay. So, so yes, um you are able to have it compare documents and a lot of the times with the current rag model, it will uh compare the top five documents that were semantically similar. But uh I've also had to work with some other AI where it will do a uh I'll give you an example. We had uh road documents that are all semantically similar. So this semantic search wouldn't have really worked well with trying to uh search you know, through all those different road documents. Uh because if you would have asked it about, you know, who's responsible for this road, it would have been very semantically similar to uh questions across like documents across

the board. So, what we ended up doing is we had Chad GPT read all the documents uh and uh create summaries of those documents that would specifically look for whatever, you know, maybe inongruities that we're looking for. So, in our case, we looked for who was responsible each document and so it would just extract that specific information about each document and then we could build a chatbot that could then respond to that information. So with more of that kind of aggregated data you do have to do a lot more pre-processing but it is possible to do any more questions. Uh so the next segment we have is the image search. So we want to kind of demo

it real quick. It's I mean we want to give you a quick overview of it and know because of the fact that we're not logged onto the system it doesn't uh work that well here and so >> yeah it doesn't work >> yeah but if you go down to the summary we want to just uh talk about the basic summary or the the first the gen generic one. So um other things that we were working on are uh potential ways of image searching and it doesn't work uh here right now but we had demo where you could search for images of green chili and it would come up based off of a similar sort of uh clip. It used the

clip model which is basically a way of doing that vector embedding stuff but with images. Um but another cool thing about image text is you could also uh prompt chat GPT to create a description of the image and then embed that too. So that's just another way you could search for images. Um and uh these models are able to create pretty good and pretty descriptive uh uh descriptions of these images as you can see here. So just to add to that yeah like similar to what we do with text you can also use like vector im like vector you know your images and kind of use those data and if you go uh can you click on the slider? So

basically the description you're seeing is basically a more comprehensive one. But if you ask ask chat GPT without the the vector database then it will give you an answer like this which is very generic. But when we do have the other description if we go back to that then it gives you exactly where that is and you know it basically recognizes characters and and you know landmarks. So it's also pretty cool like for example let's say you're in the tourism department and you want people to find like specific images then this can really help you. You can kind of refine that you know data bank uh quite well for them to use. Any other questions? How are we doing on

time? Are we ahead? Five more minutes more questions any feedback on the the database and what we have covered would really appreciate it. >> Yes. >> I'm just wondering what other data sets do you plan on incorporating. >> Uh so basically uh for data sets it depends on like say state agencies like I mean we can technically take like all of the New Mexico governor's document and just embed it within the vector database. It's It's basically on demand. Let's say we have a project with the you know healthcare authority then we will ask them like okay this is your data bank. We actually have our you know file management system. So we have a storage

for that. So what we can ask them is like which documents do you want? So they'll give us like a list of 10,000 PDFs. So we'll then recognize whether we will need like OCR or is it going to be working? Then we try to you know convert those 10,000 documents to the vector database and then we give them their you know own tailored you know database and the system to actually work with. So in in terms of like government agency use it kind of like saves up time a little bit because sometimes say you're working in you know the legal profession you need like exact answers. You don't want it to hallucinate and give you court

cases that it imagines happened. >> Yes. I've got another question but with a microphone now. Um, so it seems like you're mostly applying this to the the public sector. Um, and I'm just kind of thinking in my job role, there are times, so I don't work in the public sector, but there are times uh where I am trying to get someone to secure some computer for XYZ reason. Um, and I want to look for a policy within a ton of different policy sets. Uh do you have any plans to expand this to the private sector? >> Yeah, definitely. So I mean so to be frank, we're not a cyber security company. We're like generic IT. So it's

very uh difficult or challenging for us to kind of like pivot from that to like cyber security all of a sudden. But this is actually like I want to say a step in that direction. So we are exploring and then I think the factors we talked about like cyber security today are kind of important like for example if you have the standard definitions or advisory already in your data bank then anybody in your team can access you know like ask a question but they'll get that answer based off of the advisories so they have the most relevant information and and we do have plans but um you know let's say that we're not at the place to

make this decisions Any other questions? No. Well, thank you very much. And I just wanted you guys to kind of I guess give a hand for Felix. He's only like 17. >> I'm 20. >> 20. Okay. Anyway, so he's actually our intern and he's done a wonderful job. He's he's the main developer behind this, you know, in association with our CTO, of course. So he has done a tremendous job at you know at the company and at his age doing this presentation I feel like it's awesome. So give him a hand. Thank you so much. It gives me hope, right? Hope for the future that we have smart young people helping us out. Thank you for that. And also just say

thank you to Shay for uh all the efforts in volunteering and organizing this conference. So thank you. Yes, please give him a round of hand. Round of applause. All right, we're going to switch over the uh computer and get the slides up for our next speaker. While we do that, um make sure we're good to go. Uh we'll be right back in just a moment.

Enhancing Cybersecurity Intelligence with Retrieval Augmented Generation - Shaikh Ahmad

Related talks