← All talks

Bots of the SOC

BSides Cheltenham19:45163 viewsPublished 2024-07Watch on YouTube ↗
Speakers
Tags
About this talk
Mike Kearney explores building an autonomous agent using large language models to automate threat detection against the Splunk 'Bots of the SOC' challenge. The talk contrasts agentic AI approaches with traditional deep learning for security operations, demonstrates multi-step reasoning with tool access, and covers practical engineering challenges like context window optimization and cost management.
Show transcript [en]

hi so uh I'm Mike Kerney I'm going to give you a quick talk today about my research called bots of the sock which is my attempt to build an autonomous agent to automate solving the Splunk bot of the sock challenge so quick bit the scope I just give a bit bit about my background who I am um go over some of the AI terminology to make sure that what what I some of the terms I'm talking about we're all clear exactly what I mean by that uh and then my this detect look at why the methodology here is different I'm using an agentic way of solving threat detection rather than a traditional deep learning way uh I'm

then going to go into a bit of explanations to how it works and I will do a demo which hopefully fingers cross everything will work as intended I'll briefly sum up some lessons learned and areas I think would be interesting to further develop This research so quick bit about myself I'm Mike Kerney uh I served 12 years in the Army mostly in cyber operations and cyber intelligence uh I've been interested and research ing Ai and NLP particularly since about 2018 um starting with some funny projects back when hallucinations were just comical rather than actually seen as a threat um where I um did some optimization retraining on gpt2 uh offline to make a comedy twitterbot um

I'm currently head of cyber defense at leard UK which is an aerospace and defense Prime um and this is my first time speaking at a cyber conference but many time visiting so some of the basic terminology here AI as a whole is basically the study artificial intelligence is a broad term for the study of anything to do with machine intelligence it's a very broad term it's the earliest term a subfield of that is machine learning which is where we talk about the algorithms more modern algorithms that we may start to use to actually Implement that this is where you talking you're talking about something a bit more tangible some actual algorithm stuff like your K&N J

linear regressions random forests and so on within that you got a subfield called Deep learning which is still part of AI still part of ml deep learning is when you're generally talking about not having to manually extract features you're letting the machine figure out the features from AAR encoding a subp part of that which is the most common part is neural networks within that is Transformers so that's the sort of what the terms mean why I'm sort of highlighting that is I think the artificial intelligence term now where we've sort of got to um means that there is a difference in approach toward towards like an AI AGI type agentic type approach of defat detection which is

beyond um this deep learning I think the Deep learning based approach which I'll go on to what I mean by that eventually the idea was at some point someone hopes that an AGI type thing would emerge from this so what I mean by the different approaches the sort of deep learning approach to threat detection um is you take your computer your data source you either feature extract or encode some data from that you fit a model on that you do some prediction and then you detect your threat that's sort of how most you know neural network cnns um most types of tradition I'm saying traditional for Technologies only existed for a few years traditional deep

learning based FR detection would work the agentic way or you could say inverted commas light AGI way is rather than sort of building a model with weights that Focus specifically on a certain fret category which you know you give some human readable or context to a machine you give it some tools which are the same as a human threet analyst would use you give access to the machine the machine can then work with the tools based on the context provided to try and then find the threat in a very very similar way to a person would so what I'm sort of I've always been interested in this concept of could you get a machine to rather than optimize weights

for a specific purpose to get it to actually learn how to use the tools and attack the problem in a very similar way at human would now I'm not saying in any way that what we have currently with llms is egi but this is the same methodology you would use where you do have AGI and I do think there's a very interesting generalization here in that rather than having weights that are optimized to a specific problem say finding a particular brand of mware or finding a particular threat in DNS or network traffic the exactly the same weights that you can use for this problem could be used for giving legal advice could be used for building comedy

song lyrics or could be used for finding threats in in data so I think there is a light lowercase G generalization about some of the more um Cutting Edge Foundation models that opens up a lot of a lot of interest um yeah so I'll just briefly L to actually I'll come to that one later hit that one later on so I thought this is an interesting approach I've been playing with the idea of using large language models to try and assist security operations since gpt2 came out back in 2018 but it was really bad it wouldn't really work at all I even tried some local optimization and retraining on some of the Splunk commands to see if

that would help and it still made too many errors it confused the pipe symbol of Linux commands and it used to continuously think could just inject orc into a middle of an SPL that was a really common theme in some of the older models gp4 particularly GPT 40 seems to have been a lot better at that so I had this idea of developing an automated sock bot like a co-pilot for sock before that term came out and I realized the blunk published the boss of the sock data sets and these have answers data sets um published this would be a really good evaluation to use for this type of agent so then I sort of focused my

research then on saying okay let's take this as a good Benchmark evaluation and let's see if I can build an agent or an AI that can perform well against this this Benchmark so I'll go back to this General AI mode a lot of you have probably used chat GPT so you'll have a conversation with you'll ask it some questions it will respond in this context you know that the AI can track your conversation within the conversation like when I said the word there at the end it knows that that refers to Moon right I'll go on to why this is important so these are your questions as a user you get to put in input which is your user input the other

bit of input you get to control is effectively something called the system message so you have a system message which is like a pre- prompt something that just exists at the beginning of the um conversation and Frames the whole conversation you control that you control your text and if you when you use the chat GPT web interface that's sort of what you see when you use the API though it doesn't doesn't remember anything within the conversation you have to give it the whole conversation for every single um request every time you want an answer now why that's important for agents is basically when you use chat gbt you only see your inputting um chat the web interface you

only see you putting in data every time you add to it and you get a very long conversation when using the API you have to give for every response I have to give the entire conversation up to that point and where it's build rather than a monthly bill it's build on token use you end up with agents with very very very large token usage before you start to get any any um work done and that means it can get very very expensive and you really need to think about how you optimize your context window and your token counts that becomes one of the biggest challenge in developing these sort of systems so evaluations of work a

lot of them like the most famous ones are like the MML U where you would just simply give a machine a single question it would reply in a single response and it would be marked correct or incorrect a lot of reinforcement learning human feedback uses these sort of evaluations to optimize the model to give you the right answer but but that's not necessarily what I want in agent there are some other ways you could do that there's something called endot learning which is where you give it examples of different questions and responses and it has been proven and that often gives better results you can also do other tricks such as Chain of Thought reasoning where you amend the system

prompt or the text to say work it through step by step that proves because how next token prediction works is more likely to give you a better answer but all of this still even the endot learning thing this is still it's endot learning that you're giving examples but it's still single challenge single response that's good for a chatbot but that's not what I want for an agent for an agent um I may have a question and there is no way the the bot could possibly know the answer to that either from its training data or from its effective World model that it can infer from its training data so I ask it a question like what have I got here

there's no way it could possibly know that I have to get it to tease that out through multisteps so by giving it access to tools I can get it to deduce what it might be and it then it needs to answer its question at the end of a step now some of the models I I think I've got reason to believe that GPT 40 may be more optimized towards these actual sort of multi-step deduction tasks because it seems to perform a lot better at it than any of the others but the previous way of evaluating models on benchmarks like MML U is purely just challenge response I say even endot learning is still one challenge one response which optimizes

the model to try and just go for the answer straight away even if it's a really dumb response rather than realizing it doesn't have to do it first step it can work itself through so that's what I want for an agent I want a model that could realizes it is part of a bigger conversation it doesn't have to get the answer first time so going on to my S of architecture a simple agent unit effectively is a um single llm and then a tool which is known as user proxy the user gives the challenge or the problem to the bot the bot then is taught how to use the tool just it's in the system prompt you give

it the function details instructions what inputs and outputs it will produce it can then call function requests and then use the tool to answer its question that's the most simple basic Atomic unit for an agent you can also do other architectures which is group chat where you have multiple agents in the same chat you can give different ones of them different personalities you could even use different models you could have a mix of uh Gemini um Claude and open I ey models in the same group chat that can have advantages also found it could be quite useful to have a bot which is deliberately named a sense Checker bot you tell it your role is to point out

why the other Bots are wrong and evaluate their logic to avoid rabbit holes I've also found it could be really interesting to um not to give the sense Checker bot a deliberately short memory because what happens is they end up agreeing with each other and I've had many examples in my test runs where they end up going down a rabbit hole and they think user agent is the username and they're not realizing that they're all they're all thinking the word user agent means username and they're get confused and they're not correcting each other but if you if you ask chat GPT what's wrong with this as a single you know as a single one of prompt it will tell you

straight away what's going wrong so what can be quite useful is give your sense Checker bar a really short memory so say it only remembers the last few messages um you can there's a theoretical one here you can do which is well I've done this I've tested it is hierarchical Agents so you give one bot the ability to produce and spawn sub agents with um agent units the idea being you keep your context window clean for the big picture problem and allow the Deep detail to go into this in practice I found this doesn't work anywhere near as well as I would hope so the key to this challenge is I found is to getting that function query correct

now the actual function query the actual fun function that I'm using is quite complicated but the trouble is Splunk could produce like 50,000 rows of text it could produce gigabytes in a single return that's going to completely destroy my context window and just blow the model out the window uh it's not going to work so what I do to get around that is I basically have this sort of pseudo code um so I initially check the query is formatted correctly before passing it to Splunk and just like do a bit of manual debugging I'll then try the search if it doesn't search I would do a debug function the functions of the little robot side mean they're a hybrid

AI um classical programming function so that will actually call a large language model as part of its logic just as a single shot like here's a problem solve this it will try and debug itself with an AI function before returning the result and then the key one here is handle long responses so what I'm currently using for that is because I want the reason like currently gp4 is the only one that actually can Splunk well and the only one that's intelligent enough to work through um some of the more high level technical problems but it is a limited context window and is expensive so if I have a response which is above a certain amount of characters

I then send that to Gemini 1.5 flash to summarize because it has a million context window and is much cheaper it doesn't need that bigger level intelligence that then summarizes the key bits in the context of the question I don't just pass the response I also pass the um uh the question that I'm trying to solve into it so it can like give it a bit of context to say if you see this in this data summarize it in regards to this question and then pass that back to the main agent that that then optimizes the context window and doesn't pollute it with noise it's not relevant and then return the results back to the

agent so what I did and I've got this now I've got Splunk is my tool I've got my agent set up and I've got a person but I said at the start I wanted to automate boss of the sock data set without human in the loop so rather than having a person input I've just given it a document with Json format with all the questions for boss of the sock uh I pass that into the machine I can let it run on an evaluation run it will work through the questions it also has the answers on separate Json um because I didn't want it sometimes it says the answer is say google.com rather than just google.com I didn't really want um

extract string matching so I use another bot which is my marker bot just to say with this question this answer do you think this is correct to Mark more like a human it then appends out to the result so I can just leave it then to automated go in evaluation R so I want want to do more runs but it gets quite expensive to do a run um so I've been tweaking it my last run I did earlier this week with GPT 40 and the Gemini update um so I did it I run it on a few things I run 20 questions through the bots of the sock version two it got seven of them incorrect it got 13 of

them correct so it got 54% of the answers correct but obviously if anyone who's done Splunk boss of the sock knows they're weighted questions uh with different points so actually its Point score was slightly better it scored 65% on my last evaluation run on boss of the stock version 2 I think that's pretty good that's probably on par with a junior analyst or really Rec entry level person first attempt um there it is does make some dumb mistakes it's amazing how a model can be so smart sometimes it gets like there's some SQL injection stuff it gets which is quite High pointed question it gets it basically every time but then it also makes dumb

mistake like thinking user agent is username so it's not like a person it's it's really it gets some stuff really clever and some stuff really bad um so demo time so what I'll do is I'll just move that down and I'll do a quick run of the um and hopefully this will work if I move this across to here don't know how big the text is I don't know how well you can see that I'll try and read out any bits that are uh and I should have got this lined up that I should have it just ready to run so I don't know how clear the text is for that but I'll try and point out

anything interesting so this is just going to do a run on the boss of the stock version 2 question set and initially it will load the question uh um which it did very quickly so I'll scroll back up so yeah so it loads the question here you can see gives it the question I have given it the hint uh which comes from the question but I haven't given it the um the start out as you get in the you know on the actual proper one you get the start out answer so you know how many characters it is um that probably could help it then it's then constructed a query it's hit my um debug function

first uh managed to debug itself with a sentence Checker bot uh and now it's giving itself correct query so here it's the question was to find uh what site Amber Heard auring had visited um it correctly identified that to do that it needs to Stats by her IP address uh which it's done it's now identified her IP address it's now realized it needs to query Source type stream HTTP based on the IP address it's just inferred so I didn't give it the IP address so it's worked this out itself uh and it's doing stats values site as site so it's going to give me a list of all the sites that's going to be quite a lot big

response um it is quite a big response and it's picked out from that actually it's figured out that the answer was Burke beer.com now the question did say it's a competitor so it it's this is where the general intelligence comes in it knows that it's something to do with beer it knows it's a brewery so it's figured out oh that one's called Burke beer that's probably a competitor that's probably the answer so I think that's quite interesting how it's General World Knowledge I've sometimes seen it um do stuff where it does specific says search for beer hop Brew in the site title and it's just decided to do that itself which I thought was quite interesting so

that's a sort of quick demo I could let it run a bit longer but it generally runs runs it works its way through I say it scores on average about 65% um I'll just quickly bring up some interesting examples of where it's gone wrong I minimize that and bring that across so again text isn't very big on this so this is where it can be really dumb so here it the summarize function is been called and the summarized function has said that the answer is this attachment name which you can't actually read this it's then ignored that and decid it's still going to crying because it didn't get the field right it's got obsessed with

getting its field name right and it's just keeping going it's got the answer the answer is right there in the cont in the that text but I think this is where like um large context window recall becomes a problem and it's a lot of research has shown gp4 recalls data the needle and Hast stack test it recalls data from the start of the context window or the end of the context window but not in the middle it gets quite bad and Gemini generally performs a lot better at that um so here it's got the answer but it still continues to go even though the answer is there it it didn't realize it had the answer and respond

with the answer um if I go back to sort of next steps then so next steps um I'd like to experiment with other large language models other than GPT 4 um I've think there would be in Splunk detection Labs publish a lot of um detection logic online with very it's very high quality training data because they have the Splunk search and they have the plain English description of what it does there's quite a lot so I think by retraining on that you could probably get a local model to optimize two Splunk searches um I'd also be interested in once you know if you're doing with local model fine- tuned or maybe a hybrid group conversation maybe have something

like llama free optimized for SPL combined with a higher level brain which is something like gp4 um then look at something like reinforcement learning so get it to actually run and then every time it gets the right answer or put some sort of I haven't quite worked out how I'm going to do this maybe some sort of um feedback for how many iterations it takes or how many tokens it takes to get the answer and then give it a reward function for the reducing amount of tokens it takes to come to the answer the only problem with that is uh I would need quite a large data set I'd probably need access to every one of the eight

current Splunk um bot of the stock data sets um and then maybe a bit of randomization with r know various random steeds um to get that to work my concern the reinforcement learning on the amount data I've got available to me is that it would overfit to and memorize that that one problem set but yeah these are areas I'd like to explore more um I'm definitely interested and I think Gemini is the one that's showing the most prom promise and I know Ultra 1.5 was released since I last um updated my attempt at this so I haven't actually tried running that with the model I briefly tried Claude I briefly tried llama both of them were not just did not

work at all U but Gemini seems to be showing the most promise so that's uh all my slides so contact I'm on Linked In I do publish like put some posts up about research on there every now and again um if anyone's interested to touch out reach out so that's all for my talk has anyone got any questions