Xenia Mountrouidou - Harnessing ML and AI for Next-Gen Security Engineering

Name: Xenia Mountrouidou - Harnessing ML and AI for Next-Gen Security Engineering
Uploaded: 2023-10-09
Duration: 53 min 29 s
Description: The next generation of cybersecurity engineers will be data engineers who happen to specialize in cybersecurity. This talk aims at showing how cybersecurity engineers can benefit from today's technology to make sense of the sea of data that they are gathering. Currently, we are constantly bombarded

BSides Augusta · 202353:2983 viewsPublished 2023-10Watch on YouTube ↗

Speakers

Xenia Mountrouidou

Tags

CategoryTechnical

StyleDemo

About this talk

The next generation of cybersecurity engineers will be data engineers who happen to specialize in cybersecurity. This talk aims at showing how cybersecurity engineers can benefit from today's technology to make sense of the sea of data that they are gathering. Currently, we are constantly bombarded with information about GPT, ML, AI, and a variety of abbreviations. The question is, though, how can we as cybersecurity engineers capitalize on these tools? I will answer this question with a concrete example of the usage of ML and AI from the perspective of a cybersecurity researcher. The goal of my talk is to show that, with today's tools, a cybersecurity professional can make new discoveries and invent creative ways of using cybersecurity data for business solutions. First, I will dive into the types of data we encounter in the cybersecurity ecosystem. Then I will analyze the framework of exploratory data analysis (EDA), which includes statistics and visualizations to make sense of an opaque dataset. I will give solid examples of how we engineer features from our data. Finally, I will demonstrate the use of AI to "question" your data, help you draw conclusions, and create models to detect malicious behavior. This talk includes a demo with Jupyter notebooks and public packet capture data. It demonstrates how we can capitalize on packet captures to discover malicious activity using Pandas AI, Scikit LLM, and a variety of Python libraries. The audience is taken through the journey of raw data, exploratory data analysis, feature engineering, and finally modeling. Through this journey from raw data to models, I aim to describe the possibilities that ML and openAI models have opened for cybersecurity engineers to be creative and resourceful. The code for this talk is in the repo: https://github.com/mundruid/bsides-augusta-2023.

Show transcript [en]

I think the mic works I can hear a little bit of echo so uh I also have a loud voice so if I start getting excited and yell too much let me know please uh okay I think I'm going to get started we're right on time just one minute over and all that uh I was expecting much less people after lunch so I'm going to be trying to be very entertaining uh and keep you awake uh because it's a tough time uh first thing that can keep you awake is my name uh it's really long you can gain many vowels through that so it's a Greek name but I have a cool nickname that you may

remember actually so my name is exenia man IDU and you can call me Dr X or exenia or Zenia or um something like that I'm a senior security researcher at Cyber adapt a startup um and my research intersects uh within the areas of machine learning and cyber security so before uh Private Industry uh so I was before cyber adopt I was in another private company and before that I gave 10 years to Academia so uh so I like teaching I really love sharing uh these kinds of things uh so I hope you can learn um I have more research interests than these but I put I put the basics uh over over here uh I love building stuff

I also built a website I have my blogs over there and there will be more related blogs I have a whole series related to this talk so if you want to learn more uh again I'm very happy to share uh and uh yeah I'm happy to share code also so because I learned by doing uh okay I have more Hobbies than I have free time uh so I always put that cool picture h i take crappy pictures underwater but this is a cool picture that a friend of mine uh took gave that of course it doesn't look like me uh and I also love reading playing the piano poorly and Comics okay so uh let's get

to the talk um and um today I'm going to talk to you about uh ML and AI for cyber security I had a a pretty provocative start in my abstract that said I believe the Next Generation cyber security engineer is going to be a data engineer that happens to specialize in cyber security I think that it's going to be a cool combination uh if this talk inspires you to get further into Data uh that's really cool I think you're going to find a whole new world of cool things discoveries creativity and all that so uh I will give you the what uh the what of the daytoday of um yes sometimes I call myself a data scientist so but I do

specialize in network security so I understand Pro protocols I understand I even understand operating systems and things like that um so I will give you the uh the nitty-gritty details of uh the wat um and I will do it in two ways so that's really exciting uh for this talk because uh I will do it the traditional way uh but I will insist a little bit on the new tools that we have right now uh that are almost zero coding uh tools uh so you can interrogate your data with the AI way um I will say though and that M has gone around and I'm not big on social media I had to Google a lot to find it

but that is the story of my life right now so when I was doing my PhD it was really cold statistics and nobody I had five people in the room when I was giving talks uh and they were all falling asleep um and then we called it machine learning and that was pretty good good uh so uh I had 10 people in the room uh and now it's called Ai and it's not the same thing okay so there is a strict definition I used to be in Academia AI machine learning is a subset of artificial intelligence you both have heard of it so so what would you think is the big difference between artificial intelligence and machine

learning what would you say is the big difference have you ever uh wondered I mean they're not the same thing has anybody wondered it's say it again dat the data sets the data sets well if if you Google it you'll see it's not it's not just the data set it it could be I mean uh the big thing with AI is it can act as an independent agent okay I don't I don't think we're there yet we're getting there I mean the model are learning from us but we're still training models okay the GPT uh how many of you have used the chat GPT of course probably all of you good they're Gathering data on all of us

uh so the GPT is still a a trained model and yes the data set was huge so they paid a few million so uh to to train that so so it is like an AI but is it really and it can make silly mistakes okay so it's not a independent agent there's you know an AI scientist and I don't call myself an AI researcher uh because these people are trying to truly make an independent agent that gets into a room let's say a robot gets into a room and there's so much stimuli uh and they don't freak out they don't get crazy they don't feel completely threatened or anything like that so as humans we have this ability they don't

yet so I'm I'm careful with what I say about our AI overlords because you know I need to be on the good side uh okay so so there are differences ml is us teaching uh machines uh with smaller data sets also so I'm going to be using a smaller data set I'm going to be doing the machine learning cycle for a cyber security person and I'm offering this code uh free and you know open and please feel free to scrutinize it I won't say that it's any a a um performance oriented data science code it was created for a demo uh to have small chunks that I can show you uh what to run and give some understanding um so

if you want you can clone that repo and follow half of my talk is going to be uh that demo actually and um and please uh if you run it and you have trouble running it um please contact me I'll have my contact at the end of the talk okay so uh why do I think that okay so how how many of you handle data day to day and see you know tons of graphs in front of you or you know you got to have some data right sock analyst I mean you're looking at these screens uh full of data um so so why ML and AI are going to be a mat made in heaven and are a

match made in heaven for cyber so I do threat uh I do intrusion detection research so Maring the uh Det terministic ways of finding uh an incident is really nice uh with machine learning so so I have my models and next to them uh there may be rules uh and they can enhance each other they can uh they can give some confidence to each other so um so there's also uh faster response times not only in DET in finding something uh but think about you know when you're in panic mode when an incident is happening you have fed an AI a language model with all your processes all your manuals and it can really quickly answer what you need uh so so

the response time and the quality of responses is going to be uh bigger automating the boring things I mean I cannot say how much uh I have suffered with boring tasks in cyber security and these uh models can help us they can do this boring stuff for us uh and of course being more adaptive because we don't know what we don't know and these models can recognize patterns so so all of this stuff um is great I'm a Comics person so I'm going to have a lot of pictures what I have over here is uh the ml Ops the ml operations cycle that starts from the row data and it can apply so well I mean many of you are in

the Raw data maybe uh you have your splank or elastic uh looking at your beautiful graphs uh you Cur at the data and retrieve it and look at it in pretty graphs you could go through the next of the steps of modeling and uh and all of these fun things uh so so I really think that all of you uh can do this afterwards especially with uh uh what the GPT models are giving us right now okay so uh so I'm going to go through all of these uh in detail in the next few slides and with a demo okay so first what data do we have in security um and of course we're storing it in

fast retrieving data bases how many of you use elastic I've seen a booth also that has a lastic out there oh few but yeah I thought okay a good chunk uh so uh so the data is going to be the first that Eda okay I like abbreviations it's exploratory data analysis so so really uh you know um this is what we spend most of our time on exploring so data science people we spend most of our time exploring this data as a black box okay so we get these pocket captures I get tons of pocket captures in my day-to-day and explore them as a black box I don't know what they're hiding but they could

have a very interesting story uh and the second part that we spend uh tons of time so you could say 40 40% uh is the feature engineering uh So speaking the characteristics of the data that can tell us a good story um and tell the sea level exact a cool story uh making a product or you know um doing some uh cool uh other things uh but the least uh amount of time we spend on modeling uh and uh it's the most glorious thing at the end okay models can uh figure out whether something is malware or not or uh models can figure out uh can Pro can forecast uh whether there's going to be the next big outage

or something like that and yet uh even though it's the most glorious part it's the part that we spend the least amount of time um so um so my day today is mostly I'm focusing on Internet of Things uh so uh so my day today is trying to get data from these devices I get pocket captures and all that uh but of course many of you uh would be getting logs I don't get that much logs uh but uh but this is the sources can be the network the systems logs usually uh and of course malware uh and there could be more actually so you could give me some ideas of some things that I'm

overlooking here uh but these are the three sources that I have used mainly and then the types of data okay packet captures uh from the network uh from malware uh binaries uh so there's been interesting analysis of binary is uh with pictures and um you know how we can transform to a picture and recognize malicious versus non-malicious uh so uh static Dynamic analysis uh and then there's some things that we don't think like you know passwords so there's a few passwords lists out there that could give a data scientist some insight of how people pick passwords and how to help them not pick this common password so in a more educated way so so these

are data sources that maybe you never think about or you know I've been doing a lot of focus on domains because I do research on DGA so you know there's tons of Open Source stuff over there or algorithms to create a domain generate generated uh uh algorithmically generated domains uh so there's a wealth of data uh that we have as security people so it truly makes sense uh to use the data science tools the machine learning and AI tools so feel free to interrupt uh if you have a question and all that uh and I'll keep going with the slides um so the next let's get to the steps now okay so we have our data we have some good

people also Gathering data for us uh and now uh we're looking at it and we're like what the heck so usually uh we go to some pretty pictures okay like I said I'm a I'm a Comics person and I'm pretty visual so there's tons of ways to visualize data for different uh reasons right so we can visualize data to understand how they spread in a two-dimensional axis like that's a histogram but we can visualize data with that violin uh plot to see correlations between features between the columns in your spreadsheet or something like that so and all of these ways you don't have to be a programming wizard so I have have uh some on liners uh that you can

use to make these graphs thank you python um and now uh there's also prompting that you can do to create those graphs so I'm going to show that in my demo and then after the pretty pictures okay they there will be math sorry it happens so it's called data science so data science is mostly mathematicians that were not very good in theory and they wanted to do more applied stuff uh so there will be math uh the math though again you can approach it as uh aggregating statistics uh that can be some again really cool numbers that can give you some really good information uh on uh how your data relates so how does uh a source Port relate to a a source IP

uh because you know you have different Behavior inside the network and different Behavior outside the network if it's external IP so so so these the statistics the correlations can give you that um so expiratory data analysis also H will not just give you numbers it will also help you uh test some hypothesis uh so which is really cool so how statistically significant is the relation of the total number of packets in my streams uh compared to uh to uh the duration of the streams of course there there has to be some s statistical uh you know a correlation between those two so lastly um hacking Dave yesterday I don't know if you were how many of you

were at the security onion conference so a good chunk of you so hacking Dave was talking about that long tail and the values that are rare so we have a lot of noise if you're sock analyst there's so much noise so many false positives but then there's a rare event and again exploratory data analysis statistics can help you uh find find those rare events and see wow there's something going on that is out of the usual and maybe it needs my attention uh so and that's again uh pure straightforward uh tools that thank you data scientists and python that did that for us so now I'm going to the newer staff and I'm going to say uh once you use my code uh it may

break because this is a moving Target so you can do exploratory data analysis uh with a new framework so so data science people use these pandas okay it's they're not cute little bears uh although I love the analogy uh they are um data structures uh that are really Excel spreadsheets uh so uh two dimensional arrays of your data and what they do is they facilitate how we process our data okay we have columns and rows life is good and neat uh so uh so there is a framework now that is combining these pandas with open AI uh so that's what I'm going to demonstrate uh and of course you can also use um your own models that's that's a little

bit more elaborate uh so so now you can do exploratory data analysis statistics graphs without having to know all of these libraries that I had to learn but it's just prompting um GPT as long as you're okay uploading that data over there so there so there's this caveat okay so uh so that's that's really cool uh and uh it can help learn uh but also it can be a quick solution okay so that was one of the part that data scientists do a lot Dora the Explorer explore the data the second part is feature analysis so my demo is going to have a lot of these uh so the feature engineering uh is uh not as uh you know

um stardome as we think data scientist it's mostly cleaning garbage so first we start with lot of garbage so lots of duplicates lots of missing values H and we have ways of removing those it used to be that I had to write elaborate code now I can do it with prompts uh so so after we curate our data okay which can be painful uh then we have to extract features and then there's mathematical ways the principal component analysis is one that is used a lot um and uh there's also domain knowledge that's where it comes handy to be a cyber security engineer because you can understand Source IP destination IP uh protocol uh Services Port Services and and all of

these good things so uh so domain knowledge is the part of it's not science it's art and the data scientists may not have it uh but you guys have it you're specializing on this and then you have the Holy Grail of your data science if your features are good the features that you produce then life is good your models are going to be pretty good uh so there's two types of uh features over here the upper part is categorical the lower part is numerical so categorical is what language models deal with okay they're words okay uh so uh so all of these uh words though uh the language models the AIS they don't understand words I hate to break it I mean you

probably have seen it there's a lot of analysis of how it works it's not words it understands it's numbers okay so tokenization and converting these words to numbers that's what it understands and that's the Brilliance actually of the AI that we are using right now how we are doing these um conversions of words to numbers and keep the context of words so that these uh models are smart like artificial intelligence uh and then there's the numerical data that GPT is not very good at so I'm going to show in my demo that you know unfortunately um I can't do my models with GPT with numerical data but it's really good for the data scientist uh to convert um to

get their numerical data and again massage it a little bit because IPS yes they're numbers but they're not just a single number for the model so we have ways on converting IPS for example two numbers keeping the context of you know being in the same subnet and things like that so uh so yes numerical and categorical data so uh again the sources that I have and I will be posting uh my slides uh on the same git repository I just wanted to wait until the conference ends uh the sources are really good so I have a source daily dose of data science over here if you want to start your journey with data science I love

intuitive um explanations with not too much math and pretty pictures uh so so this is showing actually how we do this categorical encoding uh the top shows a really cool way so let's think about an example from networks uh TCP UDP HTTP mqtt for example this could be four protocols that I have in my pocket capture and I can do one hot encoding which looks like binary it takes a little memory but it's really efficient so I really like one hot encoding it has saved my life a few times uh so it's simple it takes some memory uh but it's a good translation of categorical data and then there's other types of uh future engineering um uh for that but

then yes we transform our words to numbers but we also need to pick the most important features from our data so if you've looked at the packet capture there's tons of fields okay if you get the whole OSI okay Mac addresses IP addresses blah blah blah I don't want all of that okay I don't have a open AI budget uh to train a model for days and pay AWS hundreds of dollars uh or thousands actually uh so I want to extract these features with some cool math principal component analysis is one of the cool methods uh it takes the data and it says okay this is the data that is the most interesting uh because it

causes the most variant uh so you know IPS let's say and ports is the most interesting in a specific uh pcket capture or uh pocket sizes and all that and it could be two features from all your data set or three or I've I have a model now that I thought it would be 30 features but it's six actually so and it's it's all about variance okay so so if I pull this uh if I pull this feature out does it still change is it still interesting think about variant as something like that so um so and then uh after all this math and all these things we had to learn for data science uh we

have Panda AI doing this for us it has a function to prompt pped GPT to generate features I will say uh that I'm I would welcome you if you're into python uh to contribute to this project because I think that they uh they can get some help the this prompt is a little simple but they've done a fantastic job okay I'm not bagging on them I'm using uh the framework the the libraries at on uh but you know this is a simple prompt and you will see in my demo that it doesn't generate uh very interesting features but still with the prompt uh you can generate features and tell a story okay finally the 20% of the data scientists

work uh which is also the the most impressive one the sea level people are like yeah models and we're like no I spent all my time exploring the data and feature engineering but oh well yeah models uh so models there's tons of them um again another resource that I strongly recommend Illustrated machine learning see the pictures here I have a thing for them uh so so the models are plenty out there um and you can you can split them uh with three criteria uh supervision um and parameterization and linearity okay so supervision let's take that because we all know GPT okay generative pre-processor pre-trained whatever uh what's the T for Transformer uh models uh so

um where does this model stand actually uh it is more in the believe it or not what do you think is it a supervised model that we supervise it completely and tell it you know uh how to react to external stimuli with labeled data like we tell it this is malware this is not malware and then we show it something unknown and it figures things out is it unsupervised we just feeded a bunch of data and it figures out the patterns or is it semi-supervised so what do you think semi exactly exactly so and that's our life right uh so think about your life as a cyber security engineer so semi-supervised means I have some

malware samples that analysts has have spent hours and they have written cool blogs and they have labeled them in a way for me but there's only a hundred because you know I can't work them seven days per week 24 hours per day so there's only a few of those samples but then I'm downloading all these samples from V virus total malware Bazaar that they say that it's malware but I don't know so they are unlabeled uh so I use the labeled the work of pain from my analysts to help label other things or help uh train these semi-supervised model so so these are the ones that I Resort uh to most of the time I have a few labeled data and

then unlabeled and then there's parameters uh so you you can hypothesize about your data or not and linearity uh and that's about it okay last two slides and then we'll go to the demo and try to run it while I'm doing it so to run it you will need a an API key so uh so last two slides how do we train that data how do we teach the child to learn about the world uh we have the data set with split it in three parts training uh validation and test uh and um the biggest part of course is the training uh so uh it has to learn somehow uh the smallest part is the validation and test validation is is

it valid is it uh finding the right things and test is fit it something completely unknown and see if it can actually do it and what I use in my demo and what is de facto actually so you will have to use it if you do data science staff is the kfold cross validation which is taking the data split it in five different parts uh and then take four parts for training and one part for validation rinse repeat five times with a different part for validation each and every time okay so this is the way to do it so that you create unbiased models uh that haven't seen uh all that have seen eventually all the data uh but uh they don't

overfit their knowledge to your specific uh types of data okay so um and this one again is a good visual because I use this in my demo uh to evaluate the the model so we calculate a metric let's say the accuracy or how good the model is in every iteration and then we average it okay so uh confusion Matrix I thought this in college it's called confusion for a reason uh trueos positives false positives you've seen them especially if you're doing sock and all that we want to be in the diagonal of this Matrix okay we want to have all true positives and all true negatives and nothing else if only life was so ideal uh so uh so

then uh we have other metrics um in the data science realm that you can again calculate with single liners actually uh accuracy uh precision and recall okay and these are some some good visuals so so the Precision um is the way that we um see how many of our uh retrieved items how we take all the true positives so my our true uh relevant uh predictions uh divided by true positives versus false positives so how many of the items that were retrieved were relevant or how many of the items that were how many relevant items are retrieved so I think it's a pretty good explanation I mean we can spend tons of times on evaluating uh models but I want to get

to uh the fun part so I'm scheming through this a little bit um finally you know uh large language models they can do the modeling for us uh they don't do the evaluation I'm doing the evaluation over here but still um it's not uh uh it's not really uh bad and they can give us some accuracy so you see I have uh from my actual demo uh some accuracy results uh gpt3 a little less good than GPT 4 actually that is a little inconsistent so without further delay I'll go to the demo uh first I will say if you went to that website uh I'm using V vs code uh so I call it the kadillac

of IDs it's really good I mean uh but you uh you can use anything you like for this demo uh you can use collab Google collab then you don't have to set up an IDE but I do have a whole tutorial if you want to use that uh vs code has pretty pretty colors so you know I I have a weakness for pretty colors uh so so there's that uh so I also use uh What uh day-to-day data scientists use which is Jupiter notebooks this is literally running your python in small little chunks okay because there's lots of code over over there and I need to see results and see how I'm going to move on

to the next part they're really great for teaching uh also so so yes strongly recommend uh to use those in any type or way you want they're also great for documentation so uh I can quickly go so I have markdown documentation and I can quickly go to my to the different parts of my uh of my code uh without any hinge so so let's start with this I will walk you through uh the whole thing we do have half an hour I plan to do it in 20 minutes uh and maybe I'll run a thing or two um I have sacrificed a couple of curls to the uh demo Gods sometimes it doesn't work that's

why I can't get any longer hair so anyway silly jokes um okay so so first this is a pain uh for any data scientist it's all those packages that we have to import okay so so there are a lot of packages that we have to import but if you just start using this framework the panda AI there's very few packages that you have to import uh so uh so if you hit play uh you see that this has run and uh and happily uh it worked um the previous part is just installations of those packages uh so I'm not going to uh drill too much with this but again if you have questions if you are a beginner uh shoot me a message

and I'll be happy to answer so I've imported my packages uh now your security people so I hope everybody can see that back in h over there you can see right it's big enough can make it bigger okay cool uh so so you're security people so you don't want to expose your secrets in your code uh so there's this one line of code in Python load your dot environment uh and my do EnV is a file that I never check in in git uh so nobody sees my secrets okay been there done that got the T-shirt of getting my secrets exposed it happens okay when you develop a lot of code you can make that mistake

okay it happens but uh good hygiene um is this and then my use case is pcket captures so I chose here to read a couple of pcket captures um one is the Mirai the grandfather like the speaker in the morning said the grandfather of all iot malware uh we see variations of Miri I analyze a lot of iot malware we see it every day every week uh so it's here to stay um so I've analyzed a packet capture from kagle actually which is an open uh data um collection website many data scientists use that and then the Bine uh packet capture is from kagle as well it's just normal iot behavior so that's all there is to that and then

there will be a lot of pre-processing and all of these things so the first thing I need to do is I have a pocket capture that I've read with um okay anybody knows what library we usually use to read pocket captures or even create bucket captures you may have seen it in my imports very good excellent uh so scapy um is uh the library um so uh so then we have these objects that are not pandas data frames so we want pretty little bears uh so so we have to write a little code again I'm not going to say that this is uh the most efficient uh code uh but I use SCP I extract IP layers and this is where

domain knowledge helps okay so if you know what you want to extract uh you just write a quick function with if statements does it have TCP UDP that's all I want to analyze I create a list of my data and then I use this pretty pandas data frame and now it's a data frame so um in order to avoid doing that and do it faster I have also saved my data into pickles which are like text files for my data so uh so here is what it looks like at the end of the day the packet capture uh so you see familiar things that you may see with wire shark uh timestamps um Source IPS destination

IPS and so forth uh you see a lot of uh values that are missing non Val n non values yes I said it right uh and we will take care of that or actually AI is going to take care of that um so so here we go on the moment you waiting for okay so so all we have to do now if we want to be data scientist and you canmet all the parts of my code that are traditional and just use the AI parts and adjust them to your own uh needs is to have an API token or our own model okay but it's much easier to for now to use uh uh open AI um so we need that API

token uh and then the pandas AI uh framework uh and then um we need to do one more conversion unfortunately convert our data to a Smart data frame because you know they have used uh this language um this Library called Lang chain to break down your data to chain it towards the GPT uh model so so we have to use uh another object um and then uh we can use one of their functions okay okay so after we convert this pandas data frame that it was only a little bare to a Smart data frame which is a smart little bear uh then we can uh use their functions uh so clean data I used to

have to write um you know uh regular expressions and I suck at that uh and all of these painful things and now I just can call a function okay so so you see the original data frame was 7 7 100 th 64,000 7 64,000 lines and some change and now it's 197,000 lines because all of the values the none is a value then n o n e my accent doesn't help in the difference uh so these are cleaned up okay so so you can see that uh you know it did a decent clean up uh and that I used an outof thee book box function I didn't even write the prompt okay but there's other ways that that you can

actually write the prompt uh so uh so then again the demo for the benign data and we have less data let let there be clean data without having to do garbage collection as uh I used to um okay so uh so now uh that's something sometimes I want to do a experiment if I can have some interesting insights or shrinking my data by extracting streams uh and not analyzing single single packets uh so using Source IP destination IP uh Source Port destination Port that was just to make my uh model faster uh so uh so you can use that function it's not um amazing uh but I use streams from now on because it's less data so so what I want

actually to insist is now I have I still have a bandas data frame okay it's not a smart data frame it's just a cute little bear uh but uh the uh the power of this pandas and I recommend that you check it out is that I can have one liners that can give me even more insights about the data single line of code All The Columns of my data single line of code all the data types so I can see that I don't have all numerical data so I have to um I have to do something uh so since I don't have numerical data uh I will use an IP um uh function uh an IP Library

actually uh to convert the IPS to numeric uh so that I can finally have all numerical data so I do this in here with the IP to numeric function uh so here's the thing with uh pandas they can be inefficient uh so again you have to forget your for loops and start using this apply function this is unfortunate and hopefully you know uh eventually we will do all of this with prompting and not have to uh to do this uh but uh so that's the thing to apply a function now I'm not doing a for Loop through all the lines of my spreadsheet I'm just uh using the function name as a Lambda and um that's that's about it so this is

unfortunate even though uh pandas uh is a great um Tool uh and now you see that after I run these uh these functions I have all numerics in my uh in my data frame so I can do more exploration okay so I can get stats because I cannot get stats if I have uh uh if I have um uh words so stats again single line if you if you retrieve something from this is you don't have to be be an algorithm wizard uh to be doing this data analysis okay with your data frames and a single line of code you can get descriptive Statistics over here count mean standard deviation you can with four lines of

code you can get a heat map of your statistics so this is still not AI so I'll Breeze through it uh and after that actually uh you can look at the relation of your data source destination Port protocol number of packets this is a bunch of numbers so I need to draw it okay so so this uh this ah this doesn't look as good I'm G to make it smaller okay so this one a heat map again okay even if you don't understand the math over here the bigger uh the brighter the color the more the cor correlation so I can say here that protocol has an interesting correlation with destination IP numeric okay so this is what my

correlation numbers found okay so of course the diagonal is all uh very bright red because uh the source Port is completely correlated to the source Port the destination Port is completely correlated to the destination port and so forth so these visuals can give you a story of the data and you can see you know the malicious data show a different heat map than the non-malicious data so so it can give you some really good insights just looking at pretty pictures but I know that you're getting tired of these so I'm going to brid through the hypothesis testing and outliers and I'm going to get to the AI so all of these things that I've been

doing with having to learn having to read the documentation about pandas and understand how to use these functions and all that I can do them with um AI I can interrogate my data now I'm using a little different version of the panda AI so that's why I have these two versions uh now I'm doing my own prompting so I'm not using their outof thee boox function I'm doing my own prompting and asking the AI uh what are the top five Source IPS or uh what are the most popular destination ports and actually this was a recent question which are the most rarely used ports uh within the range of and do you want to do this for the Mal

ISU data so this this could take a while and like I said maybe I didn't sacrifice enough hair pulling okay so let's Okay so which are the most rarely used ones so these are pure oh it did it really ah it didn't say anything WP WP WP okay so this one was 69 but this one was warp so maybe we can use a different uh maybe the stream so at this point okay um you are not using deterministic so there's a caveat to that okay you're not using a deterministic mathematical function from python you're using an AI and so that uh may not be uh as as good as you would like it to be okay so did

it answer um so actually it did answer so there we go uh and it did answer with with a table actually this time so so definitely uh something that I use mostly as a black box uh right now uh but I would like to dig into why it gave me only one uh answer in the previous one and now it's giving me a whole um a whole table so uh so yeah so so now here's the thing that blew my mind you can actually use the AI to do graphs okay that's not a pretty graph uh but you didn't have to again go through all the um all the documentation to make graphs you can actually use the

AI to make graphs and actually I had this brilliant idea to use one of those uh and do another graph uh live um and this is going to take an hour this does take some time okay uh so let's do a histogram with the malicious data uh but it's worth it I mean it's time that you can sit uh pretend to be compiling playing a video game or something instead of reading the documentation so there uh but I will say uh look at the rest of my uh notebook because I do have the standard ways ah it did it faster than expected so here and it looks it looks like a long tail uh so that's pretty good um so and

there's some okay data scientist have spent a lot of time doing this uh so there's a ton of libraries like look at that okay this is a nice exploration of all your data with a single line scheme so I have a lot of those more uh pretty packages but this is not AI uh so I'll I'll skip through uh to the AI um so um so here we're going to models okay uh so um I am going to just show what models I used okay I used three models over here because I'm doing supervised learning I know that I have Mirai and some um benign uh data and so um I use three different classifiers

that are all supervised uh single lines of code no glory over here except you can explore these parameters um I used a lot of Curves uh that you can look into and they can help you um evaluate your model but I'm not going to get into those I'm going to get to the AI classifier okay so the AI classifier was a little WP WP WP for me compared to the other things that I could do with actual python packages uh because it's only working on words so so if you see here I'm only choosing the payload from my pocket captures I'm feding into the payload and label so it's going to look only at the payload and try to figure

out whether there's malicious or non-malicious uh Behavior Uh so so uh in this case I'm using a completely different package it's called pyit llm um pyit is a very known package um and I can be brave and start running this uh but it's going to take a while uh so I'm going to show the results first um you see a little again okay this is what a data scientist may not like uh it's the fact that there's inconsistency so gpt3 is doing a little better now than gp4 how uh well I'm giving it very little data aund of payloads so you know it just happened uh here gpt3 is uh doing a little less uh well good uh than gp4 so

I'll start running this and I'm G to stop I think I have 10 minutes right perfect uh okay so I'm G to stop for questions I also have to test you on some things uh some of you were very kind to answer things so maybe I should be giving these uh but yeah I'll start running it uh so caveat on that if you start using gp4 uh so this is using the API for gp4 you do spend a little more money so I spend like $5 for to prepare for this conference it's not too bad but if you're a college student maybe you would need to watch that or something like that okay so any questions or um

any observations yeah you asking

so the question is how much confidence can I have in the results that I'm getting from the llm when I don't see exactly the process is this am I rephrasing this correct so I don't have too much confidence so I trust it but I verify it so the scatter plot that I did and because I had to go through the process of learning how to do this I I would verify it however there are some things that you can verify without being a data scientist right if you ask it to give you the top ports uh you can do this with a spreadsheet you can export your pocket capture to a spreadsheet and verify it um it answers pretty well

because they have uh they're feeding it prop ly the data for simple questions the answers are pretty good for more complicated questions this is the most complicated question that I'm asking it to classify things so the deterministic stuff like make a graph pick the top 10 or do some summary statistics it's pretty good because it's simple logic that it's being taught to learn uh this staff that is more a model trying to make a model inside it to recognize malicious pickup versus non-malicious that I don't trust at all uh for now I would need to look I haven't dissected the code of the zero shot uh classify I know what zero shot prompts are or uh

few shot prompts but yeah it's uh this is Hocus Pocus right now so that I don't recommend I do recommend Panda's AI for interrogating your data because it's pretty deterministic and it's doing a good job and you don't have to write much code for it you just can ask really good questions and yeah so thank you for the question

yeah so the question is uh how does pandas AI handle parametric from non parametric data sets and especially the sty correct so um I have to say uh that I have tried to experiment with the original packet captures in the beginning and the time was unbearably slow because again I'm going to an API call uh so when I'm uploading 700,000 rows compared to 200,000 uh rows there's a huge difference uh with uh that um when you say parametric do you mean like par parameters of a model hyper parameters or do you mean like just the features of my data set and the volume is this what you mean more so so the performance is not good

okay so I wouldn't wait for that long however in my kind of dayto day uh not kind of in my dayto day I'm training my own uh model right now um so uh so you can train so now you can train a model in only pcket captures let's say okay so you decide to make a language model that understands how to read pocket captures uh the only thing you need to do is learn how to teach it okay so it's not as easy as I'm presenting it but if you have tons of pocket captures you can definitely uh do that uh so with that being said if I had a local model I think because of the um because of the

API calls that I do I'm getting throttled there's a lot of things that they're doing also I have a pay that account but it's not an Enterprise account uh I think that a local model would work much better with that so so I recommend that you get one of these uh free open source models that are small and train your own I mean it's very doable actually again you don't need you need good domain knowledge to train a model I I may do a talk next year on how to train a model as a security engineer uh all you need is really good data so this is not even as much math but yeah the good data is tough to tough to get

Xenia Mountrouidou - Harnessing ML and AI for Next-Gen Security Engineering

Related talks