
[Music] so I am kak Sharma you can Alo call me cart um I'm currently working as a senior associate information security engineer at eunex it's a data center company so today we will be talking about how we can detect Insider threats or attacks using graph neural networks so first of all let's talk about me I currently live and work in Seattle and I went to uo for my Master's in CS degree so that's my PMW connection and go Ducks we are number one football team in the nation right now so that's great um my work is on the intersection of data analytics Ai and cyber security and my hobbies are typical PMW Hobbies hiking drinking overpriced coffees and
flannels um so yeah so that's about me let's talk about the agenda today so first we'll be talking a little about The Insider threats then I will introduce you all about GNN or craft real networks then we will talk about an approach to detect inside a threat using GNN and finally we will look at some challenges and future directions Al righty so let's get started so what is an Insider threat I'm just going to read the definition from here it's from IBM Insider threats are cyber security threats that originate with authorized users such as employees contractors and business partners who intentionally or accidentally misuse their legitimate access or have their accounts hijacked by cyber criminals a big example is in
2023 uh some Tesla workers leaked a lot of sensitive data to a German newspaper um that that's that's a lot of employees data so that was the big one cost of Insider threats it can cost an organization 16.2 million it can take 86 days to contain it there are three different types of Insider threats and their cost first one is the easiest one like a mistake which can cost you $500,000 per incident the other one is malicious cost you around $700,000 the last one is outsmart which can cost you 680k um that's a lot of money so let's talk about types of insiders we have malicious employees just want to misuse their access they hate the company or
whatever contractors um contract workers they have they come with security risk because sometimes they'll be using their own laptops and you probably don't have like all of your Stacks installed on their laptop so that's the big security risk um unintentional sometimes there's lack of awareness and the security is compromised unintentionally uh and sometimes the security standards are not in place and that can cause an Insider threat um let's talk about The Insider threat activity ID sabotage the whole ID infrastructure could get destroyed IP theft sensitive information from a company can be stolen fraud data manipulation could be done um maybe someone from a different company is working or someone working for a different company is working in your
company and they can manipulate data for the advantage of the other company espon spying for competitive advantage that goes back to the last Point too or national security purpose that's a big one now let's talk about the levels of Insider threats first one is low level careless unintentional um it's the level is low because probably not going to happen I mean it happens a lot of time but we always put it at a lowrisk level the medium level is there's a malicious intent from an employer with some limited goals maybe stealing some data or building their own startup it could be anything uh the high level is if an Insider is highly skilled or they could
be a foreign spy so now let's talk about the traditional strategies we use for in Ider thread detection first one is behavior based detection uh which focuses on deviation from typical user Behavior if uh if Behavior user Behavior probably go off charts they can flag you or look more into you maybe like login at different times you're logging at 12: a.m. you're you probably work from 6:00 a.m. to 2: so they probably can look into you more uh rule based detection you can Define you can predefine rules for your organization and if someone goes off those rules you can flag them or look into them further anomaly detection that's just get a lot of data and see
like what is deviating from the norm so maybe if someone's accessing data accessing company data which is really simp sensitive at a huge scale you can probably look at that signature based detection so if there's a malicious pattern already known you can look into that you can look into the log and using some matching techniques you can match those patterns heuristic based detection so if there are some heuristic you know or I mean there always are from Best Practices you can use those um to detect the threats so those were the strategies so now let's look at the traditional techniques we use first one is statistical Technique we can use clustering or time series analysis or
some other kind of statistic technique uh to figure out uh deviation from normal behavior um machine learning techniques we can use K means clustering or svm to classify normal or malicious behavior we can use matching techniques maybe use regular expression to match uh malicious activities you already know in the logs deep learning techniques deep models we can use lstm Auto encoders or other deep learning models to detect those subtle Insider threats based on a lot of different features we can build our own features based on user behaviors um you know how how many times they access the file things like that or U or other things so those are the traditional techniques we currently have in place
for inside the thread detection but there are some limitation to those limited nonlinear detection a lot of times uh normal ml or statistical technique can be hard to find uh deviation in nonlinear data because most of the data outs in the world is it's not linear it's nonlinear so that's why it could be hard to use statistical or you know not a deep learning model to detect those threats um again that goes to novel threats It's probably hard to detect novel threats with those traditional techniques High false positive rates we don't want that we don't want to flex someone without actually knowing that they there is an Insider threat that could be bad for any
organization overfitting is a big one with some deep learning techniques we can detect threats in nonlinear data but we there's also a possibility about for overfitting to that data um and again because these techniques are traditional there could always be um vulnerability to evasion so now how a graph Neal Network can help how GNN which is a little which is actually an extension to deep learning model can help detecting Insider threats first of all it can model comp complex relationship because it works on graph data so it can model relationship between users devices and events it can adapt to evolving threats because it can model that relationship in a graph so it can adapt to new or
novel threats it can get contextual insights again that goes back to model graph it knows who are the it's you know it it can figure out uh user neighbors or the different activities users doing and it can give you that contextual insights which can make your detection better and the big one is it can help you to reduce false positive that's always a big one in any threat detection um so yeah so now what is a GNN or Graal Network let's talk a little bit about that so GNN is a machine learning model based on a deep learn based on any other deep learning model but the new thing about about this is it learns from connected
data and it can uncover hidden patterns because it it is learning from the neighbors or the relationship between those neighbors so nodes in a graph can represent any entity or objects for example users uh access logs or you know a row in a log anything you want edges could be a relationship for example if two users or friend or if they talk on Microsoft teams so you can create edges between users or other activities like that so how does it actually learn those node representation it's just learn it through message passing each node passes it's feature Vector message over here over here is feature Vector to all of their neighbors and all their neighbors passes feature vectors to all their
neighbors so everyone passes is those feature vectors to each nodes and then it um Aggregates those messages or feature vectors it got from different neighbors and then update it feature V it its original feature vectors it is that simple so if you have six neighbors you probably get six different feature vectors from six different neighbors and you're probably going to use let's say some mean or summing technique some any kind of technique which can aggregate those feature vectors and then you're going to use that feature Vector to update your original feature vector and you can do that for an iteration you can do that thousand times depending how much compute you have or um you know how
like how good the results you are getting so because it actually can give you a really amazing result Dr on cph data it is being used a lot in social network analysis uh like drug drug predi drug Discovery uh Knowledge Graph completion uh recommender systems for example Amazon use using it forever I mean not forever but like for a long time Pinterest use it um I know a lot of uh drug companies are using GNN now to figure out new drugs or connection between different drugs um social network analysis always been a huge part of Twitter Twitter uh published uh a lot of papers about GNN in the past I'm sure meta is using it or other social uh
Network companies um so let's compare GNN with traditional un networks so the input structure for GNN is a graph of any size could be any size million nodes billion nodes whatever you like but for traditional un network is actually an array maybe with 100 features at every spot or every index so it's a grid-like input not a graph relationships so model actually a GNN model can learn from relationship between different nodes but a traditional little Network assumes Independence most of the time there are some models like lstm which do not assume Independence because they do learn from uh temporal data time series data so a lot of time but they assume Independence node level task you can
classify a node for example if a user is a node you can classify a user or a device or anything you anything is a node you can reg you can do node regression you can do node clustering for example if you're trying to find a r rumor Network it's pretty common on Reddit or Twitter or like a Spam Network probably people probably have seen a lot of Bitcoin spams online so you can use node clustering to figure that out to find those networks in traditional neural networks it's only sample level classification so it's hard to do those stuff because again it's assume Independence or you can do regression Edge level task You can predict a link between two drugs if
those two drugs are similar or if they can treat that same disease or maybe you can predict a link between two spammers on a social network um Edge classification is pretty similar you can classify an edge if for example um if two IP addresses are related or not to an attack you can do those kind of edge classifications on traditional Network you can't really do any any of that stuff graph level tasks you can literally uh classify the whole graph for example you can create a drug or a molecule as a graph and then class classify the whole molecule if for any problem you have if it's GNA uh if it's going to treat this drug or not
or you know like you can create any graph level problem similarly you can do graph regression but in traditional un Network you cannot do any of those stuff permutation invariance so a great thing about graph is if you feed it in a different way it's still going to be the same it's but in traditional network if you actually changes uh the input maybe put uh in something at index 10 at index zero the results could be different so that's the big one uh and interpretability um that's I think everyone is talking about it in llms that you have no idea what's going on inside llms the blackbox things like that I mean I'm not saying GNN are not a
black box it probably be for most of the time but you can use important noes or edges to to have some kind of context so and that is hard to do in traditional Network because uh I know a lot of Works have been done I'm kind of generalizing it but still it's very hard to inape a neural network and I think that's a that's harsh truth but it is true so so yeah so that was the comparison between GNN and traditional neural networks let's talk about the Milestone so the initial Concepts were formed in 2005 then spectral approaches came in 2013 and then spatial methods where you are actually working with the whole graph came in 2016 then 2017 was really
important year because graph convolutional Network came out which is still being used by a lot of different uh um organization uh and it's a really really important model and everything from there a lot of models from there are built on top of this model then graph attention Network came out in 2018 and what that does it instead of learning from all the notes is just chooses some important notes in a neighborhood and learn from those and that gave some really great great results and it's a really important model in GNN and after that um GNN started to being used in different applications as we talked about before and advancements are still going on we still need to figure out how to scale it
to billion graph maybe uh I think in future we can probably get to trillion noes too who knows the data is growing in insane rate so scalability is still a big big research Direction um so yeah those are uh those are the milestones in GNN now if you look at the GNN Publications in in in papers skyrocketed I mean not Skyrocket in a literal sense but you can see there's a lot more publication compared to 2015 and uh in 2013 so yeah and if you look at important conferences it's from in 2015 there was literally no I mean there was couple Publications but now this year um I mean last year it's 2024 right last year
there there's been a lot of different Publications in different important machine learning conferences so a lot of people actually researching and using GNN now so that was the introduction to GNN and now let's talk about how we can use GNN for Insider thread detection let's talk about the detection workflow how GNN can be used and how a detection workflow um can be built or look like so first of all data collection you have to have data that that's that's the bad thing about any machine learning models you you got to have data um and you you have to buil a graph out of that data and that's that's a really important thing in GNN how you going to model a
graph how you going to model a relationship in your organization between users or between devices whatever you're trying to classify how you going to do that that's where you really have to think super hard for example there could be a relationship between two users if they're emailing each other or you know if they're messaging each other on Microsoft teams or slack or any other um you know messaging software or you can think other way too maybe if they have similar user Behavior if they uh then you can create an edge between those two users if uh you know if they're doing similar things so that's where you really have to think about it because sometimes there are users who
probably communicate with very very less people and they're isolated so how to actually put them in the graph because if you have a user with only probably like two neighbors you're probably not going to G not going to learn a lot about that users because it only has two neighbors so you have to figure ways to make connection between those users or sometimes isolated devices how to be how to represent them nicely enough graph structure extract the node Edge features I mean that's kind of like a normal machine learning flow what features a device or users can have I mean it can have like how many times they can access a certain website I mean you can actually create you know
a list of websites which are you think are malicious maybe like Wikileaks or something you know where people can actually upload data from the company I mean I'm just making things up but but like websites like that so you can have those features like how many times a user access that website or other thing like how many times a user talk to different user how many emails they send to external people you really have to think hard like what kind of features you're going to feed to the GNN or even any deep learning framework you really have to think really hard about features too um and then once you have those things it's all about select cting the best SC F
Network model and I and in most deep learning um uh problems a lot of times what you do you just look at the papers uh papers published and just look at them like if they solved a similar problem and what are the results so that's how you narrow down the models you want to use to detect anything because there are thousands of GNN models I mean I'm sure there's Millions I mean I'm again exag exaggerating but a lot of deep learning models you can't really try all of them so you really have to look at your problem and maybe read the literature uh published and just see uh which model fits your pro problem the best um and then train the
model you you got to have compute you got to have some compute for that it's probably going to be really hard to train it on your Intel processor so I mean if you have only 100 people in your organization maybe it's easy but it's probably not going to learn a lot but so you really need to have Compu for training um and finally once you're done detect the anomaly so that's the normal detection workflow it can be extended too or you know add more more things to it maybe like hyper parameter tuning which comes in training the model like how uh what hyperparameters you going to use for uh your model things like that there could
be a lot of different things here which can be added so yeah now let's look at how graph convolutional Network you is used to detect internal threat we try we will try to just fit in the same um thing I showed you before um so data collection over here was actually the data was collected from CM uert version 4.2 data sets and open source internal thread detection data set you can go to carnegi melon uh website and you can all have access to this data set um graph construction the graph was user Behavior interaction graph so um I think the features were the behavior of the users and how users interact with uh different users so
basically in this case the nodes were users we will talk a little bit more about it in later slides um then extract the node Edge features so users user features are extracted we'll talk about those features in a bit select the best GNN model use gcn over here finally train the model detect the anomaly so now let's talk about the data set used over here the data set over here is used from CMU Carnegie melon University it simulated the activity logs of 1,000 users from January 2010 to May 2011 types of logs collected were log on log off email communication file access web browsing device usage so like if you look at it you can actually get all
these logs in your organization too I think a lot of logs are similar whatever Sim you all use you can look into that and you can probably find something similar so now how the user Behavior interaction graph is formed so nodes very simple each user is a node in the network so maybe each employee contractor whatever you have edges edges over here are two different kinds so edges actually represent interactions or similarities between users so the first Edge is very simple direct connection if two users are talking on emails Microsoft team or any other communication platform you will create an edge between those those to users but as I was talking about isolation because sometimes some people just don't talk to
each other not everyone's talking to everyone in your organization if you have like 100,000 people in your in your or that's not going to happen so you will have a lot of isolated nodes maybe a node with only two three neighbors you don't want that so for that a similarity based connection is built which is basically take the behavioral activity of all the users and use coign similarity is is a statistical technique to find similarity between different behaviors of users and if the similarity is more than 50% you will create an edge between those two users so that's kind of innovative way to create a graph you can think of different ways too so yeah and that's how that's how I
mean it kind of looks simple afterwards but uh first you really have to think about it so that's how the graph was built in this case so now let's talk about the user features so log on log off um what time they like log on or log off what are off over logins you know the time they don't work or devices used to login you know mobile mobile devices personal devices company devices device usage um again you know what kind of devices they use connections made maybe I don't know connections made to different devices off hours activity on a device uh file access what kind of files you're accessing uh are you accessing those file in off hours or during office hours
uh email who you were sending emails to internal people external size of the email sentiment analysis can be done it's super easy now um topic of the email things like that web browsing what kind of pages are you browsing are you looking at Wikileaks or some those kind of things where you can upload Insider information or you know key logger sites um sentiment of web browsing so those are the features used in this case and now let's look at the model architecture it's very straightforward you have all the logs then you created a graph from that log after that you can see over here like my oops my pointer over here can show you all the different features off a user
and then you put it in a gcn model and then message passing happens uh all the neighbors send the messages which just means their features to each user and then you aggregate you get all those features and after that you use any aggregation techniques there's a lot of different ones mean some there's a lot of different ones in literature and then you get a new feature and that's the new user feature or node feature for that user um and you can do this thing how many times you want you can do it like 10 times 50 times it's actually a hyperparameter over here um and finally when you learn all these new node representation you take all of those
node representation and pass it through a fully connected layer and and use a soft Max soft Max activation to figure out which user is malicious or which user is not again in this architecture you just put put it put everything through uh fully connected layer or just softmax activation but in other case you can you can assemble models you can probably use lstm after learning those new node representations and maybe use 10 different not different but 10 fully connected layers really depends on you how you want to build your model um so yeah so that was the model architecture for this and the results were pretty good it was compared to different baselines random Forest
svm logistic regression CNN and it beat all the different baselines over here where you can see the Precision was 95 which is very important in internal thread detection I would say more important than accuracy uh it did beat other benchmarks in accuracy too as well as recall um so yeah I think this is pretty straightforward from this graph that the GNN performed better from other baselines
so now we can look at some challenges and future directions it's really challenging to build any machine learning model like I have to be real with you all first of all malicious data is not out there I mean the data we saw was actually a simulated data real world malicious data is very scarse I mean even a company has it they're not going to share it with you because it's their own internal data and other thing if even if you have data then you really have to label that data that's a really tough thing so so we probably have to do something about that which we will talk in next slide evolving user behaviors user
Behavior changes over time just like any of the threats cyber sec threats are always evolving so that's really important too scalability as I was talking about the networks can be huge there could be millions or billions of edges between I mean not between one user but between all users and that could make any graft real Network really hard to train for a normal organization and incorporating low interaction users as I um as I talked about before some users are very very isolated how you going to make them the part of the larger graph you have to think about that too so that's the challenge so now future directions real time detection systems would be great if
you can detect inside the thres in real time and save you all money money and time to contain it handling data imbalance what we going to do with data imbalance we don't have labels we probably have very less uh data labels so what we can do for that we can probably use unsupervised learning where you do not need labels for example clustering techniques or you can use few shot learnings um to use a small amount of label you have to build your model scalable architectures it's I think the research is I mean everyone's researching about it but I think still a lot lot of things needs to be done we need scaleable architecture for any
machine learning problem for llms I mean llms actually uh uh actually did a pretty good job to shed a light on this and I think a lot of people are are working in this direction but I sometimes feel like like a lot of different models actually uh are they are put in the back burner because you know these llms which people think have more usage compared to other machine learning models uh so the architectures a lot of times are uh built by keeping llms or Transformer models in mind and not like GNN or you know other deep learning models um um and ex explainability of an AI is huge for any and anything out there but for Insider threat detection
it's definitely going to be huge because again it's really hard to flag anyone um of um for malicious activity it's that's why it's so hard to talk about inside threat in any organization so explainability in an AI is super important too so I think I completed in 40 minutes almost so yall can ask me questions oh thanks for this material it's really great um one of the things that you said is you're using a soft Max at the end of the layer which seems like it produces like a ranked score of of the users do you have any advice or suggestions about how like where is the cut off or how do you identify what's considered an inside
in that in that ranking or or what great question it doesn't spit out the rank it's actually spit out the probability between 0 to uh 0 to one and a lot of time people say I mean first of all if it's less than 0.5 you can say it's malicious or if it's like more than 05 you can say it's non malicious depending on your problem but you probably have to sometime times tune that threshold too maybe if it's saying that with 8 or 80% probability that it's malicious then you only going to classify it as malicious especially in these kind of tasks again as I said it's really hard to classify anyone as malicious so maybe you need a
really high probability before classifying anyone as malicious so it really depends on your problem what you're trying to do oh and then uh just one more question um and you kind of mentioned this sometimes um you know the behavior is has temporal dependencies like downloaded this thing and then uploaded it to a different place or something but it has to be kind of be in order and you sort of mentioned Ensemble methods a little bit do you do you have any uh more specific ideas about uh what order or what kind of models you might use I think from the top of my head you can probably learn you can probably create a graph learn all the embeddings
using your neighbors and then you can also throw in an lstm model uh which is really good in learning sequential data and and finally you can combine the output of those two models and use those to predict the final output so uh that's how you can probably do it thank you very much again for this material no problem do you have a threshold of minimum amount of data needed to create the accurate model not really may I would say at least 100,000 uh but yeah okay thank you I was looking at the last few steps when you were trying to uh work the model and you said you I think you said you used a fully uh
fully fully connected layer fully connected layer so there's there's one area that's not fully connected where you have the graph with all the edges and they're not all you know the there are edges that would potentially be there but wouldn't is that what you meant by fully connected layer was where okay no you mean fully connected layer is just like a um linear layer where with with a weight f with with the weight Matrix so basically once you are done with learning all the embedding over here all the node embeddings are learned you extract those no node embeddings from each neighbor sorry each uh user each user and then you put it in an array and then you pass
that array to a fully connected layer with same number of Weights as number of users so it's after the representation learning with graph convolutional Network thank you so you said that your model was 95% accurate that's 95% accurate in detecting the anomalies in the data set or the flag malicious ones or was it accurate uh just just the labels it had for malicious or not malicious and was your data set did your data set have artificially boosted malicious content in it or is it like what was the percentage of malicious activity it was created by CMU so uh I think it was really really less number of malicious uh uh malicious logs in there I do not know the top of my
head but it was really really less compared to non-malicious so that was kind of that that was the real world part of that but it was an artificial data set thank you everyone [Music] w