Real-Time Threat Intelligence With ML Feedback Loops - Tomer Doitshman

BSides Munich28:22198 viewsPublished 2024-11Watch on YouTube ↗

Speakers

Tomer Doitshman

Tags

StyleTalk

Show transcript [en]

hello everybody my name is Thomas deutman very excited to be here I'm a security research team lead at kto networks kto is a security company actually a sassy company our goal is to converge both Network and security and and to deliver this product to organization to Enterprises around the world and uh actually converging both Network and security to into a single cloud-based platform today we are going to talk about F intelligence and more specifically how we build them for intelligence Pipeline and using machine learning H feedback feedback in during the process let's go so we are going to talk about uh several ideas we first we are going to introduce a bit some key terms about F

intelligence and later on we are going to talk about the challenges there as you know there are many uh challenges in this process and the main focus in the of the presentation is actually to talk about the architecture of our pipeline how we do it actually in K and also introduce the machine learning P feedback how we use the feedback from the production environment to optimize our process and uh finally we're just going to talk about some case studies and conclusions so uh what is Fright intelligence I guess most of you know fright int elligence for understanding Fred intelligence we have to talk about three key terms the first one is ioc indication of compromise ioc basically

is U the most uh explainable definition for it any observable item in a relevant environment I know it's let's ground it a bit it actually can be any artifact in the network or in the host that tells us that the host or the network has been potenti compromised it can be a suspicious domain name it can be a suspicious IP address or IP ranges also a user agent anything can be anything ioc can be anything that can tell us in some confidence that we are have been compromised and our goal is to collect um most of the ioc's in the world and to prevent our customers from accessing and network flows that contain this ioc but

you know there are uh millions and billions of I in the world we can fetch them one by one so the next term is fr intelligence feed FR intelligence feed actually is a list or aggregation of many ioc's and the FED intelligence feds we get from many kind of sources we can get them for free from you know GitHub or some public list in the internet we can also pay for that so to vendors that actually do that actually work in supplying FR intelligence and the FR intelligence feed is just a list of I can be a list of domains list of ips list of URI anything and the last key term I want to talk about is a tip fret

intelligence platform fret intelligence platform actually the goal of FR of tip is to manage multiple feeds and the so the the big the first question we can ask is why do we need a tip if we have already the the feeds we get from somewhere so the there are two goals for a tip the first one is to manage all of this operation you know we get these F intelligence feeds and then we have to deliver them to the production environment we have to deploy it we have to have some cycle for deploying it so that's the first reason the other and actually the the most important reason is to clear false positives because this FR intelligence

fits are not holy we know that many times we have false positives can be a false positive in domains IP address and blocking a false sending an ioc which is a false positive may cause damage in production because customer will complain that you know all of these uh uh not nice things so uh clearing false positives is actually the main goal of a tip and let's jump to the end this is our reputation service in K this is a tip in in action we have what we have here is our main page very simple we enter a an ioc potential ioc in this case an IP address and we just want to get a bottom line verdict about what we

know about I is it malicious suspicious or unknown in this case we got a malicious score and that's the the main goal of the entire process H that is another example from the K xdr product stands for extensive detection and response here we see how the tip we have built are used in production we actually see a an incident that the customer got about CV exploitation attempt and what we see in the graph and in the small table is two Targets that we found in our tip as a malicious and we see the fre frequency of request to these uh targets and uh so this is an example how we can use our tip in a production environment in

incidents we want to report to the customers okay so far we we talked about the angle let's go a bit back now we want to talk about the challenges how we build that tip from scratch so the I think the first challenge the most important one is the volume of the data so as I said we have tens of millions billions actually of ioc's in the wild and we the first question is how we choose how we choose which s source to take how we choose which GitHub page to detect over time and this is a very hard question we're going to answer H also in addition to you know to vendors to GitHub we have blogs we have a security

forums we have many many sources we may want to scrape and to fetch so we need to decide the most uh difficult fingers how to decide what is best the second uh challenge is actually I I call it a response time if we have the best frat intelligence in the world but you know but our response time is a three days it's not so good we want to publish new ioc's to the W to the to the wild to the production environment as soon as possible because the Mal the the Mal are not waiting for us we we need to be very responsive with the field so that's the actually the challenge how to keep track

of so many malware and so many malware in the wild and the and the third challenge which I find the most uh difficult is how actually we have you know always the trade-off of false positives and false negatives and we should put a cut of somewhere you know too much false negatives our customers will beat risk and too many false positives our customers will hate us because we are spamming them with many blocks and the and the false alarms so now we move to talk about the high Lev architecture of the tip in Ko so we have above 250 ioc fits we get from multiple sources some of them are free some of them are upon paid license

and we get we get those ioc's list into our tip the first layer is a filtering filtering layer you know if one of these ioc provider finds a google.com as a malicious domain we must filter it out we can't deliver it to production because our customers won't be able to access Google so the first step is actually this filter layer the next layer is the machine learning model we'll talk about it later but this machine learning model which encompasses many security features and the context features about I finally outputs a bottom line score as we saw before based on this score we decide if we block this iosin production or not and finally after we all the ioc's that we found as

malicious are going to the k Cloud which is the production environment and then and then h block customers in real time and later after these I are getting to to production uh we have the back propagation uh step we actually get a feedback you know feedback we'll talk about it what kinds of feedback we get later so we talk about many things here agregation data fits but I want to make some you know except the ioc fits which we get from a third party from somewhere in the internet we have the internal data we are working working with Ino so this internal data is used most of the times to filter out false positives so

uh the first data source we are using is the net flows data warehouse we actually in K this we are a sassy company we see all the network all the request of our customers we have the data of all of the request of the customer to the internet and we can understand quite clear how communication to an ioc looks like you know communication to a C2 server most of the time looks a bit different than a communication to a well-known website and based on this data warehouse we can understand many insights and things about this ioc the potential ioc the next important data source is the popularity as we know popularity actually means how popular this

potential I see in the wild let's say if I will look about google.com possibly I will say tons of requests to this domain because you know it's very popular domain but usually malicious or suspicious iocs will have less popular popularity records in this data source and uh we can use is this is quite efficient efficient and simple metric to filter out false positives most most of the time but not always and the last data source is internal data source we have in K is security incidence as you remember before we see this page when we saw the Fret intelligence used in the xdr product so that's it we actually can see in our internal records how each ioc has

been used in security incidents in the wild and uh if we find if you find that potential ioc appears in multiple incident security incidents it may indicate that this this is a real ioc and it's a true positive and we should continue to block it and so we mentioned the several times before the machine learning model so let's H Deep dive into what is this machine learning as I said this is the second layer after we filter the ioc's we want to get a bottom line score so the data sources of this machine learning model is we have many kinds of data sources this is actually a thir most of the time it's third parties H

that their input is just the ioc let's say the domain name and output is generally a big Json that describes many aspects of this ioc in of domains it can be the registration date or regist or or many other details about it we take all of this information together convert it to a numeric Vector you know machine learning most of the times uh is comfortable is comfortable to work with numeric vectors after we convert it to a numeric Vector we feed the machine learning model which in our case let's guess it's random Forest which we find is a more step Plug and Play model for such cases the model uh output is just a 0 to1 verdict about the

maliciousness of this ioc um once we have this maliciousness core we publish based on the verdict if it's if the model find it as malicious we publish it to production and uh for blocking if the customer tries to access this so far we talked about the machine learning the all of the the pipeline until the machine learning model we got all of the ioc's that passed the filters they went to production to block mode now we are talking about the revers process of the feedback so as we see our R forest model made a verdict about about the IC and now we are going to the field feedback so the field feedback I'm going to talk about three kinds of uh

feedbacks the first one is the popularity we already mention it popularity you know how many blocks if I find this I is malicious I can count how many blocks we find if we find that many customers and many hosts were blocked on this one sometimes it's kind of it it may be a kind of alert of alarm because you know most of the time malicious or suspicious I see they are not blocked so much so that's a simple metric can tell us where we stand and the second one already talked about it just mention it security incidents how many security incidents we found with this ioc if we here because the security incidents contains many much more than just uh

than just popularity it contains act actual H exploitation attempt so if we see that this I see were in was involved in many exploitation attempts we get a green light that maybe we are in the right direction this I should be stay should be continue to be blocked and the third one is the most painful user complaints can complain why did you block me this IP address it's not malicious I use it generally in my organization user complaints or escalations uh we get on a daily basis and we tune our pipeline according to that and the last one which we are going to talk about extensively is the ioc fit score as we mentioned before we don't get ioc by one

by one we actually get list of ioc and generally the these list are generated by the same persons same organization and uh evaluating this ioc's list together generally tells us how quality what is the quality of this source so we just going to explain how we how we calculate this feed score you know we get sometimes we are paying for a third party H Source about getting it I you want to understand what is the quality of this one so h i going to talk about I'm going to talk about six metrics which we finds that tell us if an if an ioc feed has high quality or not the first one maybe the most simple one is

freshness if we have a great feed but which updated once in a month it's nice but it's not enough we want to be as I say we want to be very responsive with the field we can't uh wait so much and we want a a source that is updated on a daily even hourly basis the second one is popularity you know if I have an ioc feed that contains many you know High popularity domains it's again it's an alarm I may suspect that is that this feed has many false positives and I won't pay for that the third metric is actually has a you may understand you have a we have a manual allow list in K

because when customers complain complain we usually insert the ioc in their complaints to an allow list so if we are trying to integrate a new fre intelligence feed and we see that many of the ioc's in this feed appear in the allow list I may say oh these feed are going to make us many troubles let's not pay for that um the fourth uh fourth uh metric is overlap overlap is a bit tricky I overlap first is how this feed overlaps with other feeds we already know so this is an interesting uh metric because if if we have a feed that if we have a feed that is correlates or overlaps massively with other feeds you

know 90% for example I say okay what is the additional value of this feed why should I pay for that I already already cover it in other feeds so that's on the one side on the other side if you have a new FR intelligence feeds that don't that doesn't have any overlap with any other feed it's quite suspicious because usually feeds as a few even a few percents of overlaps with other feeds which give you a sense that this feed is H things the same or you know try to cover the same ioc as usually so I would say that the overlaps should not be 100 but also should not be zero we want it

somewhere in the middle and the next the five metric I'm going to talk about is the security incidents as I say if this feed helped us to find H many security incidents it may indicate that it has high quality and it's a good Roi for the for this fed and the last metric is total item count here like freshness freshness is quite simple because I won't pay for a feed of 10 ioc's you know I want generally a few thousand at minimum and generally the the feeds are much have have much higher scale like a few Millions so it's something I should uh actually I feel good about paying about this and so we talked about the machine

learning feedback now once we get the feedback the one of the first question we should ask is should we retrain or fight fine tune the Model H after you know every few days or week so generally the answer is yes but this process of freetrain the machine learning the random forest model has some pros and cons let's touch touch it quite shortly the pros I think it's quite straightforward you know we got feedback from the field we have additional data we want to retrain the model which can improve our results less complaints from the field less complaints from the customers sounds good no and and our Precision accuracy all the metric should be uh improved but this process also

have uh some cons the first one is know resources retraining the model from scratch requires resources of course cost Etc and the second thing I think it's the most it's even more important is the overfitting if you know if this week I got some I got uh some Nega some feedback from the field but this Fe feedback you know finally people are making this feedback if it was if this video con contains some misleading information like misclassification of ioc's like in sb9 or as malicious what I trying today that trying to say that it may poison our model with wrong data and this is production model so if we poison the model it has a

sever severe results on uh our performance in production um and of course it's time consuming we can this training doesn't take one minute and this the whole process Al also requires requires a data scientist that will do this retrain but also it takes several hours usually you know to prepare the data to make everything the training itself also takes time so that's for uh uh fine-tuning the Model H the next we are going to talk about from our experience three things that once we have that FR intelligence system went to talk about several uh several conclusions we got during the process the first one always use the the automatic filtering you know we build that system this system filters out

every day millions of ioc which it finds as false positives so and 99% of the false positive are filtered by this automatic automatic mechanism so it's great another uh thing is the metrics we strongly advise that uh everyone that build that system will have some Metric to understand how it performs in the production you know how many complaints we have from the customers how many filterings we have if we one day we see that we have tons of filters filterings by the system that is much higher than in general it may indicate that something went went wrong and the final conclusion which is always true is always to have some manual verification on the data we usually we

want to have some analyst or something someone that checks from time to time the data that the system outputs to see we are okay we are alied we didn't you know change things things uh too much and now we are going to go over shortly about on the architecture of the our system on the left we see the ioc fits just an example we have hundreds of ioc fits each of them has millions of ioc's next H we go these feeds are flowing through the scraping manager you know to fetch feeds you usually build scrapers after we build this after it goes through this scraper uh we still have a list of many many ioc we are in

the enrichment level so in there all of the ioc is going into a Q and then the the ti Service uh is is fetching each of this ioc and and reach it and reach SE it with many other first party the third parties we talk about so for every every I see we get a huge Json that tell us actually everything we can know about it where to which servers it relates when it was registered anything anything we can get after the data augmentation we we do with the third parties we are going into the random forest model we talked about and this R of forest model outputs the verdict we talked about and the 0 to

one according to this uh verdict we understand if it goes to block in production or not and uh then we also of course save this information in the DB we have we don't want to lose it we paid for it we invested effort in it and also it go to blocking the production environment and from the production environment as we said before we finally get a feedback from the field and you know find tunar model or um we find tar model or that we understand that this should be entered to the allow list and we should not h block it anymore so far that's was our architecture some thoughts about future announcements to the world of

tips so the the first one is anoun contextual analysis some big words it's kinds of H you know I think that so far we talked about ioc's in quite a static environment every ioc is you know it's quite B binary I can block it either allow list it but ioc is H many times we also want to understand the propagation process of an ioc in the network let's say h on the Sunday I saw this ioc in Two Hosts but one week later all all of the networks started to use it so H seeing this propagation process can tells us many a lot of about the about this I it also can understand help us to

understand some lateral movement maybe if it's a malicious I see the and second the second announcement possible announcement is collaborative tips so we introduced our tips but as you know there are many other tips in the world some of them you have even can purchase so why not to make a committee of tips to get a better conclusion of all of the tips together and the the first one is actually kind of autonomous uh tip or something that rank di give it a more rank them according to their maliciousness and maybe you know the the top 10% we are blocking and the rest we are have another directive things like this H that's all for today we actually

talk to today about the feedback loops in our um the feedback loops in our tip the the architecture of how we built our tip in Ko and uh we talked about some possible announcements and what the future future of tip can be and uh thank you very much for [Applause]

[Applause]

today no no yes now awesome thank you very much it was a very interesting talk we do have a little time for questions any questions in the room Hardway is gonna provide the mic hello thank you very much for the talk I have one question regarding timeliness uh so one of the challenges is for example for an indicator like an IP address that it's now a malicious IP and in 15 minutes it might not be a malicious IP anymore how do you handle that so actually that's a very good question because we encounter it several times so what we found that blocking IP in general is each mons each much is much more complex than you know blocking

anything else what we what we do most of the time do we block together the combination of Ip and the domain so H but that's in the case of in the case of shared hosting uh most of the times we want to detect that this IP is shared across multiple domains and generally we won't block this IP just trying to block the communication by other ioc types domain or anything else one quick question

maybe hello yeah uh for if my customers if our customers are uh not interested to invest more on the gpus and machine learning for such customers uh do you have any sessions uh for achieving this in a cost effective manner you mean that if training the model is is cost effective for the blocking actually yes the this machine learning model has saved us many um generally since it's based on um dozens of features it's quite accurate and uh yes we find It's very cost- effective and uh very also easy to train easy to build this pipeline because we use very straightforward features uh which are available for us and the training the model is very

straightforward because it's very simple it's random Forest so yes we strongly recommend to adopt this uh option okay yeah thank you and that's it I heard that you're around here so for any further discussions okay thanks again thank you

Real-Time Threat Intelligence With ML Feedback Loops - Tomer Doitshman

Related talks