BSides Bucharest Spring Meetup March 2023

Name: BSides Bucharest Spring Meetup March 2023
Uploaded: 2023-03-03
Duration: 1 h 26 min 30 s
Description: Spring forward with the first 2023 BSides Meet-up! Join us on March 2nd for another evening of insightful talks about how cybersecurity experts are fighting adversaries at scale! #cybersecurity #machinelearning #containersecurity #cybersecuritycommunity #BSideBucharest

BSides Bucharest1:26:30113 viewsPublished 2023-03Watch on YouTube ↗

About this talk

Spring forward with the first 2023 BSides Meet-up! Join us on March 2nd for another evening of insightful talks about how cybersecurity experts are fighting adversaries at scale! #cybersecurity #machinelearning #containersecurity #cybersecuritycommunity #BSideBucharest

Show transcript [en]

between Security Guys and Business Leaders and up in the boardroom and always end up in the same way the security guy goes to the business leader and says I need more money why because we need more security we need a bigger budget because we need more infrastructure we need better security software we basically need more um more people right the biggest throttling cyber security right now is finding the right people to fill in the slots so for the first sign just a couple of weeks ago we see these two numbers cyber leaders and Business Leaders who practically share kind of like the same vision on how cyber security impact cyber security insect in cyber security incident could be

impacting their organization and that's important um the next thing the next some other stats that came out of that survey or report actually involved uh around the fact that organizations about 44 of all organizations expect geopolitical risks that substantially affect their cyber security strategy right so they're worried about everybody it's not just uh small medium-sized businesses in large organizations so everybody pretty much expects that um the current geopolitical context is likely going to influence their cyber security strategy for the next two years whether it's a a global Cyber attack rates a targeted Cyber attack it doesn't matter what's even more interesting is that if you break this number down into how snb's and how large organizations perceive

that risk to their organization it's again interesting eight percent of fmbs expect geopolitical influences ability to have a substantial influence in the likelihood of converting confronting a catastrophic Center event in the next two years compared to compared to 58 percent first glance it's not a big difference right it's about 10 now if you think about it it's kind of misleading because this perception does not like reality if you look at how data breaches actually occur and the reason for why they occur and I have a slide that are detailed more about that and I'll come back to this so we've seen um we've basically seen the how cyber security leaders uh and Business Leaders perceive uh the the global threat

landscape we've seen how smbs and large organizations again perceive how cyber security incident will be impacting the organization but what is the actual reality well if you're looking at our own thread Telemetry if you will uh about 30 of malware out there is uh compiled or is accounted by profile um Executives now the risk 70 is non-pe the finalists their uh scripts pretty much everything you can throw at an endpoint that bypasses um that bypasses traditional Security Solutions that is interesting for the simple fact that if you break down the file types for non-pe malware and if you uh then you then take a look at the most prevalent non-be malware families you'll see that

PDFs python CDF Zips and whatnot these are the most prevalent file types that are that are most likely to be abused by files threats why very simple PDF also supports for example JavaScript documents support PBS macros you can run Powershell out of a boatloaded the most prevalent malware families have to do with for example in blue 29 PDF capturing according to our not so recent telemetry uh if you think about it PDF capture you've probably heard about it right it's mostly used in spearfishing campaign fishing campaigns you get that little little email with an attachment saying hey you've just won I don't know you've just learned a word click here to get your voucher your redeem code prove

that you're not a robot or a machine as soon as you do that because it is a PDF file type it has the ability to execute additional code that has been embedded into that uh into that PDF bot um and remember the slides I told you the previous slide where I told you that the received opportunities and large organizations perceive that they're not how they will be impacted by the Cyber attack according to the recent IBM uh cost of data breach report released I think a couple of months ago about 83 of 83 83 of organizations have suffered a data breach in 2022 76 percent of snbs have been impacted by at least one Cyber attack and 20 of

breaches occurred because of a compromise anal pixels part and what does that pretty much tells us you know when 20 of breaches are compromised by a reach out a business partner we're basically talking about smbs recently talking about small medium businesses according to our previous set do not proceed to be at at such a high risk at large organizations although the actual reality is that smbs can usually be used as an initial attack vector for adversary to Pivot into larger organizations for which they offer services too now again this is um the main difference that I wanted to point to point out the way cyber security is perceived and the way reality actually looks like if we're

looking at reality 95 percent so last year I think we we've seen a 95 increasing Cloud exploitation cases compared to 2022 and this is actually a number that's being presented in our recent Global thread elastic report it was published just yesterday so if you haven't had a chance to take a look at that go ahead and download it it has tons of information uh and we've seen that the number of adversaries that have picked up ttps tactic techniques and procedures to Target Cloud environments has increased threefold in the past year right so that means if you're looking at this what do you see right you're seeing you're seeing more breaches you're seeing smbs used as an initial attack

Vector um and as a way to Pivot inside larger organizations when you see 20 of these breaches have been caused by saps and you see that most of the infrastructure I don't know you can blame it on the pandemic if you want to right everything is now cloud-based an adversary that's shifted their focus right to to um to start targeting Cloud infrastructures because everybody's using it whether it's an SMB whether it's large organization so these four numbers actually tell the story of the entire thread landscape as we've seen it today what does that all actually translate into well it translates into it translates into cost so if you're a business decision maker or a cyber

security leader organization and you've had a data breach the first thing that you're going to be thinking about how much money do I have to to spend to make this problem go away or how much should I have to invest make sure that this doesn't happen again the cost of data breach has recently been estimated at about 4.36 million dollars in 2022 which is about 30 uh 13 increase compared to 2021. it's not a lot however the U.S or the US the cost of a data breach is actually biceps High then the global cost of a data breach for everybody else right it's 9.44 million 5.44 million dollars for a potential data breach a lot of potential if you've suffered

this cost it's an actual database sorry I mentioned that we're going to be talking about uh I'm going to talk about some examples so I mentioned that 30 of malware out there is not is p while the rest 70 is I'm just going to use two examples one for Windows and one for Mac this is based off some of the most recent research that we published on The Fast Track blog if you've never seen it but if you've never checked it out please go ahead and do it we post a lot of interesting research out there whether it's adversary related research whether it's data science research whether it's threat research uh it's a good place to find interesting stuff

thanks Google because Google order has been under active deployment and constant not active Department under active and constant development uh the reason why it's been it's been such a popular it turned out to be such a popular tool is because um it's used as a downloader to you know bring to the endpoint additional other pieces of malware like ransomware like key loggers Bots and whatnot what's interesting about it is that it directly so early on in the in the um in the kill chain it employs a series of tactics and techniques associated with files attack example that uses a Visual Basic dropper that that um that drops a payload into a registry key then it uses Powershell to

unpack and execute that payload directly within memory you know completely bypassing any file based detection solution that has to do with looking at files that have been written to disk and then analyzing it this technique of daisy chain uh Visual Basic Powershell wmi or any sort of scripting language has become very popular widely adopted whether you're talking about ransomware whether you're talking about the cryptocurrency miner whether you're talking about uh a downloader or any even Garden variety map or any other type of a garden variety map and what that says is that you know if you're strictly thinking of detecting threats from a PE point of view you're going to miss out on a lot of interesting stuff that's

been going on now the second example that I wanted to show you is that this is not a problem that's um endemic for Windows it also happens to be um to be present for math for example you know for Mac OS you've got binary you've got binary file and non-executable this is not not binary it's supposed to say non-executable files which pretty much involve everything that had to do with DMG files packages office documents applications and the regular strips I'm talking about bash python Apple strips uh some of the most popular malware families recently that have been using these non-executable binaries revolve around the eagle Quest For example that is a popular ransomer family that that's

been known to produce vote dmgs and packages has been known to also use a boatload of scripts mostly to ensure consistency on this on the machine once it gets there um if you're thinking ransomware is not such a common problem for Mac OS users I think again there's a lot of folks out there that usually you know want to disable uh the built-in security mechanism for Mac that doesn't allow you to download applications or install software from third-party third-party marketplaces and when they do that they're immediately exposing themselves to risk um we've seen um campaigns that usually have to do with schlayer if you're familiar with Slayer malware that is what are you laughing you know about clear I had a lot of work

there you go right so we've seen advertising campaign that have to do a lot with um with file list techniques Slayer being one of them so the point is regardless if it's Windows Mac or Linux that ratio 30 to 70 percent uh PE to non-pe malware is cut is kind of constant across all platforms so you can you can almost say that malware or non-pe malware is platform agnostic at this point now what is the solution to all this I'm gonna scare anybody below Scare Tactics are good but there is a solution that there there's always a solution to a problem and um the way we have we've approached this is by using machine learning and artificial intelligence

um which is a very if you break it down to its most basic concept it's pretty it pretty much involves the ability to analyze large volumes of data with speed and accuracy the ability to uncover patents in that data that may be hidden to the human eye and ultimately all you want to do is automate repetitive manual tasks and service type for our detection and something data so if you want to sum up machine learning um and artificial intelligence in a way that giving your grandmother understands basically you use these three steps to make sure that they get what uh what machine learning is all about um if you want to break it down as you

uh as if you were explaining it to the Marion over here you have this input XYZ you create a function which which describes the behavior it contains various various Behavior characteristics for the input which is basically a function and the outcome would be to be able to make a prediction based on a based on the data set what Ai and machine learning is not is um is three things number one is not intelligence so don't think Terminator don't think Skynet so it's not self-aware but it does provide algorithms and techniques for solving hard problems problems that would take a lot of Manpower and times to spend time to solve it is not a silver bullet so that means that you

cannot use AI to pretty much solve every other problem out there right but it can do and it does that very well it can analyze large volumes of data with speed scale and accuracy something that we humans very much suffer number three ai ai is not replacing analysts right so uh there's a lot of talk that you know AI will be taking over everybody's jobs some jobs potentially will become obsolete but that doesn't mean that there isn't room to to grow and expanding other in other areas right everybody has to at some point 10 years ago think about it 10 years ago there was no such thing as social media matter right with the invention of social media

we now have jobs that have to do with social media so AI is not replacing analysts but it can enable automation for repetitive manual tasks right so it can become a tool to help you become better at your job basically now I'm going to stop here with explaining machine learning because I'm going to ruin it and I'm going to leave it to Marion to tell you more about how we Implement machine learning across uh the breadth of our Technologies across uh craft strike cool thanks I thought I was growing Roots all right so like Libby said I mean one thing that we want to do with AI is basically remove the scale out of the

equation for our expert panelists right so we use AI across the entire crowdstrike and what I mean by that is we deploy AI on the cloud we deploy AI on sensors we have behavioral ml that look at the event in real time and then provide a solution and of course we have expert machine learning that helps specialized teams like Intel OverWatch falcon complete to give a solution to the client and even more than that even before a sample or an attacker gets to the sensor we do have uh an AI model or a high solution basically uh part of uh Falcon identity that looks at logins and tries to understand based only on login attempts

whether you know somebody use the compromised password somebody is trying to hack into the box and cuts that attack

login Behavior think of it like that yeah so like I've said the models that we've deployed on multiple platforms and I'm talking here about Windows OS X a lot of versions of Linux because you know there's a lot of them as an example our ml model for Windows the on sensor model has about 40 megabyte which is uh I think Libby was joking like the same size as the actual rest of the code base um and in terms of numbers on Windows we have 99 of the pre-execution uh for the malware detected pre-execution so it doesn't even have a chance of running and we have a lot of third-party testers agreeing that we provide 99 detection coverage we have

results from third-party investors like AV tests a b comparatives

so nine episode protection coverage using ml machine learning yeah so no scanning engines no behavioral scanning engines that are built into the sensor so that's why I made the joke that 40 40 megabyte sensor for close to 40 megabyte sensor is a really small footprint to have as a security vendor While others potentially have hundreds of megabytes of um it's interesting to know that this is the largest like yeah Windows just because there's a lot of attacks out there for for the platform the model tends to be a bit larger but models for onsense or Mac or on sensor Linux are much much smaller what that says is that you can get the same output the same not

the same the same outcomes in detection efficacy by using a far more optimized solution using machine learning so in a nutshell the way we basically process data and build or use our machine learning models in order to provide protection to our customers is by retrieving feature vectors and metadata from our sensors and when I say sensors I mean servers laptops even handheld devices iOS everything go to the cloud in the cloud we have our machine learning model that take this data process it and provide feedback to our crosstrike analysis themes that use this data in order to provide a definitive answer yes this is an attack or yes this is your you know malware campaign that we have been able

to pinpoint or being sample these are later used to augment the Corpus for our machine learning models and we never really train a new version these guys don't need to look at the same attack over and over and over again and that's why I said like we are trying to get that volume problem out of the equation for the analysis teams so that you don't need to analyze and review the same attack over and over again it's we're not 100 there I don't think we're going to be anytime soon but we're chewing off every week more and more so remember when I said that they uh will not be taking your job very soon yeah that's kind of taking a job

position because you still need human in the process to offer that feedback whether it's a dirty or malicious sample so that you can retrain the model to make it better each and every time that's correct one of our tools Frameworks that we're very passionate about is our AI power indicators of attack so what AI ways is basically it's a cloud detection platform that allows us to take events behaviors generally speaking from the sensors analyze them with AI in the cloud and then in real time provide an answer back to 12 clients this is very different from existing nonsense or ml due to the fact that on sensor ml is it's a model on the sensor and that gets

updated you know you know every few days and traditional Cloud security where in order to make any prediction in the cloud you will need to either have the files already in the cloud or the metadata already there processed and decisions made asynchronously and then made available in case you find that metadata again and you want to provide an answer what AI powered iOS does is allows us to provide an Insight or detection to our customers even when we only see one particular event or one particular file once for the first time it has prevalence zero or in this case seen it once one right and uh that's basically the most powerful framework that we put the

disposal of our clients and it's powered by and that that pretty much sums up the power of the cloud right so if you've seen a file even if you've seen it just once you can extract all the relevant information from that metadata from that file and you can compare it to the breadth of our threat intelligence we'll compare it to what we've seen on millions of endpoints it was the initial design and the the root of the name of the company the power of the crowd to strike against malware

so I started my career as a malware analyst right and even then like X numbers of years ago what happened is that modeler was trying to evade detection back then we had signatures right um and malware started by let's say they didn't start at the entry point of the PE anymore they were hiding after the first uh invites or randomly in a function somewhere then we had the polymorphism then metamorphism then runtime backers there was the over and over this cat and mouse game like if you're talking with like old AV guys they will tell you like this is this is a cat and mouse game what we want to do here with our other

serial framework is they we want to be both together and the mouse because even before this you went disability we went a little bit too fast because what I wanted to say is that the same game of evasion happens not only on signatures but happens also on it so now attackers are trying to bypass machine learning uh provided protection either via finding vulnerabilities in Frameworks for example we know that Google provides tensorflow for whoever wants to do neural networks and deploying neural networks they do not however provide assistance to tensorflow issues because they don't consider it production ready code so um like in the past couple of years there's been more and more vulnerabilities for example in

tensorflow that being fixed so I just want to plug our tensorflow to rust framework here so for that particular problem we've already had two blog posts about and we are about to release the tensorflow rust um to everybody make it open source so stay tuned for another blog release on on that but coming back to the to the model hardening um we want to take the fight back to on our own grounds we don't want to wait for the attackers to find vulnerabilities in our systems we actually want to find those vulnerabilities ourselves before an attacker gets the chance to do it and fix it so we've invented the and built this adversarial framework that at its core

takes an arbitrary file either at random or something that the user is giving it leverages day after serial configuration that basically highlights the number of other serial changes that should be applied to that file so in case it's a PE let's say some data to uh zero byte or actually pass some data to uh code case or if it's a PE a PDF rotate the image inside a PDF or you know if you have a script randomized variable names right so this would be the adversary configuration and with the adversarial configuration it goes through our API what the API does it will based on rule that our experts have been writing make these changes generate new files

based on classifiers one or many against these new files and generate a report that report is going to help us identify weaknesses on our models or attackers get to leverage those weaknesses or discover them for themselves and in the next slide I want to talk a bit about what the other serial firmware can do so this is just like a very very short snippet of what we support but like I said we support Linux Mac Windows fees we support MS Office documents word excel PowerPoint URLs scripts DBS Powershell python blur applescript and very very few examples because this is a you know one slide but I'll take I'll take the URL as an example where

depending on how you're doing the feature extraction in your url scanner AI model you might care about the sequence of the key value pairs or you might not in the case you are caring about the sequence of the key value pairs if we randomize this sequence it shouldn't affect the outcome of that URL but uh it would affect the outcome of your ml models so these kind of um changes we can make to the input in order to test the generalization of the mod so going forward and going back to the adversarial framework and how we generate the text once we have the modification report we are able to assess the strength of the of the classifier we are able to see

okay we are having a big problem with certain classes of attacks so then what we will do we will take those attacks generate even more samples and use those samples in training for that classic part what happens is that now we are helping our model generalize better for those type of attacks and when we see in the field those attacks being used against us we will not be vulnerable to them and I would have a slide where you can see very clearly that this works and it works super well um continue this is how this is an example of a classifier that wasn't trained with the adversarial samples and we put this classifier uh to detect a bunch of samples that

we've generated using the other serial framework so let's say I don't remember which classifier it was it's not relevant but let's say it's our URL classifier and we didn't train it on adversarial sample we gave it a bunch of other serially generated URLs and asked it to generate a prediction so ideally if this would have been a hardened classifier the expectation would be that all the points all the decision values would be on the diagonal from 0 0 to 1 1 so no change whatsoever right or even better everything will be in the green part meaning that this URL that was given that was generated adversarially and was sent as the input to the classified is now having a worse

prediction in terms of it being clean actually it's considered to be model right so green means it's now having a higher probability of being detected as smaller but in reality you see a lot of points are in the Red Zone meaning that again as an example if this URL had 99 probability of being uh appointing to malware now after the adversarial attack against that URL like I said I was changing the order of the key that repair it's considered to be clean like 0.0 is the decision value which leads to a clean decision and here you see the power of the of the adversarial framework and then in the next slide we see what happens when you

include adversely generated samples into the training so the green um gravity it's the uh basically the rock plot of the model without other serial samples and this is what happens when you train the model with adversarial samples in a test against other circular samples not to get the same samples that we trained it with of course but with other samples because we can combine different attacks together so if you have like three attacks you can generate one attack with each to attack with two of them and one with all three right um and the cherry on top and what I promised you it's really an assessment that we run I think this was done in

about an hour no cherry picking whatsoever just choose top 100 Alpha samples with a number of vendor detects top 100 macro samples then top 100 PE samples that day with a number of vendor detects put them through our adversarial framework I think we let it run for about 10 minutes and we asked about the five random adversary generated samples for each file and then look at the worst results so you can see that for health Temple number one initially starting with 17 vendor detects by only changing in place so we buy only editing less than 600 bytes we were able to almost make it fully undetected for those of 15 out of the 17 vendors drop their

detection and the same story over and over and over again again no cherry picking nothing whatsoever this took an hour of our time actually of AWS these times it's completely automated we just click a button and click look at the look at the results and yeah I really wanted to point out at the basically the state of the industry everybody is testing their signatures everybody is testing their code base the guest vulnerabilities but not a lot of folks are testing their ml models against bypasses against the effects a lot of Frameworks that started appearing on GitHub on attacking the attacking AI models and we looked at those we've Incorporated some of those principles into our code but it's fair

to say that we have the most mature framework out there and the thing is that it's a lot easier to reshuffer and reshuffle Linux mailer features to make it undetected yes yeah which is you know in my opinion it's it's a sad State of Affair because Linux is the backbone of the internet p is a bit better you know AV vendors have been doing PE malware detection for 20 years so they let's say they have more experience if you have a Mac laptop don't think you're safe proof here right um yeah so it's a lot easier to to bypass Linux detection um but we've already taken up too much time so we'll just conclude with a couple of

takeaways number one is before you start building any machine learning model before you start building any detection mechanism especially if you work in the cyber security industry right you got to know the threat landscape so you got to understand what threats are all the AP not locally you know on a limited data set of endpoints but also need to understand how they act and behave on a global um at a global scale and for that you need a lot of telemetry but whole point is no the threat landscape so that you know what you're building protection from protection protection now the second thing machine learning models need to have high detection coverage and efficacy um in terms of detecting the current

landscape what that means is that then Marion can speak better than I can have ethics it it's that it's very difficult to balance detection of false positives means is wrongly detecting a malicious file is clean or clean file as delicious so it's very hard to balance efficacy you know the ability to say with 99 accuracy that the file is malicious with the fact that you're about to incorrectly tag a clean file as malicious that is important because if you tag a ping file as wishes then there's all sorts of automated procedures that kick in who do containment remediation put the put the endpoint in lockdown there's costs attached to that cost in person hour you

know an IP guy the glasses like we do you have to show up to that computer put it back online reinstall the operating system uh there's downtime that causes the company to lose money because the server is not running because of an Internet protection caused by an incorrectly trained you can learn well because you did not understand yeah on top of that one of the main reasons why we build the adversarial framework was to improve the generalization capacity of our model so not only we want to discover the flaws of our models before attackers do but at the same time we want our models to generalize better by giving it examples of that are likely to be seen in the

field that we don't have here right so if you're thinking about the image image recognition right um you can use the same image rotate it 90 degrees 45 180 degrees and that's going to be a new input data for your model you can train on that as well as another example it's a bit more difficult to do that with with files with binaries because you can't rotate the binary you can if you want to swap code you need to do it in a way that that's still executable you need to fix [Music] in in the code in order for that to be still runnable and achieving the same behavior so that's what the adversarial framework does it's it's a way for us to

improve the generalization of our models and at the same time discover vulnerabilities in the detection capacity of our models before attackers because let's be honest here we know that attacker is affecting our attack they're looking forward to find a vulnerability we've taken up twice as much time as we've had in planned so uh q a if you have questions now otherwise you know you're you're the next runner-up to do the next presentation otherwise

it's still getting an impact on the operating system or so one of the key principles of the adversarial framework is that all adversary generated the data I'm not going to even say files but all the data that we generate adversarially still have the same functionality so we only change the way it looks let's say we don't change the functionality so for scripts we can run the scripts in the sandbox and see that there's no parts in gear or scripts are still running you get the same results same for binaries like apps Macos these if that doesn't happen we're wasting our time

watching this size do you reduce the pictures of that are considered by the by the Machinery model or do you treat the parameters to the model both and some extra things so the B model for example that we Deploy on sensor has a reduced feature set of what the cloud can do so if you look at the best features that we use in the cloud best performing features and we select those for the sensor at the same time uh we're leveraging uh the tensorflow to rust framework where instead of deploying tensorflow binaries and tensorflow code will be actually transcribed private to Pure rust code so we don't have any of the boilerplate code that tensorflow provides in any of the

attacks therapies at that code provides and we are able to reduce the size for this kind of models uh significantly I can't give any numbers because it really depends on the application but we can reduce the size significantly there are other Solutions like neural network compression between them yeah exactly and one post start asking questions because I forgot to mention One reach first in the US you actually get a giveaway foreign

features

yeah so we did do some experiments where we try to separate between train and test based on time the problem is that that time is not uh what happens is that thumb attacks works for some companies other attacks don't work for other companies so usually attackers try to reuse code oh we like to think that they do the least amount of work in order to bypass a vendor so what we are aiming to do here is make the cost for us very small in order to change our techniques so that we can detect the new attack and at the same time increase their cost significantly in order to come up with a new update um

but yeah we tried that it had limited gains work that well because again they change very little if they if they don't need to change too much sometimes like one simple uh sharp busting technique works just effects based on the Charlotte file data set one

here it is uh you brought up right at the end there and that there's been a growing presence of malware evasion stuff in the open source Community or in various sometimes just toasted on GitHub and questionably open source or not but yeah the whole Spectrum and that's really true and I wondered what's your your percentages like these days I mean I'm I'm sure you have a large private library for your generation that's not the least I'm GitHub I was curious what percentage that you're picking up from more freely available sources these days so I think we looked at this last time about a month ago were you the one looking GPS we looked about a month ago maybe a

month and we were only picking up like three ideas for injection faults or changes into the spots everything else we already had covered uh one thing we don't do is we don't look at the faults related or vulnerabilities related to the framework themselves just because for once we deploy One sensors pure run code so we don't use any of the third-party libraries for like tensorflow or by torch or fixture boost or whatever it is wrong at this then louder and then in the cloud again doesn't really matter um so yeah not a lot not a significant not a thing that you get yeah

I'm being told there's a five minute break no location questions we're going to take a five minute break and then we'll come back thank you everybody

BSides Bucharest Spring Meetup March 2023

Related talks