
okay so so my name's Craig thanks for coming to my talk um this is a talk about hunting supply chain thread activity um specifically in Cloud environments using anomaly detection and it's a continuation of work that I've been doing I've been working on anomaly detection and Cloud threat hunting and detection for for a long time now um my background is 20 years ago I volunteered to work on detection features and a sec product that was was kind of like an Insider threat hunting product that we something we might call it uba today uh nobody wanted to do that work so I kind of YOLO it and volunteered to do that and I've just kind of been working on detection
and analytics ever since worked on a couple of more um Big Data large scale security products along the way I spent some time at two of the largest AWS organizations in like two of the largest AWS customer organizations in the Boston area where I'm from and got interested in anomaly detection there um so this is my sixth bsides talk uh first time at Rochester though so thanks for having me it's great to be here so the way I got here is is um these are pre these are my like last few years of presentation so the way I got here is back in um in the 2016 17 time frame I was working on security
at this large AWS organization and we had customers that um as we were working through security requirements and what people wanted us to do we would have customers that come and say things like well you need to have you must have a nextg firewall or you must have this network intrusion detection system or you must have this or you must have that because these were their policies that were kind of like set like carved on Stone tablets and and you know you couldn't you had to follow them and um but what interested me I think what got me really interested is the fact that a lot of that stuff is really not that effective or even that relevant in
in Cloud environments because so much of what's going on involves pure services and just API level services and and yes there are large fleets of virtual machines and containers and increasingly people are using Serv list of stuff like Lambda but apart from the you know the virtual machines and maybe the container instances and those virtual networks there's really not a lot of useful places to apply conventional security technology that we're used to and even even some of the host based stuff like we can put EDR agents and and um instrumentation on on Virtual machines and servers but that still doesn't really help us when it comes to the service layer and the API layer because there is there isn't
really a way for us to instrument that apart from collecting logs and doing analytics on the logs and that gets interesting because um I as I'll kind of explain as we go so you know background is working on threat detection working on cloud threat hunting and detection um in 2018 I did a I went on a kind of a tangent I went on a bit of a walkabout and worked on like an an open source Sim project using open source tools and published all that back in 2018 2019 and then um started working on anomaly detection then and started doing a lot more serious work on applying machine learning techniques to threat hunting and detection starting in 2019 so for
the past few years uh I've been spending a lot of time on that not just for cloud threat hunting and detection but also for endpoint and network events but this talk is mainly about Cloud because um what I found is that anomaly detection is good to have in general across all these data domains but in the cloud in the cloud world it's really essential so um this is just kind of what I'm what I'm trying to Advocate is trying to put a a a name to what I do and I'm going to publish more on this I need to publish a longer description on this but the reason I I I I try to call
it well the reason I I invented the term detection science is to try and put some definition around what it means to kind of fuse together conventional security research and threat hunting detection techniques with machine learning and data science because it's not exactly data science because in security threat in security detection not everything is a data science problem and not everything can be solved by data science some of it um can be solved by Technologies and tools that we that we already have and also the goals are not exactly aligned because the the goal of data science machine learning usually is to do really good data science and machine learning and it isn't necessarily the goal to
deliver really high fidelity High efficacy detections uh to the users that need them and and of course at the same time not everything can be solved by Machine learning either so I I call it detection science I like to call it detection science just to kind of Define this this field of study that basically you know my my goals are to maximize basically to to get the best possible detections at the lowest cost is what I mean by resource to kill ratio and um know getting away from sometimes in security practices I think sometimes we tend to get very tool oriented and we get very oriented around um our either like NSM or or EDR tool know Network or
host data or something else and sometimes like especially in the cloud domain sometimes it's necessary to kind of throw that stuff out and start over um because there's there's a lot there's a lot of things that just don't leave Network or host based evidence in Cloud um activity logs um and finally you know it's it's not a product and it's not something that can be productized that's why I like the term detection science the practice of detection science something that belongs to us it can never be productized it can never be turned into a product or a use case or something like that and so it's it's you know it's ours so um in terms of contents I'm G to
talk a little bit about just about supply chain and why why I'm using this theme for this presentation there's a story behind that um we have four public case studies so in terms of a quick overview on cloud threat hunting and detection there there are four good large public record case studies that are out there right now um there are there are probably more than four but there are four really really big ones that are wellknown and then I want to go into um what I been doing in order to try to solve this for the last few years and what I'm doing now is a little different than what I had been doing a couple years ago
so want to kind of show that along the way so in terms of in terms of retro um the reason I'm using this theme is it wasn't that long ago when supply chain attacks were were were sort of theoretical in that we had not yet imagined that they were possible it was kind of a failure of imagination as it terms some sometimes people'll throw around um it wasn't that long ago it was maybe 10 12 14 years ago I was in a threat modeling session at a large organization and I I looked you know looking at the threat the data flow diagrams and the network diagrams I said hey what about this there's a this like vendor there's
a lot of access via this supply chain route to this particular vendor or these vendors and so what if somebody you know comes in you know that way like basically rather than coming in the front door what if they what if they you know dig through the they tunnel through the parking garage under underneath the bank and come in in a way that's never been expected before and the the the initial response that I got was at the because at the time it was beyond you know it wasn't in our experiential base so the response I got was well isn't that kind of an Oceans 11 scenario meaning like it was it's just something that we couldn't quite imagine yet but
over the last decade we've realized it I think you know starting in 2011 um the RSA breach was I wouldn't say it's it wasn't a supply chain issue and that I don't think the supply chain itself was was affected but I think what it was more a case of going after the authentication mechanisms probably probably the reason that somebody was going after that was probably a secondary to another objective it was probably means to an end is somewhere else that they wanted to get access to and they needed to get around or get through the two-factor off that was in place um 2013 we had the target breach which if memory served that was a I
think an HVAC vendor that that had um remote that that had some more access to the network than than maybe was expected and then of course the big one is is in 2020 who's heard of Sunburst this probably the biggest one so far where a a a network monitoring tool actually had malicious code inserted into the build system and it was included in a build of the product that was distributed to the user base and and it and it was a you know it had quite a quite an extensive impact and there's a lot of incident response and clean up required by that that might be the first time that we've seen a supply chain attack of that level
and sophistication and so that was that was really kind of eye opening so in terms of so Cloud threat hunting and detection most of the time we're talking about credentialed access once in a while you'll see more interesting research on sometimes people try to cross certain security boundaries um once in a while you'll see much more advanced kinds of pack research but most of the time in most cases we're talking about credential access where somebody's credentials in the form of a key or username and password have been compromised or have been leaked or have been spilled and they're in they're in hands of a threat actor so somebody's impersonating a legitimate user and this is not dissimilar from Insider threat
hunting in that I found this quote this is the I think the best quote that I found so far that summed up the problem it was by by Peter Norman some time ago long time ago but um um he argued that a compromised user is sometimes logically indistinguishable from a or an from an an Insider threat is actually sometimes um logically well what I mean rather
is an Insider threat that is compromised or in the form of a compromised user account is the two are kind of logically similar enough that it's hard to dis distinguish or disambiguate between them and so most of the time if we have somebody we have a a case of credential AIS where somebody is impersonating user or somebody is just just hijacking a user account um it's not dissimilar from hunting Insider threat and sometimes it's and it's actually it's hard to find because this is this is kind of the best there this is kind of the best problem statement that I found so far this is better than anything I've ever come up with on the subject and the
reason that cloud like especially credential access and Cloud activity is hard to find is that it's simply hard to distinguish threat activity between legitimate activity because the differences are usually just a matter of nuance and I'll give you some examples of what that looks like so the most recent um example that we have that's in the public domain is actually just just recently this year over the course of early this year and late last year there was a Disclosure by a large software provider uh password manager um who's anyone anyone read about this so yeah this is probably top of mine um this was interesting because it according to what's been published it looks like they actually went after a
developers home system that was running a media server with um it sounds to me like reading from what I can read it sounds to me like they had a they had a listening media server with a listening Port that was internet facing and somebody with whoever the threat actors were they were able to identify the user's home IP address and find that port and find a working exploit for that media server and then gain persistence on that developer's home system which they had apparently been doing work from so they were able to harvest credentials from that Sy from that home system um and then use those credentials to gain access to the cloud environment now
whether they came from I don't think we know whether they when they used those credentials and did what they did when and I don't think we know whether they did that from the compromised laptop or whether they took those credentials away somewhere else usually you'll see them take the credentials away somewhere else and and just use them directly U rather than making the assumption that they're going to be able to persist on the initial Target indefinitely um in 2021 was the first time we had what appears to be an actual case of Insider threat so in 2021 according to this indictment there was a a user who was fairly privileged somebody in engineering who actually
went rogue and started to access data and things in ways that were not authorized and took a few steps to try to obus skate his his whoit identity um was still using his his sounds like he was using his normal credentials like that he used for day-to-day work but he was coming in from a VPN service in order to try tried to make it look like or to try to probably create the appearance that the that his credentials were compromised in the hands of someone else and one of the things that's interesting about these indictments is I think without realizing it there's often kind of a a breadcrumb trail to anomaly detection techniques in these uh because
many of them will cite certain unusual um method or action activity like in this case they talk about there's a command called get caller identity that was apparently was was something that they kind of focused on in the investigation and so get Coler identity is basically like a who am I in in um the AWS API basically just tells you what your current user context is um and this was another so in all of these cases there's been large scale data XFL um it's in many of them are similar in that they're they largely consist of credential access Sometimes some lateral movement followed by large scale data XFL uh 2019 was the the second one of
these we've seen this one was this was a little more complicated in that in order to obtain working credentials according to what's been published they actually were able to bounce through a Waf service that was running on a on a virtual machine and they were able to get X command execution on the virtual machine running the WAFF service and then by from there they interrogated the metadata service in order to obtain some valid credentials because the metadata service if if you're not familiar it's it's just a service where a virtual machine instance can ask questions about itself and including things like what is it authorized to do but it can also do what are called assume Ro operations
where it can assume um privileges in order to to do transactions in various Services depending on what it's authorized to do so they basically harvested credentials that way and then use those credentials um that came back in from a VPN service and use those credentials to start doing bulk data xfill from from the storage services and the first one is 2016 anybody remember this this is the first big case first big public case um this one was interesting because yeah exactly and this one this was interesting because there's actually there's no virtual machines there's no virtual hosts of any kind involved here no virtual machines it's not to say that there were of course there were a lot of
dimensions to this intrusion set and of course there were host Bas Dimensions to it and lots of Dimensions to it but in this particular aspect of it the xfill took place just by sharing snapshots so there wasn't uh it wasn't necessary to try to persist on a virtual machine or a virtual host of any kind it was just using snapshot sharing to kind of forklift data uh presumably from the from the victim account um to the attacker's account so in terms of detection so one of the the first questions that um that I kind of found myself asking as I started working on this was well so if we know what actions are being used in order to do things like
xfill and lateral movement and privilege elevation like we we know what actions they're using and we know what it looks like so so can we you know how can we alert on these can we detect this stuff using conventional alerting and so in these in these four case studies uh three of them talk about specific actions that were unusual in some way and the the most recent one we don't have that level of detail um I would speculate there's probably there was probably some privilege related um methods in there some actions but a lot of it was probably just just S3 data operations and and bulk data xfill but in the first three um they talk about
how there was unusual or anomalous usage of these three methodss was one of the things I noticed so you know when we when we if we ask the question whether we can alert on these or whether we can just simply alert on these we really can't because these actions in cloud trail data of medium or large sized organizations depending on how large your AWS uh world is these actions can exist in the tens of thousands or they can exist there can hundreds of thousands of Interest instances of these actions in the logs uh for a time frame of between one and four weeks um most of the time I don't see cloud trail data retained longer
than four weeks at least not hot data so most of the time I'm I'm looking at on between one and four weeks so you know that's way too many to even think about simple alerting obviously um cloud trail events are even harder because this is and this is something I'm going to work on next because in a large environment with cloud trail or I'm sorry S3 audit events in a large scale environment where you're auditing S3 access generally like who's like down to the level of like who's accessing what objects and which bucket and what are what are people doing um that can easily generate I the the highest I've seen is that can get into the billions per week
um it can even even go in some cases even north of of 10 billion events a week doing large scale S3 auditing so with too many to think about alerting and in terms of so the get color identity method the other reason that it's it's hard to try to do any kind of alerting around this is that what I did here is is I plotted the number of unique user contexts calling the method and the number of times they call it in a scatter plot and you can see that there there's a few user contexts that generate vast numbers of of events calling this method vast numbers of times and there's a long tail of user context that call it uh a few
times a small number of times but it's it's you know it's a kind of a it's a distribution that's hard to deal with in terms of finding what's unusual same thing for for Source IPS the um the case studies talk about Source IPS and sometimes people will try to go down the road of alerting on Source IPS or make trying to make lists of good and bad Source IPS or known and unknown um but that's really hard too because just for this one method there are a few Source IPS that generate vast numbers of events for this this action or method and there are there's a long tail of source IPS that generate few events and so um it's it's well beyond
what anything we could think about by using any kind of a list or definition list buckets is um figured in in some of these cases because if if somebody is going after S3 in order to do bulk data xville one of the first things they're likely to do is to do a list buckets command which does what it basically does what it sounds like it just enumerates as three buckets by name and so that's you know if if we could alert on a a suspicious list buckets command that could be potentially one way to get earlier in the cycle um maybe maybe try to get on get on this before the before data xfill starts to ramp up
or scale up but this again it's a it's a very popular heavily used command it can occur um thousands of times a day sometimes even millions of times a day at the high end in a really large S3 environment and so trying to find you know one or two or three trying to find a handful of suspicious instances of this in the mix is really hard um even trying to do this like through manual data sifting is it's it's you know largely infeasible um I again for for list buckets so one of the things that you might have noticed in the case study one of the one of the dimensions that's discussed is that commands like list buckets so data
you know Discovery activity like list buckets followed by large scale xfill activity were coming from Source IPS that were unusual or that had not been seen before um and so sometimes and at first that was you know one One Direction I thought about using was trying to do that but Source IPS have one of the highest cardinalities meaning like there's a source The Source IP values in cloud trail logs are one of the fields that have the largest number of unique values there's there's just um there's way too many of them and there there's a long tail of ips that are relatively rare or unusual but they're completely normal just because modern networks are so
hyperd distributed um so that that's kind of out um there are projects out there where there are alert rules out there today alerting on some of these methods even things like these like assume role is basically the way that you obtain privileges to do what you want to do or to work in a service is you call this method called assume role um and then you get an access token that allows you to do work in a in a particular service um but these you know these are some of the most heavily used methods or actions in general these can easily get into the some of these can get into the hundreds of thousands of instances or
they can occur hundreds of thousands of times per month so it's really infeasible to try to alert on these at least a any kind of like a simple alert where we're just say just alerting on the action was assume role or gets you could value even just turning them into events is is quite expensive and but turning them into alerts by themselves is kind of out of the question so then what do we so in the case of snapshot sharing this one is as you might know this one actually has uh its own technique uh in the attack Matrix and has it it has a um a technique assigned and the tactic is exfiltration and so
they talk about detection in here and they talk about how well we can't really um because of the volume of normal normal snapshot activity it's going to be kind of Impractical tread just just alert on that generally so what they talk about is they talk about doing looking for unusual or um untrusted accounts or unusual activity basically the guidance is to look for unusual a snapshot activity so then the question is so how do we do that because most rule languages and most security products are search based or query based and so you can write you can write rules with searches or queries that that match say the action is share a snapshot or you know the or The Source IP is this
you can you can match any particular value in the field but most of these rule languages don't really have a notion of anomaly detection where you try to compose a rule that says show me anomalous or show me unusual snapshot activity that is very unusual from like dayto day so in order to do this um the one thing that works in our favor is that at the macro level if we draw an attack tree the number of patterns at the macro layer are relatively small in that most of the time it breaks down to one of these five routes um if somebody has compromised credentials that act that work in the cloud console if they're
logging into the console then it's going to be somebody coming in Via console username and password or possibly a SLE token if they've compromised a persistent key uh like a key that has a long lifetime then they could be coming in they're probably coming in from well they could be anywhere it could be from a laptop on their desk or it could be a virtual machine server in the cloud um or some combination of the two um if they decide to go the other route if they um if they're not successful in finding compromised credentials or they just decide to go a different route the other route is if you can get persistence if you can get execution and
persistence on a virtual machine then you may be able to obtain keys from the file system if they're persistent keys in in the FI they're uh sometimes keys are persistent in a file on the file system or you may be able to just do assume roll operations and obtain short-term temporary access tokens um and then that would allow you to do operations um potentially from that ec2 instance or you you could take that key away and use it from somewhere else but most of the time it's going to break down to one of these five patterns and that one of of these five uh areas is exhibiting new or unusual or outlier behavior that doesn't normally
manifest and so what I'm what I'm displaying here at the bottom is these are the these are the actual anomaly detection methods that I'm using so I'm looking so many of them are tupal or they they have partitions and that they're considering a combination of fields so things like for example a rare if you um if you decorate cloud trail events with Geo information so if you use something like maxm and you can enrich cloud trail events with the Geo information like the source country and the region and the and the as name um and that's great because cardinality there is is a very small compared to the C cardinality of source IP is is super massive but the
cardinality of uh goo and ASN information at least normally is much smaller so essentially looking for things like an action coming from an unusual geography uh a new action for a user that we haven't seen before um some some and then some standard deviation based analyses that are looking for the say the top five largest spikes and surges either in errors or events um one of the ones that I'm planning on doing is um a new source IP for an account or new source IP generally um and I'll and I'll explain to you why that's the reason that's going to work is because uh I think doing that over a long term like on a basically an
indefinite learning period it seems to work well and then over here uh I'm doing rare service for a user not rare action for a user and I'll show you why I'm doing new action for a user um because that actually has better efficacy a new role for a key is a good one because most of the time um whether it's whether it's uh a virtual machine or a CLI or it's automation or if it's a human most of the time whatever it is entities tend to reuse the same roles over and over again U for whatever key they're using um most and that's one thing that works in our favor is that much of the time entities
tend to do the same kinds of work every day not always because people will start new projects and start using new servfaces and things will change over over the course of you know months or a year but but dayto day it it's it's often kind of predictable how are we on time so we're good all right so this is what I've tried to do here is I've tried to depict why I'm using the combination of the three why I'm using um using um functions that look for for rare combinations of fields why I'm using functions that look for for new behaviors in terms of new combinations of fields that have not been previously observed um and why I'm using spikes and so what
I tried to Che here is to try to diagram what each of those three is good at detecting and so in this this is a simple chart where um the vertical axis is event density it's basically the the volume of events for for either an action or a user context and the the horizontal axis is the time frame and duration it's um because the question of whether an intrusion is new or long-standing is often relevant because if if it's day one and somebody has just begun using a compromised credentials some and credential AES has just started started say an hour ago or just today um then in that case sometimes um sometimes things like a new action for a
user are good at finding that and finding it early and early enough possibly to make a difference um and new action for user is is actually being really productive and and it I like it because it's it's one of the anomaly detections that actually aligns well with the whole thesis that we see in the case studies in that in many of those examples we are looking for activity that is rare for a user but in many of those examples we're actually looking for user activity that's completely new and has never manifested before so that's a more precise way of getting at it in many cases however if when in a case where somebody has been persisting long term um if initial
access was last week or last month and or if you know it happens today but we don't start doing anomaly detection until sometime in the future um in that case then what the whatever the threat actor is doing may not evaluate to be new activity for that user context or that entity but it may still be rare um especially in the first few days or the first few weeks it's it's you can often find it by means of the fact that the activity is it's not brand new but it's rare for the user compared to all of the transactions and work that the user does what the things the thread actor doing are relatively rare it tends to be more
um more sparse um but of course there are exceptions to that there are exceptions to everything so when if a thread actor starts doing large scale activity like large scale bulk data xfill or one of the things that happens a lot is if somebody gets execution on a virtual machine one of the first things they'll do often is to just start doing tons of automated discovery and essentially just running automation to just start asking for rules and calling methods and trying to do transactions and basically just kind of playing you know what can I get so try to do everything that you would like to do and then figure out where you get you know what you know what operations are
successful and and what are not so where you where you're successful and where you're shut down and in most environments when that starts happening um most of the time it will result in an ec2 instance or a virtual machine that is it's trying to do it it's trying to move laterally across services or actions that it doesn't normally use or it's trying to access services or resources or things that it doesn't have access to because they're not part of what it does and so in that case there'll often be a huge surge in in authorization errors um in access denied or there are one or two others and that's actually one way of sometimes when there's if somebody's
just landed on on an instance and is doing Discovery or lateral movement or trying to do privilege elevation sometimes you can see it first there so that's one way of potentially finding this stuff early um but of course you know like everything there's there's exceptions to that too because none of these are none of these are perfect because if the compromised instance or if somebody has obtained credentials that have a Godlike access to the nearly unlimited access to the account then it's likely there'll be no Surge and authorization errors however they they're if they start doing large scale automated Discovery it's likely there'll be uh a statistically significant surge in events for that user context that's that can be visible
in a Time series histogram and I can if we have time for a demo I'll show you what one looks like so in terms of applying these to the examples um in this case in get color identity this is coming from a rare Source country and network um because it's coming from a VPN provider that was just not normally used
here um also in the course of there was some defensive asion here and things like setting certain retention level per to one day presumably to try to try to try to keep Cloud for log retention down to one day so that if somebody went to try to investigate what's going on they would find that there wasn't any data um that was older than 24 hours just to try to make it hard to investigate so that's potentially um so those operations those are not abnormal operations but they would have been coming in from a source geography that was unusual coming from this VPN provider and from a network that's usual and it could also be depending on what this person does most
of the time not always but much of the time um environments are spun up and torn down by Automation and so most of the time it's going to be the automation account that's doing this kind of housekeeping and setup functions like setting things like log retention and logs and all that all that instrumentation and so it's it's of it can be unusual for a human user to start actually working in those services and making changes so there's another possible there's another two possible techniques there so uh a user accessing a service they don't normally access like modifying the logging service and also a new action like if if this is somebody that doesn't normally go around changing retention policies
that would be a new action for the user and then this one is um the original example this is kind of similar so the question here is why is the Waf role coming from out you know not only is it not coming from the W instance it's not even coming from the cloud environment it's coming from a VPN provider somewhere else so it's coming from a source geography and network that is anomalous um then the next question is why is the W roll accessing S3 which is never done before so there's there's new action Behavior there for the user context um and this the snapshot example my impression of this is that this probably would have been both a
rare service for a user and a new action for a user because as i' from what I've read about this it sounded like the user here is probably somebody that is not actually in the console sharing snapshots and doing things like that um so that would have those would have all three of those would potentially found that um if you're wondering like why not just simply do um when I just simply look for rare actions um a method in the API is essentially called an action is what what I mean by action and the reason is that this is what the distribution of actions looks like and it's it's really it's kind of a mess in that it it's it's
not a normal distribution um it's programmatic Behavior it's largely machine Behavior not human behavior so a lot of those assumptions break down but there's a very long tail of outliers that are just completely normal so um the reason I don't do rear action for user is that even in a medium-sized data set with say 300 million cloud trail events there are too many normal outliers if you look if you say show me the like the top five the top n most least frequently occurring rare actions for a user there's just too many normal outliers because like the scatter PL I showed earlier most user context they call the same actions over and over again many
times um but there's a long taale of actions that they call
occasionally um so what what I'm doing instead is this is the output of of essentially what this is doing is this is looking at a new action for a user so it's looking for cases where uh a user has called an action today or say in the past two or 3 days that that user has not called previously in the past 30 days and that actually has really good signal to noise um with 300 million events I only get eight results and most of them are actually interesting um looking back 30 days at the moment but what I plan to do is uh I'll show you that in a moment so the case study that I have for you
is the first time I turned this on so this is this is one the this is the third project I've done on anomaly detection was first working with some anomaly detection functions in in an open source platform um I spent a couple of years building anomaly detection machine learning jobs in an unsupervised machine learning engine in a in a commercial platform and at the moment right now I'm for purposes of of experimentation I've decided to do that a little different so I'm doing I'm actually doing these in SQL and rather than Computing relative anomalies in a in a learning model like in an unsupervised learning model um I'm actually just Computing the top end absolute um anomalies like the top end
most rare uh combinations um the top end um behaviors that are new and the top end surges or spikes like the the the top five or so largest standard deviations in terms of surges and events and and errors and I'm actually getting good results from that I'm actually liking the results I'm getting from that a little a little bit so the first time I turned this on I was expecting to find just more of the usual that like just just just more potential um suspicious activity and more conventional kinds of threat hunting and detection but the first time I turned it on what I found is I found these assume rooll events coming in from a source country that was
extremely rare and that it had only manifested um two of those events and what's being displayed here is that of the 42 million assume rooll events in this data only two of them came from that country and of the two .2 million described instance events only one of them came from that country and so that was interesting because that is a country that did not have the in this particular case there wasn't a a normal business relationship there weren't users sitting in that country there wasn't a normal business this relationship with that country so when we dug into it um what we found is and these are just some more numbers so there were there
were some ec2 operations um there was one describe instance action and there were a couple of others um so what I found is it was actually coming from a third party because uh you may have noticed there are a lot of third-party services operating in Cloud accounts now and some of them are doing management Automation and orchestration like you you can Outsource um management of your kubernetes instance to someone else or you can Outsource management of your entire Cloud environment to someone else or just aspects of it or you can subscribe to Security Services in order to do like security monitoring and things and so then what happens is typically some kind of Federated authenication provider gets
trusted and cross account access gets enabled so that the third-party provider the vendor is able to authenticate into the customer's account and in order to be able to to call actions and do transactions and do work in support of of whatever they're doing there are a lot there's a there's a number of kind of cloud Management Service offerings out there and there's a number of Security Services out there the reason for the um the the reason why the snapshot events are so voluminous today compared to say five years ago is there's a lot of this kind of like agentless security scanning going on where vendors will authenticate into your cloud account and then they'll take
snapshots and they'll just scan they'll run those snapshots through a a malare engine using Y rules or behavioral rules or some combination of of that and something else and so there can be tons of snapshot activities but this was this was interesting because it was it was you know during the course of my career one of the things I found is that whenever I do something new I tend to find something that nobody knew was there and sometimes that creates a little bit of chaos but it's always interesting and this was one of those cases where um it wasn't even really known that I think there was maybe some awareness that that third party had admin access
to this account but it wasn't really expected to nobody was expecting to see the the vendor coming in and just doing transactions when like in the course of just kind of out of the blue where there wasn't any actual like collaboration going on like there wasn't a service request open there wasn't a question into the vendor like there wasn't anybody asking for the the third party to come in do this um so it it it just happened and so I think you know anomaly for this reason I think it's another reason that anomaly detection is going to be really key for this is that without anomaly detection given the number of third parties that are working
in Cloud accounts now I think how apart from without anomaly detection I think how are we possibly going to find this stuff because it's it's hard enough just to find suspicious activity coming from compromised credentials or credential access um but when it's coming from a third party it's even harder because it's because it's a third party it's it's hard to even kind of know like what's Normal and abnormal and what's expected and not and like is it normal for them to do this or to do that or is it normal for them to be coming from this network or this Source geography and trying to figure that out manually with searches and Analysis would take
forever so I think anomaly detection is possibly even more important for this kind of managing supply chain risk in in these accounts in addition to just conventional threat detection um some other examples um in the chart I I mentioned there's some some Geographic detection I'm doing there so Geographic detection looking for things like a user account or a user context coming from two different countries or like two or more different countries at the same time and that's a technique that's always that's worked well for me generally with authentication events and many kinds of events um in the postco world it's not as reliable though because users are all over the place because because users can
do work from everywhere and users roam around and like I have users that they come to the US and they go home to whatever the country they're from and they're there for a few weeks and they come back and they they might actually come in from two countries on the same day in the process um everybody has multiple devices coming from multiple IPS um and multiple networks so in order to do this this is another area where I'm planning on applying a new algorithm so rather than looking for same user in multiple countries we looking for a a country a source country that we haven't seen before for a user it's something completely new um new source
IP so so new country for user it has an example on the top where we had a user actually coming in from this's a case where uh the first result is a user context authenticating in from France which is just a new country that they had never come in from before and on the instance on the bottom is new source IP so so rare Source IP is um is does not produce good results you have way too many um just trying to alert on that would would have way too many kind of normal results but new source IP is working well for me and that I find even at some scale like even at the 300 million event
scale uh I typically find only a handful of new source IPS um like in this case there are four and possibly the best example so far is this is um this is a persistence method that was just recently being talked about there was um there was research that was recently published on this like last month um where there's there's a obscure method called get Federation token where you can create create a Federated user and that Federated user Will Survive and continue to function even if the original user that created is disabled and there are some limitations so it only works in the console and it's not supposed to work in the CLI and appears not to um but this
is one way apparently of where people have been persisting trying to trying to persist or trying to avoid getting evicted so when the um original compromised account like if somebody figured fig out which account was compromised and either disables it or deletes it or changes the password or whatever this Federated user will continue to function unless that is dealt with as well and so that's interesting so um this is something that in this case this is a this is an action that uh I don't see this action normally in most of the data that I've looked at it's just not normally used um it's too obscure so this is something that we could alert on if we want to
make a simple alert rule for this I think that's that's completely feasible the problem is that until last month I didn't know and I think most of us didn't know that we needed to alert on this because until last month we just didn't realize you know until the course of maybe like the last 60 days or so we just didn't realize that that this was a an evolving technique that people were using so the great thing about new action for user is that um the first result here from January 29th actually precedes the the thread Intel and the research on this so it's it's one way of finding things because often in the cloud World
we're looking for things
that um in in the cloud domain in particular and in threat hunting in general but in the cloud domain in particular sometimes it's it's hard to find emerging threats because we know we need to be looking for emerging threats but we don't always know what we're looking for and we don't because we don't know what's coming next and so what I plan on doing with new is right now it's a query that runs across um whatever hot data exists whether it's 7 or 30 days what we plan on doing though is making that into an airflow pipeline that'll run in um the SQL database where where this data lives we're using S I'm using SQL
today and the airflow pipeline will run continuously and essentially just rerun the query looking for new actions for a user new source IPS for an account new this new that um and it will persist that output in a table somewhere in order to have memory so it'll become like a simple form of learning and so as long as that is running even if you only have seven days of hot data then the new function will still be able to reason about what things are new um for as long as it's been running um but without having so we potentially be able to find things that are new in the last say year or more um but while still only keeping
seven or 30 days of of data hot because I don't know if you looked at data class but trying to keep hot data for a year is usually cost prohibitive just be too
expensive so we still have about like 8 minutes of time so I'll show you just to um just to show you what this looks like what I'm doing today is um each card here is displaying the output of a SQL query so at the top here this first one is looking for users and roles coming from multiple countries at the same time and this last card is looking for looking for actions or transactions coming from an unusual geography and then down here we have new action for a user which some of that is interesting we have a new factor for a key that has no output and that's actually a good thing because that's a
case where kind of no news is good news because most not you know nothing is absolute but much of the time you see the same access Keys using the same rules and doing the same transactions day after day and so that's this is one case where it's actually not unexpected to not see output here and in fact having no output here is a good thing um anomalous service activity is a user accessing a service that that so it's it's a case where there's a very small number of events for the combination of the user and the service so user user working in a service that they don't normally use um because again most of the time like dayto day most user
contexts work in the same Services over and over again and they don't usually day-to-day there's usually not that much change they'll be change in the long term like over weeks and months as new projects start or new things are built sometimes that you'll see you'll see users in new services but on a day-to-day basis it's usually not that much and then down here these are these are the top five surges or spikes by top five standard deviations and these when you take the top five the the signal here is pretty good like signal to noise is pretty good because when you take the top five here these standard deviations are often in the thousands you know it it's not like
one or two or three standard deviations which that would that would give us a lot of that would give us a mix of interesting and uninteresting you know Peaks and troughs um but the top five standardizations are so large that they're usually interesting and so you can see like one of these or actually two of these error spikes are authorization errors which means that um that those are interesting because there really isn't a normal explanation for that one because if you've got a huge surge in authorization errors like much larger than what what normally occurs that then it's either it's one of two things it's either something is broken or it doesn't have the permissions it
needs to do what it's trying to do or there's somebody has as persisting with credentials and they're just doing massive Discovery and setting off authorization errors but you know it's one or the other there's not really a normal explanation for a huge flood of Errors um and then same thing for events um over here plotting the the top five spikes and events by standard deviation and you can see like describe volumes this is actually being caused by automation there's there's actually a script running paku activity so if you've heard of paku it's essentially a security auditing tool and Pen testing Tool uh for for cloud environments there's another one called bark that's becoming more popular but they're both
good but um running either of those automation tools will generate will typically generate a unusually large spike in events um especially for that user or that instance and and most of the time un if unless you know as long as you're not in a user context that has admin access to everything most of the time generate a lot of of Errors too the the other interesting thing about a rare error is if you plot very rare errors or new errors sometimes those will find subtle Discovery or or privilege elevation or lateral movement activity that is not voluminous but it's very small maybe somebody doing it by hand um will generate rare errors rare errors in
New ER can also I found um sometimes um my sres and folks use that because those will actually sometimes we found we'll predict a service failure is imminent like a service is going to fail in the next few minutes uh often we'll be be preceded by a rare or new
error um so so that is my talk and we're pretty much on time so I still have a few minutes for questions Yes sounds like you have conly sh Baseline is that dependent on the user the machine the erress what's what what are you valuing and then uh on top of that based off the events do you actually apply a weighted factor that way a combination of certain events May actually trigger where individually they wouldn't stand out yes yes so to both so it is similar to a baselining function especially like the new so the new the new function whether it's new action new source IP whatever the new observable function for an entity is you could call
it baselining because it as as it runs whether it's running just on 30 days of hot data or or as if it's if it's been running long term it's essentially learning what the normal Baseline is um but it's it's well it's learning what is unusual for a user context or an entity and we can infer by inverting that we can infer um that like anything else that's not showing up for the most part is probably normal Baseline and in terms of combinations and correlations um I think that I think that that actually may be a really promising Direction because a lot of the conventional rules tend to have really really large output and um a lot
of a lot of rules here working on cloud relevant sometimes they have too many alerts um I think that combining different detection types together looking for cases where different detection methods agree that some cloud trail event pattern is anomalous um I think that that's going to be possibly the most promising way to do this one of the things we're going to do next is um in order to do some more so that well so we're doing these anomaly detection techniques today and those are working well one of the next things I plan to try is right now it's really popular to to create rules that act on kind of sequences of events or sets of events when you say we see
these three or these five events for the same key within say 15 minutes and um those are so those those are kind of popular and lot of people are doing those today so over the course of in in the next few months over the summer we're going to try applying a machine learning model to that so rather than just looking for hardcoded lists of events we're we're going to try playing machine learning model to look for anomalous or unusual clusters of events um either sets sets or sequences either way anomalous clusters of events in order to try and find which of those um say in order to try and find like wh you know of those sets or sequences which
ones are actually suspicious but other questions
from what you're seeing in the cloud trail whether it's a new user or an existing user like the age of the user based off of like as an indicator of anous events so so that you can tell somebody new doing Discovery or it's a new account that somebody been had created for them and they're trying to understand the environment they're working in yeah that's actually a good idea so what I would what I could get today is If a brand new user started doing lots of transactions then they might tend to show up in some of those new functions but I think you're on to something there and that I think that in addition to that what I probably should
do is um create an event for a new well create an event for say user a create user and not an alert on create user because I don't want to alert on those because I don't want to buy bother you know a lot of that's normal but create an event for create user and then do a basically do a join so if I see a if I see a create user event and I see it that it correlates with uh either anomalies or rule based detections then yeah give it a higher priority the other thing I want to do is um I'm going to plan to um I'm going to plan to start either using a lookup table or a query to start
reasoning about which source IPS actually belong to the organization and which ones don't because if if it's coming from you know if the Call's coming from inside the house if if it's coming from another AWS account then nothing's going to be unusual there the source geography is going to be you know whatever one of the five geographies that the the regions are in the source ASN is not going to be unusual it's going to be AWS nothing about that is going to be unusual ual um it might tend to show up in most of the time it should show up as a new source IP but in order to catch it there you have you know in
order to see it as new you have to catch it relatively early so one of the things I want to do is I want to start either labeling um events with or what you know labeling Source IPS in the events or that might be you know that might be too expensive maybe just develop a query that looks for either looks for alerts or detections from an IP address that is not associated with any of the accounts belonging to the organization so it is an AWS IP but it's not associated with any of your vpcs um and then possibly turn that into a detection because there aren't that many normal cases for activity to be coming in from
somebody else's account unless it's a third party service provider or vendor but I think we could probably Whit list those um I think we probably deal with those by by excluding
those so you're only be
acual yeah exactly and once once that happens then those those those like the IPS of the internet gateways and their infr could be excluded and that will require a little bit of housekeeping because those can change but they don't normally change you know day to- day hour to hour um we would see say if a if a vendor account starts doing something it's never done before or working in a service that's never used before or if it starts generating a huge surge in say snapshot events or some other kinds of events or errors that's I think that's one way of potentially you know I think that the application here for finding uh unusual supply chain activity is as good
as just finding you know in-house activity um but I I like the you know I want to start kind of typing Source IPS as self or not self because anything that's coming from if it's coming from an AWS IP and it's doing authenticated transactions in your accounts and then if it's not coming from a vendor service provider then there's not really much explanation for that and I think we are yeah we're just about out of time [Applause] thanks thank you so much