
[Music]
[Music]
Thanks for showing up. Yeah, it's nice to be here. Let's get into it. So, what's in the cloud? We'll be talking about Azure cloud here. So let's look at the agenda for the today. So quick introduction about me, what I do, my background, introduction about what we going to be. Morning guys. Uh welcome. Okay. Yeah. Then a quick introduction about what we are going to be talking about today. Why traditional threat hunting doesn't really work in today's cloud world. Then looking a deeper dive into 0365 logs. what are the schema looking like? What are some of the fields that we should be looking for? Things that we should be taking care while doing thread hunting. Then I'll
actually show you my Jupyter notebook where we'll deep dive into doing some actual threat hunting using some machine learning here and there and trying to figure out anomalies and then followed by some takeaways and references. Uh so my name is Kaiier. I work as a security engineer at Amazon. I do a lot of coding, build pipelines, do a research and whenever I think my research is really good, I write a technical blog about it. I maintain a couple of open source repositories for various organizations, support open source and a huge advocate for data privacy. Outside work, I do a lot of marathons, triathlons, watch a lot of anime, read a lot of manga. You should
be able to find me at Kay on Twitter, Medium, LinkedIn, I think pretty much everywhere. Disclaimer, the views, opinions and content presented in this presentation are solely my own and did not reflect the views, policies or positions of my employer or any affiliated organizations. The presentation is based on my personal research, experiences and perspectives and should not be interpreted as an official statement from my employer. Had to put that out there. Now let's get into the introduction part. So thread hunting uh let's uh read about the definition what it says. basically umbrella term for the techniques and tools organizations use to identify cyber threat hunting. So basically what we do is try to be proactive. We don't want to be reactive
which is basically something happened and then react to that but we want to be like proactive and detect those threats early stages of the kill chain or if you talk in terms of matt attack framework we want to detect them in the initial access or the execution phase not in the C2 or the impact phase. So how early can we detect those threats and react to them? That's the whole idea about thread hunting. The traditional thread hunting was more focused on the expertise of the analyst or the threat hunter who was performing that investigation. But the modern thread hunting is more focused on automated tools enriching those things using CTI. A lot of automation which
actually helps get the job done in a way better way than traditional thread hunting. Types of thread hunting are the classifications. We classified them into three different types. structured thread hunting, unstructured thread hunting and ad hoc. Structured is when we are looking for TTPs. When I say TTPs, those are MIT attacks, tactics, techniques and procedures. Unstructured is when we are looking for IoC's, which is basically a breach occurred somewhere and then your customer runs to you and be like, "Oh, these are some IOC's provided by some research company or CISA advisory or some other platform. Can you just run it against my infrastructure and see if any of these IOC's are getting back a hit or
not? Ad hoc is basically when your customers say, "Oh, my third party vendor got hit by a ransomware. Can you just make sure that I am safe?" So, you do ad hoc on demand. Structured is looking for TTPs and unstructured is looking for IoC's. Limitations of thread hunting. Why traditional threat hunting does not work in today's world? The first thing is too much data. Earlier we used to look at what one terabyte of data per day that's the maximum an organization is generating but today one terabyte is like generated in an hour. So how are you actually going to look for threads in the abundance of data way too much data being ingested into your cloud the
schema is not defined and then it's going into some data leak or object storage. How are you actually looking for threats in that vast amount of data? You cannot expect a human being to actually sit through and go through all that data and figure out those threat hunts. That's why automation is needed to help you aid that threat hunting investigation part. Limited scalability again a lot of data. So your thread hunts or the detection engineering queries if you are using Splunk or KQL any of those platforms when you write those queries it's very important to make sure that your queries are fine-tuned to look for specific things. Which means if you're using wild cards and just like searching for everything
and anything that's not actually going to help you scale that to a larger level. If you want to look for something for one day worth of data, wild cards are good because it can fetch your results. But when you look for like 90 days of data, 100 days of data, more than that, a lot of data, wild cards are not going to help. Static nature of uh hunts which is basically like every day the threats are evolving. it's dynamic and then uh every day someone is coming up with a new vulnerability or a new pathway to actually do the exact same thing. So if you are using signature based thing which is looking for a
particular IOC hash IP address or anything like that which is like very static in nature it's not going to help you figure out what to look for that's why we have to be like very dynamic inability to address unknown threats. Suppose you know there is a threat out there and this has this TTP which is basically it's using this particular vulnerability. This is the pathway of getting into it. You know how to look for it because you know what it looks like. But what if you don't know what it looks like like a zero day you don't know how to look for a zero day because you haven't seen that. How are you going to look for anomalies which you don't
know how it looks like. So that's what we are going to be discussing here. How to look for some things which you don't know how it looks like. So the traditional or the modern thing in which the data flow you have your log sources it could be your endpoints network or cloud 0365 object storage whatever that is your log source will forward all the data using forwarders into your data lakeink in terms of Azure we call it azure data lakeink or you can put that into a blob storage wherever that is let's call it data lakeink or the database where all that data from your customers are coming in and that is being fed into your sim solution But why
am I not tapping my Jupyter notebook from SIM solution is because the data that goes into SIM solution is always filtered. You only need some data for which you have got detections or which you have got use cases. But so basically you have like 100 terabytes of data. You filter that down to probably like 70 terabytes of data and then you put that into a SIM solution because the SIM injection costs you. So you want to reduce that. That's why we are not tapping at SIM. we are directly tapping at uh Azure data lake and from your sim analyst will do your investigations and that's a traditional flow of things what we want to do is basically use Jupyter
notebook tap into data lake and do threat hunting let's look at a couple of things about 0365 before we actually dive in and do that threat hunting because we need to understand how it looks like before we start looking for what it's what is bad so in order to figure out what is bad we need to understand what is a good thing or what is a normal activity. Uh so 0365 management API activity schema comes in two layers. The first one is a comma common schema and the other one is service specific schema. Common schema is basically anything any of those fields which is common across multiple services. Service specific schema is like very service
specific one drive sharepoint any of those services you're specifically looking for things in a particular service you're looking for outlook you're looking for sharepoint you're looking for azure so you're only looking for specific service you take care of those service specific schema if you are looking across Azure across 0365 then you go for common schema uh file and page activities some of the fields which are like very interesting which looks very interesting to me when I do thread hunting file access than file downloads which is basically if somebody has like hacked your account and has got access to your account what is the behavior of file downloads are they downloading bulk like 100 files at
the same time and then accessing files are they accessing sensitive files are they accessing shareepoint to which they should not be accessing to so that's like very good indicator of how normal behavior of a user versus an outsider who has got access into your account looks like then miss uh Microsoft we we were inside activities user logged in and user logged off that's basically a signin events but these are like going into 0365 the usual signin events go into entra ID these are 0365 so these are like logged differently directory administrative activities set domain authentication one of my favorite things whenever somebody is trying to do a golden uh sle This field is actually very important. So if
you see set domain authentication set to true by an admin because you need admin privileges to do that. If that is set to true and you see that event being logged then that's actually a very high indicator of there is something going on. The person who is doing that should have a very valid reason to do that or else just suspect that that's an indication of a golden sle attack. Start there. Start your investigation. Exchange mailbox activities. Uh it's all about like accessing those email items uh new inbox rules and set inbox rules which is basically when you try to create a new inbox rule saying that any emails coming from this domain just forward it here or move it to a
different folder and when you modify that it goes under set inbox rules and when you create a new one it goes into new inbox rules. Identity Entra ID Entra signin and Entra provisioning. Entra audit is basically any changes that is being applied to your tenant. Entra signin is all about Azure signin events when you sign in. Uh and Entra provisioning is all about resource provisioning. I think that's enough introduction. Let's get into Jupiter and see the actual stuff. So okay for starters what I do is basically uh do some imports basic things. Here I'm using pandas because I just have a smaller data set. I just want to show you how it actually works. When you do it in scale, you might want
to use something like Apache Spark and then uh instead of CSVs, you probably might want to use a pocket format which is way better in storage thing. But here it's just for a demo. So just going to use pandas and CSVs. So load the data set which is file downloads and then try to look for bulk downloads. What it basically does is look for any users who is doing like more than 100 downloads. any why should a user be downloading more than 100 files and also apply a time filter which is saying that 24 hours uh that threshold is basically upon your environment. If you see a user having legitimate reasons to download 100 files and that's a normal occurrence
you might want to bump that threshold up to 200 or 300 but that's just a fine line which you have to figure out in your environment. Why is the user downloading that amount of files in a short amount of time there is no business use case for that. If there is good else flag it uh then what I do is basically just uh giving it some numbers here. So read the file download CSV and it has got uh 360,000 yeah 360,000 lines and these are the columns or basically the features that I have. So what I'm trying to do is take the relevant fields which is user ID, the file name, IP address, the user
agent and the size of the file. So these are the fields or features that I think is relevant for my detection. This is called feature engineering which is select the relevant fields from all available fields. Then do some encoding part. Encoding is basically when whenever we do machine learning, machine learning cannot be done on characters or alpha numeric things. Machine learning is always like ones and zeros. So you convert your strings into numbers, ones and zeros. Basically matrix any machine learning is basically a matrix multiplication. So what you're trying to do is take your data sets, select the right fields, convert them into ones and zeros, a matrix basically, then you do a matrix multiplication. So here what I'm
using is something called uh one hot encoding. And then the model that I'm using is a gshian mixture. Goshion mixture is again unsupervised clustering thing. So clustering is basically you take all that data plot that into a graph and then it forms clusters. So anything normal or have similar behaviors get grouped into one cluster and the ones which does not have like sit with those normal or centralized cluster go into the outskirt. So you know that okay those are some fails or events which does not sit with most part and why are they not sitting with most part we need to investigate those separately. So that's the idea here. Uh anomalies are basically using a
score of 95 which is anything which sits with the 95% of the data goes into a centralized cluster which does not sit with the 95% of the data which means that 5% which is an off factor is an anomaly here. That's an assumption. You can always tweak that factor up and down. So here we see 507 out of 300,000 something. So if I have data set of 24 hours and I am flagging 507 rows. Okay, good enough. But do I really want to throw 507 alerts to my tier one or tier 2 sock investigations? That's actually a bit on the higher end. I don't really want to do 507 because 507 doesn't look like a
reasonable number for one use case. 500 500 alerts a day for all the use cases might be okay but for one use case a bit too much so that why that's why I don't want to do clustering I don't want to do this one let's see if I have a better way of doing it okay let's actually visualize and see what exactly was I talking about the centralized cluster the outskirts and everything so here basically I take two fields because for any node you need like uh any uh line you need two points here I'm using user and user agent so any user using a user user agent which is like looking like a normal user agent like if I am using
Microsoft Windows on a daily basis and then one random day I decide to use MacBook so that's a different user agent and that should be flagged very clearly so that's what is going on in here so if you look at this graph that centralized cluster is basically the 95% of the data which has got some similar factors to each other binding them together and all the ones on the outskirts they don't sit with the data on the center which is basically I'm using a Android device which is different from my Windows device or I'm using a Raspberry Pi or a Python crawler any of those user agents which does not sit with the regular things that I'm
used to in that uh data set is being flagged on the outskirts here so it depends on what fields that you want to use I use user and user agent you could use IP IP address you could use uh if you're using like file thing you can use the number of file files or the size of files, anything. So depending on what features work for you in your environment, choose those, play around with that. This is just a plug-andplay. Take this code, choose those fields, plug and play. Uh those were good when we were doing like one to one mapping. We knew what we were looking for. We find a threshold and then we find, okay, file
downloads, multiple file downloads. Okay, I'm going to look for bulk file downloads. But what if I don't know what I'm looking for? I want to figure out something unusual going on in my environment and I don't know what it looks like. That's the case which most of us are trying to solve here and that's where machine learning comes in. And here what I'm trying to do is basically take the exact same data set which is file downloads and from file downloads I'm going to select my features again feature engineering user file name IP platform and file size. So, taking all those features, what I'm going to do here is take out the duplicate parts because I don't really want to deal with
the duplicated parts. And then I'm going to use something called an autoenccoder model. So, autoenccoder model, one of my favorite things. Again, unsupervised learning, easy to train, easy to get outputs. The best part about this one is less amount of data. You give it like 30 days of data, 60 days of data, 90 days of data, easily trainable, easily deployable. So how it works is basically take all the input data set and then so this is the actual logic of autoenccoders take all the data set and then compress that to a lower dimension. Suppose it has got 100 dimensions it it is compressing that to like 10 dimensions and then expand it back from 10. So for image compressions or like
file compressions and all of it autoenccoders are a really good use case. What we do is compress the input and recreate that. So anything which is actually a normal baseline activity will go into that compression and re come back again in that same baseline and suppose I'm talking about numbers here right okay 2 4 6 and 8 I compress them and then recreate it back it's actually an exponential thing so two will actually go into like something like one and come back again two but if there is a 30 there 2 4 6 8 and there is a 30 that 30 goes into a compression and when it I when I do an exponential expansion
it actually will outburst the other numbers and that um sounds like an anomaly. So that's the actual mathematical logic of how autoenccoders work. So here what I'm trying to do is take all that uh fields uh that were relevant all those fields selected those use an autoenccoder model and there is basically some encoding part which is like using a mean squared error relu and all of those things a bit too machine learning terminology here but again I'm taking that data set running it for 50 epochs and training and uh testing. So what it does is basically I have that data set. I split my data set into training and testing. I keep 80% for training, 20% for testing. So train my
data set on that 80% and use that rest of 20% to test out if my data set is actually working or not. U validation again same thing for validations batch size of 32 which means for every batch I'm taking 32 events and then running it for 50 epochs or 50 iterations. So here you can see some loss validation loss and then finally the output is uh 1 2 3 4 5. So basically out of 300,000 I have five events which are anomalous. So clustering was giving me 500 normally detection autoenccoders five. So throwing five at my analyst is way better than 500. So this makes a lot of sense to uh write detection engineering in a way which is not going
to like trouble my tier one and tier two a lot. So what is this showing? Uh like let's take one more example and try to figure out what exactly is going on. So this was file downloads. Now let's look at file access. Anybody who has compromised my account has got access into my Azure account 0365 account will try to like browse around and figure out what files do I have access to. uh not like downloading files because that might a lot of users think that bulk downloads will set a flag or set a reminder or an alert. So they don't want to do downloads they just want to access and see if that user has got access to
confidential data like architecture of that or some business or financial data. So here we are talking about file access again features very important to select the right features. So user client uh user ID, IP address, the application display name which means what is that application which is being used to access that files uh and the authentication type. You can choose your own fields. These were working for me so I chose those. Same again auto and go to model train for 50 epochs. I've got 267 a bit more than what I expect. The reason for this one is we have got way too many applications doing way too many things. If you actually know what is a normal behavior
in your environment which is saying that Microsoft Office uses oath to do some activity perfectly fine. So what you have to do is create a list of known activities saying that this application is known to use this particular authentication method to do this uh this activity. You can put that as a baseline or you can put that as a fine-tune normal behavior. Feed that into this model so that the model knows those activities are perfectly fine. So that needs need not to be flagged. 267 again not a great number but not not too bad of a number as well. uh when yeah this one is basically like after machine learning those machine learning use cases can be extended to
any file types any activities like sign-in logs uh file downloads or probably email access or new uh email rules any of those Microsoft specific actions you can apply those then we'll talk about something which is basically if you don't want to do machine learning too much pain we'll just focus on some regular use cases which can be just deployed right away these queries are written in Python so easily convertible to like KQL or Splunk query language or whatever that is. So here we are looking for file downloads. Uh we are looking for multiple file downloads in less than some minutes. So 180 seconds I think that's like 5 minutes and 10 10 files download. So 10 files being downloaded
in 5 minute interval. How many users have been flagged? A couple of users have been flagged. Uh we 26 users. Okay. Now if I just plot this graph and try to figure out some so how many users are actually downloading files from a managed device is managed device. So the field here this one is important is managed device am I downloading those files from my device which is like registered and I know that device exist in the Azure AD. So basically I'm using a laptop which is my MacBook and I'm downloading files from that perfectly fine. What if I'm actually performing some file downloads from Android? Why is the user downloading files from Android? Or why is the user downloading files
from an operating system which is not registered before? That's basically a question that needs to be answered. That's why this graph is actually going to help. Uh I'm a huge fan of graphs. Basically, if you look at terabytes of data in a tabular format, doesn't make much sense. But if you actually plot that into a graph, a simple graph, it does make a lot of sense. So here what does is basically you have got two different clusters and these two different clusters are managed devices uh of like is that a managed device yes not a managed device no and those things in between are basically some things that needs to be looked into why is
there a third thing which is like downloading files which is not known I don't know if it's like even okay in the environment is that is there a good business reason for that why is that happening and things like that. Uh these graphs so good. This one I didn't really want to like make it too complicated. That's why the label is off. If you actually look into the code part here, it basically says the label is off. If you turn on the label, set that value to true, you can exactly see what is that node representing. Inbox rule manipulation. Uh this basically says anybody who's trying to do some manipulation in your 0365 outlook environment. Uh so we are
looking for add inbox rules, modify inbox rules and new inbox rules which is if I set a new inbox rule saying that any email from suppose this person CEO of this firm forward that to a Gmail account that's if I add that rule any email that comes in a business confidential information or any email that has got some keywords forward that to Gmail basically exfiltration. So look for those things. Uh federation service uh I talked earlier set domain set domain authentication uh very like high indicator of a golden sample attack. So look for this one. Uh fishing and malware detections. Uh so Microsoft has this capability where any email that comes in Microsoft will flag it as a
fishing email or a spam email. So the thing is okay it's actually fishing or it's spam. But we al also have seen a lot of fishing emails which Microsoft does flag get into the inbox. It's not like purged or like taken out or moved into a spam folder or anything. It actually gets in. The idea behind this one is if you do this, it's basically taking care of all those things which are flagged as fishing and now you have a data set which is known bad. Now that data set can be used to train your model saying that oh this is how a fishing email or a spam email looks like for your environment. Now that can be used
to train your models to identify business emails versus fishing emails. What are what are we looking here exactly? If Microsoft 36065.wordict equals to fish and the event action is tr TI URL click data which means that fishing email has been clicked on. As simple as that. So we have got a couple of emails in here and these are like uh known spam fishing emails and stuff like that but good harvest of data. One of the questions that I usually get is when you want to train some really really good machine learning model you need bad data. How a threat looks like? Where do I get bad data from? Your own environment. You have a lot of malware
fishing coming into your environment for on a daily basis. Just extract that out. Use that as a training data. Uh these are basically just showing what exactly that is. So deliver a spam or spam Azure ID risky sign on one of my favorite things to do actually. Uh so risky sign on is anytime a user signs in into an Azure application or an 0365 application that is being logged. So what Microsoft does is looks into the IP address where it's logged from, user agent where it's logged from, the location it's logged from, and also some other parameters that they use to classify a login as a risky login or a regular login. So what we do here is
basically take that risky login. The risk state should be anything but none, which means at risk or like remediated risk or any of that. And then the app display name is a none value. So I've actually written this detection after working for an incident response from one of the clients where they actually got into the uh intune account. They had admin privileges and ended up being a ransomware at the end. So this was basically a app display name is a nonvalue. They get in and then they actually logged into office home. So look for these parameters. This is actually one of the TDPs for one of the known APD groups. So this good. So now I
know Azure AD and making sense. But what if my client uses an MFA which is not Microsoft. If Microsoft MFA is used then I know all those MFA activities will be logged into my Azure. But what if they are using octa? If they're using something like octa then that the MFA takes place outside of Azure and that will not be logged. So how are you going to correlate a login activity which started off in Azure but the MFA happened in Octto? You need some fields which you can map to both those tenants and that field is basically user principle name is what the username or the email id is in Azure and the exact same parameter in uh octa is alternate
ID. So what you do is basically look for all those data sets both these data sets for the exact time frame. There might be some seconds delay because it starts here and then goes there. So time frame cannot be used to match it exactly. Take one hour of data. On the left hand side I've got Azure logs. On the right hand side I've got octa. On left hand side I use the user principle name. On the right hand side I use alternate ID to join those things and then figure out okay was the MFA successful how was the MFA done? Was it an authentication like authenticator application or a push notification or uh OTP code? How were
that? And then you try and figure out a lot more things when you correlate two different sources which is exactly doing the same activity. Uh yeah I think that is the Jupiter part of it. Now let's move back some takeaways basically whatever we looked into this slide. So challenges with traditional threat hunting why traditional hunting does not work today and why we need to do something more than that. Jupyter based red hunting. I love Jupiter based thread hunting because it's so easy to implement in any environment. You do not have to like buy a SIM solution or you not have to set up your environment for that. All you need is basically download Jupyter notebook
or Jupyter app. Set it up read data from wherever that is available database or S3 bucket or Azure data lakeink anywhere. You read that data you run your queries you generate an output as a CSV. Now that CSV can be sent to your SIM solution, whether it's like Splunk or Sentinel or any other SIM solution, elastic search. All you need to do is an API call back to that SIM solution. So you're running your detection outside your SIM solution or enhancing your SIM solution, sending alerts. If you don't want to use a SIM solution, perfectly fine. Send that to a monitor where your analysts are working on Jira ticket or the Hive. These are applications where
you can send that alert to. uh automating thread hunting methodology the whole thing it's a manual thread hunt that what we saw but it can be like run on a chron basis so every 24 hours every 12 hours just run these queries and then it'll generate an output send it to sentinel ease of adoption and integration that is what I talked about open for questions yes please where do you get your use cases you just uh these are some of the use cases that I wrote after I actually saw a lot of attack vectors. These are like specifically mapped to some TTPs of AP groups their activities. So I look at those threat actors, see what they're
trying to do and then come up with the use cases based on what they're doing. Yes, please. If you have like all of the data that you have in your data lake already in your SIM or is there still advantages to pulling things out in the Jupyter notebook? The advantage is all the data does not go into SIM. you always use a filter. But if you are sending all of that into a SIM solution, these things, if your SIM can support aggregated machine learning based queries, Splunk has a Splunk machine learning, you can run the exact same thing on Splunk as well. So these are basically methodologies which you can add to enhance the SIM capabilities. Run
it on Jupiter, run it on SIM, run it wherever you want. Yes. How long does it take you to train models and run your data? Yep. What is that? very trying to hunt things fast time. Yes. So these are trained on seven days worth of data and the training was done for 7 days which is basically take the initial 7 days of data do the initial training and run that model for 7 days so it has a baseline and again that's actually going to start working but if you need a really really good output less number of false positive I'd say 30 days is a good bet. Yes. So I I did have a question two
questions actually. One, what is the recommended minimum resource to train this type of model with the data set of this use case? Yeah, for Jupyter notebook which is able to query one terabyte of data, all you need is basically a 32GB of RAM. You can run it locally. You can run it on your cloud. If you have GPUs, way faster. Gotcha. At least 32. Yes. So you can't skirt by with 24. Uh 16 is good. It's just Yeah, 16 is good. It's going to take a lot of time. Sometimes the kernel might crash because you have way too many data. So what you want to do is like batch processing, split that data into smaller
chunks, do the training. Okay. Got you. And my other thing was if you're already running something like Rita Yeah. Can you what kind of solution would you think of to marry these two together bridge that gap like uh I think a lot of publicly available open source libraries do support like collaborating multiple platforms in Jupyter itself. So all you need is this Jupyter notebook. This can read from like multiple different sources. All you need is like specific wrappers around that function like any wrapper or supported SDK or like support developer tools like any tools if it has a developer support or an SDK pack that into Jupiter bring that in and then you can actually do a cross correlation or
correlation in Jupyter itself. Okay. Thank you. Uh references. Thank you guys. [Applause]