Scaling Detection And Response Teams: Enabling Efficient Investigations - James Dorgan

BSides London44:08589 viewsPublished 2024-02Watch on YouTube ↗

Speakers

James Dorgan

Tags

StyleTalk

Show transcript [en]

so yeah my name is James I'm a principal instant responder within the cert at coinbase and over the last 10 years I've spent quite a bit of time in different instant response MDR and research roles and at the beginning of my career for my sins I spend a little bit of time as a CIS admin as well before kind of seeing the light and moving over to the security side as you might have guessed from what I do now as a job role I have a particular interest in instant response detection engineering and CA continuous Improvement which is just a fancy term for looking at the processes that detection and response teams go through on a daily basis and looking for

ways to make those processes incrementally more efficient with a view of um scaling them and overall kind of enhancing A team's overall capability and that's really the foundation of today's talk on a personal note I've been coming to Biz London for about 10 years now the first cyber security conference I ever went to was Biz London in 2012 so it's great to be here after 10 years of being an attendee and getting to present today and hopefully you all take something away from today's talk so a nice little lightweight agenda to kick things off um I'll be talking about challenges and scaling detection and response teams and that's a huge topic area uh far more content than I

could possibly talk through for the next 40 minutes so really what I'll be doing first of all is just scoping down today's conversation into a particular area which is uh enabling efficient investigations with the aim of helping teams scale and grow now even as I talk about detection and response teams that can mean a lot of different things that could be one person a company or an or who's responsible for all things security all the way to the other side of the spectrum where you've got an entire Security Department with multiple teams within it who have dedicated roles they have a lot of resources and time to make things incrementally better well I'm hoping that what I talk about today is

relevant to teams across that Spectrum it is slightly more uh kind of leaning towards the mature organizations just because they have a little bit more time they have some of the fundamentals out the way and they can look for ways to make themselves incrementally better so once I've kind of established a common problem space with you all I'm going to take you through three case studies that we've implemented at coinbase over the last couple of years to help solve this these problems these three case studies are going to be automating investigation and response workflows building context into detections and bringing employees into the triage process so why focus on efficient investigations when I think about how

teams can scale properly I find it beneficial to Think Through the different processes that teams have to have enabled and have maturity in in order for them to be able to detect and respond to threats effectively those four stages that I put on the screen logging detection investigation and response are roughly adapted from the uh funnel funnel of Fidelity by Jared ainson from Spectre Ops if you haven't heard of The Funnel of fidelity before give it a little Google when you get home read the B the blog post it's a great way to Think Through the different roles and responsibilities that exist inside of a detection response team that need to be working correctly in order

for them to scale properly so while today I'll be talking about investigation maturity in particular we will touch on detection maturity and we'll cover response maturity a little bit as well now that's not to say that the other pillars aren't important logging and detection is critically important but as I look across the industry I feel like the more mature organizations have logging in detection in a little bit more of a mature State there are some great resources out there to help you in these areas there are modern Sim Solutions like Splunk an elastic search and Snowflake and logarithm that you can purchase or deploy yourself to help you get to a reasonable level of maturity

within this area when it comes to detection maturity this is a super important area that I don't think we're ever going to solve but again you can pay vendors to get access to their detection capabilities and then there are some great open source projects like the sigma project that gives you open- Source uh detection rules that are agnostic to Sims that you can take and apply to your use cases on the response side of life and I think this is an area where as a an organization or a company or an industry we're not really putting that much Focus but it's a low volume problem and what I mean by that is if we're looking to try

and scale our teams more effectively we want the best return of investment and hopefully you're not performing response every single day it's more of a rare occurrence in comparison investigations are something we do day in and day out so if we can work out how to scale these areas we have a good chance of helping our overall team work more effectively now as I look at logging and detect I feel like as an industry we've gotten to a reasonably good position in some in some places and we put a lot of time and effort into making sure they're working properly but then at the point where a detection is generated we want to push that to a human for them to investigate

and we sort of have this attitude of we'll we'll stick it into a ticketing system probably jir and we'll tell the analyst go and figure out if this detection is a true positive or a false positive by digging through all the data that exists in our Sim never mind that's terabytes of data millions of and millions of log entry you know it's your responsibility to go and dig through that data figure out if something is a true positive or a false positive good luck have fun so how can we get a bit better in this area or what are the what the problems I should say um what I thought would be useful is let's go through an

example detection together and try and recognize some of the problems that exist when we have low investigation maturity so at the top of the screen we've got an example detection um an anomalous user uh has sorry anomalous login has succeeded for user from an IP address that's geolocated in France now I chose this detection because across every company I've ever worked at I have seen some variation of this detection some impossible travel logic someone's written a detection rule that looks for an anomalous pattern of behavior and an analyst is left to respond to this type of detection in some cases this is all the information they get a j ticket gets created with this as the title and it's

now up to them to choose their own path to investigate it so as an analyst let put our thinking hats on what questions might we ask and what data sources might we use to investigate this detection so first of all you might ask what recent logins does this user have do they typically log in from this IP address do they typically log in from France is the detection logic just bad what team does the user work in and where are they located are they supposed to be in France or do they work in the United States what devices are assigned to the user if this login is for a Windows machine and you primarily use MacBooks

across your estate that might lean you towards the being a true positive but if you check your asset management tool and you find out the user a developer and they've just been assigned a Windows machine to do testing that might lean you in the other direction is the user online are their devices online can we just message them on slack and say hey are you in France are you using a VPN is this you and then a really important one you know what other alerts do we have for this user or for their devices over the last hour day or week because the last thing we want is to miss a detection that came in 2

days ago that was any desk being installed on their device and we don't correlate these two detections together so just across these five questions that I've quickly rattled through you can see that we've hit a wide variety of different logs and data sources we've hit identity provider logs potentially 2fa logs we had to sign into an HR tool and query and asset management tool we checked slack maybe we've signed into an EDR console and then also maybe you've had to search in jir for your um previous alerts and previous incidents now even if all of that data is in a Sim ready for to query you still have to write queries to get that data

out you have to copy queries from a runbook and change the time stamps and the user that you're searching for you have to fix the SQL error that you inevitably make and you have to go and get that data back and interpret it which means just for those kind of five questions maybe it's taking you two minutes per question to get that response so before we've even really started even investigating this detection we've just formed that Baseline understanding we're already 20 minutes in and when you think about the detection here this is an anomalous login that could be from a threat actor 20 minutes is a long time if you pick it up straight away when it comes in to be

waiting to investigate it and find out if it's a true positive would you be comfortable letting a threat actor have access to your Confluence for 20 minutes your code base your slack instance probably not so as I've been talking over that example detection hopefully for some of you in the room that have worked in a sock this kind of problem space has resonated with you and I think there are a few key symptoms that come from limited investigation maturity the first that we've just CED is delayed alert triaging we're already behind the curve it takes us too long just to get those key investigation facts out the way to understand what's going on so we can

start performing a deeper investigation the next one which is a really key thing that I try and talk about a lot is a lack of shared investigative Baseline and what I mean here is when we give a detection like we just saw to an analyst we're allow them allowing them to choose their own path and ask their own questions and that means there's no common understanding of how to investigate an alert between different analysts so if you give that detection to an an analyst that's been your company for 3 months and an analyst that's been your company for 2 years they're going to ask different questions and take their own path to investigate the same detection maybe your more

experienced analyst knows the better questions to ask or maybe they know the better queries to run or even just the better column to bring back within that query that shows the data point that makes us a true positive not a false positive and I've seen this have a minimal a minimal impact that could be it takes longer for one analyst to investigate it than the other up to a really significant impact like one analyst resolving alert as a false positive and another analyst investigating the exact same alert and and it being the other resolution and then finally you know me just talking over this problem gives me a little bit of fatigue um there's an

analyst fatigue problem here if your analysts are responding to 10 alerts a day and a lot of people including myself would argue that's too many alerts for your analyst to be investigating and every single time they're having to ask ask the same questions run the same queries just to get to that Baseline level of understanding there's going to be a big difference between how they investigate that first alert and how they investigate that seventh alert of the day and also they're just expending a lot of focus and concentration and effort just to get to that Baseline level of understanding which means they're fatigued before they even start to really dig into what we care about

the nitty-gritty details so to kind of wrap this up in a bow if you will it just takes too long per detection to perform a meaningful investigation and that hinders our ability to scale effectively as your company starts bringing on more data sets starts generating more detections even if they are High Fidelity if your High Fidelity rules are still generating one or two false positives every two months over time that's going to add up and you're going to struggle to scale so that's the problem space hopefully some of you can recognize the issues if you worked in a sock and you can feel the pain um so what I want to do now is take you through three case

studies um that we we've implemented at coinbase to try and solve some of these problems some of them will be um new to you some of them might be old to you and if you have other ways to solve these problems please come talk to me afterwards I'd love to talk about it uh so the first one is automating investigation and response workflows so when I started looking at this problem space I started by looking at the different detections that our analysts were responding to and trying to group them together now what I noticed is that because the detection is fired there's always a starting point some that's um run to identify something anomalous and

the starting point could be a serial number it could be an email address an IP a hash and so on and so forth but there's always some kind of starting point and what I did is I kind of clustered all those starting points together and recognize that there's not an infinite number of those starting points there's actually quite a few maybe 15 or so and when we start looking at the different questions that our analysts were asking for each of those groupings I kind of noticed that they're all asking similar questions some variation but they're all trying to get to the same answers if you've got a detection for a user login we always want to know who is the user what assets

do they have assigned to them what does their single sign on History look like and so on and so forth and so we kind of came with this idea of okay how do we automate this problem wouldn't it be cool if we could have a central application where you could provide a starting point an email address serial number a hash an EDR guid a cloud Arn and as a group we could crowdsource the best queries to run so when you provide that email address starting point it will go and hit historical logs and bring back all the things we care about single sign on logs asset inventory logs SAS logs and so on and so

forth and while I was thinking through this problem we thought well historical logs are great but sometimes we want information right now that's live you know is this user's account suspended is their endpoint isolated so we thought why don't we get the application to go and plug into some of our systems as well and pull back that data VI an API we don't want our analyst having to open up a new tab and go to their EDR console and sign into their SSO platform and get this information manually and then finally we were thinking well at some point our analysts are going to investigate a threat figure out that it's a true positive and want

to perform some kind of response and again we don't want them to have to open up yet another tab in their browser sign into their EDR console or the SSO platform have to worry about suspending the account suspending them in TFA invalidating their backup tokens performing all these response tasks and then wrapping it all up and updating a ticketing system let's automate that as well and that was the high level Vision that we were trying to achieve so we started building this out and along the way we ran into some really nice unexpected side benefits so by abstracting a lot of the complexity into a central place we realized we could more closely aligned responsibilities to problems so our

experienced analysts as they work they're constantly building better queries to get answers to their questions and constantly refining them and typically those queries were living in a users one note or Aly one note or in Confluence or in their brain typically which means everyone else in the team doesn't get to benefit from it so what we realized is that over time instead of keeping those queries in one note they could just contribute them to the central application and over time the entire capability of the team would grow and grow and grow and that meant that when we onboarded a new analyst what used to take a month or two for them to get up to speed and be able to

meaningfully contribute to investigations and learn all the data sets and learn all the right queries to run and how to combine it all together they could now within a week or two take a detection take a starting point plug it into the application and then get to benefit from all the best-in-class crowdsource queries that we had as a team it also meant that on the back end our analysts and Engineers could roll off data sets and roll on new data sets a good example of this is when we moved from one EDR platform to another there was a period of time where we had some of our machines on one EDR and some of our other machines on another EDR and at

the point where an analyst wants to isolate an endpoint they don't care what EDR you're using they just want to stop the bleed get that response action going they don't want to have to log into two consoles and go through that process of which EDR is this tool running or this machine running so by abstracting that into a button essentially that says isolate this endpoint we can allow the analy to focus on what they care about which is responding to the threat so that's the background and what I thought would be useful now is instead of me just talking over what we tried to build I could show you so hopefully I can show you on the screen um what I

want keep in mind as I'm kind of playing this video is remember that detection we went through together all those questions we had to ask how long it took to get answers and how you were doing that for every single time you had a detection what we tried to do is just automate that workflow by taking a single investigation fact plugging into an application and making it do it for us on the right hand side you have kind of the Core Body of the application um I should mention this is a quite heavily redacted version of the application and also cropped so you can't see the cool branding at the top and bottom I just

wanted it to be visible on the screen um but on the right hand side you've got the data that is presented to the analyst and how they can interact with it and then on the left which I think we're going to zoom in on um you'll see the different workflows that over a period of about a year and a half we've slowly been building out so we started with groupings of user investigation tools things like converting an IP address to a user all the way down to web activity searches and finding installed applications across our estate who's been accessing certain production environments and what they've been doing Integrations with threat intelligence tooling searching for certain indicators

across our estate and alerting on them moving forward endpoint triages and fishing workflows all the way down to the really fun stuff that we get to use which is let we get down there issuing new response tasks so what we're going to do is stick with the uh the most commonly used view which is the use investigation tool and we're going to take an investigation starting point my email address in this case but it could be a serial number an EDR ID you know you name it so we're going to put my email address in Hit search and the application is going to go off on my behalf to all those different data sources the apis use our best-in-class

queries to go and grab all that data and bring it back fairly rapidly there's no SQL errors it just works so straight away we can see thankfully my account is active I'm not suspended um my tour face datas is active if I was suspended you see that reflected here I'm active on messenger um and other surface level information like when my account was created and when my password was last changed and so on and so forth we also pull back information like my users my team that I work in what org I work in what my hire date was who my manager is whether or not I'm a full-time employee or a contractor and also do I have any risky permissions

assigned to my account that might change how we respond to a potential account compromise my job title my location and so on and so forth we also go and hit our asset management tool and bring back all the devices that are assigned to me I have one other users have three or four for each device we then go and hit hit uh pull back the serial host name EDR ID what EDR is installed what the platform type is when it was last online and then the important stuff is it online right now can I connect to it is it contained is it locked is it in the process of being locked is it in the

process of being erased and then we can do some light touch response tasks like suspending a user account forcing a password change and so on and so forth without having to drill into the actual response functionality along the top we have other workflows that we've built out to dig into other behaviors like login activity what TFA devices I have assigned to my account has a new TFA device been registered recently my URL logs installed applications all the way through to the really important stuff what detections have my machines have what contextual detections do we have and what related incidents do we have so we're going to dig into two quick areas the first is the login activity so just

from that one search we've gone off and run the best in-class career we have for bringing back SSO logs and we can see that you know for every application I've signed into in that redacted tab it would show you what the application was and what data I was viewing so if my account was compromised we'd be able to see what the threat actor had gone and done we then pull in other facts like what my IP address was what country I was logging in from what city what my user agent was my ISP and so on and so forth all those useful little detection facts that might help you identify that something is a true

positive in addition to SSO logs we'd go and probably pull in I think we go to tofa logs here oh visualizations um we also try and visualize this information a little bit to help it be more accessible so you can see here that I typically log in from the United Kingdom funly enough um but I also have a couple of logins from the United States which might appear to be a little bit more Anonymous here not Anonymous anomalous here um especially if it's from like a non- Macbook in my case and then for TFA activity we also go and pull in um how that TFA was satisfied because it's one thing if you satisfied it with a

security key which is great but if the if it was satisfied through a bypass code that was generated by it that might be a red flag that you want to go investigate and say why was a bypass code generated and then I think the final thing that we're going to go and look at is as I mentioned when we went through the detection together we also care about related alerts and incidents so the application is tied into our incident and ticketing system and it will go and pull back all the related detections and incidents relating to my user or my machines to help you understand what's been going on so we can see here that gez my users have

three incidents um one of them is for remote access tooling being installed rust desk in this case it was investigated by someone and I think in this case it was incorrectly resolved as a false positive again hypothetical situation so all of this is just about taking information to perform a baseline of understanding for our analysts so you so they don't have to expend energy when they're investigating the Baseline facts and they're free to dig deeper when it matters uh the second demo which is much shorter I promise is about response so in the same way as I mentioned earlier we didn't want our users or our analysts leaving this application having to juggle multiple tabs to perform a

response task so we also maintain a catalog of response capabilities that were built by our analysts which you can scroll through and kind of self-discover what different things we can do and what's really nice about this is not only can you get that self-discovery and we can do cool things like retrieving remote files and retrieving browser histories and suspending user accounts and so on and so forth but there's also a description of each response task does so here we're going to search for suspending a user account and we know what the task does but then under the hood we can see that it's going to invalidate my user not just in my single sign on account but also to aair account

workspace account it's going to invalidate tokens it's going to delete backup codes and you can get that comprehensive idea of what's going on we can also tie a response task to an active incident or investigation so we can pick an ongoing incident or investigation and when we run a response task askk the application will take that responsibility of saying who what where why and when this response task was run and go and put it in the ticket and also go and put a message in the instant channel so everyone else is aware of what's been going on and what tasks have been taken and then this is the final view where you can either pick one user or a

series of users that you want to perform this response task against and then you can choose to either suspend that user or in this case suspend and I think we're going to do uh forcing a user sign out so actually like killing their sessions right there and then and then we can also Force our analyst to provide context why are you doing this why does it matter and that will get added to the instant ticket and the instant Channel as well to provide that kind of visibility to the rest of the team now at this point you know the user would click issue task they'd be forced to go through a tofa ver verification flow to make sure they are who they say

they are and they're an authorized user but I think in this case we're just going to exit out and we now see the last view which is the response task history so every time we perform a response task we track it and we can see who the author of the task was what task they've been running who the target of that response task was and then critically if that task was to bring back an artifact browser history aile from an endpoint it stores it in a central place and then you look on the right hand side links to that artifact so if one analyst runs a response task it doesn't sit in that one analyst

download folder and then the other anals have to run the same tasks so that was a really quick uh whistle stop tour of the application it started off with kind of three or four workflows and then over the period of about a year or so we've been identifying new workflows that we do a lot and we've been building them into the application with the overall goal of saving time so we track usage of this application and how much time we think we've been saving and take it with a pinch of salt because metric's quite hard but we think nowadays we save between 30 and 50 hours per month of analyst Time by automating these workflows and freeing them up to focus

on what really matters so that was case study number one uh that's the longest case study by far um we've got two left to go and the first is building context into

detections so the application that we just saw it queries a whole load of different data sources under the hood um but one of them that's really really critical are these two tables on the screen here um at the top we've got an example user profile table and at the bottom you've got an example machine profile table and so essentially for every employee we have at coinbase we keep a record of a user profile and for every endpoint device we keep a track of a machine profile and this is basically Consolidated key data from a load of different data sources so in the top you can see an example for my user account you've got my SSO ID my login email my

location what or I work in what my role is what TFA devices are assigned to my account what endpoints I have assigned to my to my user and then also as I log into different systems if it's a trusted login that's been satisfied by U key what IP addresses are strongly associated with me as an entity at coinbase and then so on and so forth we bring other information as well similarly we also do the same for machines so for every endpoint device we have we have a machine profile line and we track the endpoint serial number who the owner is what platform type it is what EDR is installed what the EDR ID is

is the device online is it erased you know what known IP addresses do we have for that individual device all the way through to what USB devices are connected and so on and so forth and essentially we have a system that runs that on a schedule goes and queries all the Upstream data sources the SSO system the HR and or system the TFA provider the EDR and it will go and consolidate this information into these different two context tables and this system runs at different intervals depending on how fresh the data we need it to be for example my job role isn't going to change very often so we can bring that data in once per day but my IP

addresses and my USB devices probably are going to change quite a lot so we would query that on a much more frequent interval and the reason why this is important to have these two two tables is that it provides a pivot point for our lookups so if we have a detection that comes in that's just got a serial number this machine this serial number has done something bad an analyst would previously have to go and look up that that serial number in a asset management tool or an EDR platform find out who the owner is then go and find out everything about that owner and run five or six to get to this point with these two tables

we can say all right we've got a detection for a serial number we now know we can do a lookup Against the Machine profile for for the line that contains that serial number and now we know everything about that device from the serial number we know the platform we know what EDR is installed what USB devices it has and critically we know who the owner is and then there's a link between the owner in the machine profile to the user profile so just from that serial number we get all the information about the machine and then we can pivot to all the information about the owner and we get yet James Works in London he

works as an instant responder he has these two FAA devices and so on and so forth so this is what it ends up looking like um this is a kind of detection that maybe you get a low prevalence binary keychain dump was executed on the device my MacBook Pro um and a serial number so not completely useless detection we have a detection fact a serial number that we can go and pick up we might have to plug that into an asset management tool and do all those queries manually or we can take that serial number we can query our two context tables and bring back the relevant information at the point of where it's displayed to the

analyst so kind of a small representative example here now we know that this serial number belongs to Adam who works in the marketing team he works in London we know what his EDR guid is and we can also see that this user has two devices assigned to their account and we've just removed the need for an analyst to go and bring back all that information manually what's also really nice is this has been really useful for our threat detections so a while ago we were worried about a threat actor you know compromising an account and enrolling their own 2fa device if they go and buy like a u key they compromise an account and they can find a way to get that uba

key enrolled there's a risk they can keep coming back time and time again and satisfy that tofa requirement so fortunately we get tofa enrollment events that come in through our our logs and we started looking at them and the problem is when you work for a really large org and especially when you provide multiple backup U keys to your users this ends up having you know too many of these events for your analysts to respond to if you tried to put a detection like this into production you'd overwhelm your analyst immediately but when we started looking at the logs we realized we didn't just get the account that the 2fa device was being enrolled into we got the IP

address where that enrollment was taking place from and so what we could do is we could do a join Against the Machine profiles table and say every single time a uh an enrollment comes in go and look at the um the email address and find the owner all the machines that are assigned to that owner in the machine profiles and then for each of those machines go and check all the historical IPS that have ever been associated with that machine and if the tofa enrollment was coming from any of those machines historical IPS then we know it's being enrolled from their corporate device and it's less likely to be an issue if it's not associated with

their corporate device it's more likely that either they're enrolling it from their personal device which is a problem in and of itself or it could be a threat actor enrolling it from their machine allowing them to come back time and time again in the future and this meant that we took a detection that wasn't working for us at all in production tied it into this data set and now we've had it in production for a long time now it doesn't cause us really any issues and we've got that kind of detection coverage that was only achievable because we can keep track of all this information in a key context table and that's case study number two

the final case study which is probably my favorite is bringing employees into the triage process so across the two case studies that we just talked about they they've been really really helpful and we use them all the time um and sometimes we just get detections like this you know something anomalous has happened uh a user a login has occurred for the user food. bar originating from John's MacBook Pro um from an IP address that's GE located in France yes we could go and use both of those case studies we just talked about but sometimes it's just easier to message the user on slack and say was this you are you in right now or are you using a VPN do you recognize

this login and so I started thinking about how can we use our employees as a data source more effectively now in the same way that I'm here today talking to you about some ideas that I've had I realized that other companies have done the same thing so I wasn't the first person to have this thought by any means you know Dropbox elastic and paler have all had a similar idea and thought about how they could use their employees as a data source and it all kind of revolves around creating some kind of Bot whether it be a slack bot or a teams bot and finding a way to codify a mechanism that when a detection fires you message your

user you ask them to triage the detection and then you get a response and you do something with that response so this is what it looks like um on our end you know this is the slack example so whenever we get a detection for an anomalous uh login that meets a certain criteria we'll summarize that information so in this case we've got an anomalous login attempt for James d@ coinbase.com um and the loog the detection facts are here's the time it's occurred the device's name John's MacBook Pro it's coming from Germany it's a Google Chrome browser coming from Vodafone and we can pass that to the end user and say hey do you recognize this

login was this you and within a predetermined amount of time they can triage that information and tell us what they think they can either say I don't recognize this and then we could either page security or bumped up our priorities uh for triage or or they could say this was me click I recognize this login we can push them through a 2fa flow and we can decide whether or not we want to trust their response another example is sometimes detections don't relate to individual users which is what we've talked about quite heavily so far um appreciate it's moving around a little bit so if we wait until they click cancel hopefully there we go um this detection

is for um a access key being created for an AWS account that we deem to be like highly sensitive um it's a hypothetical example but probably wouldn't be a bad idea to have a detection like this and there's no one person that's responsible for this account but there may be a team so we could codify a message that goes to that team's channel the infrastructure security team maybe and we could say hey look we just detected an access key being created in this account that you've told us is really really sensitive um should this be happening and if it if it should be happening there should be a ticket for it so please give us the ticket so we

can go and make sure it's all it's all good and it's all safe um they can then review it whoever's around and in this case they're saying hey we should never be creating access keys in this account you know please declare a security incident they can then escalate us to Security on call and we can go and respond and what's nice about this is they've taken the triage workload and the relevant team that would need to be brought in anyway to help us resolve this issue is now aware they're spinning up their instant response life cycle we can get them in the same channel they've been presented with the information that they need to be able to respond

accordingly now the key thing here is that um we treat this data source as a as a as a data source not an authority so there are certain occasions when um you know you can't trust the user's response completely inside a threat is a good scenario where they'd be able to click yes you know this was me this is safe and they'd be able to verify that to a fa flow but we need to be careful about trusting that end user similarly there's a risk r that someone could be socially engineered into providing access to their account and that means that there's a chance that they could also be socially engineered to responding to the prompt in a certain

way so we do treat this as a data source not necessarily as an authority and this is what it looks like from the analyst perspective so we've sent a detection to um analyst for review they've clicked the button I don't recognize this login and they provided the context like hey this isn't me I don't use a Macbook and I'm and I'm currently located in New York and this login says it's from France this isn't me and at this point you know an analyst potentially has enough information that they could go and perform a response task straight away they might decide they have all the information they need that they can go and suspend the user

account or force a sign out invalidate their active sessions to mitigate that immediate threat and then perform the investigation more methodically and more slowly without worrying about stopping the bleed in fact you could actually codify this you could say if the user responds and says I don't recognize this login go and click off uh go and fire off like a sore workflow to perform that action anyway it doesn't need to involve the analy I pulled some stats over the last 12 months of what this looks like at coinbase so we've had 700 interactions or just just above that and that could be anything from Context Gathering to uh notifications to asking for a triage uh

kind of resolution and about 200 of those have resulted in detections that were resolved from end to end so that's where we've detected something we've passed it to an end user they've said it's okay and we've applied logic to it that we're happy with their response that detection is being resolved and we haven't had to put that detection in front of an analyst to take up their investigation Cycles so that's the end of the three case studies um and these are kind of the key takeaways if you take anything from today's talk um first of all try and find ways to centralize and automate investigation tasks there are going to be things that your analysts are doing

all the time running the same queries that are just kind of like drowning them in work if you can try and find a way to automate them it doesn't have to be how we did it but if we found that automating and also centralizing this process was incredibly beneficial I mentioned how much time it had saved us it's now just a natural part of our workflow it didn't take that much Dev time to do um and it's been incredibly impactful for us from a to affect our ability to scale properly secondly you know try and build these investigation baselines um I see far too often that as I was talking about earlier one user investigating a

detection and another user investigating the same detection will take very different paths and there's no Baseline understanding across the team about how to investigate something or what the key facts of that detection are to try and establish a Baseline and then incrementally over time just build that Baseline up and up and up which will enhance your team's overall capabilities and then finally um try and enrich detections with key investigative facts no one wants to receive a detection that says hey this serial number has done something bad it's just not helpful and you're putting a lot of emphasis on the a list to then go and convert that serial number to a machine that machine to an owner that owner to

an entity and so on and so forth which just waste Cycles so again it doesn't really matter how you do it I've shown you how we do it but try and enrich your detections with the common facts that we need in order to investigate that detection effectively um that's it that's the entire talk I hope it was somewhat useful to you if you have other ways in which you've been able to solve these challenges genuinely please come talk to me I'd love to understand how you've been doing it as well and I hope it was useful happy to take any [Applause] [Music]

questions hi thanks for your awesome talk um I was wondering have you ever experimented with waiting all of that information to come up with some kind of score to prioritize and look into certain things first yeah absolutely so I think it's a little bit too common that sometimes we have uh detections that could be anything from low medium high critical but they all just go into the same que and we don't treat them any differently so uh we have been toying around with uh like alert aggregation and having you know multiple detections for a single uh entity assigning a score to those things trying to establish a Baseline and then looking for variations of that scoring

is that is that what you were talking about yeah so it's a really interesting problem space and I think some companies do it really really well and some companies really struggle to do it um it just takes a lot of time to get right trying to establish what normal looks like in your environment is not an easy problem it's a data science problem in my opinion um so it's a nice like cross functional project I definitely don't want to say that you know we necessarily have it solved we have an approach to it um but it's still like a work in progress W hi James thank you for the talk the first a use case um great tool

that you have built uh in coinbase but at the same time that means and you have uh work with I don't know F stack Engineers developers to create that tool to use different apis to connect with the tool and as well to maintain maybe you have your own P Scout environment the tool looks great and it has save you good return of investment like you said but you have to build it create and the maintain so that's okay but at the same time you have to secure it and if you have any Gap in that tool then I imagine at least you are using for five to 10 different other security tools apis Etc if that one is compromised it will be

quite an important tool yeah you're absolutely right um so I guess to give a little bit of background that application you saw started off as my Christmas project like two two years ago um and it's certainly become come on a long way since then then uh and this is kind of what I was talking about where some of the stuff I I'm going to reference in this talk are for those more mature organizations I certainly don't expect like you know organizations that have one or two people in them to try and achieve this you need an engineering team um what has worked really well for us is we have got like a set of tools like building blocks if you

will um which our analysts can use kind of they have complete Freedom within them to go and build an experiment so your ability to go and query different apis you know we can go and build a source solution or invest in a source solution um and then allow our analist within that Source solution that has all the API keys and the credentials to go and play effectively to go and give them freedom to go and build stuff and that's really what enabled me to build this is we had um a framework that allowed you to quite rapidly build web applications we also had another platform that we used to build logic blocks and workflows and much smarter people than I had

already put that in place but then it enabled me as an analyst to go what problems am I trying to solve someone's already you know given me the building blocks I can go and play around with it so you're absolutely right this isn't achievable by like every company out there um but I will say it took less time than we thought um like V1 of that platform was built over a Christmas break um it's you know looking under the hood you'd probably be shocked at some of the code don't get me wrong like it's some awful code under there but it works and the impact is there thank you for the talk um just a

quick question um you you mentioned that the the second case study uh involves polling Upstream systems and I just wondered if you considered a push based architecture there is that something that the Upstream systems would even support so that you don't have that lag between retrieving the data and it changing I'm really sorry I couldn't see who was talking and I was trying to find where you were could you mind repeating the question yeah yeah not a problem um so thank you for the talk it was a question around the architecture of what you've got so you are polling these Upstream system systems to centralize it into your context tables um I wanted to understand if you considered a push

based architecture instead and if that's something that these upcam systems even would support to push the data into your security tool yeah so um from an architecture perspective you know we are in a fortunate position where a lot of this data is in a Sim um some of them come from apis and then there is again built on that same sore tool I was talking about um we run like regular SQL queries to go and query the data that's already in the Sim and put it into a table or essentially just like Cur requests to go to the API and bring back so per user go to your SSO provider what is the SSO ID go to the EDR provider what is that

EDR ID and then put it in the table um it's honestly just a long list of queries and and curl commands that run that then build the tables at the end of the day awesome thank you for that amazing talk was definitely interesting to see that the question I had was uh the gentleman just asked what was the tool that you were using that question is do you have any plans of open sourcing this at all um the tool question you might have noticed I've been a bit squirly about giving the answer come up to the end and I'll tell you exactly what it is I'm very I'm very keen to give the company that we use the props for the tools

they've developed what I don't want to do is I I don't really know when the recording ends and I don't really want to put out there into the world like this is our vendor um for obvious reasons but like really genuinely happy to give you the answer that question and give that company a shout out um hopefully that's useful to you uh in the open sourcing front as I mentioned I would be hideously embarrassed to show some of the code that runs underneath it and it would take a lot of time just to clean it up and make it generalized enough that it could work in every environment like it's very very dependent on uh our SSO system it could

Absol converted to the one that you use but building it to be generalized would just take quite a bit of time and as mate over there said I'm one person I'm not an entire team of Engineers it's not my full job rooll maybe this Christmas project who knows well done awesome let's give my thank you d

Scaling Detection And Response Teams: Enabling Efficient Investigations - James Dorgan

Related talks