← All talks

Reinventing ETL for Detection and Response Teams

BSidesSF · 202429:111.1K viewsPublished 2024-07Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Reinventing ETL for Detection and Response Teams Josh Liburdi Join this session to hear about the unique data collection (ETL) requirements of Detection and Response teams and learn practical strategies for enriching event logs at scale without breaking the bank. https://bsidessf2024.sched.com/event/9b6d4ccd957578ba446537bb15b3172a
Show transcript [en]

so without further Ado here's [Applause] Josh hey everyone my name is Josh this talk is called Reinventing ETL for detection and response teams thanks for coming obviously thanks to the organizers for inviting me I have probably 45 slides to get through in 29 minutes so let's just get started if you don't know me or haven't seen one of my talks before I think there's a few things you should know my talks are really dense lots of information lots of opinions lots of meta commentary and I don't use notes so this is the perfect opportunity to put your phone down close your laptop if it's open if you're not paying attention to the talk there's a really good chance

that in a few minutes you're going to wonder where the heck you are um I'm kind of surprised anyone came to this talk because I think ETL is definitely the most boring topic it besides this year and it may surprise some of you to hear that I also think it's pretty boring um but I think it's really important and that's why I'm here so this talk is going to get interesting later on but we have to cover the basics first if you don't know what ETL is it's really just the way that data moves through your organization you have data somewhere you want the data to go somewhere else that somewhere else is usually like a Sim or

a data lake or data warehouse or whatever they're going to sell you at RSA this year and that's it it's pretty simple if you're a visual person maybe this is helpful um you have data in systems and services wherever you want it into your sim you get it there through some kind of data pipeline system honestly none of this is very interesting to me I'd rather talk about the people that use the data that's your sock team doing alert triage your incident response team doing investigations and I are your hunt team doing research and uh threat detection in some cases um you know if you don't have these teams in your security organization don't worry cuz I don't

either in mine um but just know these are real people they have jobs to do they all use the same data um but they use the data in different way so they have different requirements for the data and in i' say medium to large sized organization these people are all very bad at talking to one another data goes into their workflows it gets siloed knowledge information all stays within the team so I think the point of ETL is to turn data into information or turn it into knowledge that would be good ETL if that's happening so I guess then the point like the the opposite of good ETL would be bad ETL which which is just collecting

data and putting it somewhere and so for that reason I don't think it would surprise anyone to hear that I think the current state of security ETL is not in a good place I think it's actually quite bad um and when you see some examples it starts to make sense like we give people this kind of data and we ask make them answer the question is this an intrusion in our organization and that's literally what your sock team anyone doing alert triage does every day and it's terrible but wouldn't it be better if we gave them data that looked like this a little easier to read a lot more context more on that later and I honestly just like

that there are words where there used to be random numbers like I'm a fan of that but you know we pay hundreds of thousands of dollars a year millions of dollars a year for security platforms to take all the bad data put it in there so that must make the data good right well no it's actually still quite bad at that point because then your people are writing queries in those platforms that look like this um seeing this live is actually very funny to me because I'm not confident everyone in the room can read this um so I think this worked out pretty well um I think there's a correlation between how large your average Sim or data L query is to how

bad your security ETL processes are because wouldn't it be better instead of WR If instead of writing this you were writing something like this instead which I think most people in the room can at least read this if you can't um understand what it's doing at a glance and by the way those security platforms who pay millions of dollars a year for sometimes look like this and that's cool like I guess if you need your detection and response platform to have that classic swirling toilet bowl look that's great so what is the the problem with security ETL I think it's that we take bad data from security vendors from non-security vendors sometimes from our own internal engineering teams and we

put it into a Sim or a data lake or a data warehouse whatever you want to call it I don't really care they're all the same thing to me and those Solutions are okay um I say they're actually quite bad at the for the reason you bought it which is analysis and the result of all that is that we put that pressure on the people using these systems and to understand this data to have domain expertise and to not get fatigued trying to figure out how to write that 40 line long data late query which all leads to the potential for inaccurate conclusions you don't want to be writing 40 50 line long Sim queries at 2 a.m. during a ransomware

intrusion I can promise you that so for all those reasons I think this talk is better titled inventing ETL for detection and response teams because I'm not really sure that it ever existed to begin with so let's talk about that instead I think this is probably what good security ETL is um there might be more things this is just all I could think of to put in my presentation um these are ordered for a particular purpose it's really easy to collect data this is actually what most of your security data vendors do they just collect data they don't do these other things very well to be honest um that's also the easiest thing to do in

addition to it makes them the most money what's really hard is contextualizing data making data actionable that's actually when data becomes information to people but um this wouldn't be very fun if I just kind of stopped there so shoot why don't we go further and I kind of impart some commentary on where the industry is going and maybe where it is right now um I keep hearing the phrase Federated Sim and I'm really confused why anyone thinks this is a good idea um mostly because I just don't understand how you deal with the data governance aspect of having data somewhere that you don't control like how can I rely on that data if it is managed by my

vendor um or just not managed by my security team maybe I'm just too old school on that point I also now think after having done this at brex for a few years that unified data models are the standard if you're not using them your team is deficient your team is malnourished think of it however you want this is just the way things should be now I also think um my job should be easy like I really just want to not work very hard like that's my goal every day don't work very hard make security easy and I think that it should be effortless to derive Insight from data and I would say pretty much every

security data vendor fails at this point so at this point I guess in the talk you're probably thinking okay we're like decades into this whole Sim thing and uh yeah data Lakes they're probably going to be around for a while too so it would be nice if my data was better but is this actually a problem because I've had this problem for literally forever and you know my answer to that is you're pretty lucky they didn't give me a three-hour lecture slot and they only gave me 30 minutes to give this talk because yeah it is a problem I just don't have the time to actually go into every aspect of it but I'll give you one

one that's pretty good it's not one I've ever heard anyone in our industry talk about um it is actually quite well known in the I don't know larger ETL space or Community if even one exists and it's this idea of data Decay so so data Decay is really easy to understand with a simple example let's say after the talk you meet me in the hallway and I give you my phone number and I say hey you can text me we'll chat and then tomorrow I go to T-Mobile and I change my phone number and now you can't text me that's data Decay so my phone number at the time I gave it to you was

accurate but through no fault of your own the information became outdated that's literally what data Decay is and I don't think it would surprise anyone to know that I came with examples of data decay in your environment probably happening right now and you don't know about it this is really prevalent with well actually the simplest example is IP address analysis so I don't think anyone really thinks about the fact that like the analysis we do in the sock and a hunt team and on and on is so dependent on the attributes of the data not changing so to be very clear if you're putting in IP address data into your sim and you come back a few days later and you run

your Sim's fancy geolocation function on that data you're susceptible to data Decay because your sim and I promise you no sim vendor or data security vendor is solving this problem you're taking data from today and applying it to inert to data that um was previously oberved days or weeks months ago on and on I'm not going to talk about everything on this slide I do want to call out the I think impressive scale of this problem with something things like the Luminati proxy um 8 million plus active IPS with an 11% daily churn rate is a pretty impressive number of servers joining and leaving the network if you're not you know doing this analysis right when the activity is observed you

have a data Decay problem the um I don't know the funny thing about data Decay is that once you know it's like a thing that exists in reality you kind of start to see it literally everywhere um talked about IP addresses also applies to DNS domains applies to web pages um the cloud in general files in your in your network and on your endpoints and then even more abstract Concepts like threat intelligence like it is literally everywhere if you go into your sim tomorrow and look at your data with this concept in mind you will see this everywhere so um I'm not here just to complain let's talk about solution where should we be doing this kind of

work and where is it not really effective well I don't think it's very effective at the people wear um again through no fault of their own um it's also not really solved by a sore platform because a sore platform at that point is just automating bad data analysis um I don't think it's very good in the Sim either there is a scenario in which it's probably okayish which is when your sim generates an alert that alert gets enriched with geolocation whatever data and then it's sent to the sock team and the sock team has a really strict SLA for responding to alerts the lower the time window from the data being observed to someone actually

looking at it lower the chance you have a data Decay problem but um I actually think the best place to do this is in the data pipeline so this is usually your first opportunity to exert influence over the data to actually change it and typically you know you have data coming from a system it hits the data pipeline within seconds to minutes so if you're doing all the enrichment there you don't really have a data Decay problem because you're just putting enriched data into your sim so epx I've worked on a project called substation for a few years now um we've been using it in production for over 3 years it's been open source for

over two years uh I'm not going to read everything on this slide to you I will say that what's written here is no BS um and this solution is a fraction of the cost of similar vendor Solutions you're going to see at RSA but I do want to talk about use cases because like this is the whole point of building something is to solve a problem these are probably the most common use cases you see for a project like this and similar security data platforms you'll pay a lot of money for the first one is a very boring use case I'm not going to talk about it beyond the next few moments it's the route data

to and from anywhere use case this is the use case that some vendors have pitched and made a lot of money off of over the years as gee wouldn't it be great if you didn't have to pay Splunk all that money that's the use case it's really easy to do if this is your only use case you shouldn't pay a vendor to do this if you're already in AWS look at a project like substation there are others as well you can do it uh at much uh less cost to you the Second Use case is one I've already touched on a few times it's normalizing data to a data model um I really believe that this is just table Stakes now like

if you call yourself a modern detection and response team and you don't do this bye don't talk to me um the third use case is the most compelling one it's also the hardest one to solve um and I like working on tough problems and it's enriching data to generate context and this is the um really the point of this talk which you'll see in a few moments but you know if you talk to your security data vendor your sim vendor whomever and you say hey you guys you got context they'll be like yeah we got context check it out you can upload a CSV here there's your context I'm like okay cool we've been doing that in Sims

for literally decades so why am I buying a new tool to upload a CSV file and by the way if you upload it through S3 that doesn't make it any better um um some vendors might be like yeah we have context we scan your AWS account once a day and generate ec2 metadata put it in a table then you can figure out what to do with it it's like okay cool thanks not only is that data like not fully up to date at the time you scan it but also you're not really helping me so really what I'm talking about for the rest of this presentation is the hardest use case which is realtime data

uh contextualization so let's talk about something interesting I'm going to show you three I call them ETL Solutions it's kind of a weird phrase I didn't know what to put here um I'm going to show you three solutions to this problem that you won't see at RSA you probably won't see it at RSA for the next few years because I don't think anyone is crazy enough to do this and sell it as a product but I'm going to show you how we do it and I'm pretty sure this is the first time I remember ever saying some of these things publicly so um oh I also want to say um there's going to be architecture diagrams I know

those are boring you're going to see some examples of how this actually changes data in a moment um there's a lot of AWS iconography if you see these things and you're like well I like kofka and I don't like Kinesis or I like kubernetes and I don't like Lambda you know that's great for you um just feel free to like insert whatever Tech you want that's complimentary in these diagrams and this will still work so um this first solution is what I call the time travel pattern it's when you take data from a data stream whatever the data source is EDR is heavily featured throughout this presentation and you use it to enrich itself and the way time travel works is

you have a data stream and you have multiple consumers from that stream reading the data at differing frequencies one is reading it much faster that's what we call the enrichment function or enrichment consumer that consumer's only job job is to take data and do high latency data enrichment so it could be DNS resolution it could be calling an external API it could just be making the data look pretty before um it needs to be consumed by some other service puts it into the database processes the data pretty fast then we have a transform function or consumer and the job of that function is to just do data modeling it's basically its only job but the cool thing is

because it's reading the data at a a higher latency you know 10 seconds compared to the enrichment function by the time it receives the data the data it's receiving is already fully enriched and ready to be accessed uh for the data model so it gets data and it says Hey Oh I have a process event I'm going to check to see if there's any more context in the database yep there is I'm just going to take that context and put it in my event and happily send it along to the Sim so what does it look like when you do this in production well it looks kind of like this um we've got an example of the data

before processing and after processing um if you've never seen an EDR event before there's a couple things I think you should know like usually those events are of different categories like network connections process execution file rights and hundreds of other things um but they lack a lot of information so they usually just have a reference to the process by ID which is just a number or a string and maybe if you're lucky it says like a process name like Spotify so the way time travel works is we take that process ID and we check the database say okay do you have a process execution record for this yep we do then we just insert all that information so

we get the full command line of what was executed the start time of that process dozens of other things I can't put on this slide because it would make it illegible but the neat thing is that we also get the parent process and so we do it again we take that and then we get that command line and the start time and all these other things I can't show you today because it won't fit on the slide at that point we know that running board D is the process that spawn Spotify and why would we stop a second uh the first time so let's do it again and so we just keep doing this on

and on until we end up with launch D if you don't know what um launchd is it's basically the process on your MacBook that controls all the other applications running on your system that's the simplest way to Des describe it so what we end up with is a event that is way more contextualized like I don't need to spend however long it takes me to figure out writing a three sequel join query to get all this data in one event and I want to be like very clear this is just showing up in our Sim um we're not doing anything in in the Sim to make this possible um and every event in brix's Sim looks like this by the way

all right that's the first one got another one for you um the second one's called the telephone pattern telephone like the uh children's game so um the way telephone works is you have you know think about the data in your organization all of your data sources and data sets they know a little bit they have a little bit of knowledge about what is happening in your business so the way telephone works is we take multiple data streams and we share that information in real time um to contextualize the events here in this example we've got an identity platform data set device management device inventory EDR cloud service because EDR is the example I went with for today's

presentation I don't need to tell you that these look different but anyone who's ever looked at an EDR event knows that the event on the right has information in it that is impossible for your EDR to ever uh access or know about so let's talk about how this works uh oh by the way this EDR vendor um doesn't put their host name in their events which is really infuriating some of you I'm sure know who this is um so we this is this change here where we actually insert the host name is actually a form of time travel we don't use like a lookup to do that we do it in real time but what we do then is

we have the host name so we say okay I have a host name with wouldn't it be great if I knew who was assigned that laptop so we take the host name we look it up in the database and we say yep the device inventory service does know who is assigned that laptop it's Alice cool now we can just search our data by user and we don't have to think about host names or host IDs or anything complicated because I want my job to be easy but then we don't stop there like why wouldn't we just keep doing this cuz it's so awesome we do it again and we take the email that we enriched and then

we search the data from our IDP and then we know that Alice is a person she has a role at the company maybe she's a manager in the engineering organization and we do something that I would refer to as I guess derived uh context where we actually use the events flowing through the IDP to understand if she's her account is active or maybe she's suspended maybe she's deactivated this is all done in real time without you know csvs dynamic lookups whatever you want to call them um and again this information is just in the Sim like there is no enrichment happening in the Sim this data is just freely available to anyone who has access to it it's an insane um luxury to

be able to have this here's the last example this one um might be the most confusing not because it's complicated because it's actually really simple and I think people would initially think like why would you ever do this if we have a SIM so this one I call the nxr pattern and this is as simple as just taking whatever queries you going to write in your sim and doing it in your data pipeline so why would you want to do that when you have a platform you paid millions of dollars a year for well there's a few reasons but one of the reasons is that when your let's let's refer to alerts as threat information

when you take your threat information and you just make it data it becomes so much more useful to your organization for example instead of that threat information being locked in a knowledge Silo in the sock or the IR team suddenly anywhere that that data is replicated becomes accessible to the team who needs it and now the fact that your sock and IR team and Hunt team are bad at talking to one another matter a little less because they all have the same base set of information um you can do some other fancy things with it but that's the core reason to do it again I don't need to tell you these things are different but I do want to

call out something that I think is really clever here and it kind of gets to the point of just making my job easy if you don't know what the security off trampoline processes in MACC OS which I wouldn't expect basically anyone to know what that is that's fine you don't need to know what that is because we have a description right here it says it's an authentication prompt that is called by a API that says execute with privileges this is great for people that don't have domain knowledge and for anyone to just jump in and start searching for um again this is just data to just start searching for keywords that um are relevant to your

investigations one thing that I really like about this is that these become like analytical signposts for your sock team your IR team like it even goes beyond detection like when you have a Threat Signal it is really easy to focus your attention there versus everything else you might not need to care about we give it a short name this one uses miter attack uh tactic and technique you can do whatever you want obviously but like that stuff is not even the interesting part this is the interesting part we also put references in our data like again this is just data this is not an alert it's just data sitting in the Sim anyone can search at any time and so we

have references to anything applicable to the signal um in this case it's a reference to The Objective C blog so the neat thing about making this data is that you can search it so you could just search in our SIM for any reference to The Objective C blog and see what events have come back it's pretty it's pretty novel and obviously we risk score it because it's 2024 and everyone has a risk score but the cool thing about having a risk score in your event is you're not really Shackled to what your sim tells you a risk has to be a lot of sim are focused on hosts and users and yeah those things typically are risky but

what if you wanted to calculate the risk score for a server on the Internet or an individual process running on a host or maybe an entire team in your organization when it's just data you can do that and I think that's really powerful so we're nearing the end here and I'm getting the wrap it up signs um so what's left to say you can deploy these right now with substation you could probably build it yourself too but I don't know that I'd recommend that um we tried to make it as easy as possible for you to test these things see how they work um with the full knowledge that this is all very complex and yeah I don't think there's

much more to say than that get the open Source if you're interested if you want to do more reading it's sound medium uh if you want to reach out to me LinkedIn is a great place to do it if you find me on Twitter don't message me because I won't read it I don't use Twitter anymore that's it that's the

presentation awesome thank you Josh believe it or not we have time for one question make it a good one make it a good one well the zeroth question is your manager should give you a raise so that's the most important question Jason where are you says Jason um but the next one what do you think about using gen for schema normalization to something like ocsf to make ETL easier yeah you should use it for that all right and here's one how do you handle significant throughput load on the context table at terabytes of logs per day you use Dynamo DB and it makes it easy Jenny I was asked about already cool and final question see we're

getting through all the questions how does it not take so long to enrich with many layers of process events in a large scale environment so we use serverless services so if you were to do it with you know manage serverless which is like kubernetes I think it would be more difficult but we have a really small team we don't want to deal with servers or managing processes we just let AWS do it for us and finally will there be a commercial version of substation no comment all right do you use AI tools for your SE attacks I don't know how to answer that question sorry yeah I'm not quite sure about that one all right hey we're

ending right on time thank you so much Josh for that awesome presentation we we did manage to have time for questions so first I have a very special gift for Josh let me find it and here it is from socket security one of our sponsors a very special speaker gift so thank you for that