← All talks

To Normalized Logs, and Beyond - Building a Threat Detection Platform from Scratch

BSidesSF · 202349:39360 viewsPublished 2023-05Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
To Normalized Logs, and Beyond - Building a Threat Detection Platform from Scratch David Levitsky, Brian Maloney You’ve been asked to build out a threat detection platform from scratch - now what? Join us for a deep dive on building a scalable and lean detection pipeline. We’ll show how to automate data ingestion, use detections-as-code, filter data, and more to build a serverless platform to detect threats. https://bsidessf2023.sched.com/event/1IKX2/to-normalized-logs-and-beyond-building-a-threat-detection-platform-from-scratch
Show transcript [en]

hello everyone welcome to normalized logs and Beyond with our speakers Brian Maloney and Dave Levitsky or David Levitsky hey everyone good afternoon um thanks for sticking with us for one of the last talks of the day my name is Brian Maloney and this is David Levitsky and we work together to build a cloud native threat detection Pipeline and platform for benchling now what is benchling you might ask benchlings the software service platform that powers some of the world's most cutting-edge biotechnology research with customers ranging from two scientists and a molecule startups to some of the largest biotech companies around these will say benchling protects some extremely sensitive data on behalf of their customers and needed a scalable and reliable solution for detecting threats to that data while retaining maximum flexibility for an unknown future our work resulted in a platform built on replaceable components using combination of AWS Technologies and off-the-shelf Hardware software which is the subject of this talk and just one last note about mentioning we are hiring so if you're interested maybe we can come talk to me um but first let's just briefly go over the steps of threat detection for those of you who might be primarily offensive practitioners or if you're just new to the field this you can think of this kind of as just a brief blue team 101. so this is kind of the the overview of the steps um from collection to normalization enrichment detections running on that security alerts going into investigations and then response actions so what about collection right we start on the left with collection we we need the data to do our work um collection I think of it is like the hard part um any modern organization of any significant size is going to have a ridiculous number of security data sources from your Enterprise identity and your HR applications things that you use to run your business to your productivity software that you use to to drive to communicate with your customers to the product itself we're a software as a service company so obviously our software generates a ton of data um and these sources they're they're not uniform you know they're they're going to come in in many formats raw text logs delimited formats CSV and space to limited Json could be some binary formats even um and those come in Via different protocols uh you know some of you may not have to deal with syslog today but a lot of people still do um buckets are a common way of delivering logs as well um and you need a way to get that data into your platform easily HTTP apis via either pull or push are another common method and then Cloud event buses are kind of an up and comer in this space so then once you've got that data into your system you want to normalize it um why do you normalize because you need to be able to do efficient queries on that data but in order to do that you're going to have to pick some kind of a schema you're going to want to make sure that you have the essential pieces of data that you need and you're going to need to um track those things like time stamps and log sources and then of course you're going to have to think about what goes into your normalization as well um so for example if you do have those raw text logs that may involve regex's regular Expressions which there's some operational risks there which we'll cover maybe in depth a little later CSV logs also if it's a perfectly standardized CSV you're probably not going to run in any trouble but sometimes you have quotes in the wrong places that then break your parsing so normalizations engines just need to be limited so that they don't run over and they need to be able to handle erroneous data once you have that normalized data you're going to want to enrich that enriching it provides the insights that you need to detect threats to your uh your platforms this enrichment could data could come from internal sources like your Enterprise Data Systems um you know HR systems things like that who's doing what how do we attribute things it could be metadata about your production resources and it could also be indicators of compromise for your detective controls so you may have specific indicators of compromise that are specific to your platform this may also come from external sources so those indicators of compromise could come from public databases they're like things on GitHub that you can grab or there are vendors that can also provide these kinds of uh threat data that you can use uh with your normalized logs finally once you have been enriched data enrich normalized data you can run detections on those to find the actual Bad actors in your systems this may be predefined indicators of compromise it could be anomaly detection using statistical methods or unsupervised ml models or maybe you know everybody's favorite topic large language models might be part of this in the future but it's important uh to do to be able to run iterative development of those detections because they do change over time and you want to be able to tune the performance as things change in your environment and you also want to be able to reduce false positives because there's a human cost to this in the next step finally those alerts go to the response team they need the details about what happened and how they can investigate it so you do need to have some kind of a system for alert delivery moving on to the last step and then we'll get into the architecture you want to be able to investigate what happened so you're going to need the same data for that right um One One log Source might indicate something's wrong and then you need additional context from other sources from there that step you can move to confirming whether it's a malicious actor or benign most the time it's going to be benign but you still have to check every time you get one of these events fire in the case that it is malicious again just wrapping up kind of our blue team 101 you're going to want to contain that intrusion close the vulnerabilities that may have allowed somebody to get in in the first place um prepare and execute your recovery plan and then as always whenever there's some kind of an incident be it an operational incident or security incident you're going to want to conduct a Lessons Learned section session afterwards so this is what we want to do now how did we actually do it so we're going to jump in now to the architecture so I'm going to go over kind of two parts of the architecture the inputs and then the detection side and then I'll show you just a slide briefly where we put the two together so you can see on the left hand side of the screen we support multiple different sources types those could be HTTP API syslog buckets Etc and those go into a series of lambdas so we are an AWS shop so this will be an AWS specific architecture but this could be applicable to many other clouds so those we use lambdas for inputs except in the case of syslog which obviously it's not really easy to tie syslog to a Lambda um so for that you'd want to have a traditional more traditional kind of compute platform with a load balancer from those Lambda inputs we then add the origin metadata and run it through Kinesis fire hose which allows us to aggregate huge amounts of data into a semi-structured data bucket and that's kind of our golden storage bucket where we do all of our work from once we get the data in you can see that the same bucket is kind of down on the bottom left of this diagram we can use both aws's analytics tools and third-party analytics tools with which we connect to the same bucket and this allows us to use the right tool for the job which is one of the most important things that we try to accomplish with this architecture so putting these two sides together this is what the entire thing looks like and it's a little you know it's a little big it's a little complex but it's all managed by infrastructure as code doesn't require very much uh you know care and feeding at all so now we'll get into kind of deep Dives on the different parts of this architecture so to begin with we're going to start with the data collection which is one on the left hand side of that diagram that we were just looking at so this takes the raw data and brings it into our platform now as we covered previously in the overview the data sources are extremely diverse we have a preference for pull and the reason you want to prefer pull for data collection is because there aren't as many reliability concerns with the pull data source so a pull data source caching is inherent you grab the data when you're ready to receive it and generally that's very very straightforward and allows data buffer On the Origin side with a push data source like syslog or an HTTP receiver that increases the reliability requirements on that input and the reason for that is that if your input goes down for you know hours days uh then that data will be buffered on the source maybe uh and it will eventually be lost it may be lost right away so that's why we have a strong preference for pull inputs even though we support both in our architecture now the scale can be very large so we needed an architecture that is you know elastically scalable Cloud native and we want that scaling to grow and to shrink as the input changes over time so we want it to handle bursts you know the the old slash dot effect right uh and also just general seasonality of data where you know nobody uses the platform on a weekend so we don't really you know need to be running a full capacity cluster at that time another goal is to have the cost growth be no more than linear with volume so ideally we can get some economies of scale as the volume goes up the costs may not grow as quickly as the volume does but at the very least you don't want to be paying any more as you get more and more data because that limits your long-term scalability we wanted a system that was limited required limited coding so a lot of these inputs we wanted collaborators to be able to help with this um you know anybody should be able to get data to us so we developed a system with a significant wrapper functions that allowed you to write many types of inputs in just a few lines of python and we wanted to be resilient to failure and have health monitoring which will cover the specifics of in a few in a few slides most importantly stay Cloud native otherwise you're going to be just paying extra and you're not going to be efficient on the cloud that you're running on so for pull sources the architecture for both sources is kind of similar for all pull sources one way or another in our architecture the notification notification of data will arrive on an eventbridge bus and that will route to an sqsq which is in front of an input Lambda the reason we use sqsqs instead of going directly from eventbridge to Lambda is because sqsqs allow your Landers to run synchronously and give you some um retry benefits it's more able to redrive data if uh if things don't go so well in your Lambda which can sometimes happen so after the it arrives in that sqsq the function is triggered and the data is collected and written to that Central Kinesis fire hose uh which then goes into our bucket now what about when it's an API that we're pulling from and it's not an S3 bucket S3 buckets are pretty much the trivial case but when it's an API you may not get a web hook notification I haven't really even seen that eventbridge has a scheduler built in which is an AWS service but that scheduler does have some limitations it doesn't have any ability to catch up if it gets behind it'll just keep monotonically generating cron type events uh and um it doesn't have a sub minute granularity if you need that so we do have a tiny event generator for our poll sources and that runs in fargate and it just generates events into our eventbridge bus that triggers those inputs so this is kind of the visual of what that looks like so as you can see on the left hand side we have log buckets which generate bucket notifications into the event Bridge bus and we have our job generator which stores booksmarks and dynamodb but otherwise also generates into the same eventbridge bus those trigger those sqs cues which trigger the lambdas in turn Which pull data from either those buckets or public log apis and the logs of course are written to the event the Kinesis fire hose and then onto our bucket now what about those push sources um as I said uh there's syslog still exists in the world uh you might need an HTTP receiver AWS is pretty much completely optimized for building high volume https services so similarly we want to use a similar type of architecture Lambda is a very good tool for this because it doesn't need to process things for long typically so you can very simply meet this need by building a receiver Lambda behind an API Gateway now you'll need to make your own decisions if you're going to build something like this about how resilient you want it to be do you want to be in you know just multiple era availability zones within one region do you want to be in multiple regions uh and resilient across that but in general again you know you can reuse the code from the pull inputs and use the same formatting and filtering that you would use in your push inputs designed for uh for high volume of course so you do want to allow for batched uploads um if you do an HTTP request for every single log message you'll quickly overwhelm even aws's scalability and finally since this is a security talk you do need to threat model an https input and decide what the risks are that you're willing to accept um you know in some cases I don't think that it's particularly risky to have this on the internet as long as it's you know right only right only being the important thing um but you know your mileage may vary um be sure to you know threat model this for your own organization now once you have that reliable log endpoint um you can then use collectors to collect things that don't speak HTTP so you know there are many tools for this um they're open source tools they're easily easy to get easy to install and from there you can make a centralized syslog collector and have that forwarded to http or you can pick up files from machines or you can collect metrics and that all goes into the same pipeline as everything else so once again we're all centralized everything you know is is hunky-dory so far so what if something goes wrong right so obviously these things do uh you do sometimes get malformed logs you do sometimes get uh you know a format changing how do you detect that so again once again we're using an AWS native architecture so we use cloud watch alerts to detect messages and dead letter cues from our sqs cues too many messages too few messages Lambda failures and the runtime of lambdas because there is a hard limit of lambdas you can set a limit on each Lambda but the longest you can set is 15 minutes so as you start to approach that you do need to tune um typically we only see failures in response to somebody making a change again you can see format changes Downstream but again that's usually due to a human likewise other failures with like run time exceeded or things like that are usually due to some human as well so we have not seen a lot of uh We've not needed a lot of care and feeding and it's usually untraceable to something in our gitups process so now we've collected the data and we've put it all in a centralized S3 bucket um or actually as part of putting it into an S3 bucket we're going to take this normalization step we developed a companion system uh notice this is the normalization and enrichment step we developed a companion system for managing enrichment data which David will cover a little later on but right now we'll just talk briefly about normalization so we talked about normalization earlier just the why again structured logs are more performant and easier to analyze and you also have an opportunity when you normalize to standardize your field names so there are some common things like username and IP address that can make your analysis code much easier to write if you use a standardized format we also have some instrumentation inside our normalization engine which allows us to verify if something starts performing poorly if we get less data than we expect it's just another point of instrumentation in addition to our Cloud watch one important choice when you do build your own normalization is whether you want to keep the raw message or not so the whole point of normalization is to reorganize the message into something that's easy to use do you keep the original for us the costs are low enough that we have chosen to keep the raw message that allows us to reprocess in the future if our schema does change now I mentioned log schemas if you're building from scratch like we did you have an important uh you know very important choice to make in what schema you use there are some very high capability schemas that have a whole lot of uh you know data types in them um they require a more advanced normalizer and a more advanced normalizer requires more care and feeding the more regular Expressions you use the more likely that something could go wrong um but some popular choices as far as those more advanced schemas would be the elastic common schema or the open cyber security schema framework both of those have support by you know fairly large chunks of the industry but what did we do so we decided to go with a very lightweight wrapper schema as I said that we are retaining the original message so we have a fairly large uh a fairly good ability to take actions further Downstream on that uh that original message or slightly restructured message so we just have a simple wrapper that includes where it came from um what the metadata about how it got there was when we saw it first a deduplication ID and you know the message itself and a few other uh less important fields now again since this is an architecture deep dive we're just going to briefly talk about how we actually accomplish that normalization and that's what I consider like to be a Cooperative design the inputs give the basic format uh they ensure they verify that the format matches what's expected and structure it so that it's already in you know in a structured data format so it can be embedded into Json without needing to then pull out the data later on um from there it goes to fire hose and fire hose has the ability to run arbitrary functions on uh the messages that go by in firehose so there's an additional normalization Lambda there where we can do centralized processing for example we can add enrichment data into the message if there's something that's going to be transient that we might need in the future and we can also do um some metrics on the data as it goes by in the normalizer so that should wrap up the normalization phase and will then move on into the detection phase which David's going to take over for oh thanks Brian so going back to our original architecture diagram uh thank you Brian for covering the left half which is the very complex phase of you know getting a lot of different heterogeneous data sources and placing them into a single bucket where you have a nice source of Truth and right now we're going to focus on the right side of the equation which is what do we do with this data from a detection response function when it's actually flowing into our kind of data Lake so taking kind of one quick step back uh you know what is the detection blue team 101 really we're looking at some sort of data source and trying to identify whether malicious events are happening that somebody needs to either respond respond to either automatically or manually and for us we kind of have broken it up into two different classifications we have streaming style detections where you're looking at events as the