A pain in the SaaS: Scalable Detection in the Age of Data Sprawl

Name: A pain in the SaaS: Scalable Detection in the Age of Data Sprawl
Uploaded: 2025-06-27
Duration: 54 min 48 s
Description: "Surely it can't be that hard. Well if we're going to pay 7-8 figures per year for our SIEM anyway, might as well TRY to do it ourselves, right?" Security has a data problem: too much of it. In this talk I'll share how to implement scalable threat detection using ClickHouse and open source tools, s

BSides Seattle54:4859 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Alan Braithwaite

Tags

CategoryTechnical

StyleTalk

About this talk

"Surely it can't be that hard. Well if we're going to pay 7-8 figures per year for our SIEM anyway, might as well TRY to do it ourselves, right?" Security has a data problem: too much of it. In this talk I'll share how to implement scalable threat detection using ClickHouse and open source tools, share modern approaches to detection in huge datasets and explore where security data management is headed. Alan Braithwaite Hacker, Co-founder, CTO at RunReveal Alan built Cloudflare's and Segment's event pipelines which processed millions events per second. Using this experience, he's built Go libraries to reliably and efficiently process events at scale. He hopes to one day earn a coveted black badge at DEFCON.

Show transcript [en]

All right. Well, I'm not going to waste any time because uh we've got a lot of content to get through. Uh I'm Allan. This is a pain in the ass. Um I think y'all uh may have some relatability to this problem, which is why you're here. So, uh feel free to ask questions during the talk. I'm I can repeat the question and answer anything. Uh it's pretty dense. So, uh just let me know if you have any questions. Um so, who am I? Uh I'm Allan. I am a uh data engineer by background. Uh I used to work at Cloudflare on the data team. So um basically processing all of the HTTP and DNS events from the entire network and

making them useful and usable for the operations and analytics teams. Um it was a pretty big data pipeline and I really cut my teeth there, learned a lot. Um and then I went to a company called Segment which did more user analytics. So that was kind of interesting because it was a similar type of work that I was doing but for a much different application and so a lot of the challenges were a little bit different. Uh but now I'm building run reveal. So I've always been kind of a hacker at heart and really interested in the security community. Um my first security conference was actually at Bides 2015. Uh so it's a a big

um it's very honoring to be able to speak here. Uh that was at OpenDNS in San Francisco and the community's always meant a lot to me. So I'm happy to come and give back a little bit. And uh Runreveal, we're a security data platform helping uh security teams to detect threats in their networks and hopefully prevent attackers. So SIM really sucks. Um it's SIM for those who don't know is a security information and event management. It's a term that was coined a long time ago by uh Gartner, the the company everybody loves to do um you know analysis and uh provide their feedback on what is happening in the industry and help teams make decisions

about what's going on. Um, but obviously like I'm a founder in this space, so I can't possibly be unbiased, but for me, logging has been a bit of an obsession. Um, ever since I watched the log stash like announcement talk in 2012 or 2013, uh, the author said that he was really, uh, doing hate driven development, which is to say that he was really frustrated with the lack of tooling uh, that is provided to teams who are trying to build logging pipelines. And Log Stash, of course, eventually got acquired by Elastic and became Elk, and now that's one of the most prominent SIMs today. Um, but let's ask ourselves like, why does SIM suck? Um, something that really grinds my

gears and frustrates me in the security community actually is, uh, when you see like an organization or a hospital or a school get hacked, a lot of people just kind of get defeist and say like, well, it's because they didn't care about security. And while there may be some truth in that um and that there's some messed up incentive structures within those organizations, I'd like to challenge that notion and suggest that maybe it's because we haven't built like the right tools to be able to do these detections at scale when you have so many data sources. trying to uh parse all those data sources into something that we can actually create a story and correlate uh what an attacker's path is

through a corporate network. And so in my mind, I think it's more of an economics problem. Uh a lot of these SIMs that have been built in the past are really built in an age where logging was very different. Uh computers were very different. we didn't have uh big data be a standard uh in terms of like the way that companies process data. And I'm here to uh kind of talk about what you can do, what we can do to uh improve the economics of data processing, specifically as it relates to security logs. And hopefully with a better understanding of that we can be more informed and make better decisions in our organizations to be able to do

detection better uh at a scale that is you know on the order of terabytes of data uh per day or per month. So [Music] um yeah so the problem is just that oh what's this talk about? Yeah, it's uh really about making security data more generally useful and not being locked up into the silo that is your SIM. It's about um defining what is the problem, the scale of the problem um by giving a brief history of detection systems and how we've thought about it in the past. And then we're going to talk a little bit about uh what are requirements for actually building something that can do detection across terabytes of data across all your infrastructure. and then

talk a little bit about what a modern data pipeline looks like in other areas of an organization. Uh because SIMs have been kind of built like I said in in their own little island and they haven't really realized a lot of the benefits that we have had in the data space uh for a long time. Um, and then we're also, of course, going to talk about how to use open source software to build your own detection pipeline. Uh, particularly one that can, uh, handle this much data. And then we're going to go ahead and recap a bit. Um, and like I said, I can't be unbiased. I'm a founder in the space, but what I'm about to show you is

essentially our architecture. It's essentially the architecture of any major data processing pipeline where they're trying to do anything interesting. And so to in my mind it's not really secret. Um, and I'm really giving this talk because I want to see organizations everywhere improve their detection because I'm just really tired of hearing like somebody got popped and they don't actually know how it happened or they couldn't find like where uh what was compromised or even put together a story of what happened. Um, and I I really hate that. So, um, oh, what is this not about? It's not about uh sorry, it's not about like setting up a grey log or an elastic search. It's really going to be about like a larger

scale of detection. So, at orgs of like 500 to a,000 people roughly. But um so the problem is really that there's just so much data everywhere, right? I think everybody can relate to this. Anybody who's tried to set up any kind of centralized logging in their organization or even in their home lab is just like inundated with all of these different tools, all of these different like devices that have their own formats. Um, and the thing is like there is a lot of uh security logs that are exposed from these tools. Um, but that's also kind of a catch 22. It's a problem too because now we have uh terabytes of data that we have to sort and sift

through to be able to find anything useful out of it. Um, so some of the biggest companies are ingesting pabytes of data on on the month or annually. And it's really spread across like all of these different uh surface areas that you really want to capture all of because they're going to have kind of the whole story of a hacker trying to get into a network. And if you're missing some, it makes it really difficult to put together the puzzle and figure out where things are because you have to go into an individual's like or an individual uh products UI, look at the audit logs there, and then be collecting all of that data. Doing that

during an incident, maybe at like 3:00 a.m., that's just a huge pain and oftentimes leads to extended remediation periods and a lot of confusion, a lot of stress and chaos. Um, so SAS applications notorious for generating lots of logs, but of course we have devices, uh, network appliances. There's just a lot of places that we have to collect things from. But furthermore, to make things even worse, uh, there's dozens of data formats out there, right? So SAS logs kind of easy. Most of them are exporting them in JSON format, right? So JSON is actually one that I kind of like because as an industry we've spent literally billions of dollars optimizing the processing of JSON for better or worse. Um but it's

structured, right? So we can actually just look at a JSON object and kind of have some semblance of what each field is supposed to represent. Um but of course historically they were plain text. uh logging has been like the original telemetry um since the dawn of computers since the first person wrote print hello world in their uh C program but since then it's evolved right we have XML we have CSV um GCP logs are uh a particular pain of mine because they've decided that it's a good idea to uh do protobuff internally at at Google but now they need to expose this to everybody So all of their other logs are in JSON. So they decided we're just

going to base 64 encode the ProBuff and stick it in a JSON log and call it good. And so like when you're actually having to go and integrate with all of these dozens of different data formats, it just becomes kind of insane. Um not to mention timestamps, that's a whole talk on its own. Uh if you are dealing with timestamps and time zones, it's just gets pretty miserable. Um but of course like across all of this uh schemas are not standardized either. So there's like this big problem of normalization where we have to go through and look at every single log and decide like what is the source IP? What is the actor here? What is the resource

that they're trying to access or uh change? Um, so it's kind of like this big problem where nothing looks alike, but we're still trying to get it into a form where it's useful to be able to do something interesting. Uh, fortunately, um, fortunately, structured logging is kind of the norm now. So like across at least SAS applications, we can get some sort of like baseline and be able to pull out fields based on JSON. But at any organization, you're still probably going to be having to pull in some legacy log sources which still need a little bit more handholding and a little bit more monging to get it into something useful. But um I did want to

talk a little bit about the history of how we got here though because that's kind of core to this whole talk. Um I I strongly believe that uh Splunk and Elastic have actually been really good uh for the industry as a whole. Like before those tools, we didn't really have like before Splunk, we didn't really have that good of a detection and response system that did centralized log processing. And so Splunk kind of took that in the industry and made it standard and they were just willing to go and integrate with every single uh product out there to get the logs in and start to be processed. Um and back then yeah I guess at the beginning it was

like GP and SIS log. Um but then Splunk came along along they made it like indexable and fast. Uh, elastic uh was like the first major open alternative that kind of came about and it was almost an accident because uh there was log stash which was being developed by some passionate developer who who realized that elastic search would be a pretty good thing to be able to search your logs, right? Um and then Cabana attached onto that gives you like a nice visualization layer. Uh but ultimately like elastic kind of falls short because it was built the history of elastic is that it was built as a uh document search database right and documents are

quite large. They tend to be PDFs. They tend to have a lot of information in them that is not structured at all. Um, and so Elastic Search as a log database like it works, but it it doesn't make sense in the long term because uh logs themselves are kilobyte maybe. Um, some get into the tens of kilobytes, but if you're processing logs that are like megabytes of data, you better talk to that vendor and figure out like why there's so much data in one single log. Um but then in in addition to Elk there was a lot of other uh projects that were open source that came along like Grey Log and Fluentd those are actually still

quite good um for collecting logs but they don't do a whole lot in terms of aggregation or detection and that's yeah that that sort of thing. So we won't get too much into those but the big thing that I want to highlight is in the last 10 to 20 years we've actually had major advances in how big data architectures are built and uh that is translating into every modern sim today that you go and talk to a vendor. All of these vendors are actually doing the same thing under the hood. They're just kind of skinning it and selling it the way that they want to sell it. Um, so we're going to talk a lot about

that. Um, so why is it so expensive today? Um, you know, obviously Splunk being what Splunk is acquired by Cisco, um, they're going to want to charge a lot because it's actually expensive for them to be able to index and process those logs. There's a lot of support that has to be given to the organization. So it's not just like you can uh be given a Splunk instance and just go off to the races. You often need to hire consultants or employ engineers to be able to do the normalization and ingestion. Um so Splunk as a starter, it's going to be a little bit harder. Um but when you're running things in the cloud, Splunk has a cloud, Elastic has a

cloud, most SIMs today that are produced are cloudnative. um the prices start to add up because you're paying your vendor, you're paying the cloud, um or the vendor is paying the cloud and those things are expensive. So outbound network costs on AWS are 9 cents per gigabyte, which doesn't seem like a lot at first, but when you're processing terabytes and terabytes of logs, that gets very expensive very quickly. And so trying to minimize the amount of network IO that you're doing is really important. Um storage costs also tend to add up. Um you know with high performance storage there's a quite a big like latency gap between your local NVMe storage your S3 optimized storage and your like cold

storage uh that is ultimately being written to disk somewhere. Um and at each layer, every time you add a layer, you're adding to the compute costs as well because it has to wait and load that data uh slowly from from the storage. Oh yeah, the retrieval costs. So it tends to add up and when you are buying another SIM from a vendor who's paying their cloud vendor, you know, these things just stack and stack and stack and it it gets quite expensive. Um, so in the real world, I talk to a lot of security teams. Uh, I try and understand what it is that they're having trouble with. Uh, I'm really passionate about building a product that

works well for people. And, uh, I was just astonished like having not a really a background in security operations, astonished by how much people are paying, uh, for just their basic log collection. I mean, I know that I was pretty passionate about it before, but when I heard some of the bills that people are doing, it was just insane. So, we it's not surprising to hear seven figure contracts for even medium-siz organizations uh for a commercial sim. Um even then, like they're discarding a lot of logs just to uh be able to do anything useful with it. Um, in fact, there's a company called Cribble, which is former Splunk engineers who left Splunk just to build a company that

filters data going into Splunk and save money. And now they're a multi-billion dollar corporation just to save money on people's Splunk bill. So, it's insane to me. I just don't understand it. Uh, and even midside companies don't like have huge contracts with these vendors and it just completely prices out everybody else. If you're at an organization with like a limited security budget, a lot of times people just skip it. And this is like honestly probably one of the most useful tools if done right in your organization because there's no other tools that are actually going across all of your different sur attack surface areas talking to like getting data from all your different SAS um and that sort

of thing. So, uh, what we're seeing more and more of is companies are either foregoing buying a SIM entirely or if you're, uh, an engineering forward company that has like some good security team that knows about, um, these costs. Uh, you might just build it yourself. And so, I'm going to talk a little bit about Brex and Ripling in this talk because they're both uh, companies who have built their own and talked about it publicly. uh particularly because they've uh recognized the value of owning it internally and the other things that you can do with the data once you have it. So they're able to collect much more data at just a fraction of the

cost. Um, so because of these issues, like some of the companies that are foregoing buying a SIM, uh, they're instead just relying on like point solutions like Whiz and Crowd Strike just to, uh, do their detection and response. And the problem with that is once again that you don't have visibility into all of the other areas in your organization, particularly all the SAS and shadow IT um applications that people just sign up for um to add to it. Like AI is becoming a thing and people are just throwing data everywhere, not really caring about it. But ultimately, a lot of these other tools when you're buying point solutions, it's not going to cut it. So,

but I actually see that as a good thing because it's really sending a message to the greater community that these tools aren't really doing the job that people are paying seven figures for. So, it it really makes sense to me that these companies have decided to forego it. Um, but the other flip side of that is now you have dozens of security vendors that you're paying and it's not really clear if you're actually getting the the full value out of them as as you want. But uh but we keep paying the piper and setting money on fire and um as long as you know CISOs get taken to nice dinners like they'll still keep buying these

products that eventually sit on the shelves. Um but there is hope. Um despite all of these Oh, sorry. [Music] Um yeah, sorry. Uh despite all of these issues, data processing has come a long way. And so let's talk about how you might actually build one yourself. We're going to start with uh the requirements that you might need if you're uh brave enough to go and do this. And then uh the general architecture for all of these sorts of things and specific tools that you can use in your organization if you can get approval for this sort of thing, how you would do it incrementally and also pulling in examples from Brex and Ripling and how they did it. So, uh,

let's start by talking about requirements. Um, the first thing you're going to have to think about is where's my data? What do I want to ingest? Uh, you can't ingest everything. You can't have 100% visibility across your entire organization. That's just it's going to be impossible, right? Um, computers produce so much data, it's just unbelievable. But uh if you can prioritize the sources that are the most important to your business, then you can start to um build something that works for those and then extend it as you go on. But uh you probably want to collect your infrastructure logs from AWS like cloud trail. Your SAS audit logs are very helpful because that's where people

are doing most of their work days. Um if you still have a physical network, your routers and printers might want to be on there. just depends on your organization, but it's good to do the exercise and figure out what's the most important so you can start to think about bringing them in. Um, the other thing that we're going to want to do is enrich the data because a single audit log is not very useful in isolation. You want to be able to correlate it across different environments and add some context to the data so that you know what it is that you're trying to detect. Um, something that's very useful is IP based enrichments, knowing like if somebody

came from a VPN or a tour network. It's not necessarily a threat, but it's a good signal, right? You can tell like if somebody's trying to hide from uh your organization what they're doing. So, um, the other thing that you're probably going to want to build is filters. So, dropping logs that aren't necessary because while you can do this at the collection layer, which I'll get to in in a bit, um, you're probably still going to be collecting things from some SAS vendors that you need to drop. And the more you can drop, the more efficient you'll be at doing detection and the faster everything will be and the the lower cost it'll all be. So, um,

another important thing is normalization and transformations. These are kind of two and the same. Normalization is uh bringing in the data, pulling out the important fields and making them similar so that you can both index them and query against them and have a good idea about u what's in the logs. Right? transformations are important because it's I mean it's kind of a part of normalization but sometimes you just want to clean up the data a little bit before you pipe it into a system so that you can actually um do some like uh either obfiscation like if there's sensitive data like credit card info or just being able to move fields around is a a pretty useful

uh thing. So, the next thing we'll have to think about is how are we going to store the data? Uh, storing the data is kind of a a little bit of a tricky subject. Um, some logs you'll need to store for compliance reasons, but aren't actually useful at all for detection. [Music] Um, so for those sorts of logs, you'll want to use something that's like a cold storage medium, uh, something like S3, something that's cheaper, that's not going to cost you a long a lot to, uh, have very long retention periods. Um, and it won't affect the cost of your fast storage, whatever you choose for that. The fast storage though, it's a bit trickier because there's a lot of

solutions out there. There's a lot of, uh, ways you can build it. Um, there's a lot of great open- source databases that are really good at this, but they they have a lot of knobs. Um, there's a lot that you need to do to configure them. Um, but first, you know, when we define the requirements, when we're talking about the requirements, we just really want it to be uh have sub-second queries and have enough retention that you can do uh kind of long investigations at least to cover like a campaign. Um, but uh the most important thing in my mind is just being able to submit a query, get something back within uh at the most a few seconds, ideally

subsecond, so that you can actually do the investigation without having to like sit there, go for coffee, get distracted, and actually uh be productive as you're using your SIM. Um, and there has to be some indexing strategy in the fast storage, too. Uh it when I was at a segment, we had an incident and the security team was looking for how many times an IP accessed our infrastructure and we were querying. We had stored all of our logs in S3. We were still a startup. We didn't really have a SIM yet. Um and when we submitted the queries using Athena, they took a long time. Uh but that wasn't the big issue. The big issue was those queries

cost thousands of dollars each. So we were just sitting there waiting for the queries to finish not knowing that it was costing uh thousands of dollars. But ultimately it was because the logs weren't indexed. Right? If you're just throwing them into S3, you're not going to have a very good result and you can't really hope for any kind of real-time or live detections, which is really kind of the whole point. Uh, so we'll kind of in summary, we're going to want at least two route or two storage tiers. And if we're going to have two storage tiers, we're going to want some way to be able to route between them to be able to indicate as

you're ingesting logs, which one are we going to send it to. So now for kind of the the big juice of this, which is uh detections. Uh detections are also a very tricky subject. Uh there's dozens of different ways you can do detections out there. Um of course if you're using Splunk, you're using something like Splunk query language. Um if you're using uh Elastic, you might be using Elastic's language. Uh what I've come to realize is a lot of security engineers like that piped query language type thing. Um, and so figuring out which detection format you're going to use at the start is pretty important because whichever one you're going to use, it's going to have some, uh,

different requirements in the storage system that you need to meet. So, um, you're probably also going to want to be able to test those detections. So if you're a Panther user, you probably are, you know, like Python. Uh, and you can run those tests against sample data in your repo. Um, also if you're a Python or sorry, Panther user, detections as code is something they kind of uh invented and made popular. Um, and that's just being able to store your detections in a uh in a repository so you can track the changes over time and know exactly like what changed where and if there is ever any kind of regression in your detections and you

miss something. Um, you can go back and revert that change or understand what happened there. Um, but uh the summary is like there's a lot of different detection formats out there. Figure out which one is going to work for you, stick to it, and then you're going to make decisions based on that as you build it. Uh, I will kind of plug uh PQL, which is an open- source detection format. Um I mean really it's just a query language right now but um if you like that uh custoto query language or splunk query language it's a way to be able to write that against any SQL database. Um so we find it pretty useful because uh as we'll learn most of

these big data tools kind of are built on this SQL concept because that's what analytics and data engineers are used to using. So we built it as kind of a bridge between the two worlds of security engineering and uh log analysis and actually compiling it down into SQL. So it's pretty cool. You should check it out. Not used to using this uh remote. Um there we go. So here's the big secret. All big data applications look like this. All modern data architectures are essentially just uh a pipeline that looks something like this. Now, I've annotated this to be more of a security use case, but pretty much every uh vendor or project doing big data well is

going to have an extract uh step. They're going to have a transform step and a load step, ETL. You've probably heard of this before. Um, but it's no different in logging and it shouldn't be different in logging. It simplifies the concept so that we can actually build something useful. And so, uh, for our, uh, homebuilt SIM, what we're going to do is we're going to have collectors which are actually going out to the network and collecting logs from devices and applications and other vendors. and we're going to be storing those in uh some sort of like buffer or staging area. These are the raw logs. They're not enriched. They're not transformed. They're not normalized. We want them in

that area. This so that we can better process them in the next stages. But also, it has the side benefit of if you don't want to have a dedicated cold storage tier later, this can be your cold storage tier as well. So it works really nicely to have that uh layer in between. Um not a hard requirement but I highly recommend it. Um so for that uh I really recommend any S3 like um storage medium. Uh there's a lot of compatible products out there today, a lot of open source ones as well. Um but you could also use something like Kafka. Kofka, you have a little bit more issues with retention and stuff, but you can make it

work. Um, but then once you have everything in that little uh buffer layer, then we're going to transform them. And this is kind of where a lot of the meat of getting your logs ready to be used for detections lives. Um, so for these processors, what they do, uh, things like Flink and Spark and Vector, um, they'll actually go and read all of those logs. They'll parse them into a, uh, data format, and then you can start to do things like transformations, normalizations, enrichments, and really build out something that's much more useful to use. And I'm going to go into detail on all of these, so I'm not going to get too too in the weeds here, but

um but those processors are useful technologies for being able to do that. And then finally, uh we're going to load these into some hot storage database. Um so I really like ClickHouse. Uh Cloudflare is a big user of Clickhouse. That's kind of where I got I learned about it the first time. Um but it's an open source database that is columner. We'll talk a little bit about that. Um, you can also use something like Snowflake. I know that's not OSS, but Iceberg is kind of an open-source uh storage format. It's not a full database. It's more like the Lego building blocks for a database. You can query it with SQL. Um, but they all have

uh something in common and they're all columner databases which is really really efficient for processing logs. Um, and so once you've loaded it into this kind of hot tier, that's when you can start doing this analysis, correlation, and uh, investigations um, using something like Flink or Spark or just SQL that is scheduled somewhere to be able to run and analyze the data. Um, and from that analysis, sometimes you're inserting insights, signals, um, threats back into the database so that you can correlate against those later or you're just, uh, using it for an investigation for reporting. But of course, the most important thing is that you're uh, writing some uh, queries, some rules that actually send alerts out to your

pager duty or whatever it is that you use for incidents. Um, and that's what's going to ultimately wake somebody up, hopefully not too frequently, uh, to be able to respond to things. Um, and last note, you might also want real-time detection. So, uh, a product like Panther is not only real-time detections, they also have their Snowflake thing, but um, there is a lot of value in being able to just alert immediately when something is very suspicious in a single log, but you're not going to really get a lot of correlations out of that. So uh your mileage may vary uh but I recommend having some way to alert both from the processing stages, the transforms and

from the an analysis stages. So having a longer context window uh just being able to see and correlate things across uh more data. Um so this is actually Brex architecture at a very high level. Uh their project is called Substation. Um it is built using uh lambda functions primarily as the processors. Uh those are typically written in Python as I understand it there. But you'll see like this looks very similar to this architecture. Right? So we have the sources that are being ingested into a raw stream. They don't go into a ton of detail about how those sources get populated but uh I believe that a lot of them are coming from SAS logs directly.

Um, and then in those processors, they're actually doing the enrichments, the normalization, and then putting them into that process stream or writing them to the raw bucket for long-term storage. And then after it's been processed, uh, they'll either write it to another SIM, which is doing kind of that that big database work, um, and alerting based on certain detections, or they're writing to some of their other, uh, destinations like Dynamo DB for some of their other types of detections, uh, or a processed S3 bucket if they just want the normalized enriched data to be able to analyze later on for other reasons. uh because if you look at the raw S3 storage bucket, it's not going to be

normalized data, so it's not going to be as easy to load back in for investigations and that sort of thing. Um so they've done a lot here. They've saved a lot of money by doing it. Um it's pretty cool architecture and I'm a big fan of it. Um but Ripling, I don't have an architecture diagram for them. Also kind of similar. they're also building on Lambda functions um and uh ultimately writing to a database where they're doing uh the same kind of queries. Um but now that we've kind of got the requirements and architecture out of the way there, uh let's actually talk about what implementing something like this looks like. Um the first step is collection.

And honestly, this is probably one of the hardest steps because as I was mentioning, uh the surface area for attacks is growing dramatically. Uh we have so many different devices, so many different applications. Um and they live in various places, sometimes physical networks, sometimes virtual networks, sometimes in your cloud, sometimes in your brother's cloud. Um and uh so we need to figure out a strategy for actually going out getting those logs and getting them into our uh data pipeline. Uh so some sorts of strategies involve API polling. So if you're like a user of GitHub, that's kind of the standard way to get your GitHub logs out. Um some SAS applications just support natively exporting to S3 buckets. That's kind of

the preferred way and it's slowly becoming a standard. Uh but that way you're actually able to just start processing it immediately, start the normalization immediately and kind of sidestep a lot of the collection process. But for everything else, for like the servers that you're deploying in your infrastructure, you can use something like Vector or Bento, formerly known as Benthos. Um, Vector and Vento are tools uh that are essentially logging agents. Vector is written by Data Dog. It's in Rust. Uh, it's extensible. Um, you can go and hook it up and collect all the logs from a particular box and send it to basically wherever you want. Um, and it's very efficient, right? It's written in Rust. All you Rust stations out there

probably know all about it. Um, and then Bento is a kind of a go- based alternative to that. Um, where it's a little bit higher level. It's not specifically about logs or telemetry data. It's really about kind of any kind of data. So, it's a little bit more flexible in that regard. Um, but you know, it just depends on your use cases and it depends on if it supports the integrations that you need out of the box, right? Right. So, you really just have to analyze your environment, your infrastructure, and see which one's going to work best for you. Then, a little bit older is Fluent Bit. Um, kind of a rewrite of Fluentd, uh, which was a

Ruby implementation. So, this is also very popular. Also has a lot of integrations, very fast, um, and useful. So, like it's kind of a more um, I don't want to say antiquated, but it's it's still useful. Um then for schema management, this is kind of kind of an unsolved problem. There's just like so many different ways that people expose their logs. Um you have of course JSON. So if you can have some sort of JSON schema validation as you're ingesting it, that's pretty important. U but also there's protobuff. Parquet is a a columner format that you can store things in in S3 and a lot of vendors have been starting to support writing that directly to S3. Um, a

message pack is also kind of like a a binary optimized format that's semicompatible with JSON. But it's really important as you're building these integrations and ingesting things to version control your data definitions because I guarantee you they will change. And if you don't have them version controlled, it's just going to be kind of a nightmare to uh to manage them long term. So, um, but overall, I would highly recommend JSON and S3. Like I said at the beginning, the industry has spent billions and billions of dollars optimizing this particular workflow, and it's pretty easy to use, understand, work with. Um, in my opinion, uh, it's just some of the binary data formats are more effort than they're worth and you

don't necessarily save that much in the long run if you're spending a lot more time managing them directly. Uh, the next is really processing. So, this is kind of like the data refinery. So in my mind as you're like going out to all of these different places and collecting the data um they are coming in in different formats and now we really need to start like making the data useful. So this is where you do your data normalization, your enrichment, your filters, um adding context, building up a log that's actually more useful that you can tell like where the IP is actually from, uh what city, what country, um how are they accessing your service. Um, and perhaps

most importantly, part of the normalization is actually getting the data in a format that's going to be easily indexable. Uh, one of the the challenges with Elastic and Splunk is that a lot of them a lot of people are just spending all of their time trying to get the data into a good format so that it is quickly indexable. But with the modern data pipeline, you kind of do that up front and you can have your indexed fields already normalized before you insert them into your database. Um, fil fitters there, not filters, my bad. Um, but you need the the fitters to uh reduce the data that you're actually collecting because otherwise it's just going to be too much to analyze. There

is a limit to these systems and there's not a whole lot of value in collecting things like Kubernetes logs. Um, some Kubernetes logs can be useful, but in my experience, uh, as programmers, as a software engineer, we're taught to log everything and let the logging system handle the filtering and extraction. I don't actually subscribe to that, but like that's what everybody been taught. And maybe it's Kool-Aid, maybe it's uh just propaganda from the log vendors, but the reality is uh you're going to have to filter out the data at some point. There's not going to be everything that you want to log. So, and what are tools that you can use for this? Um I would say like if you had to

pick one, Vector is actually pretty good at this. Uh so we we're talking about vector for collection but it's actually also good at uh pulling data directly from like an S3 bucket and doing this types of normalization and enrichment and filters. Um Bento and Benthos also a pretty good one. Um but if you're really looking for uh much higher scale then spark streaming stream streaming or Apache Flink are two very popular uh very well supported big data uh streaming architectures or uh projects so that you can just connect them directly to an S3 bucket uh write your rules write your enrichments your transforms and then just uh plug them into a sync. And the nice thing about

those two is they both support well they're built in Java but they support Java, they support Scala and they support Python. So if any of those languages are used already in your organization then it makes it a lot easier than uh something like vector bento which are written in rust and go and don't really have that amount of dynamism like you they're not as flexible if you need to have a lot of flexibility but uh the refinery is uh a big part of it. Um and then storage. Storage is uh kind of the bread and butter of investigations. If you want to be able to correlate things across a wide arrange uh amount of data, like you need

to have a good storage layer. And personally, I can't recommend ClickHouse enough. I've been a user for many years. Um and it's just incredibly faster. It kind of blows my mind seeing how much it can support. you can insert millions of rows per second uh on really consumer- grade hardware and just being able to do that uh is huge because you can then query it at a similarly high speed. Um but the big thing that they provide is columnoriented database. So, uh, for those who aren't familiar with that, column oriented databases are a way to store data that looks similar in a, uh, localized format on disk. And when you do that, you can increase your

compression ratios dramatically. And when you're increasing your compression ratios, that means you're also increasing the query or decreasing the query increasing the query speed, decreasing the query latency, and uh building something that can really correlate effectively across a large amount of data. Uh again, S3 and iceberg, it's also calminer, but it's still kind of a do-it-yourself sort of thing. Um it's a little bit more difficult to set up, but still quite useful. And then you're going to want the tiered data strategy as I mentioned before. Uh but also Cloudflare uses Click House for massive amounts of data every day. Um and then I did want to just do like a quick 5-second lesson on what Columner

databases actually like why do they make things faster? Why do they make things more efficient? And ultimately gets at the the crux of why SIM today sucks because all of the vendors they've kind of built their own bespoke databases and none of them are really doing this. Some of the new ones are doing this today and skinning it with something else but the reality is um this is what they're doing. Uh so rowbased databases are what classical databases are and they build indexes which are pulling pointing to a single row on a disk in a database. But that's actually very inefficient for computers to actually traverse the index and pull out that individual row from a

small um like block in a file on disk. And what column oriented databases do is they kind of flip the script a little bit. They store the columns together and uh on the right side of this uh is how they do indexing. It's not indexing in the classical sense. It's called a sparse index in that you only have certain marks every um 100 or so rows, thousand rows. In Click House, the default is 8,192. Um, and when you do that, you can actually load up all of those rows and scan them very quickly in consumer CPUs. And so what that allows you to do is it allows you to like trivially paralyze your queries and scan gigabytes of data

per second um looking for certain signals, certain things in your data that you're trying to pull out. And so [Music] um the column if you take anything away from this database or from this uh talk it's that think about the database that you're using to store your security data because ultimately it's going to make a big difference on the bottom line of how quickly you can detect how quickly you can respond and how effective your uh SIM is. So um these advances have massively improved the the query storage and efficiency. Um again click house, snowflake, iceberg. Um but any data that looks similar should be stored close together. Uh the specific uh advancements are really delta encoding.

So only storing the diffs between each uh element in the column uh which is good for timestamps and numbers or dictionary encoding which is incredibly good for logs because we only have so many words in every language and it's just basically storing pointers to the same words or the same strings. Um, and then finally, the thing that has really changed also in the hardware realm, which is kind of my background, is storage prices has dropped just insanely in the last 20 years or so. And uh, what used to cost uh, 38 cents per gigabyte is now 5 cents per gigabyte. And that's actually in just the last 10 years um, for flash storage prices. So, uh, if

you're paying for extra retention, like that just doesn't make sense anymore and you should really put pressure on, uh, people to ask why you're having to pay so much for, uh, additional retention. Um, and then finally, I think kind of the thing that I maybe glossed over a little bit is operations. So, keeping the lights on, making sure that uh, your schemas when they change, you get notified. If the source goes down, you want to get notified. A lot of this is just like writing cron jobs or queries that run on a schedule to check to make sure that there's still something there to check to make sure that that source still has some amount of volume coming

in. Um that is kind of a baseline. If you want to get more advanced, you can do like uh um AI based like correlations, but that kind of gets difficult and hard to tune. Um, but you want to do something to monitor the the data flow. And then scaling, uh, autoscaling can be kind of finicky. A lot of those tools that I was pointing out, they support autoscaling somewhat out of the box. So, it's not too too bad, but you really have to have something there because otherwise you could have an incident, have a DOSS attack, and not be able to use your SIM, and that would be pretty bad. Um, and then finally, permissions and access

control is always a challenge. Um, figuring that out, uh, figuring out how to audit the audit log is also kind of tricky. Um, but figuring out how to implement that when you're building it yourself, I think is kind of a unsolved problem today. Um, there's tons of products out there just doing this, but um, that one is probably the hardest. Uh so in review uh there can be a ton of benefits for building your own detection pipeline, your own security data pipeline. Uh not the least of which is just having ownership and control over the data, being able to use it for other applications, extend it. Uh you can retain it for longer and you can

have even better performance than you would if you bought uh from uh a big old vendor. Um but some of the drawbacks of course is you have to get permission to do this. you have to uh put some resources into it. And so it is still like a little bit challenging, but if you're wanting to do something like this at home, it's not that far-fetched today. Uh if you just keep it simple, you keep the architecture simple, you know, the parts that go into it, then you can really uh start to build something interesting and query terabytes of logs. Um, and it's my hope that by understanding these modern data architectures, we can start to uh really

improve our detection pipelines for everybody across all organizations because it's really no excuse today for not being able to tell like what path an attacker took through your network and also uh not being able to detect some of the most basic things that are out there just because they weren't in your SIM or it was too hard to get it uh process. So that's it. Thank [Applause] you. Any questions? Yeah. How bad has the schema management hell been? Sure. Like these blog formats are changing. Yeah. Yeah. So the question was how bad do I think the schema management hell has been? Um it really depends on the the vendor and the source. Um for like

your established cloud vendors, it's really not that bad. they occasionally like add some logs um or add some fields which if you're clever like that shouldn't break your your SIM system but it's really like the smaller vendors the uh the startups the SAS products that are like used by I don't know hundreds of organizations but not thousands uh those ones tend to have schema changes especially if they're shipping often um pretty frequently like on the order of a I don't know, every other month or so. Um, so it it's really variable throughout everything, but um I'd say the more important thing is just knowing when it goes down and a lot of people miss that step to begin with. So

uh while the schema problem is bad, just getting the like baseline health still receiving data I think is the most important thing you can do. So cool. Thank you. All right. Well, I'll be around after the talk if anybody has any other questions. And thank you.

A pain in the SaaS: Scalable Detection in the Age of Data Sprawl

Related talks