Detecting Threats at Hyperscale: My Journey and Lessons from Google's Front Lines

Name: Detecting Threats at Hyperscale: My Journey and Lessons from Google's Front Lines
Uploaded: 2025-10-27
Duration: 28 min 25 s
Description: Domagoj Klasic, a security engineering manager at Google who has spent 12 years in detection and response, shares lessons from scaling threat detection across millions of endpoints. The talk covers the evolution from ad-hoc SQL-based detection rules to a unified, strongly-typed system that abstracts

BSides Dublin · 202628:2555 viewsPublished 2025-10Watch on YouTube ↗

Speakers

Domagoj Klasic

Tags

CategoryTechnical

TopicDetection Engineering Threat Intel

StyleTalk

About this talk

Domagoj Klasic, a security engineering manager at Google who has spent 12 years in detection and response, shares lessons from scaling threat detection across millions of endpoints. The talk covers the evolution from ad-hoc SQL-based detection rules to a unified, strongly-typed system that abstracts away infrastructure complexity, enabling security engineers to focus on intent rather than query optimization. Key themes include modeling data as domain objects, integrating contextual information, and partnering with software engineering to build scalable detection infrastructure.

Show transcript [en]

Thank you. Hello everyone. Thank you for coming to this talk. Uh we're going to talk a little bit about what it means to do the textured at scale. Uh actually hypers scale. Sorry. They told me if I put hyper in front of it, it's going to be even better. Uh but first obligatory introduction. Who am I? Why am I talking about this? So my name is Domago. Uh I'm originally from Croatia and uh I'm currently security engineering manager at Google. Um and I had a lifelong interest in security as as I'm sure many of you had started doing security somewhere in high school or so uh finished university and shortly after university like 2 or 3 years after I

started in Google and basically I was there ever since. So I've been in Google for 12 years. Uh and it's been quite a privilege. Uh this is not going to be like a detailed talk about my life journey but if you're interested in some of the experience over those 12 years you can catch me up later around the conference. Um I started as a engineer uh IC in security and I worked in detection response organization. I stayed in the detection response organization for those 12 years. So my journey pretty boring in a way. There's nothing special here. Pretty smooth sailing but I really enjoyed it. Um there is a little bit of a of a change

that happened somewhere mid those 12 years I kind of stopped counting. Uh I became a manager. So this is my newest kind of skill people management. Uh this presentations is supposed to be a little bit technical but if that doesn't work out uh I can always uh start talking about map prioritizations and performance management techniques. Happy to do that as well. So that's me. That's a little bit of a intro. I will try to keep this talk also a little bit shorter because I would actually like for you to ask me questions. Uh so this is going to I am hoping this can be a little bit of a live ask me anything session. Um let

me introduce to you Google's detection and response or security organization in general. Uh just briefly what is our mission? I think it's not surprising that we will find that our mission is to protect respect and defend our users googers and the internet. Actually in our my own team in detection when we say users we mean googers we mean external users for us everyone we need to protect everyone that's our mission of course in Google many different teams contribute to this mission Google had bigger security organization with many different moving parts detection and response is just one organ one suborganization and inside detection response many teams contribute to that mission we have digital forensics incident management and so on this talk

is going to be mostly about detection but of course we not forget our partner ers uh the way we approach this problem space from like human perspective our engineering and operations are combined in one role. So security engineers in detection and response in specifically in detection they both do operational work they both do engineering work. So this is kind of combined uh two sides of the same coin and being excellent at both is crucial to general success and uh being able to deliver on our mission. Of course we are not alone. Strong partnerships within the company are essential to our success and you will actually see this what I mean shortly. Most notably partnership with our

software engineering organizations. We have a lot of software engineers in Google as you can imagine. Some of those software engineers are exclusively dedicated to building security infrastructure that then we use to protect the company. That partnership is critical to our success. We spend a lot of lot of time on this building that partnership and working together. So those all those contribute to our general oral mission of detection and response. So why is it a problem to do kind of detection intrude detection at scale? Well, I think it should be kind of obvious uh the huge amount of variety you have to deal with every single day. Like everything can happen at scale when you have thousands of machines actually

sorry no hundreds of thousands of endpoints you're monitoring millions of endpoints you're monitoring among millions of endpoints thousands of them have a bad day every day right they're broken something's wrong uh behaviors you think should never happen they happen daily because this is just basically a statistical certainty if you start designing your rules by this only happens if an attacker is in the network guess what no this happens basically every day. Turns out employees do weird stuff. Turns out the combination of all these different systems when they combine some weird stuff pops out. And of course uh things go completely bonkers all the freaking time. Uh for detection or having logs is you know your bread and butter. But then you have

logs arriving in the distant future. You have logs arriving in the from the middle ages in time stamps. How do you deal with this? you have logs that have arrived and should have arrived constantly all day every day. So that's the challenge of being at scale. You just have to deal with pretty much everything at the at the at the end because you when you connect the the dots the logs you basically oversee the company in a way through its own logs and then you have to figure out okay what actually makes sense here and how to deal with this. Uh I have a great talk today about metrics and how metrics can be misleading. So I thought I should put

some metrics here uh full of you uh just kind of quick stats to to give you a sense of the scale maybe a little bit of a bite of what we have kind of deal with. So in terms of data sources or if you're in another way want to read about this this is kind of logs that we have uh on the order of 500 this is a little bit outdated on the order of 500 there are more different types of logs we test just to make sure we can cover different threats for different surfaces. uh this kind of in raw metrics this would translate to current around 7 trillion log lines per day. If you want to think

about each individual log lines that we have to process and inject to protect the company and write our rules these uh these things that were struck through those are in billions like we had 120 billion uh over over a couple years then it grew and this constantly grows. So this is like a to give you a little bit of a sense that uh it's never static right as the company grows you have to kind adapt to that. Uh if you want to think about like services and applications that we monitor on the orders of thousands uh Google also buys a lot of companies uh as I'm sure you're familiar we are also responsible for protecting them. So each and every

company is basically completely different architecture, completely different systems they have and we kind of have to kind of integrate and monitor them. So another dimension to the complexity and the scale. So we have 50 uh different types of acquisitions. We call them acquisitions that we monitor. In terms of data processing, what do we do to kind of combat that? We have around 1,200 even more rules. I actually don't know the exact number. Still I'm manager. I don't need to know that. Luckily uh there is a lot of kind of internal technologies we use. Of course we have our own end agents that we roll out to to to have a really healthy telemetry. We use machine learning if you're

interested. Yes, of course we use it. All kinds of statistical analysis to do like detection. We use plain old manual hunting. Let's we have a lot of security engineers who really know how to dig in, go into the logs, go into stuff and find badness and attackers just by like being very creative and explor exploring data and we have a lot of automated systems that process all of this. So this is like the second part of the slide on your right side. It's a little bit kind of overgrowth simplification. Growth simplification say we have 1 million events per year. Yes, many of them of course are handled completely automatically. We have these automated systems in place that kind of analyze

those events. They decide on the badness and then some of the leftovers is put security engineers, security analysts, they look at that and then they make final uh decisions if something is really bad or no. And if it's an incident then we had of course incident management procedures that we do handle that as well. So this I hope gives you a little bit of a sense of what's what the scale and what are we dealing with. So given all of this and all the problems we have uh h how do we tackle this? How do we approach this? And this is also part of my journey at Google. A couple of years back uh we ran into issue that

we actually had trouble scaling uh with our new detections, new rules. How do we actually answer to like the growth of the company? how do we answer to covering for more threats because our systems were not as good. So we really sit down and we really thought about like our detection and how going to make sure we we are sustainable for the future and I'm going to start here kind of very philosophically and I'm going to immediately jump into some cool examples. So this is going to be a really interesting jump but from the philosophical perspective and kind of pillars of of detection at scale we devised kind of this in big big uh buckets or three big pillars. First is

modeling. Uh given all the problems like that you have with all your logs and all your data sources, you just cannot leave them lying around like that. You need to take active control over this and you need to do a lot of data modeling to represent domains that you want to protect. And this is where we invested a lot of energy and a lot of resources. And this is like the key what one of the very very key aspects of what it means to do a security engineering work at Google. You have to like look at data sources, telemetry you have and then you have to model that into abstractions that represent your world on top on top

of which you will build your rules and those abstractions then represent a distance from the row logs that enables you to maintain them easier that enables you to scale and this is like one of our pillars. Second one is intentbased detections in terms of when you write detections on top of logs and things like that often you run in issues that you actually have to like wrangle actual data sources. You have to think about how is the data processing pipeline working uh what do I need to do to make sure that the pipeline or the query finishes or things like that all nonscalable for security engineers because you don't have time. This is not like what you should be thinking about.

you should be thinking about how do I actually find the evidence of bad behavior in the pipeline. So we have uh invested a lot of time energy in actually designing a system that has a unified API security engineers that can use it to express their intent what is actually you want to detect on top of your abstractions you modeled earlier and then equally important context uh bringing in relevant information just in the time when needed. You cannot make uh automated decisions if you don't have information. Integrating with key information resources and key information sources in your environment in your company is critical simple example asset inventory information HR information and things like that. Pulling that in when you have a

detection enriching it and then making additional conclusions is the way to scale. And those are like the three philosophical pillars of our detection that we we want to build. And we invested a lot of time in partnership with our software engineers to actually build those systems that kind of hopefully kind of enable us to do that. Um let's drive kind of jump quickly into the example of code to see like how this works. Um I uh the code I'm going to show you is mo code. Obviously, it's not how it really really looks, but it's actually not that far away as well. It's a little bit simplified for the purposes of this presentation. And of course,

we're going to start with back in my day. So, when I started uh we were obviously writing rules, right? Uh the way we were writing rules then was like like this was a little bit more complex, but basically we were writing a large amount of SQL queries. Almost every single SQL query was one detection rule you can think of it. And any result when you run this detection rule, you get some results. When you run the query, you results. You can consider this like a detection. Something bad happened. In the most possible trivial uh trivial rule you can possibly imagine is like written up on the on the on the screen. This would like the basic thing.

You try to query your execution logs to find a case your hash matches a virus total hash that has a no bad signatures, right? And then like okay, I want to detect this. So we wrote rules like that obviously uh it was very painful. So there's a couple of problems that I outlined with this that really prevents you from thinking uh how can you like expand this first and most foremost these are all just strings most of the time because we used very simple logs parsing but you have host is just a string guess what you have 100 thousands of hosts they don't have unified names you can't like rely on this so UID file

path arguments the all just strings basically it's hard to compare them it's hard to reason about them doing any kind of advanced data processing not easy then when you want a join cause different data sources like in this this situation you have to know joins you know how SQL uh specification and any SQL documentation of its implementation especially internal Google tooling has hundreds of pages well our security engineers knew what's on the left page of those specifications and that's not good because then you have to spend time like really wrangling like the super detailed aspects of how do you like join data this is very trivial join we of course you got more complex joints for

more more complex rules then you run into performance issues of those joints like this is not trivial thing and maintenance burden on those rules is immense and then finally uh what I also know that basically through those SPL operations you're just hiding your intent right you're you are you don't know what this rule is somebody comes later wants to read it wants to modify it he's like okay yeah I have to understand this complex joint why is that here what is this telling me what's the purpose of this of course no documentation ever. So it's very hard to have uh uh clean documentation even at Google. So what do we have today and why do we think this is much better? So our

rules today look more like this again not exactly but closer to this. Of course pro buffers to do to the to save us from all of this. Basically if you look at here this is like a analog of the previous rule. Here we have execution event. This is like a strongly modeled, strongly typed domain that I was talking about before. You invest in this, you model this. This is things that security engineers have decided that this how this is how data structure should look like. Those are all strong type data structures, right? You can see that they contain additional data structures. File has its own primitives. Uh machines have their own primitives like the the host identifier. Then you

to the host identifier you can attach different operations that are only specific to those host. And basically you're using kind of very normal software engineering principles to to model your domain. And every single uh execution events of which there is like millions and millions per second they create one of these messages in a in a storage. And then on the right hand side this is processing pipeline and this is the API. This is this intent based API to write your rules. So you use something like this to express your intention and it's much easier to reason about like you just write logs do.execution execution events. That's how you read execution events from logs. You don't know what they are. The

platform automatically handles late arriving logs, early arriving logs, overlaps in the logs. You don't have to think about any of this. Really, really focus on your uh your attention. You want to implement some filters. You want to implement some enrichments. This is like putting those enrichments into the into the moment to make sure your rules are high quality that you can bring in data. And then you can just kind of output those rules as as another we call them facts. These facts in themselves are also structured. So you can build build on top of facts. So they are like recursively uh recursively structured and then you can compile more complex rules. So you can model your rules from

base facts into the more complex facts and then out alerts for for if there is really something bad happening. And this is now serving us really well. uh it's kind of solving a lot of these maintenance issues, maintenance problems. Looking back for things, you don't have to think about it. It really enables you to scale. It's it's a lot of investment, but this is like bringing this uh reducing the gap between software engineering and security engineering like bringing those things together is something that really kind of enabled us uh to to really push the needle and move forward. And that's it. That's what I wanted to tell you for for today. A little bit of

code, a little bit of philosophy. I think I I I hope I made some points. Uh guess what? This thing is not perfect. It doesn't work perfectly uh to migrate this satisfaction. So we need smart people. Uh we are hiring. So there is a Google boot visit us. You probably want to meet uh Juan there. He has some some some things for you. Uh I invite you to spend some time with us there. I can go in more details or or whatever you are interested in. If you're interested in my my path in Google, if you're interested in technologies and technical details, happy talk about this. And now I would like for you to ask me

questions. I believe we have like 10 minutes or so. I'm going to answer. Yeah.

[Applause] So you said that what's the main challenge model? >> I think the the currently the main challenges is more in terms of like how do we even expand it further? How do we integrate it with you know AI which is coming definitely and we see some some of the gaps here of like what's the next thing we need some smart people who can think about that as well. So that would be like be one of the one of the challenge. Uh another challenge is uh those are like um when you think about how do we do uh end to end testing of these things is a little bit un unexplored area underserved. We want to

invest more there like when you think you have these big rules they detect some threats. Okay. How do you know they actually work right? This is a very interesting thing. So we are a little bit we would be would want to invest there as well. So those are like examples of those things. [Music]

Ah yes that's a big I'm going to go back here that's a big advantage of this so see this like for example execution event so when we acquire another company and let's say that company has some other agent on the endpoints this logging executions the only thing we need to do is create a specialized parser that translates that to our execution event and then every single rule that relies on our execution like any rule that covers executions will just work out of the box. Sometimes we have these funny moments where we start seeing logs flowing and actually triggering rules without we forgot to send an announcement that we onboarded a company because the pipeline is already

ready and it's just automatically doing this. Before that this would be a nightmare because for every single company we would have to actually duplicate the rules tweak them a little bit for every single escro query tweak them and say okay now this this set of rules works for this company right for this position that was like immense immense every other additional marginal cost of an acquisition was huge on our our team and on our resources >> what Google to stop using Google infrastruure against other enterprises whether it's in structure or Gmail. >> Uh I can't answer you that question completely. Google has a lot of like different teams and abuse uh teams that deal with abuse. They are like doing all

kinds of crazy stuff, advanced stuff to fight that. Uh that's not kind of my area. I I won't comment further. Yeah. Go ahead.

>> Oh, I wish we could have flipped the switch. Uh yeah, it required serious engineering, serious dedication, serious support from obviously leadership. It was like it took like couple of years to build those systems and do the migration. It didn't came for free for sure. Go ahead. >> For people running detection against what kind of top

>> uh are you asking me like from customers perspective?

>> Yeah. What are some of the stuff you get wrong? >> I I can't say what what are like the classic things people get wrong with GCP. I I I don't think I can comment on that. But what I can say is for for all of our customers, don't worry, we got showered. We are uh uh yeah uh we I mean for real uh our mission is to protect the entirety of Google to protect its production infrastructure. By extension we protect cloud and by extension their customers. For example, we have rules that can detect sandbox escapes from GCP and so on. So that that protects everyone. So I don't know like what actually would be

mistakes that customers doing. Um you want something

>> question there.

Uh yes, I I'm I'm I'm pretty sure it did. I I'm not directly like working with that relationship, but this is invaluable information, invaluable partnership for us. Yeah. >> Go ahead.

>> Ah yeah. Uh the it's it's an art. I can I can say that. Uh it's it gets difficult from time to time right because operations is driven by interrupts work uh by like these priorities. If an incident comes up, if a urgent thing comes up, you have to do it. Uh and engineering work requires more like sitting down thinking type of things. Uh we have devised like a lot of team structure and a lot of our protocols around this that you have like scheduled on call so that you can plan ahead know when you're going to be on call and so on. There is like we have a we have Google has invested a lot of in

into our team. I'm really happy for that because then we have like a follow sun model obviously so that people have like normal working hours and normal on call hours. So this really requires like a good partnership with leadership and engineering to kind of put that structure in precisely that you can kind of manage this. But we think that is invaluable like for us because being in touch with operations and doing those incidents doing those alerts and then it's informs your engineering and vice versa. >> Uh go ahead. Yeah.

>> Yes.

>> Yeah. Great question. I wanted to put it in the pretation. I didn't know how it's like gets into real details. This is just one pattern that API supports. We have multiple kind of patterns that we use to break out the domain. One of them is for example correlation pattern. We can correlate different events in different like time frames. There is different patterns that we kind of support that you can use to express your intent. >> Go ahead. question.

Ah yeah yeah yeah actually I actually don't know fully the correct answer here even in the first example actually is just a language the implementation was we rely on fundamentally like Google data processing technologies so those are like internal data processing engines that run these things at at massive scales when actually they read they go and actually read logs from their centers how that plumbing works this is like even three layers below So what you're seeing here is this API that security engineers use those API gets translated by our platform into those uh rule into those instructions that Google's data processing platform understands and then data processing platforms go in and do their match basically for us. This is another big

thing. We we rely heavily obviously on Google's immense experience and skills with processing huge amounts of data. >> Unfortunately, no. No, you can implement the API first for yourself and then you can run it. >> Uh go ahead.

Um I don't think I understand the question correctly. Are you asking is this publicly documented or >> Yeah.

>> Ah yes uh it's not publicly documented won't be unfortunately. Sorry. And this is like I'm giving a little bit of a of a simplified version here. Uh this is uh internal stuff. What will be the crossover? I think that for example Splunk uh API is like the way you do searches in Splunk and this is more oriented towards text based searches and this is not. So there is a little bit of a significant difference here. Go ahead. >> How big is your team? >> H good question. Uh I I can I think I can say that detection or around the world is like 13 40 people. It covers entire company and then uh like obviously we have

different offices uh in Europe in US in Sydney India like that there is like all offices around the world teams in the office is a little bit smaller. Yeah. >> Go ahead. I just want to ask in your previous process every

Um, sorry it's a little bit complex question. Maybe you can uh talk to me a bit later. I I kind of didn't find all the didn't follow all the question. >> I think this is like final question. Go ahead.

Sorry, what kind of [Music] >> advice? Uh get good at data analysis, understand how to do this uh especially scale and uh do some coding. Those are really critical skills. Yeah. Thank you.

Detecting Threats at Hyperscale: My Journey and Lessons from Google's Front Lines

Related talks