← All talks

Trawling for IOCs: Catching C2 in a Sea of Data

BSidesSF · 202530:2386 viewsPublished 2025-10Watch on YouTube ↗
Speakers
Tags
About this talk
Moses Schwartz presents a data-driven approach to detection engineering, showing how to efficiently extract indicators of compromise (IOCs) and build threat feeds and detection rules from large security datasets. The talk covers three practical examples: pivoting from code-signing identities to build hash feeds, hunting for malicious GitHub repositories in sandbox execution data, and generating detection rules via template automation while navigating regex pitfalls.
Show original YouTube description
Trawling for IOCs: Catching C2 in a Sea of Data Moses Schwartz In the vast sea of security data, how do we efficiently find malicious activity and turn it into actionable intelligence? This presentation introduces data-driven detection engineering, showcasing a data-first approach to building detection rules and threat feeds. https://bsidessf2025.sched.com/event/b15a703ac7e09437840f5bdb114fd5bf
Show transcript [en]

We're going to go ahead and get started here for an excellent talk with Moses S on trollling for IoC's looking for C2 in a sea of data. Big round of applause for Moses. [Applause] Everyone, thank you everyone. So, the conference theme this year is here be dragons. And I really liked that and I've kind of leaned into it. Uh, and I'm drawing an analogy with fishing. Now, caveat, I don't know the first thing about fishing or boats or ships. So, there are probably details that are wrong in here. Uh, please mock me if you know what I did wrong. But, uh, the analogy is we've got so much security data that we can pull interesting things

out of, but sometimes it's kind of hard. And I'm hoping to give a few examples here of how to do it. I'm not going to spend a lot of time here, but who am I? Um, my name is Moses. I work for uh Google in the security operations product. My team mostly writes rules that and uh develops feeds and stuff that go into the security operations product. My passions are really like f focused around automation, eliminating toil, uh getting the boring work that security engineers and analysts do down to the absolute minimum so that we can focus on the fun parts, which kind of leads me into what I want to talk about here. Datadriven

detection engineering. I didn't invent this term. I've uh copied it from some other places, but I think it's a really good idea because today detection engineering on the whole is kind of artistal, right? A really smart analyst or engineer goes and looks at a threat report or they look at an event or you know whatever other source of data they have and they think really really hard about how they might detect it. This is my generalization of the uh Fineman approach to the scientific method. You look at a problem, think really hard, make a guess, see if you're right. And that works uh that and then you go through tuning and testing and then you

push it out into your prod environment and then as you see new ones, you continue tuning. But that process only scales linearly with the number of people you have. If you want twice the number of rules, you need twice the number of engineers. And actually once you have a big set of rules in your backlog uh you need even more people to maintain it because it's like tech debt right you have to continue tuning and maintaining them. So instead the concept that I want to move toward is um taking a data first approach looking at like what we can pull out of all the data that we have available to us to start building up labeled telemetry or

you know like these curated data sets that have really interesting stuff in them. Then we can look at the results uh mark things true positive false positive come up with new analysis strategies and then have everything kind of be this feedback loop where this data just gets better and better and we improve our detections. Now that was all very handwavy to be honest but it's the grand vision. I want to go into some specific examples of how we can pull stuff out of data and kind of work toward that. So this talk is split into three parts. First is trolling for serer identities. Um here I'm going to go through how we can take ser identities

from binaries and build a feed of hashes from it. Then I'm going to go into uh using data from virus total sandbox executions to find interesting GitHub repos that get downloaded by a variety of tools. And then the third is going to be talking about generating rules, uh, generating regular expressions, some of the gotchas and the dragons there. And uh, I forgot to say this on the first slide, but I think I'm going to have time for questions at the end. And I really like questions. I'll also show this again later. Please put questions into the slido. Okay. So the example I'm going to use for the signer identity pivot stuff is remote access tools. Now a threat feed of just

remote access tool hashes isn't really much of a threat feed. It's just a feed. You could use it in multiple ways. Um a lot of the time you might be able to filter out false positives or maybe you could use it to identify any remote access tool in your environment that isn't the one that you have actually intended to use at your company. But um you know remote access tools have been in the news quite a bit. They're a common part of many attacks and most of these tools like most executables these days because we seem to live in the future now. Everything is digitally signed or it pretty much just won't run. Uh they've

all got you know these digital signatures. What we can do is go take the data that we have in Virus Total and start pivoting from knowing who the signer is to getting that list of hashes. So, I'm going to go um pick on Ultra VNC, not because it's malware, but because I like it. It's, you know, one of the OG clients. And I'm just going to go download their latest uh distribution. It's actually not very new. It's from 2023, but I guess VNC doesn't change that frequently. And uh you know, we can double check that hash. I would click download, but I'm actually really lazy. I'm going to copy and paste it into Virus Total because someone else

probably already did it. And yes, they did. That's the same file name we have at the top. Uh I'm going to click on that details tab to go see the signer stuff, but we can also double check. Same hash. And interestingly, the first scene date there is the same as when it was uploaded. So, this made it into virus total really fast. Uh, you might also notice that three security vendors did already flag this as malicious. Um, I don't know which ones. I don't think I agree with them, but it's a dual use, right? Okay. So, when we get into what that signer looks like, um, they they have this UVNC, BVBA. I have no idea

what the BVBA stands for. Uh, and we can ignore pretty much everything else higher up the chain. I'm just going to take that name and use that to pivot and go search. Before I go to the next slide, though, uh, note the date. This is just kind of neat. You can see like how their development process went. This file was signed the day before it was uploaded to their website. So, I'm just going to use the virus total web UI here. Uh, search for signature UVNC BVBA. And cool. That matches 6,000 files. I went through and spot checked some of these. Um, and they all look like the right stuff. When VNC, VNC viewer, ultravc, the next question is what do

you do with that? So, like I said, um you can use feeds not just for threats, but for things that are u non-malicious, too, right? If we knew that we're using UltraVNC, maybe we could pull in just this list of hashes and build a lookup table or something, you know, you'd call it a feed when it gets really big, I guess, and set up some rules that alert on anything that calls itself ultravn. see without being in that list. That might be interesting. Um, I don't know. There are a lot of ways that you could use this data that's generic to pretty much any platform, but I did want to be able to give a

little bit of a peak into what we do inside Google security operations. We have a reputation for doing things, you know, big. And I'm hoping that'll be kind of interesting. The secret sauce really is that we can run SQL queries against everything. That is what makes a lot of our analysis so much easier. We don't have to pivot through the public APIs. We can write SQL and join across multiple data sets. And if we do any particularly sensitive joins, we have to ask lawyers before we do that. But it's a really powerful way to access that data. And so we would set up something like this query where it's going to uh an SQL query pulling this list of hashes from

the virus total data searching for that exact signer identity. And then we have a bunch of those in a big configuration file. We've got one for Ultra VMC. We've got one for tools like RU Desktop. I ran across that one a while back. And a whole bunch of others. Those all get run periodically against the data in the database. Then we take the results of all those, aggregate it into a single feed, put that into a development threat feed in our product where we do some testing and internal stuff and then that gets pushed out into the pro uh product. Again, I would really love questions and I wanted to include the QR codes again.

Oh, if anyone wanted to take a picture, I could go back. Okay. Uh, next up, trollling for git repositories. Um, so a lot of tools download additional stages, additional modules from well, I mean from the internet, from any free hosting platform, but it's really common in the PowerShell world especially. We have tools like Powerloit, PowerShell Empire, and they all just download stuff straight from their GitHub repos. Now, if they're just a few of those, it's pretty easy to hardcode a list. Like, I'd like to know when Ploit shows up at any of my command lines. But what about stuff that isn't as wellknown or stuff that's really ephemeral, like repos that are created just for a few days? So, how could we

find those in all of this data? So if if these strings occur in like binary executables, we can just write a regular yara rule and search through and if you have that string, it'll find it. That will work in some cases, but a lot of stuff, especially PowerShell, like you don't have a static binary in the same way, and there's a lot of obuscation. It can be hard to find that. It's really powerful to look at the um strings in your sandbox executions instead. So you can see the commands. That's what we're looking at here in this graphic. This is PowerShell going and grabbing something from github user content.com blah blah blah. Most of the time anything that

isn't uh you know ju just for fun and education, it's actually going to be obfuscated. So you may have to jump through some extra hoops to find it on the command line. Looking for B 64 encoded versions. Um looking for other indications of obfuscation like bitwise exor. But um while in like production environments it's sometimes hard to match up network logs and host logs and get a really good idea of what happened. If you've got a sandbox environment it's very very easy. we can just look at the network logs and see which samples reached out to GitHub and get that URL and we can build this from that. So that's what I'm going to do here. Uh can

go write a rule for virus total live hunt. This is um this is just regular yara that you can run inside their their platform. But if you import this vt module you can look up details of the sandbox execution behaviors. So not just you know the static analysis of the file but what commands it ran and what network connections it made. So that's all this is. Uh we're looking for anything with PowerShell that reaches out to GitHub user content. I left that running for a little bit and ran a retro hunt. Started getting results pretty much right away. And then I went to the API to pull down some details because through the UI I

don't think there's any easy way to go and say show me um all the network connections from this set of samples especially if you need to reg uh reax it out. So I went through um just pulled the results from my hunting rule set and then for each of those go and request the behavior summary with more detail. Uh, and then for each of those, go through the HTTP connections, pull out anything that matches GitHub user content, print them out. And, uh, I find this part really neat. And this is something that I'm pretty sure you can even do with, you know, free access. You're just rate limited. Like just this random sample of stuff in

Virus Total. We see some really interesting looking repositories. So I I I have no idea what evildev is. Is that any of you? Um it's interesting like when you go and look into it, their um repo named M that either is private or was deleted, but their token stealer repo is public. So uh I again I mean maybe it's pen testing, maybe it's just for fun, but a lot of this stuff is really interesting. other things like we need to tune some of them out. Spicedy I'm pretty sure is a Spotify client uh still you might not want it if you're in an enterprise environment but probably not the same as RainV exploit there. So okay in the last example we

took the data and built a feed because we had like thousands and thousands of hashes. In this case, we tend to run into smaller numbers, especially since some of these can be ephemeral, right? Um, there are some that are going to be around forever. Power sploit, PowerShell Empire, but if you go check on evil deev, like it was there, still there yesterday. I don't know about today. So, sometimes it's more efficient to build rules rather than constructing everything into a feed. And this is going to be actually like pretty straightforward, but then I'll get into all the gotchas. So, I'm going to start with the end result. Like what should a rule to match this look like? Here's a

basic rule. This is written in the URL language that again that's what we use in Google Secops, but if you kind of squint it looks like every other uh detection language. And the core part is really that regular expression and the fact that we're filtering it down to network connections. And uh here using PowerShell Empire as the example. So I think this is pretty straightforward. I don't think I need to go through it line by line, but we're uh using a regular expression to pull out just that repo. Then putting it in the outcome section that makes it available as like a summary of the results from rules. Then we go backward from there. Once we've got that list of

repositories, uh, and we know what our end rule should look like, we can go turn that into a template. So, all I really did here was replace some of these things with that double curly brace syntax that Ginger templates use and then wrote a tiny bit of Python to go through our list of repos, um, open up the template and render the result. So, this is very easy. Now you've got as many rules as you have repos um I'll give you a sneak peek into some of the problems actually my rule name you see in in the metadata it changes based on each repo in the uh top line there that rule name which is actually the one that

matters for our system that's the same name so I'm going to have to go and add the repo URL into that name except of course we can't have slashes and I'm not even sure about dots. I think we can't. So you end up needing to work in a bunch of special handling just to convert things, get everything into the right format, manage it across each different data source where you're trying to pull this stuff in. Okay. Um few more places where there are dragons. Regular expressions are always problematic. They're also kind of like most of what we do when writing rules really. But a lot of the time uh this doesn't cause a problem but an

example.com that dot is going to be interpreted as the you know match any character character. And so you'd also match example XCOM. It very rarely makes a difference in this case like looking at domains. I can't think of a case where that really broke something, but this can uh be a cause of surprising behavior. Any of the regular uh expression special characters in fact end up being super difficult to escape. So a lot of the time like down below character escaping backslashes are so deeply problematic. There are a lot of cases where you have to do basically it's like a a triple escape because it gets interpreted first in the Python file and then it gets interpreted when you send

it through some other system and then if you do it wrong when it actually gets to the regular expression either it's going to be you know an escaped backslash when you want an unescaped backslash or uh it's perplexing and when you have a double backslash suddenly you need like eight of them except when you don't because um well it's complicated. What that leads me to though is talking a little bit about normalization. So this is something that you hear people in data science talk about a lot. You've got to normalize your data. Your data set has to be really clean to get good results out. And uh I don't think they're dealing with the same issue with uh escaping

characters that I do, at least not dayto-day. But it turns out to solve that problem pretty well. So I built a bunch of normalizers where we can do things like take um a regular expression that matches any legitimate Windows drive letter or that server route and just replace it with a special character or a special string. Then we can pass that through all the rest of our processing. We can copy and paste it between files. And when we go to actually write a rule out, we can denormalize it and just turn it into the text that we know is the right regular expression. That is a really powerful way to get around some of that escaping

hell. The last part, again, I've said this a couple times, like ongoing maintenance. That's one of the dragons that people sometimes forget about. Every time you make a rule, that's one more rule you have to maintain. It's just a different kind of tech

debt. Okay. So, back to that, you know, Google scale talking a little bit about what we do. Again, we run SQL queries against everything. And so, that makes it easier to put together some of these pipelines. When a new repository shows up in the data, it's relatively easy to get a new rule for it quickly, but we do have to look at aging out old ones. Um, you know, keeping it to a manageable number and we still have humans in the loop for all of these steps. I I have dreams where this is fully automated and all we need is to check in once in a while, verify things, click the LGTM button and we have all the

detections. But we are not there yet. Uh all of us still have jobs for quite a while longer. Okay, so this is going to be my last slide and we have some time left. So again, if anyone has questions, I would love to answer them. Um, okay, bringing in the catch. I I hope that's the right one. Sorry, Moses. I just to say in the audience, we do have several questions in the slidoh. If you want to get yours in, uh, get them in now. Awesome. Thank you. Uh, where was I? I think I was going to say I know nothing about fishing. I hope that's the right term. Uh, okay. Again, datadriven detection engineering. I really like this concept.

I think it the the idea is just taking data and make making detection engineering like a little bit more rigorous uh a little bit more software like and a little less artal. Um, again this is besides uh and this is not a vendor pitch or anything. All my examples were Google because that's where I work. my manager would be sad otherwise. But I hope I kind of talked through it where I think you can take this to any kind of data source you have and use the same ideas. Uh and let's see, I'll reiterate some of this stuff isn't just limited to, you know, finding bad things. I threat feeds are a little bit misleading because good thing feeds

are uh totally legitimate and useful for a lot of detections and at the end of the day you know it's it is all human driven. I want to build toward this continuous feedback loop where we reduce the toil for engineers and improve our detections. But it's still going to take work and we're going to have to keep coming up with new analysis strategies. We're going to have to keep coming up with different ways to uh I don't know detect threats. We're going to have to continue triaging the results. It never ends. But that is hopefully how we can keep up with the emerging threat landscape. Cool. Everyone, let's give Moses a big round of applause. And

Moses, thank you so much not for presenting here in the imposing IMAX theater. Moses is a threetime presenter here at Bside. So, let's give him an extra round of applause for that. Thank you. And thank you folks for putting questions in the slido. So, there there are more than we can handle. So, we can do a little real time stuff here. If you want to get in there and vote otherwise, I'm going to take the top three questions that seem like reasonable questions and I'm going to put them here so Moses, my speaker, can see them as well. I'll be happy to talk after this too. Oh yes. So Moses says he will be available

in the sky view if you can find him. I think you might have a crowd of people from the questions I'm seeing here. So the first question is that's not a comment. Thank you for your uh comments here. Have you considered using deep learning approaches to capture unique data? Uh yes I have considered it. I think there is a lot of opportunity there and uh I mean we've played with it a little bit and honestly like I Google's in such a great place we have the tooling to do all this stuff but we haven't really gone deep into it because we can still get really far without it like regular expressions are still getting us a lot of really interesting

data that's absolutely a place I I want to take this work but I think that's such an important thing which is we're like we have to have all this advanced super deep learning AI. There's so much basics that can really get you pretty far, right? Yeah, we're we're still just scratching the surface. Okay, here's something about how about network logs again, when do they work if you follow your datacentric approach and what's the challenge outside of sandbox again? So, that one is right here if you want to review it again. And we got one up vote, two up votes. So, we'll go through those next. Okay. So the issue with network logs is that you have different sources of them. Uh

say you you have your endpoint logging and hopefully you have you know logging for your network connections but you may not have that perfectly then you have to match that up with your firewalls or your proxies or what whatever else. And it's not always immediately straightforward how you map one of those connections to what you actually saw on the host computer. Uh I mean it seems like something that should work but I have worked in a lot of places where we had to spend a lot of time figuring out the ways to pivot into one system to pull the data back and reconcile it. Uh that's why if you have a sandbox where it's all just right there and you know

it's the same system at the same time that makes it easy to use the data. But in the real world it it ends up getting messy and you have to interoperate between multiple systems. And if it's encrypted, like maybe you only have the domain, then it it all depends on like what your log sources are. All right. And if you want a deep diver dive deeper into this, find Moses during the happy hour. I'm sure he'd be love to geek out with you on that. Okay. We got two up votes, so we're going to take those. No, we're not up voting that one. Uh, could a sec ops user provide the scope of your TI feeds SQL queries? A

secops user. H I'm not exactly sure how to interpret this one, but why don't you have a look at the question and see if you can uh have a response out of this second one right

there. I think I understand. Um so this is like all internal stuff the way that we do it. Uh I think for a user of the system you you have to right now at least use things like what what do you say uh data tables? He's looking at data tables and APIs and talking about queries too. So I'd say a short answer is we don't have a way for our users to directly do SQL queries like this. At least not yet. Uh, you could definitely talk to us um if this is something that's useful. I don't know. I don't really have much of a better answer there. I'm sorry. That's right. We got plenty more on the list

here. So, let's look at this one. Um, are you correlating rules with other sources like threat techniques? I think you might have answered this one, but have a look and tell me what you think. Uh, we are trying kind of everything. Uh, one of the neat experiments that we did was um, if you look in virus total, the Kappa sandbox labels all the executions with MITER techniques. And so we thought to ourselves, what if we go and search through all the samples by malware technique and just generate rules for those and you can get some useful stuff out of it. But the problem is that it's actually th those um uh verdicts are per sample not per

command line. So you can't just go and say you know this command line was persistence or something. So it's not that straightforward. it ends up being a case where the the data is messy and I think you probably can get some good information out of it. But um I it's all everything is all edge cases. All right, I think we're taking one more question right now. Let's see. Uh we did deep learning already. Oh my god, we got two more that showed up here. What is this? One of these ones I got. Okay, I don't know. Dealer's choice here. the last two questions. You pick it and I'll read it. How's that? I think um let's do that one. Okay. So,

for the signing data, uh is there any specific results that you found from that such as some software that frequently changes signatures or other malware that may be missed otherwise? like maybe you're not catching them from the signatures of the the malware and such. So yeah, there there are some tools that have like eight different signatures and I don't know why and we might be missing them. And I hope I'm not giving any anyone ideas, but like if if someone wanted to just get a new certificate for their company every month or something, uh that would make it a lot harder to detect their samples using a technique like this. Although I've brainstormed like you once you have some samples you

could probably pivot based on other similarity and start pulling this stuff in. Um to some extent it's you know that that's like the cat and mouse uh it's easy to hide things but then you can try to work through it. Uh and I'd say short answer yes it happens. I don't know that much in the way of details though. I think in my experience it's mostly been like well for some reason this one has eight different signatures that we see but I haven't seen anything like really compelling or story worthy. All right that's our last question for today. One more big round of applause for Moses here everyone.