← All talks

BG - A Serverless SIEM: Detecting All Baddies - Chen Cao, Daniel Stinson

BSides Las Vegas35:06194 viewsPublished 2021-08Watch on YouTube ↗
About this talk
BG - A Serverless SIEM: Detecting All Baddies - Chen Cao, Daniel Stinson-Diess Breaking Ground BSidesLV 2021 - Camp Stay At Home - August 1 Video Tags: bslv2021-bg-serverless_siem-1046268
Show transcript [en]

this next talk is from our breaking ground track please stay tuned for chen cow and daniel stinson dice presenting their talk a serverless sim detecting all baddies welcome to our talk everyone this is uh us presenting on a serverless sim on a budget how we detect all baddies at cloudflare um with our custom sim we're glad everyone's able to find this room and luckily they gave us the biggest room since we knew this talk was going to fill up uh my name is daniel i'm a security on the cloudflare team and chen's gonna give us a little intro about our team and ourselves hi my name is chen and i'm also a security engineer on

detection response team at cloudflare this is my fourth hiker summer camp and this is danielle's third one we both have been with cloudflare for more than two years and over the past few years we spent half of our time building this project called a dab and spend the other half of time on instance response and building alerts using the project so today we're going to talk about how we build this serverless sim that meets all of our requirements at call flair and why we are building it so first let's talk about what is cloudflare cloudflare is an internet company that offers lots of network performance and security solutions for internet sites and teams as you can

see here uh there are 25 million internet properties using our services our network is in over 200 cities and more than 100 different countries we block 70 billion cyber threats every day and we have over 1900 employees all over the world that means we have a lot of logs as you could probably imagine we run tons of different services on our production servers and we have many many sas applications to support our daily businesses and lots of these log sources scale with the number of data centers we have a number of requests we are serving in our network and number of employees we have as a company and we are growing very very quick as well

which also means the log volume would grow very quick so the question here is where do we put all these logs the traditional answer is the same it's more or less the core tool for any detection in the response team right now especially at cloudflare we have a giant elk stack where we have already pushed tons of logs to it so the obvious answer is to just use the xf but hold up a easy choice may not always be the best option for us so personally i had experience of managing a sim cluster and that's probably the worst experience ever in my career patching vulnerabilities is never a fun thing for anyone but as a security system i don't want

any security issues in it either besides that after more and more logs pushed to the sim we just keep seeing random 500 arrows and a lot of times we'll have to wait for minutes for query to be executed it's also very hard to scale for example if we're under some large scale attack we must receive tons of locks within a short period of time and the traditional sim usually won't work well if the log volume changes a lot also if we want to ingest logs from our global network most sims will require a logging agent to be installed on every single server the performance of the agents are usually not very great and at cloudflare

we are also very careful when we have to install our third-party program on our servers luckily our detection response team has a strong software engineering background with interest in security work and we try to engineer our way through problems for the long-term fix rather than buying a short-term solution so why don't we start over and think about what could be the best option for us then build our own thing in the design meeting i look over at daniel and tell him let's do this live and we will build our own sim let's start from the beginning and see how a sim should work high level flow from left to right we have some log sources for sure that's

like the most important thing for anything then we need to have a way to ingest the logs to the system of that we need to have some ways to build detection rules to find if there are any suspicious activities in the log events then after that we need to have a way to notify the team about the potential threats in the log events and finally we need to have an automation method to automatically resolve the instance or enrich the alert so that when real human working on the instance is easier to be resolved so basically from left to right there will be fewer and fewer events being filtered out until we have only real actionable

potential instance at the end of the pipeline we have uh so from that high level overview uh we come up with these requirements and we have a lot of servers in servers in our network so the sim needs to be able to receive logs from our global network easily also performance is very important for our servers so the transfer transferring process of these logs needs to be efficient also as a security product we definitely want the transferring process to be safe as well to have a better experiment of using the sim the search needs to be efficient and easy to use no one wants to wait for minutes for the results of a search query rules can also be

added easily and it needs to align with the industry standards so that we could migrate community rules easily to our sim and of course it must be affordable as well the whole system shouldn't require require dozens of engineers for management or millions of dollars for usage to meet all the requirements after evaluating and comparing different options in the industry we decided to build our own sim using serverless architecture on gcp obviously serverless doesn't mean like everything is running up in the air it's just no servers for us don't need to so that we don't need to patch the system uh it will be auto scaled it's cheaper and there is no operational cost on managing servers

also gcp has a lot of great services that allowing us to build up this system very very quickly we use different log tiers from gcp we use cloud logging as our hot storage mainly used for quick searches it has the benefit to use the sync to push everything to both gcs and bigquery we use bigquery as our warm storage for searches during instance instance for and for detections gcs we use it as backups and code storage in case of emergency our main source of computing logic is in cloud functions and we chose go as our primary programming language because go is very good at handling concurrency as you could probably imagine in our sim we probably need to pull logs from

different telemetries and process them in parallel and go has a lot of benefits of doing that and we want everything to be terraformed as well so that it will have better deployment experience and easier access controls then finally we could turn the previous high level diagrams into a more detailed pipelines this now looks more actionable for us to start working on i will hand over to daniel to talk about more details on how we implement this diagram step by step so at this point chin's kind of gone through the motivation of why we want to build our own cement cloudflare against these existing solutions so now let's walk over what the life an event looks like

throughout the sim starting with probably the most important phase log ingestion sometimes sends required complicating parsing logic and complicated configurations so we can try to figure out how we can adjust our data as you can tell there's a lot of different blog sources and that we need to parse and understand and the single largest box on this page is blog ingestion that's because it's probably the most important job of this entire project it needs to be the most reliable since we can't drop anything and we need to query this and it needs to be the best source of truth so there's lots of ways we can get data into this gcp sound and there's lots of different data we

need to get in and not one not all methods are going to be good for all logs there's a lot of logs for sas products where you don't you're not going to just have a syslog endpoint for them and there's lots of logs for edr products and other things that need a lot of correlation and pre-processing so let's figure out what's the best method for each the first method and probably the easiest one to think about from an operations point is just to write logs to a gcs bucket this is really easy for your operations team who might already be writing logs of your backups to some s3 bucket then instead they can also copy the query

logs and the audit logs for that same server to gcs as well this is really easy because it's less installations and once they write it there we handle all the processing in batches and they can handle large fluctuations really quickly one example where we use this internally which is probably our most successful and highest volume log source is cloudflow logging cloudflare we supports a product for our customers which we use called cloudflow logging where you can export your logs in a stream directly to a gcs bucket they'll update this file when there's new logs up until a certain size and then start a new one this makes it really easy so every time a new incremental write is written

we get a gcs notification where it says hey there's new data we grab that in a compressed form write that to both cloud blogging and start running detection logic on it this is super simple and it's a lot more efficient than even the primitives that gcp gives you per for gcs ingestion since they have a service managed for you called data flow which actually turns out to fail a lot of the time and it's a lot more expensive and doesn't fall into our serverless principles another way we can get data into our sim is through bigquery a lot of services especially within the google ecosystem support native exports into bigquery from your google workplace logs

if you're using that environment as well as taking data from google analytics or just a spreadsheet for add economics ad hoc analysis and putting that into bigquery this is really easy and we tried this configuration at first but the main problem we realized is it takes up to three days to get the logs for up for a login event from the event happening all the way to being adjusted in g and bigquery and this is just horribly slow if you think about it having three days for an attacker to have unfettered access before you even know that there was a a bad login is horrible and that's not really not good for us so how can we do this quicker that's

when we came up with our generic log-pull method we realized that almost every sas api and even internal applications that support audit logs are going to have this in their own interfaces so let's create one that we can morph into using all of these and run them on a set schedule as you can see in the architecture at the top right every five minutes we're going to reach out to google's apis we're going to fetch all of their logs and write them to the proper destination in the bottom right you can see the go contract that all of our log poll methods must conform to this interface here where they get all the auto logs

right into their topic and that's really all that their clients have to do this is a really simple method and conceptually and over half of our logs use this but the real con is get to implement this for every unique log source over time we develop lots of libraries that help us help keep this implementation shorter and shorter each time as we reuse them and it's turned out to be really useful we have this running on five-minute crons or something like g suite and other things that require more time to update the data runs on a one-day schedule or whatever you really need to make it's pretty customizable so once you get a life an event through

ingestion the next top topic is doing analysis on this for detections or maybe an incident response we think of this in three main buckets there's a stream analysis where you can look at one event batch analysis or ad hoc search how we differentiate these is do we need to look at more than one event for the detections that differ differentiates between stream and batch and then if batch it's also batch if you need to look at events across the different telemetries we should definitely should support growing across those then of course during an incident you need to be able to query those logs easily quickly and ideally if you've done a similar query in a pass incident

you should have a really easy way to pull up that saved search and share them across the team so let's look at how this works stream analysis we use google's common expression language which it has a really nice go library and it's pretty easy to read the rules and understand them and write them as an analyst on the bottom left you can see here is an example cell rule and this cell rule utilizes a bunch of different features it checks if the log field is present in your alert because sometimes the log isn't always there and you want to try to run this on every log even if those fields aren't there so you can check that

you can do strings contains search to see if a substring is in there and you can do a third party lookup with a macro that you can write customizably and to check what the vendor is for this mac address we've written a lot of these custom macros that do things like ip address enrichment look up someone's groups and other things and you can really just see this is a stream rule on the right which is roughly how they look and it's pretty easy to get the gist of if someone's exporting something let's create an alert the other way of doing analysis is batch analysis batch analysis is when we take we put all the data into bigquery and

you write your detection logic as a sql query we run these as bigquery scheduled queries at whatever interval and time frame someone defines in their query logic on the bottom left you can see what the time frame is configured as 15 minutes for this rule where we're looking for failed to mfa and then every 15 minutes it's going to run that above sql expression and it'll check how there have been more than four failed mfa attempts and they'll in the past 30 minutes running that every 15 minutes we'll make sure we see that and then the type any tells you that if there's any results at all we want to create an alert other types

we support are if there's a new added field so that's useful if you're looking at all of the dns resolvers on some linux host and if you add a new dns resolver to their host file then we should look at that because maybe they're deviating from our standard cloud configuration we wanted to make sure this was really easy to write for for our developer all right for the development point of view for our detection engineering team because this is not the common solutions that everyone's used to so we have all of our detections written as code managed with ci and everyone has a local testing environment where they can check their rules and match the correct syntax

have data piped to them to do evaluation of if the logic is correct and you can also then once you check your code into ci it'll it'll test all of your rules make sure there's not too many false positives or false negatives and make sure that your queries run in a performative way the rule format if you've noticed is largely a sigma rule we've we've made this a lot easier for our team to write by writing some vs code plugins which do the validation and do removes a lot of the boilerplate for creating a new rule you can see here that there's a we have a macro that you can write that'll generate the new rule

as you go on to certain fields where you have specific options it shows you a drop down of what you need to write and create the new uuid the date all these things that just take up time you don't want to do there's also we also recommend that everyone installs the red canary miter attack navigator while they're in vs code because then it shows that in their detection rule that which each of these minor attack tags will mean so this is the life of an event from both ingestion and detection now chin's going to show you what it looks like once we've filtered those to the events that we think are malicious

thanks daniel so off the engines from some suspicious activities in the log events we need to send them over and notify the team about the potential threats so currently we have a cloud function called sirens to handle notifications it is a lighter version of alerts router but meet all of our requirements there are mainly three functionalities here the first one is the duplication for example if we found someone have five failed mfa attempts within 30 minutes then this same person made a sixth attempt then a seventh attempt we won't generate three different tickets is that it will just be one ticket and all the other events will be commented comments added to the original ticket

when we run run the cloud functions we store every ticket id in a redis instance then check ticket status ticket creation time and ticket labels for duplications during runtime it also has a local testing environment so if a ruling testing status the alert won't the alert won't trigger any real notifications instead the alert will be stored in a cloud logging file so that we can just check those without bothering the whole team and most importantly our notifications can be configured very easily and flexible as you can see on the left side we have a notification policy file come with the code and you can specify like whatever notification policy you want and for now we support three different

notification methods which is handout messages jira tickets and page duty alerts so if we want to specify like a new jira ticket creation notification method you can choose like which what project project name you want what's the issue type it can be a story a task or whatever you have and you can specify the assigning and reporters as well assigning and the reporters can be a static email address or can be a page duty id so that one ticket created it will be automatically assigned to you whoever is on call and you can also add customize the labels to the jira ticket as well then on the right side every single rule has this field

called policies in this case we use trend tests as as an example so in this case uh all the alerts triggered by this rule will will send a notification to chain test notification policy which is a handout message and the jira tickets after we sent we added more and more rules to the system and we started to getting one and more alerts as well so we realized oh we we we're spending too much time on responding to those instances and we probably should have some way to automatically resolve the instance as well as enrich the alerts our automation platform is powered by gcp composer and underneath is using apache airflow with lots of customized plugins that fit

our environment it is also very flexible if we want to create an alert first then using automations to enrich our alert is supported if we want to do automations first then decide if we need a ticket we can do that as well here is an example of how we use automations to make our day-to-day life easier initially our detection rules find an employee is trying to modify their mfa settings then our automations first find the ip of the employee is using then review logs in our data store automatically and leave the result as a command within the instructions to next step then automation will pin the user say hey we thought you are trying to modify

your mfa setting using this ip at this time can you confirm if that's expected then the user would receive an email and the user replied in the ticket directly and say yeah confirming it was me and i was i just learned i can use face id for my admin phase settings so was enabling this it's worth noting like uh during this whole process no dnr teams there is no dnr team's interaction in this whole process and it all happens after business hours i will hand over to danielle to talk about our current status and our future plans so today this is the detection response team at cloudflare's primary sim that we've been using for over a year now and we have

over time added some more nice to have engineering features like having monitoring for when one of our locking pipelines goes down and this happens sometimes because apis change your credentials accidentally get rotated or whatever that is and so now we'll get paged and we can quickly fix that that way make sure we're not missing data other things we've we've noticed is that our same cost is almost half from gcp's cloud logging what's cool about this is actually entirely optional method that we've used for ingesting data and we can remove this tomorrow and remove half of our costs and this wouldn't impact our detection responsibility to write these rules or to do investigations we've realized that our costs are about

50 to 70 cents per gigabyte which as far as we know is the most affordable option factoring and then we keep the data searchable for a full year and from what we can tell this is 20 20 times cheaper than traditional sims and yeah like they might have more features than us and they might have easier to use experiences but we don't think that's worth the 20x cost other cool things that we think about that we have are that this is entirely managed by terraform this allows for us to have multiple development environments like right now i have two running and those are for testing new logging pipelines new automation frameworks and api integrations as well as we

always have a staging environment in a production environment and these all run pretty affordably which is why we don't mind running too many of them and what's cool about having sentara farm is all of our access control and all the changes are very explicit and we can do rollbacks really quickly if we need to next after this we we're just getting started with this project even though it's been in development for two years and we've been doing we need to do a lot more log normalization enrichment and we need to integrate a lot of the threat intelligence that cloudflare has to offer in-house and external third-party resources the infosec community has as well and we currently have some machine

learning work in progress where they generate anomaly-based alerts off of any data set but we need to make this easier to use and more importantly easier to understand for people who are on call and get these alerts since it's not always clear why an alert goes off and of course the constant problem with detection response themes is to constantly evaluate new blog sources to see how how can we make these useful for emerging threats if these sound like fun problems to you we're hiring on our objection response platform team at cloudflare and we're happy to take any questions people have

hello welcome to the q a for our serverless sim talk i'm here with chun chao uh daniel stinson dies uh who were our presenters um thanks folks for joining us uh first off i loved uh you know i was watching your presentation first of all it's great to see you working with sort of these modern devops methods to produce this tooling uh we've seen attackers more and more adopting these kinds of high leverage techniques to develop and deploy ransomware and the like so you know detection and response need to keep up if we're going gonna have any hope of iterating fast enough to contain the threats so uh you know what uh you know i guess to that end like let's ask the

hardest question first right uh a lot of the success of these sorts of cloud native platforms and other areas of the cloud outside of pure infosec you know where they came from observability orchestration containerization deployment stuff you know from uh from uh the uh the you know sort of the the systems engineering as opposed to security signals uh in events uh they they really the success that's really been related to open source and the formation of kind of a non-profit community bodies to democratize and commoditize the kind of work you're doing here um do you have any plan or any hope that you might eventually be able to you know move these uh out into a place

where people cannot just adopt your methodology but actually you know take bits of code that you know whatever is not tied closely to cloud fair and stuff you're not worried about leaking um anything at all yeah that's a really good question so uh so we definitely have that on our roadmap and but you know like kind of like like our team is still fairly small and so like everyone like got their like other job to do as well interview detections doing instance response like a lot of things like this but we are also hiring so we're like hiring some dedicated uh soft software software uh security engineers for this project and we're also hiring like a director for this

specific project so like if we so if anyone is interested can may come join us and we can build this together and make it open sourced and yeah and yeah we're definitely playing some great pivot right there from open source to hey uh let's be cursing people this is a little jeff's kiss by the way thank you yeah so yeah yeah that's like uh this is actually the only question we have prepared because like we know like people would ask about this especially like uh we have spent like two years building this thing i like this definitely like a lot of uh time with them and like spend a lot of efforts but yeah you know like open sourcing

this thing that's not like just like say like click button yeah and that's all but yeah absolutely i mean and to be clear i mean like things like you know prometheus kubernetes these sorts of that but those took decades to go from like animal tools to you know public you know cloud native foundation stuff right so don't please don't take what i'm saying is like hey why isn't it here already and we as we've been developing this we've kept most of like the cloudflare specific things as like separate modules and so it's it was definitely like we built it with the intent of doing that and we'll definitely be doing that kind of piece by piece

if we can't do it at one time um because there are certain parts with like like there's no reason we can't share like our vs code plug-in for sigma rules since i think other people are using sigma rules in vs code so like that's something that's immediately useful outside of the rest of the sim um but we definitely want to do that and we did so like like chin said uh we need more people than just chen and i to work on this to have be able to support it because we don't want to just release it and not have anyone answer questions or help people if they have problems yeah absolutely sure yeah you know it's

it's one thing to be a 10x engineer it's another to be 100x at the plus sales and engineering and marketing all the other things no absolutely um so then on top of that uh let's uh let's say you know you talked a lot about what you did which was all very great and exciting but what would you say are the things to watch out for if you are a security organization looking to create a new cloud-native platform like this for yourself like they see your your talk they they grab the slides they go to their you know their team they go let's do this right and and they on they need to put all you

know because it's not open source they got to put all that stuff together themselves what are the pitfalls to watch out for i think one of the big things is just no know whatever platform you're developing is on quotas and pricing uh point of views um i think depending on how you're developing it uh like different different services like the cloud functions and pub sub which we use in gcp for messaging and kind of our compute have different pricing models to where it's it's really smart to kind of do it one way and then as we're doing our data warehouse and bigquery um you can there's one really silly button to not check on partitioning your

data by day and if you don't click that and you accidentally write like a select star from whatever query um you can rack up a really really big bill on accident which is if you can justify it's for some security incident like cool no one's gonna ask questions but like sometimes you're just testing something and you racked up a bill and that's kind of painful um but uh yeah consult consult your local cloud economist and and and put in some safeguards around yeah i like watching them

related to that we have one of the the uh participants in the in the discord channel asking is i believe 50 cents per gig per day for 365 days like that i don't know if that's disbelief or or you know but uh any any comments on the is that sound familiar in terms of like the sort of i think it's 50 cents per gig per month for the year we're storing it um so it's this is that right chun yeah kind of so like we can only see our monthly bill we can't see like i mean you know we can't see that like year ago as well but like for the storage like you know like not all the data store for

the same time so like for our like monthly bill it's about like for the new data ingested in this month and for all the old data have been stored for like the 360 days like that's like around 50 cents to 18 between that branch and it's also also worth to mention like half of the cost uh spent on like cloud logging which is like totally optional we use that just want to like make it the search more efficient but like if you just want to detections and like rules and do early and do the search using sql queries you can definitely like ignore those and just keep those and it can save a lot of

money as well so but we still use that because i uh like people are not may not be too familiar with the mexico stuff so like they may still need like a more virtualized tool to like search logs and we use that for the hot storage so yeah so an accessibility for your user base sort of benefits exactly because there's a lot of people coming to our company that as analysts as we're hiring that too it's uh that come from like a splunk or an elk stack background and jumping from that to saying like here are the schema tables for all of our data write a sql expression that can do your own nesting and

your data like you can't just type it to count and do steps yeah no yeah yeah exactly so luckily we don't like with our terraform i am situation we can't really like we make sure that people can't drop anything um but uh it's uh it's just we keep everything in cloud logging as more of a you can write really quick searches and just like regex search across everything and the the nice thing about cloud blogging right now is you're not charged per query uh you're just charged for ingesting data at all so like once it's ingested you can search as much as you want it's pretty quick and there's no real downside to just getting familiar with

the data there so yeah i i actually and i one of the things i really loved about this was as we talked before like how this enables you these devops models you're using these sorts of things but the fact that you can use this infrastructure because it's cheap enough you can do those kinds of things and do you know blue green or do hey we're gonna run a one percent experiment over here we're gonna you know spin up some new stuff to see how we can maybe uh you know if it helps with you know catching something or if it keeps us from blowing stuff away uh when we're when we're trying out something new and that's that's

hugely powerful so very very cool um what is there anything else like the other aspects of the system that you maybe that you didn't get to dig into as much that also helped further that like you know stuff you didn't really get to highlight in the talk because of the time constraints or oh i think uh another thing worthy mentioned is like we have done like great job on the automation stuff so like for example like we we use like apache airflow um to like may not be very familiar with it's basically like python based and you can like build your own games very familiar with airflow we used it yeah yeah actually we had our

own in in-house system that we replaced with airflow a couple of companies ago so yeah yeah nice nice so basically we use that so that like when we're deploying like some automations uh we can easily add that and like we do a lot of things like that we can automatically like suspend users.com isolate endpoints like do forensic collections like don't even like need a real human interaction during the whole time and i feel like that's really cool but like it's like apache airflow you know like it's their own like itself can be a like separate talk so so we didn't spend like too much emphasis talking about it but like we do have like a lot of our customized plugins and

operators we do have like build a lot of tags using it to help us like respond to the instance quickly so at cloudflare we do believe like so anything not urgent should be done like automatically and we like like to engineer our way to solve the problems so like we like it a lot so where what would you say in your in your current system like what are there any sources of complexity left that you'd really like to kind of pull out maybe and and go at this you know as in the next iteration if like you were like ah that was kind of a blind alley or it left us in a sort of an awkward position

where we're spending a lot of maintenance on this or something i think it's a lot of work is like classifying like when we get a new chunk of data like which log index should this go to um because like in classifying all the different data and making sure like uh since we're building this in go or we have static types and everything we don't want to drop data if like a new telemetry source adds a new field we don't want to like not pick up on new fields into like our data types and we is going to make sure that like we're not currently doing any normalization because we don't want to lose the context

with each event but something that's what we want to do over time and next and so i think it's kind of the looking at data and seeing how do we normalize how do we structure it and like make it searchable um is kind of the next phase where it needs a little bit of love uh yeah fair enough yeah okay um well then i guess i'm out of questions i think at this point uh you know unless uh either of you had anything else that you wanted to add okay in that case very much appreciate your time uh i think this is a very exciting and and judging you know from what i've seen that there will be at least

you know a few other companies that are you know in organizations that are really eager to dig into this as you you know put this information out there so well done thank you thanks for asking the questions see you next year in vegas maybe