
hello and welcome to our talk so if you're here and uh read this title hyperscale detection and response maybe youve faced a similar problem with your sim that's scaling um and if you're just passing by and thought that this talk was cool we got you covered so um in a few slides we'll go over our introductions what was the problem that we faced how we approached it how we solved it um a short demo and then we'll leave you with some future thoughts so who am I uh I am N sonom I manage eBay's uh security analytics and uh detection engineering team I led the development and delivery of Argus that's the hyperscale detection and incident
response platform for eBay and um when I'm not coding or when I'm not working I'm usually attending music concerts or I like to dance and uh if not that then I'm fostering kittens from uh the local rest resue Center over to you Kiran thank you g hi everyone I'm Kiran shirali uh so I am I and my team are nous customers in in eBay so I lead the detection engineering team um I I know she said security analytics and detection engineering so I guess I work for her too um so uh we are the team that basically writes security detections uh B on on logs look you know does security research and we look at um
uh malicious content in support of our 24/7 incident Response Team um so uh that's what my team does and when I'm not in Endless meetings in uh at work I like to go out for hikes on on especially on sunny days if the day is not sunny I like to sit back with a good fiction book so let me set the problem statement as to what I what I or what we wanted to talk about and what we wanted to solve right and also give you the context as to how we got there and as to how um uh eBay functions because that context is very important now before I jump into that I'm curious um a quick show of
hands of all of you who work in incident response or you support incident response through like a sister function like detection engine security tools or Sim development who all in the room okay that's quite a few awesome so let me set some context as to how we function I'm pretty sure this is very similar to how you our teams work um so we've got very specialized teams in eBay um we've got assets uh that we need to monitor so somebody from detection engineering will go and um uh work with these uh uh uh owners you know we do modeling on what potential attacks can happen on them we analyze what logging artifacts are there on these systems we then work with a log
onboarding team to get these logs onto a centralized s right uh we write security content on these logs we push it to production and then when something malicious happens alerts are generated from those contents and our incident Response Team AKA CER uh responds to them so while not on this figure we primarily use service now as a case management system wherein all our incident Engineers on there uh but all analysis of logs and investigations is done on the centralized centralized Sim now let me spend a quick moment on infrastructure because it's it's it's important uh we've got multiple zones but at a very high level think of two large zones inside of uh eBay networks
one is MP what we call as marketplaces so that's got any infrastructure that supports ebay.com that's production infrastructure um it's um I mean the the there's an infrastructure engineering team that manages this whole Zone then we also have corp corp is anything go that has got to do with employees and employee services there's a separate engineering team for that uh infrastructure engineering team so two infrastructure engineering teams got their own stuff their own cmdbs add a level of complexity right on top of that um we are predominantly on Prem we manage our own data centers so almost 95% of our deployment is all on Prem uh we've got a little bit of footprint on public cloud like Google cloud and uh um
uh Azure but predominantly it's on on on prep also being a mature company we we uh essentially have a lot of custom stuff so out of the box connectors don't really work for us we have to build things to be able to get logging artifacts so all of this adds adds complexity to our monitoring challenge now remember I also said about scale of marketplaces right there's a lot of scale there's a lot of uh servers now that a adds the problem of scaling in in terms of large scale security logging uh data sets I will share numbers in a little bit about how much our day-to-day logging U uh inje is at uh but scale is
a is a problem and then finally when we started this journey we were on a lot of unreliable data transport protocols so log loss had become a huge issue for us so all of these were the the the problems that we needed to solve um and and um this this talk is about about this right so what did we do what how did we solve this our monitoring and detection and um response Journey started almost a decade back uh somewhere around at 2014 at that time we wanted to expand coverage we went out at that uh and looked at the more popular Sims out there we settled on the most popular Sim we bought the Sim we um uh the infosec
team deployed it set up all the infrastructure for indexers uh you know forwarders uh search at clusters we um reached out to the teams within marketplaces and carb got all CIS logs I know CIS log is very unreliable uh but nonetheless we got it we started writing detection rules on it we piped our alerts into into service knobs for our incident Response Team uh to work on um 1 tab per day logging became two two became four four became eight uh I don't know some of you are an incident response some of you work with incident response any time I talk to my incident response team they go oh yeah get us more logs get us more data get us more
context to air on the side of caution just get us everything right so to meet that need we kept on expanding and expanding and expanding and then we started feeling the burn of Licensing fees at 20 terabytes per day at that time we realized this is not tenable this is not scalable we got to do something right so we went back to the drawing board and we started thinking what could we do how can we still ensure that we are reducing risk for eBay expand banding monitoring coverage but not breaking the back and we started internally looking at what could maybe eBay's infrastructure teams providers how can we solve this problem and that is when n's team came in and helped us
solve this solve this issue so I'm going to hand it back to her to walk us through the rest of the story thank you Kiran so as Kiran was suggesting we had two major problems to solve uh one was the reliability issue and the other was scaling issue so how did we go about the reli ility issue uh while we were researching we realized that one of our eBay's internal infrastructure team called as unified monitoring platform leverages elastic beats to reliably collect logs from Marketplace hosts these elastic beats agents reliably reliably sent logs to the eBay's infra Ingress and then from this Ingress we added our pipeline that was three components predominantly Kafka apach fling and hadu why we chose these three
components um the the choice was straightforward for us Kafka is a distributed stream processing lock collection and transportation mechanism which integrates seamlessly with Apache Flink uh again kafa can be used to transport uh reliably uh transport data reliably between different components of a Sim so uh moving on to Flink Flink is a very robust and a very strong powerful stream processing engine that worked well for us and at the massive scale you can basically scale it to however you want to moving on to this solved our reliability problem and then moving on to the scaling problem we wanted to S uh to save these uh logs like huge petabytes of logs reliably and uh cost
effectively and while we were thinking Hadoop was an no brainer for us because it has practically flat or linear cost for all the ingestion and again we had an EBAs internal team supporting our Hadoop clusters so uh the Synergy of Kafka Apache Flink and Hado uh was uh the way to go for us while we were discussing about the data pipeline reliability and scaling issue we realized that we can leverage Flink and its massive computing power to build a detection or an event Pipeline and then we leverag fling SQL to build detection logic and reliably s uh get the data build the detection logic and sync alerts to service now which is again our case management
system so uh the the whole concept is called as Argus and these three components are the open source components that work as a Sim as well as a data analytics platform for us I'll quickly go over a short demo to to show how easy it is for an detection engineer to write a detection use case using sling SQL and for an incident response engineer to go on service now look at the alerts and perform investigations using Hadoop
UI so this is how uh Flink SQL dashboard looks like uh whenever you are trying to build it it may look a bit different this is a Rios dashboard again Rios is the team that maintains a part Flink and Kafka for us uh they've added a SQL editor for us and when a detection engineer goes to write a detection script they'll usually use this editor uh write the script uh name it use the SQL version that they want to and then go on editing their uh Flink SQL script so I'll go over the different components of the script one by one and and show you how you can leverage uh the power of Link SQL to build a detection
use case as you can see you can import any user defined function here again a user defined function can be written in Java so the possibilities are endless you can create a function using your uh user imported user defined function and uh use it in your script now let's talk about the three components of the script the first being the log Source table this table will imagine a c Stream So the lock Source table will get the data pull the uh pull the data from the input lock stream then we are going to do some detection Logic on the loog stream and then sync the the relevant alerts or the or the logs that match your detection logic back to the
kfka stream and then you can route it anywhere you want to so uh I Define a a log Source table with the relevant Fields I create an alert sync table and Define it with the relevant field uh here I'm using the connector type Kafka you can use uh any connector type that matches your architecture the third component is the actual detection logic component in the script here we are inserting into alert sync again it's pretty simple uh you can plug your detection Logic the uh the U fields that match the detection logic from the lock Source table will be inserted into alert sync once uh the detection logic is uh is is complete the detection engineer will basically save
the script and then spin off a Flink job from this and these are all the running Flink jobs uh again they are running on streaming data so all these uh detections will continuously run on a realtime data this is where the detection Engineers work stops and an incident response engineer Works starts so this is a service now dashboard where you can see test alerts I will walk you through one of the alerts so when uh an incident response engineer gets an alert these are some alert metadata Fields some additional information again we've worked with our internal service now automation team to get a customized Sim drill down uh Link in the alert itself when a ceset
engineer clicks on this Sim drill down alert Sim drill down link it redirects them to a UI now this UI is what we called Zeta again you can use any uh other UI of your choice you can use any olab or it can be as simple as a hive query to interface with your Hadoop backin um and I'll go over what is a drill down so you know once uh an alert is triggered it can be triggered by one matching event or multiple matching events so um when a incident response engineer sees an alert they want to have additional context additional information and want to be able to correlate it with additional data to make an informed
decision so while they are routed to a Hadoop UI we have a prepopulated query that is the drill down search that will be running and we'll give them the results and the relevant data in in seconds and once the uh the relevant data is given again you can plug and play your own logic here and once the uh the the uh the drill down is running and it gives the uh output they can easily use this uh to make an informed decision whether the alert is benign or a true positive uh now what should be our future State you say so uh you know we are currently leveraging Argus as our secondary Sim and we are in the process of migrating
all the all the content from our third party Sim to Argus in the future State we want want to completely get rid of a third party Sim and we want to build an Ingress that is getting logs from Corb as well as Marketplace and then leverage Argus as our centralized Sim and completely move off of our dependency of a third party Sim um and how did this approach help us let's hear it from the customer over to you g thank you Nija that was pretty cool huh that was that's a lot of content that we had to uh put together so that you can digested you know in a 20 minute uh demo but the gist of this
this whole presentation was that open-source components tend to work right uh it has worked for us it will work for you too uh we used Apache Kafka Apache Flink Hado uh you can too you can use um uh elastic beads to basically get data sets we built our own Ingress there's a custom Ingress but you can use something like log dust so if you are thinking about open source components for your incident Response Team your detection uh Team you can come and talk to us there was a lot of things that we already evaluated we can give you suggestions um if you think that you know a detailed blog post about how we set this up would help you let us know
we have uh contact details right at the end of this uh talk you can reach out to us or just come up after this after this conversation also um you may think oh is this a lot of uh uh coding scripting how would my security Engineers be comfortable with this so we use the principle of keep it simple right you saw Nija walk us through a a fling script job that's essentially a detection rule that has been templated so they gave us these templates so my team can only focus on the SQL content that is needed to look for that detection right when when alerts fire and then um an alert is you know generated incident response is looking
at it the drill down searches immediately go to a templatized query on on Hadoop so that way an incident engineer only needs to tweak the query and run it and they don't need to relearn a lot of these things so by making it simple for the the security Engineers on our team we were able to adopt the system and and you know embrace the system um now in summary how did this help us first of all remember I said we were breaking the bank at 20 terabytes per day we are now at 55 terabytes per day in security logging per um and um we are having higher coverage our ceso is no longer complaining about my uh uh you know Sim
costs anymore um large data sets we now able to query large data sets and this is important because we're thinking about our future and know just hold on to that thought I'll come to it in in a few seconds we're no longer dependent on customer support of a of a vendor um if we need anything we need something customized all we do is we we just turn around nja sitting right next to us we go hey Nija can you build this this for us right and and we get that support we have embraced open source which means that we can build I know uh sorry we can contribute back to the community and then finally we are thinking about
machine learning because we've got these large data sets now we never had it in our third party uh Sim and we are looking at what can we build uh using ml or simple ml models to get better insights and our hope is a year from now we can come back set on the stand on the same stage and walk you through our journey of ML and Argus but till then this was our talk thank you for coming any [Applause]
questions hi thank you this was a really great talk um so in terms of like the open source you mentioned contributing back how do you uh budget time for engineers to contribute back to these open source Li and what types of things are you trying to contribute back hi I can take that so um just an example of how we would do that there have been a lot of customization while we are trying to connect to you know some um some public clouds like gcp um we have Flink that is our uh that's kind of you know tweaked to our system and our architecture so if there is a connector that needs to be built uh to to leverage
the uh Power of the gcp pops up and get the locks onto our pipeline that's Argus uh we end up doing a lot of custom stuff uh you know recently we had this example of having a uh a synchronous pops up connector that's Flink gcp connector and then we had to build an asynchronous one it was funny enough that there wasn't an asynchronous uh fling gcp connector built and then while we were building our own stuff and our own connector we thought of hey we can just you know contribute it back so uh these are some examples where we know while we are tweaking these uh these systems for our own custom architecture we realize that
while we are working we're also you know we can we can shed that time and then just like hone it for uh more open- Source uh you know contributions thank you
I guess this makes it simpler for everyone uh so thank you though this is a great presentation um I I do want to know what what were some of the Alternatives that you had discussed or considered in this process that uh you you didn't immediately go wow this is great we could just go directly to CFA and then uh automat yeah nobody has that Insight so what what are some of the considerations there's also the consideration of of saving some of the um the the FTE or PE the people power within the company and Outsourcing that to someone who has an off off-the-shelf solution that could handle maybe some of these uh use cases um or things like
spark you know to handle streaming I know that they have a streaming uh service too so were there other kind of uh Alternatives that you considered and um eliminated so we did we did look at a couple of solutions out there when we were looking um evaluating this uh situation right we were looking at other Sims uh the stuff but one of the push that came from all the way from the top is that there's a there's a culture of Open Source that our CTO and our ceso is embracing and that has been a a strong Factor because we use it there's the aspect of also contributing back also the second thing that uh went very
strongly into this factor is is the fact that a lot of this the platforms were being used by infrastructure teams uh within eBay so we didn't need to reinvent the wheel and we could just leverage an existing platform take it only for you know our portions like think of it as a cluster just for us and be able to write stuff specifically for us so it was that combination again for our scale and what we wanted to achieve the pros and cons this was like the best uh Pro and that's why we went here again it is not you're right it is not been like we woke up in the morning and said yeah today we're going to do Flink and
equl this like I said the journey started all the way back from 2014 uh the solution this I think it was what 2018 2019 was when we were we started on this on this plan of Flink and the data Pipeline and eventually became an event pipeline for us can I quickly follow up on that sure uh so outside of these considerations for your specific company's uh requirements um are there things that uh the audience can take away that were considerations that were eliminated along that process that were heavy considerations that people should be considering here so the best thing that I would say as a takeaway is a lot of these thing a lot of these um uh
technology Stacks We Fear them at least we did because they were unknowns once we started using them uh you quickly realize that they're very easy to use they uh as long as you have some kind of a engineering background you can write you know code that is uh that can reach out to rest apis and things like that so that's what I would say just look at them um you know just play with them and you will see that a lot of these these these text tags actually do work and that's what I would say uh should be a takeaway from this to um and if in case you're also asking for any technologies that we tested and like thought that
this isn't working uh we did try try a few of those we tried Druid we tried elastic we tried um you know S3 so we tried a bunch of those and then I think these three components were the best for us so in terms of Technology yes we did try a few things um I think I speak for a lot of people here that a write up on this would be awesome I this is a lot of good information um probably the second thing was how many millions was Splunk costing you in the first couple months we should talk after the talk yeah you should talk after the talk yes yeah just our licensing fees did
become very expensive and because of our scale this made a lot of sense for us right and this this would work for us I I know we are out of time so if you have any questions please we'll be right here after the talk please come up but thank you so much for attending our talk