
beside San Francisco theater 15 good evening this is the last session of the day before happy hour thank you so much for being here I'm Andrew your room host here with adti and U from Netflix talking about decoding fraud Netflix's fraud metrics all right thank you everyone for coming closer all right I know we stand between you and the happy hour so we're really really grateful that you chose us over the drinks
um all right so two years back around two years back when the Ukraine Russia conflict started the we saw an increase in cyber attacks throughout the world on Industries government specifically we saw an increase in DOS ta dos is distributed denial of service which is when the attacker send so much traffic on your infrastructure that you really can't you you really don't have the capacity to serve good users so when this was happening at Netflix we were seeing an increase in DOS which was causing impact and the question that our leadership was asking us was was is there a correlation between the recent increase in DOS attacks and the Ukraine Russia conflict are we seeing
more dos than what we see usually or are we seeing new kinds of do do and the answer to that question was we didn't know the reason we didn't know was because the way we used to do um incident response is we use realtime metrics to look at the data and uh this is only persisted for typically just two weeks because this is a lot of data so we used to look at so if you ask okay what did we see in two weeks we can look at the graph we see this incoming request and the moment you see a spike that's a Dos so we could look at that and say oh this is what happened and
this is how we did in blocking this dos but honestly we didn't have the long-term metrics if you you asking us about Trends over months over years how are the these Trends shifting we didn't really have a good answer to that um and this is a problem that is isolated not just for dos it is for any kind of fraud or security threat that you are dealing with so before I go deeper into what we did about this we actually build security metrics but yeah um before I go deeper into that let me quickly uh introduce you to who are we and what is it that we do so I'm Aditi Gupta I am a staff security
software engineer at Netflix um transitioning into the Engineering Management role very soon uh I've been here four years and I lead uh Doos uh strategy and uh and hi everyone I'm R also working in trust and safety team and Netflix um my role is more like a security analytics engineer and working together with a TD yeah which is to say basically all the cool stuff that happens behind this is what she did uh so yeah I uh we work in the trust and safety team at Netflix and our job is to basically sit and watch videos all day no I'm kidding that's not the case we make sure that you can watch videos and you don't have
to worry about any trust and safety issues so we build scalable systems we build data analytics to make sure that your data consumer products everything remains secure the kind of problems that we uh solve for our account fraud dos content theft piracy and things like that so with that let me jump back to what we would be talking about we'd be talking about why did we actually ended up building a lot of fraud metrics how did we do that we'll go into some specifics with case studies and in the end we'll uh give you some tidbits on you know if this is something that you are interested in doing in your company what can you do about
it so with that why so the first thing is V visibility this is something uh we uh judged upon with that U story about the conflict this helps us basically give so from a from a executive point of view they want to care about okay what is it that we are seeing in our ecosystem but even from an operations perspective you want to know like are we seeing new kind of Dos are we seeing more are we seeing different attacks uh so building in fraud metric was essential for us to get a better visibility into what was happening in our ecosystem how were the attack changing um attacks changing and how was our uh response doing for
that the second was investigations so earlier when a Dos happened that had impact that was not blocked by our systems we would have to spend hours and hours to kind of do the investigation figure out what's going going on and it was very time intensive building in the fraud metrics helped us build these data foundations which which basically reduced this investigation time to like 10 or 15 minutes um it so for example having something like a false negative metric which says that okay I did not block enough would help us understand okay this is a Dos that we need to kind of dive deep into and these are the signals that can help us do this
much faster the third one was operations so having the metrics like false positives false negatives things like this helped us understand how our system was operating overall so when you are building a defense system you know it's not just one thing you do there are a whole bunch of different knobs that you use and these metrics helped us understand hey this rule this knob we can go more aggressive on we can uh turn it tighter whereas this one it's actually starting to affect good users so let's you know kind of be a bit more LAX on this so these operational um metrics helped us uh very quickly tweak how our systems were performing it wasn't all nice and easy when we when
we uh started on this we sort of underestimated how complex it would it's going to be so it's it's not like hey there is this data table and you know I'm just going to build a dashboard on uh top of it the data complexity was actually pretty intense when we started building out the the uh foundations for this we ended end up building more than 20 tables or uh pipelines which were working on this immense amount of data so we are thinking about um [Applause] users requests and things like that and so so so the data that we were working with is huge and because it's huge it has a very small uh data retention like
it's some was stored for just 3 days some was stored for just 7 days so if you don't do if if you don't extract your metrics out of it it's lost and the data was also spread across in our ecosystem there was some data in the uh request table some data was in a play Bank table so there's data that spre spread across your ecosystem so you need to bring it all together and I think the biggest one for me when it comes to the data complexity was the difference between implicit and explicit it so when you are um doing something manually you can be like hey this this this this together doesn't look right so this is
actually a bad user and not a a good user but when you kind of try to take that implicit that's in your head and you try to build it into this ETL or this algorithm it gets a bit more tricky you have to think much much harder um so that differentiation between licit and explicit was something that was very challenging and very interesting to work on the second was that the thread landscape is constantly changing which means security uh defenses that we have also have to constantly change this means that the metrics that you building to measure this also need to be kind of sometimes changed as an example we have defenses throughout our system at some
point we were like let's also add in one more defense here and when we did that our metrics were kind of not uh correct they were missing out this part so there was something you need to do to reconcile those so how do you build your uh metrics in a way that those changes are actually minimized instead of kind of having to go back and forth always to kind of fix that and the third one was success metrics how do you even Define that the defenses that you have are successful ful that they are doing their job this is always a fine balance between uh the um so you have to look at the false
positives you have to look at the false uh negatives and this is always this trade-off between the risk and the growth so you could block everything and you would have zero risk but then also you know you're impacting good users or you could be like yeah everyone goes through but then you're also increasing your risk so uh defining your success metrics in a way that actually balances this two is uh another uh challenge that we were dealing with when we were defining a fraud metrix all right so with all of this said I'm sure you guys want to now know how did we actually end up building this so with that I'm going to hand it
over to UA to talk about what we did here thanks um hi everyone I'm R in n I'm going to show uh some Matrix framework we have used to build Netflix for Matrix uh I will also share some of the challeng we face during this time and some tradeoff we made um that's definitely brings me a question for all you first you know raise your hand if you have ever struggled to show uh increasing problem of like fraud and threats and also it's impact so yeah so definitely thanks thank you so much for your response you know like to be honest I cannot see all the hands because of the lights but I do see this definitely
will will be a very common challenge for all of us here so that's precisely why we want to introduce our three uh layer Matrix framework we have been used here so let's get started from ground app so the first one is we call it operational Matrix so those are very important to our technical team uh those Matrix will provide very detailed and actionable uh data uh for our engineer who will handle those threat on a daily basis uh some of these metric example could be like realtime alerts so which could be uh based on our either service Health metrics or decision Health metrics those metrics can easily help us to identify the anomalies quickly and we can react
on them in real time so that's moving up uh from operational data we call it business Matrix so here uh we uh basically translate all the operational data into the business Insight it will help our stakeholders or managers to understand how fraud will affect business operations and objectives some of the metrics we build under this label is some like cost of fraud for example as a TD mention we have DS which have user impact so how many user will be impact uh and then what's the cost because of the downtown due to DS at the same time I'm like account take over rate or the cost because of account take over some of the customer service costus because of the
uh account security context here at the end finally we consolidate both business level data as well as operational data to get our C Level Matrix so you can imagine the audience for this level is our Executives the purpose of this Matrix is mainly to provide our Executives the high level view of our fraud landscape and then help them to make informed decision about a resource allocation as well as our PRI uh priority or long-term planning here some of the metrics we build or group here including uh like Risk exposure of fraud some of the return on investment of some fraud prevention Technologies here so to to summarize our framework what we have done here is we build this framework
based on who is the audience and then how can we utilize those metrics to communicate with them very effectively because of this framework we can ensure all the labels in our organization are informed and also involved whenever we have new fraud detection and mitigation here with this framework let move to our first case study which is DS so as a TD mentioned we Define like dos as like all manous attempt to disrupt the Normy traffic by overwhelming the target so before I jump to the metrix we have built I just want to show One Challenge we have is we don't have direct label to say oh when is the DS so usually what we have is like this we
just see a sudden and a massive spike in our traffic initially all our metrix involved a lot of manual review U for example our engineer just like took a look of this real time alerts and then they will spot some um unusual spikes say all this is D for our system so in order to build long-term Matrix what we have done here is we apply anomaly detection so definitely I understand anomaly detection is not perfect for every single DS attack it cannot catch everything but at least it give us some of the labels for us to start with so when you miss some have some similar challenge you know how to find a ground choose you probably need
to make some tradeoff here um with this foundational data what we have here is I just want to quot out in this case study is how did I identify the questions I built for the metrix so the first one is I talk with our stakeholders uh who is our like managers or product manager so to understand their questions so I C Business label Matrix so their main question about oh do we observe more DWS over time and how about our uh response so what what I have done oh in order to build this metrix answer these questions I need to understand oh how many of the D over town what's the attack size average response Town based on all the
anomaly detection I just applied and then I talk with our engineering team who handled the doll threat every day the main question they ask me is oh yeah do I need to update our current mitigation strategy and how so in order to help them uh make that decision the key components of this Matrix or I have built here is I called operational Matrix is I we build some pop line to track all the attack uh patterns some service Health metrics and our response characteristic for example we track how many of the D dos utilizing IP randomization or j3 has minimization and with this metrix it can give us our uh engineer more insights where they can take look we also track
like a rule Effectiveness as a TD mentioned uh at beginning so if the rule effectivess decrease what's the what's our reaction what should we do so we can tell a lot of things from our operational metrix to answer how to update our current strategy and at the end I also talk with our leadership senior leadership what they care more is oh should we invest more resource since the is increasing so in order to answer that question what I have done here is I calculate some of the cost because of d uh like impact to our user how many uh stream we may missed or what's the downt of our DS and the correspondent what's the cost here to help them make a lot of
decision or about our resource allocation here okay so let's move to the second case study and because of the to limitation thanks and what I want to hear is what I want to show here is first of all yes we also want to talk about account take over and again the main challenge here is we don't have ground choose to label every single account which is comprised so because of the time limitation here I just want to mention we use our customer service teams uh flag and uh and pull out some of the behavior and build a machine learning model to identify some suspicious activities with this found uh foundational data we start build some
like ATL rate volume and cost over time but what I want to call out in this case study is the operational metrix because when I talk with our Engineers the main question they ask to me here is oh yeah we have so much data and we have a lot of ideas which one which one should we do first so the key words I want to mention here is actionable how to build this actionable metrix so you know once we have this at labels cost rate I can do a lot of things I can build a lot of insights but what I instead what I focus here is to build a main driver like at Main driver Matrix to understand what's
the in point for those uh loophole we may have in our ecosystem so first of all I just want to mention all the number I'm going to mention all the fake numbers so we have like about 20% of the user are use like weak password or some comprised credentials so it's very easy for them to be account take over and on the other side we also find about 40% of the uh successfully logging from froster is actually from credential Staffing attack with this metrix so it's very easy for us say oh because credential stuffing is the high is more important and contribute most of our ATL so we propose we should prioritize how to improve our Bo
detection in our system and how to build a better better MFA to prevent those atos and then after the engineering team set up those like new B detection ideally I hope it all can be reflect in our metrix so we won't like to see both ATL rate and ATL cost will decrease over time as well it will also be a very strong indicator for us to say oh uh because we launch this thing our Matrix is also decreasing so I want to mention it here is we start from like measure and to consider the actionable items and then we reasure again in our uh using our Matrix if anything wrong we probably need to either redefine our Matrix or
redefine our uh main driver Matrix here so uh with this two case study I just want to show you uh right now after we building all after we build all the metrics how we are communicating with our leadership or stakeholders so to answer some of like A's question at beginning when I manager ask me oh is dll is increasing so usually what we would answer here is yes the is increasing but we don't have any data just like trust me here but with Matrix right now what we are reporting to the engineer team is yes we see 3x more um volume uh increasing in DS but at the same time we also see 30% decreasing of
our response time and the top targeting path is our logging we should Deep dive our logging rules to see if anything um uh like anything happened anything we need to improve our mitigation strategy and our logging path and to our leadership team right now what we reporting is because of DS we see like more than 100K user got impacted it will also cost about like 2 million US dollar lossing so again those data are like fake data here so it's much easier for us to communicate with our Executives and ask for like more resourcing and then higher priority about to consider a fraud or uh security concerning our new product launches okay with this information I'm going to pass that back
to atid and then let's her conclude our presentation yeah thank you thanks you um so yeah I'm sure uh you still want answer to answer to this uh question was there a correlation between the conflict and increase in DOS um unfortunately there's no way to know because we really didn't have that data foundations back then and as I mentioned the like we don't save data for that long so it's it was hard to go back and backfill those metrics so we would never know but but but but there is a silver lining now that we have built this now if someone asks us this question we can tell so as an example few days or few weeks back there
was this uh threat Intel on um a activist Group which was claiming to launch attacks and Netflix's name was there they said in the next 12 hours they are going to Dos all of of these companies we we from our side we didn't see any impact we didn't see any incidence where we were like hey something is happening you need to respond but when we went back and looked at our data that day we did see a significant increase in the Dos activity on our infrastructure thankfully our automated defenses did block that so that was all good but now we are able to uh uh get these insights on what is happen happening so we are much better
informed so so quickly to uh uh summarize if you are interested in building your own metrics there are a couple of things to keep in mind first is uh break it down into uh business level C Level and operational uh metrics one way to do this is talk to your stakeholders identify what questions is it that you need answer to and do this at every level at level at business level and so talk to them because you may know you know all the answers but or all the questions but they have a different perspective so this helps you kind of uh build metrics for the Right audience and the third is focus on actionable metrics it's uh easy
to say hey this looks like a fun number to have or this looks like an interesting fact but if it's not actionable there's nothing you're going to do out of it so it's important to focus on actionable metrics um with that thank you so much I just want to add in this uh plug that we are hiring across multiple roles on my team my uh sister teams we are hiring for SW who who can build scalable systems and have interest in security we hiring for a security analytics engineer and also a TPM so yeah if you're interested please do apply and with that thank you so much all right thank you so much thank you ad and UA you have a lot of
questions oh boy we have a big there's a big list of questions so I'm going to go through them as fast as I can and still try to end on time we have three minutes first question um and a reminder for those of you who may be departing right now the happy hour is happening now and will continue for a while the party which is like the happy hour but more dystopian maybe is happening at 6:30 both of them at City View which is the fourth floor so please come to both first question did you ever get an understanding about any motive for launching dos on Netflix I got all the upvotes by the way yeah uh interestingly uh this is a
question that we get asked the most and um we are working on it so we do have some insights but um we are still like this is something that we are focusing on building right now uh for uh some of the Dos we know it's like activist group for some it's you know just like doing it one was actually our own testing um but uh but yeah I mean this is something that we are still kind of working towards so I don't have that great answers for this question okay how are next question um how are the main driver metrics such as weak passwords credential stuffing derived fromo historical user actions ingesting data dumps Etc you want to that how were how were
the main driver metrics weak passwords credential stuffing Etc derived from historical user actions injesting data dumps Etc so uh yes I think we have few uh plays so first one was I'm not sure if we can share but we definitely uh we have some vendors provide some of this like credentials maybe lick so we also uh check the overlap with them to understand the uh password quality and at the same time for the uh credential stuffing because most of them share a very uh simple uh I would say characteristic or pattern here so for example they usually come from like same IP or same j3 hash they will try the loging multiple times because they want
to do this uh credential mapping so you can see the logging success rate is super low so those kind of like indicators for us to identify all those are could be account uh credential stuffing uh and then we use this to map to each account ID say oh this could be their uh main reason hope it helps yeah okay another how question how are you labeling and classifying fraudulent requests during a Dos attack how do you segment and filter anomalous traffic from legitimate traffic um so if you have uh legitimate uh traffic spike it looks different it looks different if you look at the uh distribution of ips if you look at how long it lasts where exactly
it's hitting what ex what is it's doing in our ecosystem it does look different so there are ways we do use our internal signals and internal data to be able to tell those two uh apart okay we have six more questions and we're out of time so I hope that you can take a look at these on the um on the slido and have a look at them um so again thank you very very much for being here today really awesome presentation um let me give you your thank you gifts thank you thanks