
All right. Is everyone in the theater ready for the next edition of the Matrix movie? You're all here for that, right? You're all here for that. Okay. No, you guys are here in the amazing IMAX Theater at B-side San Francisco. We're about to get started with our next session. So, please give a warm welcome and a round of applause to our next two speakers, Adidi and Yui. And they're going to go ahead and introduce their talk. And take it away. All right. Do I speak here or? Yeah, speak right into the mic. You You It should pick you up. Pick you up. You can You can push it down. Put it Put it down. There you go. You're all set. All
right. Sounds good. Cool. Uh, thanks everyone for coming to our talk and choosing us over the lunchlines. I really appreciate it. Um, today we are going to be talking about navigating the unknowns, fraud mitigation for Netflix live events.
So two years back in 2023, Netflix started to experiment with streaming live content. Live content refers to content that is broadcasted in real time which allows our users to experience the cont uh the event as it happens. These events are often to engage our audience with like unique and time-sensitive experiences. For example, you can think of uh sports events, comedy shows, boxing matches, and so on. 2024 was a big year for us. We added several titles to our content slate, both big and small. Quick question for the audience. How many of you have heard of Netflix live events or watched at least one live event? All right, that's great to see. So, as as we got more into uh these
events, it got the attention of the world. It drove up user engagement. It got some media attention. All eyes were on us. Unfortunately, this included the eyes of the malicious actors as well. They were watching us and thinking about how can I disrupt this event or how can I profit from this new Netflix feature. So, when we talk about disruption by these malicious actors, let's talk about what could go wrong. So, that was supposed to come before. Okay. So what could come what could go wrong? Um and what disruptions can these fraudsters uh cause? So let's take Genai's help to figure out the answer. You can try this prompt in your favorite geni tool and ask like what are
the top security threats that Netflix should worry about with its live products. And you can even throw out your guesses. Any guesses what could go wrong? DOS. DOS distributed denial of service. Any other guesses? Redistributing the seam. So, uh content piracy. So, I'm sorry. Black screens. Yes. Which is which can be caused by multiple things but also a DOS. So, uh when we asked the Genai tool, this is what it came up with. Some of the top threats in this list are things like content piracy which is uh when there are these third party websites which are illegally streaming and distributing your content. Um another risk is DOS or distributed denial of service where the attackers
are trying to overwhelm your infrastructure and disrupt the availability of the event. And then there could be account takeovers which someone uses attack techniques like credential stuffing to kind of get into your account, lock you out and then watch Netflix content from your account or maybe resell it. Um so these top risks actually closely align with what we anticipated as well and these are the kinds of problems that my team works on. So before I go further a quick introduction. Uh I am Aditi Gupta. I lead the software engineering arm of trust and safety at Netflix. And my team is responsible for building u services and systems that can automatically mitigate fraud and abuse at scale. And
uh hi everyone uh I'm a a security analytics engineer in trust and safety team. I'm also leading some of the projects such as matrix signal development work and operational work. So I'm honored to be here to share some of our work today. So uh we are part of the trust and safety team at Netflix. We build services and intelligence uh to combat fraud and abuse at scale. So think of problems like DOS, content piracy, account takeovers and so on. So this is uh this is how the talk is structured. We'll start by talking about why why does why does fraud mitigation matter during live events, the challenges that we faced. Then we'll talk more specifically into how did we
prepare for this and then I'll uh summarize with the key takeaways for the talk. So why do we care about fraud mitigation? What what happens you know if we don't do anything about it? So the first and the foremost is our users trust and experience. Live events are designed to engage users in unique time-sensitive experiences. If you have a scaled attack during a live event, that can disrupt this user experience and steal that moment of joy. So, for example, you're watching that NFL game on Christmas and the system and the service goes down right when there's about to be a touchdown or you miss that really crucial punch in that boxing match. The thing with the live event is
you really can't replay it, sorry, in the moment. um you can watch it later but the moment has passed and as you can imagine this really doesn't do well for the user satisfaction. The second reason is our business and our brand. Netflix operates in a competitive landscape. We have several competitors who also offer live streaming services. The way we think of security is we think of security as a competitive advantage. By effectively mitigating these attacks, we are protecting not just our platform but also positioning ourself by differentiating us from others in the marketplace. So now that we have established that it's important to focus on um what are some of the unique challenges we actually faced while
uh solving this problem. So the first challenge was the challenge in the unknown itself which was trying to understand what could go wrong. Live streaming was new to Netflix. So we didn't know how things would look different, how things would look look same, what uh assumptions still stay and what assumptions change. We had to understand the problem before we even started diving into the live preparedness. Let's get more specific about the fraud mitigation for live and what was challenging and different for that compared to a traditional streaming. So the first challenge was understanding if this is normal. As I mentioned, live events were new to us and we didn't really know what was expected. What was the baseline? If
you think of any security detection software, you generally expect that well we have a baseline of the traffic like this is how the traffic is supposed to look. this is how things are supposed to look and if there's a deviation from that you're like well this looks like a anomaly and possibly an indicator of fraud happening for live events we didn't have that we didn't have that and to further add to the complication these the the traffic during these events can look very erratic you might see a spike in the beginning of the show or a spike during a key moments of the show some events are big, some are small. Different events attract different kinds of uh audiences. They
have different duration. So things look very different and it's hard to define what is normal and what is baseline. The second challenge was conflicting priorities. So on one side we had product who wanted to really ensure good user experience and make sure that our fraud controls are not blocking good users. It's not stopping good users from signing in. And to do that we had to make sure that we are really not blocking good users. Our tolerance for false positive was really really low. On the other side, we had um we had to protect our systems against any kind of scaled impact that can bring down the availability of our systems. So we wanted to make sure that we get
really really good at blocking bad traffic which meant we had to maximize our true positives. So if you look at this, this was like a tug of war. one side expecting us to kind of go more conservative and the other side trying to get us to go more aggressive and what we had to do for us the top priority was the user experience. So we were walking this tight rope between these two conflicting priorities. The third challenge was unexpected traffic volume. You can anticipate a bit about what your estimated traffic would be, but you can get it off as well. I mean, our estimates for the NFL Christmas Day, you know, they weren't they weren't that
accurate. So, what happens when you actually get a volume surge? We have um we have a services which are mitigating fraud. Can these services handle the scale? Can these services handle the scale of um an increased spike or a a scaled attack? So we had to make sure that we actually scale those services properly. And the fourth and the final challenge was new attack tactics. We operate in a landscape that's constantly changing. We have emerging threats. Gen AI has sort of lowered the bar or what it takes to actually launch a attack. We are seeing new attacks out in the wild every single day and um we had to prepare ourselves for anything new that shows up. It's easy to
prepare for what you know. It's hard to prepare for the unknowns. So with that, I'll hand it over to UA to talk more specifics about what is it that we actually did to tackle these challenges. And thanks a tit you know for giving us such a great overview of our fraud landscape especially during the live streaming events at Netflix. So uh in the next section let's deep dive into the details. First thing guess what uh I would say like the AI question basically you know answer most of the uh top three types we also observed in our rail cases content paracy account takeover and the DOS attack. So I'm going to share some uh real examples in the nest section and
also talk about some of the steps we have taken to identify and stop them uh and also going to share some of the strategy you know how we'll ensure our good user will not be disrupted here let's get started here the first one let's talk about the content paracy I believe definitely it's not a surprising here so this is the screenshot we capture during the poet Tyson If you zoom in, you can see more than half a million people were watching from this paracy website. And then we also receive and got find some other paracy websites like this hosting on very cheap or suspicious uh domain and we also find some of them from top social media
platforms. Here the next one is account takeover. Um we see a lot of question about oh where to watch this live event. Even people know the content is from Netflix. They also curious or sometimes they will try to look for ways to access Netflix for free. And then uh if you take a look of our Q4 earnings, you probably will realize all these big live shows such as NFL or uh the boxing match definitely drive a lot of engagement. This definitely motivates the froster to take over more accounts. So it's quite possible for us to see uh kind of like a small spike or big spike about account takeover especially before the big live event show. Another common attack as a
TT mention is like DOS attack. So this is something definitely like a malicious attemp try to take over uh try to take down our service by overwhelming it. we probably will see multiple spikes in our incoming traffic and then all our live show time are public. If Netflix is targeted, it's very very easy for the uh bad actor to find the exact time and then the uh right specific showtime to take down our service here. Uh and then like basically all the streaming company including Netflix is always the top target in the industry. So our competitor also call out they have some experience about the service outage due to the DOS attack. So with this you know
examples we observe in Netflix uh we often see DOS attack contray account takeover. So as a TD mentioned at at beginning like each show is different and every live event has its own challenge. So to us we don't know what kind of fraud scenario will happen and how will it unfold. So that's leads us to this neester question. How do we prepare for the unknowns here? Uh there are multiple things we have done but I'm going to share our action items from three perspective before during and after the live events. So before the live events actually we have done a lot of things but I have only select some of the topic example I can share here. So I'm going to use like
DOS as example here. So typically as I mentioned always a malicious attempt to overwhelm our service. As a data person I definitely will be very interested about the data pattern. we will see a certain spikes here and then we apply some anomaly detection to detect this unusuals. So for example, we will find some of this suspicious IP uh sent us like a lot of or high number of the request in a very shortterm window. So but the question or the graph here is how about the spikes during the live events. So I also prepare one example here. So you can see during this like live event show we also see certain spikes but we also see some uh like
spikes slowly increase at beginning. So the question to our fraud team here is if we see such a traffic spikes during a live event, how do we identify this is a good or fraudulent traffic? How can we ensure our current you know strategy for detecting these anomalies will not affect our good user? So as a TD mentioned at beginning uh resolving this issue is definitely not easy. So we are using distribution analysis to help here. So by comparing those you know different behaviors between fraud uh from fraudulent traffic versus our good user we can spot some unusual here. uh and we also analyze different type of fraud type including like DOS paracy uh especially through some comprised device
uh credential stuffing or account validation which may also contribute some of the spikes here as well. So for this case we also analyze different user behavior. So for example we spend some time um to identifying and then analyzing the cynet uh IPS here. So uh definitely it's a method you know multiple user can access the same single IP very simple example of the net IP maybe you want to watch Netflix from a public website a public IP like in Starbucks uh those IP may be like net IP behind so we can imagine those uh CNET IP usage will increase during the live events so we also spend time to help us identify those user behavior and change
our fraud prevention configures. So what we did here is we do create a lot of customized live configures to protect our like good user as well as prevent the fraudulent traffic. here before I move to the next thing uh we have done to prepare for the live I do have a question but it's not a fraud relevant question here is how many of you guys have watched the po and Tyson fight and who are you like supporting I mean like if you support Paul you can raise your hand or something oh okay oh no hand okay so then I just assume like everyone here uh was supporting Tyson. Okay. If you ask me the same question, I
will support Netflix. So, I was on call in that show. To me, it's not a fight between Paul and the Tyson. It's kind of like a fight between Netflix versus the bad actors. Um and then I definitely hope Netflix will be the winner here. So just like uh you know uh boxing matching here each fighter will get get ready for their opponents like a nest move. We also conduct the proactive testing to get ready for the potential threats. So sometimes like we failed and at sometimes and then we fix uh we adapt and we try again and then deliver another punch and finally you know we may miss some of the thing we're really happy with. So this kind of like
productive testing definitely boost our confidence uh significantly and also increase our service uh capacity to handle all the potential threats here. Okay. I have shared a lot of things about you know how do we uh prepare before the live events like proactive testing distribution analysis and the group I'll group by different fraud type to do more analysis here. Uh let's move the clock to the live day. You know when my manager told me like yeah oh you will be the uncle for the big event. I was thinking like I'm going to sit in a big control room and then with this monitors like show all the fancy lines as this you know graph I generated I mean AI
generated I used a promo language you know I hope we were that's very cool but actually that was I am before is basically just sitting you know at my apartment and then look at the monitor and find all the anomalies but what I want to share here is There are two key words or two key c actions uh one is monitor the second one is response. So first of one we have very clear on call process to monitor those uh unusual activities in real time and we also use a lot of investigation rules to help us identify if something uh is happening. So for example if we detect some fraudulent traffic we can apply some
simple fix like a quick allow list or block list here and as a TD mention um we may have some new attack in a do uh like in a live streaming show. So for example a brand new DOS a very complicated version we never seen that before. We also prepare some emergency threshold or measures quick which can quickly stop the bleeding too. Okay. So after you know spending a few hours in the fire I'm still fine and then the live event definitely comes to the end here. So then uh it's the time for us to answer the question how did we do. This is when we reflect on what happened as well as look for the ways
for the improve. I summarize the work uh we have done post live event into a feedback group. The first thing of course we conduct a poster event analysis to help us revisit what went well what didn't. We cross validate a lot of assumptions we made before and during the live events. Um for example like the CNET analysis, we revisit a lot of assumptions we made there. Uh we also analyze some of the new patterns or attack techniques uh we observed during the live events. We also review our like on call monitor process um in the uh during the live event as well. The next one is we will use this real data to help us improve our fraud prevention
measures here. So for example, we will you know further customize our live configures, optimize a lot of false positive, false negative control here. And uh the last thing here is we definitely will reme-measure and monitor them again especially the effectiveness of our improvement. Um uh the most important thing here is we also reassess it through the Nester live event show. So with this feedback loop it will ensure our strategy and will work as intended and also help us to prepare for the future events. So I will pass back to conclude our uh presentation today. Thanks Hui. So let's bring it all together. So let's look at what UA just described in a timeline view. When we are
preparing for the live events, there is a before, a during and after. So before the event, we um invested in improving our data, our configurations, um building tooling which simplifies the uh response levels if something goes wrong. building the right dash uh the the right dashboards and also proactively testing to kind of get ahead of the curve in case if there's anything wrong any vulnerabilities we want to identify that even before it before someone else does. During the live event we focused on timely discovery. So we monitored and we also had our response levels re ready to go in case something um something bad happened. And then after the event we did a postevent analysis and use this feedback loop to
um uh to further analyze how the event went, how are um if there were any surprises and bring that feedback loop back into um into our systems. So final slide. Um so if if you want to kind of see that okay how do I take this talk and apply it to my own work. If you are building something where you are navigating the unknowns or um you are building a real-time service and you have you have certain uh of the same ch uh some of the same challenges. There are basically three key takeaways. The first is that with things like this, the challenges are unknown. What worked before may not work anymore. The assumptions have changed and you need to
think different. You need to think outside the box. Um the first step we did was understanding the problem, scoping out what needs to be done and then define some guiding principles like this is how we want to approach our defense. building in that strategy which lets you which uh gives you this northstar of how do you even tackle these challenges. The next thing is minimizing risk. As we know it's not possible to completely eliminate the risk. We can bring it down but there's nothing like perfect security. So we focused on minimizing the risk which involved um improving our operations like uh building tooling, building these uh simplified response levers that will actually bring down the human time to
respond in case something went bad. We also uh gave a fresh look to all our configurations, all our fraud controls to make sure that we are really doing uh we are we are balancing the fraud control with the user experience and minimizing our false positives. Then we also invested in building the adaptive uh mitigation. So what that means is fraud controls behave differently when you have a live event versus when you don't have a live event. It would behave differently for certain live events and differently for some other live events. And then the final thing I would say is that focusing on the entire life cycle. So when we are talking about minimizing the risk, it's not just about
minimizing the risk before the event, but it's also about minimizing the risk during the event and using the learnings after the event to max uh to minimize this. So uh uh first thing is to focus on the operations. How how would you kind of build in the operations? Uh identifying the key metrics that you need to look at. How do you structure your on call? How do you structure your monitoring for the live events? And then there is this feedback loop that you kind of take the learnings uh do the postevent analysis and verify your assumptions. Did you actually make the right assumptions? Did the fraud controls you have? Did do a good job? Was the attack traffic what
you expected it to be or was it different? and use these learnings to kind of bring them back into your uh fraud prevention uh services to do a better job for the next live event. So with that um that's the end of the talk. I do want to um caveat saying that all of the work that I um that we uh presented here is not just two of us. This is the effort of our entire trust and safety team and our crossf functional partners. So, a big shout out to all of them for a great work that they have done. Um, so yeah, thank you for being a fantastic audience. Let's hear it for Ditia G in
UAW. We got a few questions here, ladies. So, we'll we'll pull them up here in just a minute. I've got them on the slido for you. Uh, we do have a little time. Lunch is still going on. So, for those of you, please stick with us. And we have a few questions we'd like to get through. And then, uh, where's our photographer? Where's our photographer? Photographer will be around here somewhere. We we'll we'll try to make sure we get a good picture with you guys here at the theater. Uh whether it's during the end of the talk or beyond that. All right, so let's see what we got on the slido for
us. Okay, we got two questions in so far. Anonymous asks, do you use ML or AI to help with the detection in real time? Please answer into the mic. Um so for some of our uh so we we talked through like three different kind of attack types. For some of them we use machine learning. Uh we are looking at AI and seeing how we can actually bring this in to improve our defenses but there's more that we can do there. All right. And we got one more question on the slidoh. When you identify an anonymally how automated is your process to remediate? How do you build the automations and processes when given an unknown landscape? The question is here too if
you want to refer to it. And for anyone in the room, we will do a little bonus question. You're going to have to tell it to me so I could repeat it if there's something you'd like to ask of our presenters here before you head off for lunch. So, um yeah, we we do like uh most of our defenses is actually automated. We have a system that has certain guiding principles to what I was talking earlier and most of our systems automatically mitigate fraud and abuse. That being said, there can be something new that shows up in which case we actually would have a manual intervention. Um there was another thing I was going to say. Um
well, we'll say that for Okay, we got to go. Go ahead. Oh, I can add one thing here is uh for the second one like the unknown part. I feel this is also challenged to our team but what we focus definitely is the false positive control. So when we do some like automatic detection for the anomalies, we also uh always add some consideration about false positive. For example, use our historical uh data like especially from good user to help us navigate a lot of potential risk because of this automation as well. And I remembered what I was going to say. So um even if there are unknowns, there are still similarities with what you already have. So, uh, with
with our streaming and live streaming, we focus on understanding what was similar and what was different. And sometimes the similarities there can take you a long way, even if you get hit by something unknown. All right. Well, I think we're going to wrap it here and let's have a round of applause for DDG and UAW. If you guys have any other questions for them, I bet you can find them around City View. Thank you ladies so much for an excellent talk.