← All talks

Army of Proxies! How Netflix scales identity based zero trust architecture

BSidesSF · 202429:53323 viewsPublished 2024-07Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Army of Proxies! How Netflix scales identity based zero trust architecture Grant Callaghan In nearly 10 years of identity based security the Netflix strategy has evolved significantly. From crafting novel solutions to migrating to common off the shelf tools and an expanding definition of workforce, Netflix’s Security organization embraces context over control to help entertain the world. https://bsidessf2024.sched.com/event/6763cd586c54ea3bfe722f2d13398bb6
Show transcript [en]

uh please welcome grant grant Khan is a staff security software engineer at Netflix where he has been working since 2018 his work includes development of tools and services that defend the Netflix platform such as design and implementation of authentication platforms and control ples he has consulted an open- source and Community projects as piffy and enoy proxy prior to Netflix he has worked at PDI Cisco VMware Google as well as number of R&D startups in sink valy please please welcome grant for his

talk hi everybody thanks for coming um I want to thank the uh organizers and all the volunteers um this has been a great event so far I've been really enjoying myself um and thanks for having me uh have this talk um so uh my name is Grant Callahan I'm a member of the access control and engineering team um at Netflix and the information security organization um I've been there a little over 5 years um pretty much in the same role um the names of the teams have changed um but we're in charge of providing tools Services libraries Integrations Architectural Review um and advice on corporate and Workforce authorization systems um so today I'm going to discuss um a couple of the

tools and techniques um that we use um along with various platform teams uh to scale the adoption deployment um and operations of of zero trust Architectural Components in a collaborative way um that hopefully enables the business while balancing um risk and speed so uh before we get to the primary programming um I'd like to clarify few terms that I just used um and possibly provide some limits to how this applies across the engineering ecosystem so I'm going to talk about scaling in a different sense um zero trust architecture can mean uh a lot of different things to a lot of different people um and uh Netflix has been around for a while so what I'm going to talk

about is kind of the latest and greatest uh but it's kind of an evolving um Target so often when we talk about scaling at Tech conferences we're talking about RPS right how how much um traffic can we serve um what I'm going to refer to today is kind of how we scale human operations to avoid imposing too much friction on business efforts um so and in terms of zero trust um it can be you know co-opted by architecture marketing terms um you know even within our uh our teams we kind of disagree on what exactly it means um in my mind it's kind of a Northstar what we kind of try to achieve um in this talk I'm going to

focus primarily on uh uh kind of the concepts introduced in Google's Beyond Corp and Beyond prod white papers so um service identities identity wear proxies um and similar style architectures um speaking of proxies uh we're going to use them in the traditional sense in terms of like blue coats and um API gateways um but we're also going to talk about a little bit in terms of how as security organizations you might be able to um leverage other teams to proxy your interests and embed in their business processes so like I said before much of this is is discussing the latest and greatest that's available to our developers uh but in no way encompasses the vast uh Netflix technical ecosystem

um so and older Legacy applications are migrating at their pace and that's okay um changing authorization systems is hard um we've witnessed to Google struggle with it um my heart went out when I saw meta The Meta team struggling with it and I can't confirm or deny but um you may be looking at an engineer who's responsible for taking down streaming studio and corporate all at once so uh we like to avoid that so finally um let's see I've changed uh some of the names you're going to see a lot of logos uh which might be different than some of the um uh presentations you've seen before uh what I'm hoping to do here is uh I'm replacing a lot of

internal systems that we have with kind of external common off the shelf hopefully if you haven't heard of them hopefully you'll get introduced to them if you have uh maybe you'll have a better understanding of the types of services um and where they stand in this type of architecture um let's see and finally oops um sorry so we seek to entertain Netflix is an Entertainment Company um this is after lunch it's Cinco de Mayo uh hopefully uh if you're hanging out besides much of this might be a remedial to you um what I hope is new and maybe brings a smile to your face is I'm going to personify um our services um I like a

good uh a good Caper uh and so I'm going to do it to the theme of Netflix's um army of Thieves and army of uh the Dead so let's talk about how it started versus how it's going because there are a lot of scary things to defend out there but only so much time energy and headcount so it started at Netflix kind of with a program way before my time called Lisa the location independent um security approach and that's where we really began using identity as the perimeter um this approach leaned heavily on bpn and SSO for a strong Network boundary um and this ensured strong authentication for everyone entering the private corporate network uh but once you got

past that moat it was pretty wide open um and the permissions were uh fairly lacked you could uh get what you needed to get your job done and at the beginning with a fairly small highly technical Workforce and a sole um streaming product this made a lot of sense um and it worked um it reduced um any risk to uh the main product or customer Joy while allowing us to um iterate technically unfortunately made a couple assumptions one that this flat Network would remain desirable um and two that um Services would kind of naturally Harden over time to get outside the VPN without a catalyst of sorts so today um we have a network topology

that looks uh something like this um the names and services here have been replaced uh with different examples um so we don't use all of these uh but we definitely use similar um in uh spirit types of services many of them didn't necessarily exist when we needed them so we had to build versions of our own and the reason we had to do this is because of our evolving Workforce so Netflix is no longer primarily a streaming only engineering focused organization today we have quite a variety of audiences that interact with Netflix for business processes Netflix Studios is one of the largest content producers on the planet game and Animation Studios have joined the Netflix ranks and now live production

events each one of these audiences has a very different job function uh very different job duration but they all share the need to access a subset of protected corporate resources um so that traditional flat Network lacking segmentation and authorization um is no longer appropriate for the accelerating business needs and changing risk landscape so without further Ado let me let's meet the crew uh the cast of characters required to pull off this type of high security Caper implementing zero trust components with leaving as little Trace as possible at least as far as uh end users are concerned so today I'm going to give a brief introduction to The Gatekeepers sitting on the edge handling initial requests and interactions the bouncers

checking cred credentials and directing people where to go The Hideout where our users and service identities hang out become known associates and how they got there in the first place the bodyguards security close at hand for any app that needs it The Mastermind because somebody has got to direct all these Divas the getaway driver for when you need to get from A to B fast the talent scout because security is a team sport so always be recruiting and the client because who uses this stuff anyway so let's start with the front door of the system API gateways specifically those that incorporate identity Weare proxies so this was um kind of popularized by Google's Beyond Corp um but there are a lot of different

systems that can be set up um to kind of be the front door systems like Kong or native cloud services like kubernetes Ingress controllers or even um customized versions of Open Source services like Netflix's Zool so these proxies sit on the edge they're in a highly leveraged position they provide a singular choke point for external traffic to pass through when Crossing this threshold requests are inspected for valid credentials these are often exchanged for from external to internal representations and finally these requests get routed to internal services to get fulfilled additionally Engineers might need to get direct instance access um for debugging so our Network uses U bastions and jump host to protect with an ssht authority called bless um this

allows um Engineers to go through that protected proxy um centralizing authorization logic and reducing our key material sprawl so zooming in here on that main diagram we saw this is where we had an example instance of Kong sitting as an API Gateway so if you show up with the front door without a ticket our gateways are often going to redirect you here the bouncer or in our case the identity provider so identity providers and bouncers inspector credentials closely is it valid is it username and password or pass key they can even provide some basic validation on top of the verification is the requester of age is their ID vertical or horizontal is the requester an employee or a contractor

often identity providers can proxy some of these credential validations behind the scenes and they can do either full specification compliant redirects to Upstream Iden providers like G Suite or either do direct API validation of credentials this allows you to Federate and aggregate various forms of user audiences so here in that original diagram if we zoom in we can see keycloak serving that uh um that use case for us here so all right so where do our identities gather who do they meet up with how do they get there these are the purposes of the various processes that make up The Hideout of our principal population so at Netflix we're using HR business processes and application

Registries and a central Registries for internal services and authorization systems to reference so examples of these for other organizations could be workday or ADP for organization and hierarchy information combined with something like spiffy or Spiner for identities these sources of principles can then be aggregated in a shared grouping service like active directory ldap or a custom version whenever our system needs to check a user for their identifying characteristics and known

affiliations so here I have an example uh logo of active directory internal we have a custombuilt directory

so up close of personal bodyguards and service mesh operate within intimate trust boundaries in our topography we use a customized version of envoy and an offy sidecar similar to open policy agent we call it Gandalf so these processes can be deployed as side cars or Damons in the kubernetes world or even traditional Linux processes like systemd on the instance they can help with things like automatic Network operation metadata like R like request tracing and metrics including handling client certificate authentication ooth and Pass based authorization interprocess communication so this AIDS in implementing a system very similar to the one described in Google's Beyond prod with servico service communication protected by service identities and client certificate Authentication

so here when we zoom in we can see an instance where we have an Envoy a sidecar running um we've got uh the open policy agent running as an authorization system and we have the application process and possible middleware all

interacting so next The Mastermind the control plane so we need to onboard and offboard users and permissions and Integrations so even if you have all of these control points the API Gateway proxying requests at the edge service mesh protecting every single internal Endo unless those are choreographed the potential can go to waste so to coordinate all of these moving pieces you need a player that can see the entire board a control plane so we need a way to quickly uh quick group people when they are hired or change rules we need workflows to enable and disable these users and permissions depending on the life cycle of the business relationship and this can be highly variable with those audiences that we

just saw so we also need to integrate with various pieces of our platform to reduce the need for manual configuration that be can be done incorrectly so and finally we need to see the difference between plans and reality and make the appropriate changes when developers are first setting up access permissions they might not know the ultimate target audience or how the business is going to use their system ideally they can specify a slight wider net to Grant permissions and have this right-sized overtime based on observation so at Netflix we have systems like repo kid and squash SSH to repossess unused permissions and make policy modifications and suggestions automatically over time based on this observed access

history so we've got all these control points we've got a way to control uh the enforcement points but if that's all manual click click Ops can only get you so far and even if you have all the configuration and features available when you really want to step on the gas you got to meet the developers where they are so in our case we've integrated tightly with a number of processes including continuous deployment at Netflix we use Spiner and a custom declarative plugin called manage delivery but many other orch orchestration systems exist such as Argo CD Jenkins pipelines lots of ways to get this done so there's a lot of data and configuration um that can be mined from

application manifests um examples of these manifests um appear out in the wild such as the um open app model by Goa but when you have these dense delivery specifications um you can extract interesting things like DNS domain um that can automatically configure those API gateways um and configure automatic ooth clients um with the proper redirect URLs and when you get this automated enough you can even get it down to the granularity of tying into pull request review with a federal environment so UI developers can get a full endtoend protected zero trust Network on every single change so this has been particularly appealing because uh our UI developers are easily able to demonstrate their changes in production or in production

like systems um which is much more than static code review um can fully

portray so you've got the infrastructure it's optimized you've got distributed enforement points orchestration but you still need to win the hearts and minds of your organization and that's that's where talent scouts come in so we need to find new members new teams to initiate into this yet as unknown and sometimes unproven system so at the size and scope of Netflix um the engineering organization and the sheer number of microservices it would be impossible to accomplish migrations like this manually so especially if it had to be itative if we had to ask for more than one change at a time so what we've done is we've been embedding code and tools in various scaffolding tools similar to yman

generator at Netflix we call it um n the Netflix workflow toolkit so we're also able to do things um this is able to set up a boilerplate uh with those API gateways for us and all the um basics of authentication additionally we're able to team with uh pair up with other teams um such as uh the UI team and embed authentication inform in UI component libraries um this helps us uh tightly couple um our deployment and identity wear proxies with our UI um and various applications so you might recognize uh these libraries as bootstrap or material UI at Netflix we have an internal component Library we uh call

Hawkins so why do I do all of this who uses this right so I would say we've got a bunch of end users um application security um has been able to use and leverage a lot of these things uh most notably when we had log for J um our uh API Gateway was able to give us a lot of observability was able to help us block uh huge classes of attacks while we are able to remediate things on the back end um incident response is able to look at all of these centralized systems logs data metrics we're able to do session session revification if necessary um we have uh probably our biggest um end users aside from end

users are our technical support operations about 30% of our tickets um or their tickets tend to be users asking for access to do their job so at the end of the today we have a series of systems and teams in place to deploy and manage a fairly complicated topology that gives us a lot of flexibility so it looks something like this again um we don't use most of the um logos up here uh but we use similar or build similar services and tools ourselves so most of the teams May understand a small piece and intersection at which their services operate um but really the entire Squad needs to work together and perform their role so that we can pull off deploying

zero trust components without so much impact so all right thank you very much um that's the presentation uh [Applause] questions all right we have first question in slido how do you determine permission is unused by developers and thus requires reposition what if I only require access to a particular resource once a year so we do have to take we do have to pick a time frame so um we do have a a various metrics um and observability but we do have to pick a a back Trace time frame and say Hey you know is it a year is it a year and a half like how many uh what's the time frame in which we can determine that

this has been um is stale and we can repossess it um part of that is also making it easy to regain it just in time in case um you do require it again please go ahead shout it out and we'll repeat the question think great talk I was wondering with doing certificate based identity are you doing ration with

that um so could you repeat the question for the recording please uh so the the question is with certificate based identity um are you doing revocation did you go with a crl how are you doing that um so I believe I I'm not sure because that's not exactly my team so I can't speak uh definitively to that um so I'd rather not give inaccurate information yeah go ahead yeah how do you at Netflix determine whether it's buil versus bu this is a a constant debate um I would say uh we're con we're we're often looking um one of the biggest problems is we've run into uh problems that are larger at scale than the market has solutions for

at the time so we'll often have to build um and then later on once the market has catch up we'll look and evaluate regularly to see when it's time to replace and to buy any other questions question right please go ahead uh you mentioned 30% of the from supping getting access um you like what is your how do you evaluate those requests what are the yeah so please repeat the question for recording please uh so the question is um I mentioned that 30% of uh our tickets were requests for Access um so how do we evaluate um th those requests um so there are these actually come to our support team so they're slightly indirect um and so 30% of their tickets

are access requests um generally there's um kind of a decision Matrix um does your team member have this access already um do we'll have a list of what we call Standard profile files that they can be referenced um right now some of it's manual some of it's automated um we have some specific um uh groups or access policies that are uh more protected and they require so like access to PCI Data um requires background checks and those are more enforced um so it it depends uh but generally there's a business process to evaluate um to fairly quickly get the are the access they request unless there's elevated risk in which case we take more steps we have question on slido a couple

of questions uh how does the zero trust model work for work for data use cases like ml models that are in microservices yeah this this is a a complicated interaction I will try my best to explain um I don't know well enough to explain it well but we have a combination of the service identity and the user identity and so um we kind of when like a a spark job or a python notebook runs um it can't run with more permissions than the user running it so the user running it has to have access to the same um underlying uh data tables um and then there's another permission of you know is there uh does this service

application have permission to access that so we kind of do a matrix of the permissions and combine the identities all right uh next question from slido what is your process for reevaluating Technologies in your stack how often and with what teams uh I don't think there's uh an official process um it's generally I think pain-based you know how much uh pain are you experiencing running your own own identity provider internally um and you compare that with how much pain you can ask your uh consumers to do to change versus how much pain it's going to be to move on to a different identity provider and you just have to constantly look and evaluate and use your best

judgment right next question ZTA is usually some combo of device identity and network how does the device factor into the device factor into the security model you've implemented um so device is definitely kind of further down the maturity model we're working um we're working on that a little bit uh part of the bring your own device makes it a little more difficult uh but there are various ways around it um we've done uh things with um Chrome extensions and and plugins uh to interact and make sure that your um we call it stethoscope uh make sure that your devices in good health and register it uh with uh our Fleet um but we also have various uh different um classic

systems uh running inside the ecosystem like champ awesome there no new questions on slido if anybody has a question on on the audience I want to just shout out the question we still have a couple of minutes so you mentioned during your talk that there some sort of ability you have to provision Z TR Network for individual developers on there any addal R for having that ability Tois

addal en um so what I think sorry I may ask you to repeat the question thank you um the question if I can summarize is um is there any additional risk with um introducing uh the infrastructure at like the pr or you know work in progress level and doing that kind of Fairly regularly um and I think uh anecdotally what we found and this is kind of that pain based um we do a lot of our own um support so anecdotally what we found is as much as we can automate the better so if you can automate it you can do it more regularly but you also reduce misconfigurations and misconfigurations um have been just really uh the biggest

uh either source of vulnerabilities or source of business friction okay we have question on slido have you run into any trouble with architecting this in a hybrid Cloud scenario and were there any challenges setting up this across cloud and one PR that is one of the benefits of working at Netflix is we're pretty much in on AWS so we haven't really had to look at hybrid clouds too much all right uh no new questions on slider anybody in the audience has a

question all right let's thank Grant once again for his talk [Applause]