
For our next next session, we have Ashwin and Fletcher who will be presenting about a blueprint for building a generic authorization service for your organization. So, let's start with a round of applause.
>> Uh hello everyone. Uh my name is Ashwin and I'm joined by Fletcher. We are part of the platform security team at Rob.
organization the agenda will follow uh uh will be this way. We'll uh talk a bit about the problem, make you understand why exactly it's important for us to solve uh the journey we took arriving at this blueprint. Um uh and we'll be going a bit into the architecture, the lessons uh that we learned along the way and uh recap some of the key takeaways. Uh that said, let's look at the problem. Um so as we look at the sheer scale of uh modern infrastructures uh companies are running anywhere close to thousand to around 4,000 microservices and uh they did this because uh there are benefits of moving away from monoliths to microservices uh be it uh scalability you can
independently scale services uh right uh and it comes with increased reliability and it also enables uh individual teams to rapidly iterate over their product. Uh but it also has created a massive challenge. Uh access control which would have been done at manageable points are suddenly being done at thousands of isolated locations in your infrastructure. And this uh access control at this scale becomes extremely untenable uh for security teams to uh maintain or secure or audit. So uh what happens when you try to uh manage access control across uh thousands of these independent services without a unified strategy? You end up with this uh fragmented authorization landscape. Uh when I say fragmented uh it could look like uh every team
building their own authorization solutions for their own specific needs. Uh this means uh defining authorization within their code, a specific config file and uh uh or like piggybacking on an other authorization solution. Uh and sometimes this could also mean implementing uh rudimentary access control which is uh which are very coar grained leading to uh teams and identities having uh persistent broad access. All this uh creates this uh cluttered and brittle and redundant logic uh when you talk about uh access within your fleet. So uh this uh spaghetti architecture you looked at uh uh doesn't just look messy. It's uh it fundamentally slows down your developer velocity as well. A lot of times u as I mentioned config access
related configurations are defined as uh uh conf within a config file within your repo. So every time
waiting for 20 minutes to even day
uh cognitive load to the engineers who are working with many different systems. They do not know where to go to get access uh where to go access to get their um work done and don't do they also need to figure out if they do have access to get their uh work done. Uh so this creates a very high cognitive load and every engineering team as their authorization needs change from coarse grain to granular they're reinventing the same wheel by re-engineering their existing solution. Uh so this again uh and this happens at different teams at different points of time based on their uh needs as well. And every time a new authorization solution is being introduced into your infrastructure, it
becomes a security bottleneck because security teams have to do a deep dive to understand that uh the authorization solution is implemented correctly and they're not shooting themselves in the foot. And while your developers developers are being bogged down by these bottlenecks uh your security posture is degrading leading to severe uh security risks. Uh most common of these are zero central oversight. Uh your information security team does does not know which identity have access to to what across your infrastructure. uh the reason is as I mentioned uh authorization and access controls are being defined in different parts of your infrastructure. It could be some uh some database in a service or like uh specific file in a specific
repository and they all uh and every time uh an incident happens and you need to figure out uh to get insight into what access and identity
uh there was no uh single uh capable solution. teams would have built these rudimentary access control that we refer to
and uh and what happens uh when an identity gets offboarded because these are being uh defined in a static manner. No one is going back and cleaning it up leading to orphaned permissions and even the whole process of provisioning access uh that will be uh broken as well.
The rightful owners are uh that kind of tool
or when a team member wants access to something manages.
And these security risks are not just theoretical. uh they are the most sought after uh vulnerabilities in the industry today. In uh 2013, broken access control was not even uh in the top five of the risks. But since 2021, it has been the uh number one uh vulnerability uh that uh hackers are looking to uh
control related vulnerabilities are some of the highest paid uh uh bugs uh that uh researchers uh uh regularly uh uh find and uh uh get paid for. So why is that? This is uh because solving access control problem it's it's hard uh when you try and plug in like a scanner it can't find uh like a like a logic bug because uh these uh access controls are being defined in multiple points within your infrastructure. So now comes the question how do we uh go about uh solving this uh problem uh at scale uh which is uh draining your uh developer velocity and it's also exposing uh us to the number one security risk which is broken access
control. We need a fundamentally new approach for managing authorization. And this is where we get to our blueprint. To fix this, uh we need to decouple authorization from uh individual application code entirely. Uh what you need uh what we are recommending is a single uh source of truth for all your access policies. uh which is not uh which then acts the uh which then provides a way to uh enforce these uh policies in a distributed manner based on your customers uh requirements. Here customers are internal services and tools for us and uh today uh what we will share share is just that uh battle tested architecture using open-source tools that can handle millions of authorizations
uh perh authorization decisions per second. Let's uh uh let's take a look into the journey uh on how we arrived at this um solution. uh to bring this uh blueprint into reality we had to clearly define our technical requirements and uh the most important one is being highly available. Authorization is in the critical path of every uh request that the services get. So if authorization goes down your uh applications and your customers will be very angry with you. So extreme care needs to be taken to com coming coming up with a solution uh to make sure the authorization itself uh is extremely and highly reliable. And the other part of this is um it needs to be efficient. The capacity for
your uh authorization should ideally scale with the customers uh traffic. This means there is no wasted uh capacity just that's just lying around and um this uh solution that we are coming up with should be extremely uh flexible. This means uh on the granularity front let's say an internal tool or internal team is responsible for building out a very uh complicated UI and they want granular access control on it. They want to be able to say depending on who is using their UI, certain components need to be loaded versus not. And even every single field needs to be granularly provisioned uh and not
also be able to be supported on this platform. Granularity is not the last of it. It should also uh be able to support use cases where highly performant uh high traffic but uh low granular.
We're looking at milliseconds of uh P99 millisecond latency for authorization
at uh the solutions. We evaluated three different
uh but when we looked into uh some of the options that were available uh some of them uh in some of the cases we had to ship our policy data to an external database of sorts so that we can then uh do authorization over there. Uh but uh this definitely had uh security and privacy risks we uh risks. There were solutions where we could license uh certain authorization solutions and then we could run it internally as well. Uh but uh from our finding most of these
the other option where uh which it was our backup option if we couldn't find something on the open source world where we could leverage and build something out of it. uh
for us was uh what uh was like a hybrid uh solution where uh if we could build a management layer, but we found a mature open-source engine that was specifically designed to run as a high performance sidecar that would solve a lot of our problems. And here uh open policy agent stood out and another uh open-source solution called topaz stood out and this is where
decided on topaz. The reason uh why we went with topaz it was designed to be run as a sidecar. It provides a very clean and intuitive framework when uh representing your access control needs. For example, uh
engine
disparate systems to implement uh granular authorization needs uh and uh it's also odden compliant. Origin is
uh this means that all our application uh if all our application uh are using the odds and compliant interface, we can swap topaz with some some other odds and compliant uh solution down the line. if the need arises. Um so this is where uh the uh this is where we ended up at. U the the solution for us was uh internally uh referred to as guard. Um uh it's a custom multi-tenant uh control plane. Um this means uh multi-tenant here is nothing but every use case is considered a different uh tenant for us. Uh so uh what we are opening up is uh what we allow is uh different uh uh use cases to come in and define their own
policies and policy data uh on this central control plane and it acts as as like the source of truth of all the policies. But it just doesn't end there. We provide ways for services to use those policies to uh integrate into their service. So then uh we can implement end-to-end authorization solution and uh it's uh and it's uh flexible in a manner that not only addresses use cases of most of our internal services through the topaz integration but it also enables us to u support internal infrastructures uh use cases as well uh where having a sidecar is actually a a lesser available solution and uh uh Fletcher is going to go go into it a bit and this all
to talk about the architecture. >> Hey, I just want to remind you for Q&A and so we don't have to run the mic all the way up there. Um please go to besidesf.org/q&A org/q&A Quebec November Alpha and you'll get a link to Slido to enter your Q&As's and we'll announce them out here. Thank you. >> And so uh talk about the architecture first uh what is an authorization solution? So they're generally composed of three components. Uh the first is the application that you want to secure and we call that the enforcement point. It's ultimately responsible for uh enforcing a decision. Uh but we want to make this part as easy as possible. Uh ideally we shouldn't
even have to make the decision at this level. We should be able to delegate that to a third party that we call the decision point. Uh this is Topaz in some examples. Um and this simply just makes authorization decisions based on some sort of policy. But where does the policy come from? Uh that comes from the administration point and this is guard in our use case. It's a central API service multi-tenant and it should be the source of truth for all authorization policies in the organization. A quick note on reliability. Uh authorization is nonoptional. If you can't authorize, then the only secure fallback is to deny everything. Uh but with that in mind we can make some
distinctions between what we call the control or config plane which is authorization policy updates. So this is uh the administration point. This is guard. Uh but there's also the data plane and this is the part that can go down. Uh so we want to as much as possible uh keep the administration point out. So with that in mind, let's go ahead and break that rule with the central evaluator. This is where the administration point is the decision point. Uh this can be useful for uh cases where you need really high policy freshness um and where there are relaxed latency requirements. Um but the thing you need to keep in mind here is that it's fundamentally
lower latency because the life cycle of the administration point is not directly tied to the enforcement point. Uh an example here could be an web application gating internal human access. Um you may something like this may be necessary if you have very large authorization policies uh for reasons we'll get into in a minute. Uh this is where you'll you they're of the size that you kind of need your own database for them. Uh but our general recommendation is to go with the sidecar evaluator. This is with topaz. Um with this the decision point is the Topaz sidecard runs next to the application. Uh and there's an initial download of policy data from the administration point from guard
but that's where the data plane dependency ends. Uh so on deployment is the only time that the administration point has to be up and after that the sidecar will run on the last known data. This is can be a reasonable security decision for a bounded time frame. Uh but the caveat here is that the policy size is limited by the sidecar's heat memory. Uh that's because it to support millisecond response times. It it keeps the entire database in memory. Uh so we can go to an example of this. This is a a case study where we have a data broker service that acts essentially as a front end to a database. Uh and some of the uh
properties it had is it was extremely high traffic uh on north of a a million uh authorization decisions per second. I had very high availability requirements and it was reasonably latency sensitive on the order of like 10 to 20 milliseconds. Uh so uh but in this case because most of the clients were internal services policy propagation delay was acceptable. It's they were not on a I click something I get access I'm expecting uh a change. This is for internal services. Uh this is a great application of the sidecar model. This is the sort of uh critical service that you can expect to have success with uh with this model. Without hand off to Ashwin for another case study.
>> Okay, let's talk about the cool kid in the room AI agents. So um how so you can use the same framework for securing your MCP servers as well. uh this framework is uh agnostic to what kind of identity uh is uh what kind of identity you're securing against for in our use case it could be uh humans uh agents or even workload identities and uh this is a common uh use case that we are seeing on uh how to sec go about securing your uh MCP servers and let's just walk through a simple example on how flexible the system is and uh how you can secure your MCP servers. Let's say if you have an internal MCP servers
that has tools to uh enable source code related changes or uh communication tools whatever you you use for your IM or documentation related tools. You don't want an external customer supportf facing agent having access to the MCP server at all. That's something that you can easily um configure uh on your m this specific MCP servers tenant on guard right and the topass sidecar is syncing policies specifically for that internal MCP server but it is allowing uh all the internal agent agents like employee assistant and coding assistant to get access to this MCP server. It just doesn't it doesn't stop there. You can even configure uh the tool discoverability as well. Uh depending on which agent is hitting your list API
endpoint, you can uh return the relevant tools. You can go even further on the MCP server, you can configure attribute certain standard attributes that's applied across all the tools. some uh let's say if you can somehow uh compute a risk score you can say hey uh even though the coding agent is allowed to hit the source code tool for whatever reason if the risk is not low don't accept it that's something that you can easily uh configure using uh topaz and you can go further right like you can go further to be more granular on the tool level itself but here we would recommend uh creating a separate tenant for your tool itself so that uh the top
sidecar heap memory is always uh in control. Uh right. Uh and in here this is just an example showing that if a agent is trying to access uh a a specific users's repository making sure that they do own the access and this is something that you can sync from your GitHub team uh uh GitHub team u policies um and because you control this uh central plane you get that flexibility to uh automate a lot of these a lot of these uh use cases sending and over back to Fletcher. >> So we've talked about a couple of uh use cases for like uh even critical service needs but there are some platform level needs that uh we don't even want to take
a a data plane dependency on any guard component for so we want near zero availability degradation on this. Uh that also means that we have no local network hop uh for authorization. So that can save uh a couple of milliseconds on each authorization. Um and the way that we decided to do that is uh we have mature platforms like ISTTO and Kubernetes and they have their own authorization engines, their own authorization policy languages. Uh so the idea was to have our central policy format and transpile a subset of authorization policies to the platform native format and those are applied directly to the platforms. So if we look at a case study with a service mesh use case running onto
uh latency here is absolutely critical. Uh we can't be adding one or two milliseconds to every request that's too much. uh so even though STTO supports externalized authorization we we don't want to consider that for uh for all services uh so instead we have a reasonable default granularity of authorization that can be applied to all services uh something like the service name endpoint HTTP method and for more granular use cases you uh we recommend the sidecar model but that's on a case- by case basis Uh and similarly this there's also uh applications for Kubernetes arbback. Uh this is less of a latency issue and more of a reliability issue. We want every every decision that a cluster uh needs
to make should be within the cluster. We don't want to have an external call for any uh any authorization decision here. Uh so for that we have the the full picture. This is guard. Um so to reiterate the central evaluator is useful for uh either high cardality very large authorization policies or high freshness requirements um or just for ease of integration. Uh the sidecar evaluator for fast resilient authorization for all your critical services. Um and for really extreme use cases uh needing near zero availability degradation um we recommend the template service which does the policy translation. Um that's for foundational platforms and all these are not easy uh but it only needs to be done once uh not
every not for every micros service uh you have. Uh so let's talk about some of the lessons we had rolling this out. uh local host is too slow. Uh this we we ran into even in with in the case of the data broker service because you know 1 to2 milliseconds is fast but it was a measurable latency increase um to the overall service. So how can we speed this up more? Uh one option is Topaz is a library. Uh Topaz is open source. Uh you can use the authorization logic uh from from GitHub. Uh and this model is great. It's it's you merge the two data plane components into one. Um so the policy decision point becomes a policy
enforcement point and you can hide this behind a an authorization SDK. Uh but it has downsides. Uh Topaz is written in Golang. So you have limited language support. Uh one thing you could try is uh compiling to web assembly and embedding a WM runtime. Um and if if you try this, please let us know. We'd like to see hear how it how it goes. Um but a simpler option on our end was just add a cache. Uh you can add this in the authorization SDK. Um and uh just set the cache expiry to less than the uh periodic refresh period and you'll have the same security properties. Um and this in this decreased our latency uh on on almost
all requests by you know from a couple of milliseconds to tens of microsconds. Um and this is a decision cache. So anytime that we get something from from Topaz just cache it and uh and have a reasonable expiry. Um cache authorization policies. So you have the administration point and the um those periodic refreshes from the topaz sidecar. Those are full policy refreshes and they're expensive. Um, and they're almost always the same uh for each tenant. So there's no real need to uh call the database every every single request. Uh so we just add a shortlived cache on on that API and on that API only. So you still get the freshness guarantees of the central evaluator. Um, so just like
for example, a 10-second refresh on the export cache and uh would be fine if your sidec cars are only refreshing every minute. Uh, and another issue we ran into was the uh the topaz we had our applications configured to start after Topaz started. Uh, but to when Topaz starts it has to make a round trip to the administration point to download the policy data. So if the application is started and the policy data hasn't been loaded in yet, uh Topaz is a sensible authorizer. So it denies everything. Uh so the solution to that uh is to wait until Topaz has received the policy data uh before even starting the application container. This is something that we were able to help
upstream. Um so uh we can now watch Topaz to see if uh the first initial sync has completed. Uh but that fine. >> Okay. Uh so uh great that you got buyin from your uh security leadership to build this out but how do you drive adoption is the next uh hiccup that you may run into. And here are some of the things that we would recommend. uh make it the default for services uh when applicable. Uh Fletcher kind of uh alluded to it pre previously. You should uh enable level one granularity which is uh API and endpoint level across all your microservices using that mesh uh custom native mesh uh integration. Anything more granular uh you would want
to hook into all your uh new projects that's coming in. And for us it's the application security review flow where we understand what the project is and then recommend whether guard would be a good fit for them or not. Uh and uh it's much easier for teams to integrate right from the get-go and as the product evolves their authorization model also evolves on guard. Uh next comes how do you go about targeting existing mature risky services mature services only target the risky services which do not have granular access control. Uh now this makes it very easy for you to convince them to onboard a guard versus uh uh otherwise it's going to be a very
uh very uh high effort uh project and they wouldn't have motivation to move to a system which is not giving them enough uh uh benefits. Uh the third uh thing that we would recommend is enabling it on the platform layer. uh basically when you enable it on the platform layer for example in this in the cases that we talked about mesh or kubernetes you're usually interfacing with one or two teams and you get massive visibility of uh who has access to what across your fleet. So uh what are the key takeaways uh from this uh presentation? Uh let's talk about what this uh solution is uh uh achieving for our organization. First off, it's bringing us uh visibility. We
took a highly fragmented map of isolated silos and turned it into uh a single pane of glass of access visibility. This also means that security now has the ability to answer uh which identity has access to what across the different components of your infrastructure. Second uh we the effort that we are taking to uh standardizing interface on how access is provisioned and how access is being evaluated opens uh opens up a bunch of new avenues to invest in. We can start investing in building governance layer uh where uh we have a standard way of requesting access to any of the service that's already onboarded to guard and we can build in the right uh um governance primitives there to
define who the rightful owners of a resource are building the building the tooling on uh how access uh requests are initiated be it humans and in the future AI agents as well and uh it also enables us to apply principles of lease privilege through just in time access and granular access too because access is now not being defined in a file somewhere and it's much more dynamic. You can uh auto expire uh access uh and grant access when required to. And the third that we are moving towards is uh building tooling around removing unused uh permissions because we know who has persistent access on this platform and if that access was being used or not. We
can tie those two uh data together to figure out if uh someone needs to if we need to uh start removing these unused permissions. And uh last but not the least, the thing that we're working towards is instant thread detection because authorization uh decisions are being emitted in a standardized format on which identity uh try to access which resource. we get uh we can start writing these rules and u dashboards to uh light up as soon as identities uh credential is compromised and they're trying to probe what this identity has access to it lights up our dashboards so that we can immediately uh respond to that thread and remove that uh access. So the recap of the blueprint
uh the hub and spoke model works uh a distributed enforcement uh and a centralized control uh is working for us. Uh do check out Topaz. It's a wonderful open source project and it uh opens up clean primitives uh for enabling any of your access control uh needs. Use native evaluators for critical systems uh like mesh or kubernetes or whatever orchestrator or networking system that you use. Decoupling authorization logic from application helps enable security to achieve their goals. Standardizing authorization primitives uh for developers enables them uh to uh better um velocity and productivity because they're not reinventing the same wheel wheel time and again. Uh that's it guys. Uh thank you so much. Uh happy to uh answer any questions you
may have. >> All right. Thank you very much. We do have a few questions in Slido. So let's start with the first one. Okay. So who writes and how are the policies maintained across thousands of distributed microservices. So as we mentioned the strategy we are going with is providing the hooks for uh level one granularity for all services. Level one granularity is uh whatever your STTO for us our orchestrator is uh STTO. So it's API level and API method level. Uh so uh they have a UI or a YAML file that they can come in and uh describe uh what kind of identities have access to what and they they get to whoever the service owners are they get
to manage that level one granularity. The level two granularity where uh authorization policies go beyond what STTO can do manage. Uh it's again uh it's very flexible. A team can come in and say hey I I'm building this new UI tool. I want to own all the policies. Great. Like we have basically guard is a multi-tenant service where you can say this uh any identity can come uh with you can configure a tenant to have specific owners and they get to uh manage access policies. Did you have anything to add? >> All right. Here's another one. >> All right. which authentication uh is needed to make a central au authorization system like guard work? >> Uh it's again agnostic uh on the
authentication uh we we don't have heavy-handedness on how you need to authenticate as long as we it goes through an authentication solution and authentication layer and then the request comes along with the authenticated identity that's what we use for authorization. uh for humans ideally it would be your IDP like octa or keycloak or whatever you use internally and for uh machines it can be any authentications right like uh mechanism on the less secure side API keys or on the more secure side MTLS or JWTs >> thanks all right how do you encourage good policy hygiene as teams author their policies >> uh that's a very good question uh we haven't given a lot of thought right now
on that uh we do given uh we do give uh close recommendation as we see their use case and provide them like hey don't start uh what are like the kind of identities that we support from infosc right like you can't come in and say like hey I have this lunchbox identity no like you uh you can either have a human identity agent identity or a service account right uh we give those uh recommendations but over time it can drift and we haven't run into that problem yet because still we are too early in our journey. >> All right. Are these policies apply to any sort of authorization request only machine to m only machine to machine or
does it work with the application users as well? >> Yeah, it's very flexible. You can uh take a look into again topaz. You can model any uh the primitives it opens up uh it's basically uh opening up based on objects and relations to other objects. So here object can be any identity and another object can be any resource. Did you want to add anything to that? Yeah, I mean it's just the the type fields are you can specify arbitrary types and uh so this sort of skies this limit there. >> Yeah. Uh that's why the there are concerns of people shooting themselves in the foot again there. So we do recommend uh when we are doing that
initial design process with them. >> All right. To help prevent people from shooting themselves in the foot uh what's the one thing you tell a team just getting started with this today? Um, do not build your own authorization solution. Please use the existing system. >> All right. Well, time for one more. Um, what strategy did you use to adapt the new policies to mimic older auth patterns where needed? >> Uh, that's an uh amazing question again. Uh we work uh very whenever we are trying to swap an existing system to an old one, we uh we do uh we come up with a rollout strategy where we are running uh the existing authorization solution decisions with the guards authorization
solution and putting them up on a dashboard and see whenever there is a mismatch. Whenever there is a mismatch, we go in and see what exactly happened, address that gap, and we keep continuing until the dashboard isn't showing up any differences at all. Uh, and then we roll it out. Actually, let's do one more. How many microservices have you onboarded into guard? >> Oh, that's a question that I don't know I can answer accurately here. It's in a public setting. Uh but uh nope. My uh my lead my leadership over here is saying I can't share that. So >> I think a competitor asked that. >> All right. Well, great. Again, thank you for your time. We got these parting
gifts for you. Thank you >> everybody. Round of applause. Thank you guys.