Identity Observability with Knowledge Graph: Beyond Monitoring to Enforcement

Name: Identity Observability with Knowledge Graph: Beyond Monitoring to Enforcement
Uploaded: 2026-04-29
Duration: 29 min 22 s
Description: Chen Xi explores identity observability at scale using knowledge graphs to address identity and access management challenges in large enterprises. The talk covers how to model complex permission structures, enable real-time enforcement decisions beyond simple group membership, and automate access re

BSides Seattle 202629:2227 viewsPublished 2026-04Watch on YouTube ↗

Speakers

Chen Xi

Tags

CategoryTechnical

TopicCloud IAM Detection Engineering

ResearchCase Studies and Incidents Analysis Technical Deep-dives

StyleTalk

Mentioned in this talk

Platforms

Neo4j

About this talk

Chen Xi explores identity observability at scale using knowledge graphs to address identity and access management challenges in large enterprises. The talk covers how to model complex permission structures, enable real-time enforcement decisions beyond simple group membership, and automate access remediation based on actual usage patterns—moving from reactive monitoring to proactive, context-aware security.

Show original YouTube description

Bsides Seattle February 27-27, 2026 lecture Presenter(s): Chen Xi

Show transcript [en]

All right, let's get it started. Um, thanks for joining this session. Uh, today we're going to talk about the identity observability with the knowledge graph. So there's a small title there, uh, beyond monitoring and the enforcement because we're not just talk about monitoring but also enforcement. Uh, further slide talk about myself. Uh my name is Tenshi a security engineer at Boston for just a month. Um and I'm also a author for a kumet secrets book. As you can see my specific domain less on the secrets management, key management, local identity and infra security. Um yes one month ago I still working at Uber when I try to send the proposal. Uh at Uber I built the identity uh

observability system and zero trust uh architectures. So before we get started there's one disclaimer uh I got to mention here uh all the numbers and metrics and data point here just for education uh don't treat it very seriously because these are fake number uh I have to do that because uh I already leave Uber I'm a former employee of Uber and don't use the the data point here to judge like how Uber system looks like I don't admitted that um and also all the opinions my are my Uh here's today's agenda. Uh we're going to quickly go through the problems and talk about a system we're building and then uh most importantly we'll talk about the enforcement. Uh the last part

is talk about the uh new challenges we're having. Oh wait. So uh the problem we have is the IM challenge at a scales. So it's nothing about AI nothing about the threat threat. It's just like IM system. The first question is uh the modernized the IM system always grows from the beginning. If you look at the the companies when the company grows the IM system grows. So take Uber as a quick example. So it's it has some you know very quick like a growth speed stage where we have lot of new employees come and join the companies. Everybody come into the company and they get assigned task to say hey you need to fix this tomorrow. So they check the code

repo and look at different data database and internal tools. What is the quick and easiest way to grant them permission group? Yeah, we're building the internal group of systems and when the first day they join, we just assign them into the group and they automatically get the permission all their systems across different like whole ecosystem just to check hey is this person belong to this group. Great. Now let's just let him to assess the resource. Uh we're we're just growing so fast. Everybody need to do their job. Let's do the quick and dirty. Second stage there are frequent rearch. We have managers. we have like new people join the team. The manager feels like I don't want to

manually add one add the people to the groups one by one. Why not just you you need to give me a new feature. I want to put the whole groups on the other groups and I automatically grant my whole team the permissions to unblock them as quickly as as they can which we define the birth rights assess but there are other like necessary group to expand the birth rights to be more big and big. Now the situation becames like a few years later we're going to ask a question hey who can assess to what right now regarding that assets who can assess these assets. So every time when we asking the questions it takes bunch of

hops go to different system ask them to aggregate the data and come back to us. So it's literally few years years later the IM system became deeply nested and it's very very hard to answer some very basic questions because the whole ecosystem grows so big again this is our fake number but simulated number uh we uh at Uber we have more than like 80k users these users are pretty much not driver nothing related to driver it's just like internal user employees and people need to go to the hub pick up the phone answer the questions to the driver. These are Uber employees too. Uh we have more than thousand SAS applications. Uh and most importantly we

have more than 100k groups and like the SS number is a fake one but we have more than a million. Okay. Uh and also we have to define the SS here. The asset here includes oh wait some technical issue. Is it back? I think there's some connection issues. Oh, oh, okay, we're back. So, continue the assets. So, uh it's abstraction concept. The assets may include the hype tables, secrets, all the internal microservices and data assets and all of them. So, definitely more than a million. But if you look at these scales, the IM system we initial we initially built really falls short. Uh it's it's the first one is like the heat and blast the radius because we

cannot answer like it's not easy to job to answer the best questions like who can assess what and also it's like every time when we change this when we do some small change of minimum group change the impact cascaded cascaded into like thousands assets like every time you touch a small change with a group automatically they get like granted permissions to thousands of assets. assets. The second part is like there's no assessability. We literally don't know who can access what. The last one is we have a large security organization. Everybody want to do you know do some project make impact but what exactly the impact looks like we have to some way to measure what is the top of priority for

us what what is the data what is the number you can solve. So this come into the second part. We are trying to build a system with a data and a platform to answer all the like the sh the uh all the like all the narratives of the um the IM system we're building. But before that we need to answer the first layer questions what what kind of data we have today. If you look at the internal system for user, we have HR systems, we have employment types, when they join the new companies, we have the or we have the management chain. It's actually pretty like as an engineer when I first time enter into the HR system, I

was super impressed that it's so complex and we also have the IDP system. Yes, we do have a centralized IDP system but integrate with different systems. We also have internal cluster res cloud resource uh and also internal is uh internal SSO system generated the service account granted like created a service kind of user and grant and inherited the group rules. Lastly uh some smart people they try to invented a system named the owning system. We want to make system like more generic abstracted. So they try to mapping the groups to to the internal assets. Uh we have that data. We have the data. We also has the pipeline where we bunch of team try to clarify the user based on

their employee types uh types and define the risk scores. And then for the all the assets we try to do lot of work to clarify the sensitivity of the data because it's compliance requirement. It's a separate effort but some team is driving it is there all the data we have yes but we never get a chance to have a system to connect them together that that's the major topic for today best of the data we put a bunch of efforts to building to leverage the graph DB to building a security knowledge graph so this is a pretty much oversimplify the graph DB model so um there the copilot are three nodes we have user groups and

assets between these nodes. We build the ages. There's a sample age here like a user is a member of a group but also a indirect member of a group because the group potentially can be deeply nested and even cycling and from the group to the assets. We're checking bunch of systems. We're building um the edge like has assessed based on the auditing log and also we're building the edge like can assess like which parts grant them the permission. Again this is oversimplified. Internally we have more than 50 ages between the three nodes. There includes the permissions policies and everything. After we building the system with the data and graph DB now finally we can answer the very basic

questions regarding the query. Hey how many user can reach to the assets. Now one query you get it. What is a blast radius if I change some group? What exactly the impact and also most important part a lot of cases we have to do reverse look up beyond like who has assessed a specific system we reverse look at that what else he had assessed has he he assessed to the data system or just like internal database or other microservices so I probably won't go through the details of these slides but like this is a like high level compar uh compare pair between the relation DB and a graph DB. So um during in the IM world so if you

try to um answer the questions in the um like left side. So for relation DB every time we need to ping the team and talk with them and after the complexity join ask them hey can you run the query and get back the result to me or expose ex post the result into a excel spreadsheet or something share with us. But now with graph DB we link all the data together it's it's much much easier and most importantly I want to emphasize that it's um sub you can do the cycle graph query and also you can do the b direction by default and the cascading cascade analysis is not a big issue now oh more importantly for relation DB join

is not a fun cross multiple table join is not like it's always not fun you you have to cost the data engineers always give you the correct answer but reality is well not good all right so uh we won't pause the story there yes we have the observe uh we can answer the question uh is that enough so we have a if you look at the IM system it's a house it's a beautiful house but it's a broken house uh because the company grew so fast we have so many like a security efforts But we also need to admit that the existing IM system is a broken house. So the permission accumulated over years. Nobody knows the full assess picture

changes happens. It works. Yeah, it works but not great. Now with observability. So if let's just take a step back. If you have a broken house, what are you going to do? The first thought you are thinking about is hey, let's try our best to fix this house. Broken house. which part is broken, let's fix it. But more importantly, how about we build a new house? It's it's a running system. It's easy to say, let's build a new house. But it's not easy to do that. What are you going to build? What is a new system looks like? So I identity obserability fit into the gap on two part. The first part help you understand

the structure issues. Take a nested group as a quick example. the deeply nested group cause a lot of structure issues. The um and and like most importantly the identity abilities system helps you to understand what is the initial user intent. If you look at the original question a user jump into the company hey I need the permission to do XYZ that is the user intent. If you look at the user intent there's nothing related with the group nothing at all. But why that happens? Because you want a quick dirty way. So our identity observability system can help you to print out the blue blueprint based on the user intent to tell you hey for the

new architectures how it looks like. What is the existing data looks like? Oh, it's broken again. >> I don't know what's happening. They're back. >> Yeah, it's back. Okay. So with that said, let's jump into the next slides beyond observability. To me, I think this is a core part if you try to build a new house. So we link the data, we put it into the graph DB, we can do the continuous query. Now we can finally answer a bunch of questions. If you look at life cycle of the O end to end there are few stages. The first stage is pre stage. Pre-author stage is like hey a user want to do something that is a pre-author stage and with

existing data you can validate and say hey there's there's other ways you can do if we if we you know building a relationship between a user to assets directly in the pre-author stage you don't have to like you know go through a group uh directly or if you have a jeet or pam solution you can hey we have the jet solution for the target assets now you don't have to go through the jungles of like group the second part is during us step back to the previous problem I mentioned all the system try to make it very simple a simple question is like whether this user belong to this group or not if yes check just let it go

with all the existing data we have we can patching we can do the real time blocking at the choke point now it's not just like identity of ability um can support the during a stage that not just like you have to belong into a group but also who you are what kind of specific user attribute you must have to assess the target assets it's a runtime gio system like you can build for dur stage the the after the first two stage they're after O stage the O is already happened the IM O is already Now you you get an audit in there. You can't do the detection. But if you look at the detection, there's another

problem the ID aggregation. Um I didn't specifically mark the data here. But imagine that for a user, if you look at a specific a user, if you look at internal system, how many IDs can mapping to one guy named Tenshi, it's very it's very easy to say there are more than eight or 10. I have my username. I I have my employee ID. I have my email. I have my SPF ID, UI ID, UYU ID plus like you know um as did I mention SPF ID? Yes, we have like service accounts all the things all the ids actually mapping into one person in detect detection stage. One of the challenges they have is hey I look at

some logs these logs like show me some ID. I don't know exactly who they are. So one thing we we one cool thing we did with like the knowledge graph is we actually export the uh ID into a separate table and ask a detection team hey just query and you know enrich all the data into like one format and it's very easy to query for them and define a ruler from them and there's one part I intentionally skip is the automatic re remediation um so there are two data I mentioned before can assess and has assess. So the percentage number is around like like one like 1%. What does that mean? If you look at the group users within the

group, the user really targeted into the system is only 1% of the the permissions they have. So that means more than 90% of users can be removed from the group even the group who granted them them the permission. So we actually build a automatic automatically like remediation engine periodically to check the data to say hey this people need to remove from this group. So this is a very quick study how we use a graphd driven assess reduction as I mentioned we purely check the uh assess log find the people who haven't assessed the system or his peers because we have all the data now we know who the their peers in the same team and we also look

at a log from their peers and determine say hey you had never assessed the target assets within n last 19 days so we think that you don't need that permission anymore. more. So we try to run the engine, remove the people from the group. How many time? Okay, three minutes. Oh, we can talk about a little bit more. So, uh, another thing is, uh, the research score. Oh my god. I don't know. Give me one second. Let me see if I can fix. Hey, please come back. Okay, it's come back. So, um, I mentioned the unused permission. It's very very useful because we have a broken house but we have we have to keep it running. So we

continue to to do the remediation but also find but also we have the risk scores mapping into the like mapping to the user. If you find a user frequently leverages the system talking to some assets his peer never assess or this user has some specific attribute and we try to you know adjust the risk score a little bit then as a loop coming back and during the runtime we try to challenge them with jeet and pam system ask them hey can you ask your manager to quickly approve or can you quickly just give a you know jer ticket then we allow you into the system. Um yeah, it's it's a lot of fun um to using the data to p

the different cases. Um the last slides I going quickly talk about the uh because you look if you look at the system we're building we're pretty much a focus on the user and a group but there's a new challenges with the with the AI. So there's a typical crashing the shadow AI um there are a bunch of like AI they try to get the permissions and granted by employee uh they just like you know get the OS granted permissions beyond a security review if you look at the data there is just like hey a user frequently talking to the targeted system we don't know whether it's a AI agent or not so there's a bunch of efforts try to expand

the group the graph to cover more cases uh we try to collect the data um you know set the boundary to say hey these are you know uh pretty much a sensitive data um we don't want these people to you know talk like assess this data also we try to loop back and mapping the data flow pass and try to from the wer log to check that hey why you're frequently talking to this system oh you're AI then we circle back automatically attack and say this identity is granted into AI so long story short if you think about the gra graph nodes we're actively expanding the user node to the AI agent node. Yeah. And again the last the last part

all the efforts is not for observe or observability or we shouldn't pause there. Well we're invented a system named context aware enforcement where during the runtime we agreed the risk scores all the behaviors attributes and try to give it a little bit hard time even is AI. All right. So, uh I'm about the time and I'll try to sum summarize all the key takeaways. Um so, as you can see, it's a lot of fun to build a graph DB for modern IM system. Uh simply three nodes and more than 80 ages uh can answer a lot of questions traditional relation DB cannot cannot answer. Especially if you have a larger scale systems across different departments ag

get the data back to you. Uh second part is the system should not pause there for observe only the enforcement is a key you have to leverage data uh you know during the assess pass um give either people or the AI agents a little bit hard time. Um and the the third part is there are a lot of business decisions you can leverage data to answer which one is important which one has the high risk which one we should address. Um so last part is yeah u if you look at all the IM system you feel like it's it's not good but it's a running system what should we do build a new house but how

you you need a d to answer the questions like what is the initial user intent looks like and be the blueprint for the new house um last part uh very quick like comments uh it's not easy job AI agents um introduce new challenges um but like we're exploring the usage of GR DB. All right. Uh that's all for today. Um again, thank you for joining this session. Uh yeah, feel free to leave your questions and comments. >> Oh, I saw one question. >> Yeah. Uh a little bit of a implementation question for like uh your runtime blocking enforcement or like your continuous monitoring. >> Yeah. Are you going back all the time to run like a slower multihop graph query

or are you normalizing to a relational scheme or something else like that? >> Excellent question. We actually try to both we actually try to both depend on the usage say if if your runtime like there are different runtime require uh there are different runtime and different latency requirements. So let's take a quick example. If you're people if I can get response like less than 500 millisecond you're still happy. So in graph DB layer we have the cache like we verify that if like most of queries can hit the cache give you a resource back around like 100 millisecond you're happy but still cases that you're not happy you want to run as fast as you can. Yes,

you're right. We actually normalize the queries and export the data into a relation DB. Um, and then for you to like this this kind of data is pretty much like a static data. For example, the UI ID and user attributes. You're not frequently change it because your upstream IDP system is changing that every 24 hours, right? So we can just like every 12 hours we refresh there run the query again export into a place. >> Okay. >> Yeah. >> Thank you. >> Yeah. Uh new question. >> So um first off I was being kind of disappointed with the talk because you saved it at the end because you m magically sprinkled some of that magic

AI dust at the end. So thank you for recovering of that. >> Yeah. But um the question is is when you're making the changes, when you're recommending changes, you're automated or looking at reviewing parts of the house. >> Yeah. >> Are you looking at the edges of the graph and saying here we can do things which have less impact, right? And and not as many dependencies, right? And then building that up over time or you trying to change things at the center of the graph which have a lot more dependencies and things, but that's probably the biggest bang you can get. How how do you determine when to start making changes at your center of the

graph? >> Yeah, excellent question. I hope I can answer this like you know with all the details but the short answer is we are pretty much a focus on the age. We're pretty much focused on the age but not the note. Uh okay. Um so let me think about how how to answer more because there are a lot of user scenarios we have to simulate and like try to build a relationship. So we're not like we don't want to touch the note too much. In fact we want to build a system as I mentioned like we need to expand the node in very limited cases but for most of cases we're still want to focus you know to

build relationship with edge. Yeah, because because they're also related with query. You don't want to like jump some multiple hops to get results. >> Go ahead. >> Hi. First off, I I like the amount of AI in your talk. I think that was that was nice. Um a little bit at the end there. Um but my real question is I'm wondering um as you were working on this graph um what did you start to have interactions with other teams like for example like I don't know if there's a team that's aesthetically who wanted to do things like look at it and say like oh this user was compromised or this this database was compromised we want to understand the

impact of this database being compromised but then also then you need to know something about like what is in the database deal with that kind of data tagging and how did that go >> yeah this is excellent question so actually uh this is small story before we launched the project we already have a effort from the SR internal SR team because this is exactly the questions they care about. every time they're suffering because they need to talk with the for example database they need to talk with database team and like all the microservices they need to talk with computer team and they're they need to understand the whole architectures it's very much painful for uh I would say

most of teams are security fox they're security fox they're not in front fox they're software engineer part they knows the uh enforcement part the existing you know the like how the IM system works but every time we have to bridge to the software engineers u working on the security space and talking with the different teams so first of first yes you're right it's this project initially was driving by the red team because we want to draw the attack graph but after that we actually after we build the system system we go beyond that because we want using the data to power the long time. Uh, Gario. So, I hope I answer your question. >> Yeah.

>> Yeah. Um so the the question is ultimately the end goal is to have that graph DB as the policy decision point uh while you have as many policy enforcement points across your infrastructure as possible >> so that you can so that graph eventually becomes your source of truth >> for policy decisions while every critical uh asset is uh protected by a policy enforcement point that would query or question that policy decision point for for you know up-to-ate data is that >> so very good question uh be honest with you my personal opinion is the graph DB today cannot be the only s of source uh the reason is if the bottom pipe it gets

the data from the upstream IM data it's not a real-time data from this from the you know data source part it helps a lot in the uh risk analysis but if you think about that one there's a small circle to you know set the boundaries for a lot of assets to say hey you cannot touch this space for this part I would say we still stick into the static policy yeah but beyond that there's you know a larger circle to say hey I think I need to grant you permission quick and easy and also sometimes challenge you with risk then choose graph Thank you.

Identity Observability with Knowledge Graph: Beyond Monitoring to Enforcement

Related talks