← All talks

AiIAM: Transforming the Democratized AWS IAM Architecture with LLMs

BSidesSF · 202428:31181 viewsPublished 2024-07Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
AiIAM: Transforming the Democratized AWS IAM Architecture with LLMs Anthony Scheller, Jorge L Gomez Watch how AiIM (AI Identity Management) simplifies the principle of least privilege by using a developer-first service leveraging LLMs to automate AWS IAM policy generation. By empowering developers and following a democratized AWS IAM strategy, you too can say goodbye to manual security reviews. https://bsidessf2024.sched.com/event/fb1e2c406f358a51ddfcad03aedbba7b
Show transcript [en]

uh right now we want to talk about vapor lock uh we've got uh Anthony sheller and George Gomez Anthony you're uh head of security engineering with StubHub uh and George you're with twio uh what is your role with uh twio I'm a security engineer at twio perfect excellent uh folks if you have any questions for the fellas uh as we go through as they're talking go ahead and drop it into slido I trust now on the second day you folks have already gotten pretty accustomed to the app you've got the website up and running if not just go over to bside SF uh.org qna that's q n is in Nancy a and you can drop your question in there if we've got

time we'll have the fellas answer it I'll ask the question on your behalf and then following this we'll we'll get you folks on to the next session but more importantly get you on to lunch so with that fellas why don't you uh take it away for us cool hello okay all right um well yeah and so uh this is vapor lock um but it was formerly known as Aim so you guys are in the right place uh if you came for for aim um so yeah today we'll be discussing about how we've leveraged uh large language models and control plan log analysis to address the principal of lease privilege problem with service to service identi and

access management policy creation um so going over to the agenda um so we'll start out with some quick introductions um and then we'll uh oh and then we'll uh move to an identity and access management recap what it is and why it's important um from there we'll start talking about you know how we go about navigating the challenges of governing identity access management at scale um from there we'll talk about the cloud security maturity progression in large Enterprises and what that looks like and that'll kind of lead us into vapor lock talking about what it is how it works it's architecture uh followed by a quick demo and then um we'll go into some improvements and Q&A um so uh

a little bit about me um so my name is Anthony sheller I've been in some form of security engineering for a little over a decade now um started off my career in offensive security uh testing uh doing consulting at PWC um from there I went over to Hulu um and built out the cloud and network security engineering function over there uh and then more recently worked uh with George over here um at too as a staff engineer on the cloud security team um and I'm currently leading uh security engineering over at StubHub and so uh now I'll pass it over to uh my colleague George for his intro and get into the meet of while we're

here great thank you so good afternoon my name is George Gomez I'm a staff security engineer and Tech lead in too's Cloud security engineering department I've actually operated as a security engineer or architect last 15 years primarily in different Industries but primarily in the energy Tech and Cloud infrastructure I actually met Tony about a year ago and we have really interesting conversations about um how overly complex I am policies can be specifically um in service Cloud providers like AWS and gcp I actually recall performing an audit and on a production account and uh I saw an AM policy where they had um star for action star for permission uh for actions permissions resources and I questioned

the lead engineer about it and he simply said it was just easier to get get it done this way and just couldn't figure it out any other way so it's no wonder why misconfiguration still remain the top Vector for data breaches and cloud service providers uh and these complexities really um are the reason why we've seen a lot of those High um profile data breaches like uh you Home Depot Deo eBay Etc uh so Tony and I actually brainstormed different ideas on how we can solve this problem and really make it easier for engineers to uh create their own I am policies following that whole principle Le privilege right and this is really how we build vaper

loock so um before we just jump into solutioning Let's uh kind of Step take a step back look at some of the challenges uh behind am policies and let's start by just defining what im is so IM identity access management really focuses on Three core areas authentication authorization and auditing so authentication really deals with that identity that principle try to validate that they are who they say they are right um authorization checks permissions for that principle to ensure that they have the right access the right permission sets right and then auditing just keeps track of those events to ensure data Integrity now if we start focusing on I am policies in particular it really just creates that

like describes creates that ledger the description of what that resource or identity should have permissions to do and in simplest terms it really describes the who can access the what so um complexity really uh starts to um arises where you start to add um you know multiple permissions along with the growing number of available services in AWS and gcp um and then you start adding the conditional statements that you can add for each one of those and then you're also trying to meet compliance requirements to you know to restrict blast radius right and then you to end up with a lot of permutations a lot of combinations of of different permission sets that you could apply On Any Given

resource right so it does leave a lot of room for error here so certain strategies to kind of help with this is really falls under two camps right centralized and democratized from a centralized approach um really there's a few teams and usually just one that is responsible for reviewing and approving these policies to ensure principle of lease privilege is being followed right um generally this burden falls on like the security team but we've also seen platform engineering teams uh take on this action it really just depends on the industry um General tools that they use to help simplify this uh um ends up being things like uh using a terraform or um some infrastructures code module

But ultimately you still have that single point of failure because you have that Central team that has to um you know review this and ensure it right on the plus side you do have standardization right so there's a benefit of having a standardized process and approach when it comes to that but it's at the expense of developer velocity now um the second strategy is democratized and in this approach uh Engineers are empowered to build their own services and really to own their own services and which should include I am policies but um but with this uh model uh it does increase developer velocity but it does require that discipline from teams to ensure that they're following

the principle of Le privilege um and general strategies here involve creating permission boundaries or reusable infrastructures code modules um the problem the problem here is that um not all engineering teams are created equal right you may have some strong engineering teams with really strong um security practices but then you have weaker teams that may not have that same um maturity right so from I guess from a security Baseline perspective it's just not consistent right so you don't have that standardization um and it also requires that constant care too right to with the Ever Changing um uh landscape with um new uh Services being offered by the different CSP providers right I wonder Tony um in your experience as a

consultant what would you say um organizations typically adopt yeah so um you know in my experience right without fail organizations who are poised for expansion and always looking to move uh they they really tend to move towards a democratized approach right and and so let's take a look at why that is right so organizations that that are scaling they're hiring more teams to build more features quickly right unless you're scaling those you know uh core platform and security teams at the same rate as your developer teams calized uh you know the centralized IM governance and infer governance just won't work as we know that's pretty much the case um so uh you know security practitioners right as

security practitioners right we need really need to architect the way that we you know uh deploy our controls in a way that covers this decentralized model and um so with that said right organizations with a you know a mature Cloud security posture um and have generally implemented like these below processes and Technologies to do just this so uh we started like the foundations right so infrastructure is code pipelines so so that's moving that's the the principle of like moving all of your infrastructure management and creation and modification right through um infrastructure as code pipelines and putting it all into code right so there's a couple different benefits of from a a security perspective right one

of which is you have like repeatable maintainable infrastructure it's all checked in to your virtual control system um and then right this also allows for you know security teams to kind of embed those security by default controls into those you know um I reusable templates and so once you do that though you can then start uh kind of paves the way right for for teams to start uh security teams right to start implementing the policy as code checks right so if you're using terraform it's things like you know Hashi Corp Sentinel um or CF guard com test Opa um and so from there that allows you know security teams to as you know to check that

infrastructure as it's moving through these infrastructure deployment pipelines does it meet certain criteria so an example of this is like let's say you have an you know IM roll module right and this is an example in terraform you have anroll module and uh you put in some explicit permission boundaries and and some explicit denies but you still want developers to be able to create their own identi and access management policies you can still like put that through the infrastructure as code pipeline have a policy as code check that says hey you know you might not care as much and still do a review but you might not care as much as long as they're using this specific uh

infrastructure as code or terraform module and so using you know kind of building on on different layers right allow allows security teams to do that and so um next item here right is uh organization um or tenant layer IM am protection so um this is you know most uh commonly kind of like aligned to you know aws's service control policy module or U model where basically you have these uh you know at the tenant level or organization level you can Define certain uh you know deny policies for example and say like you know no matter what permissions that you know uh different principles have in these child accounts right you still can't do X whatever that may be um and so you know

security teams you know going going through further through the list right security teams will also do you know routine permission audits with uh you know cloud service provider uh control plan analysis um and then developer education right so as we know in in this decentralized model um where you're moving uh infrastructure management creation out to the developers right having them understand the gravity of the configurations and the infrastructure infrastructure that they're creating and making sure it's you know secure is is critical and so um and you know lastly right um SAS Cloud security tooling so that's like your your whiz your Orca your prism Cloud your all of your other ones of of flavors right um you know using those as

like Cloud security posture management tools so ID looking through your environment does you know do I have an externally exposed S3 bucket do I have an ec2 instance that has you know security goup weapon into the world right um really good at checking those things um and so they've also added additional functionality uh like the whole industry really in in the space is added additional functionality um that kind of looks at uh identities right um so I am permissions that are overly permissive based off of what the principal has been you know uh using in the last couple months um and even you know IM roles and policies that haven't been used in general so um Tony I have a

question uh so it sounds like some SAS Cloud security tools like cpms are actually doing stuff like this looking at overly permissive May policy so why do we need Vaper loock yeah great question um so so you know you should use these SAS products right if you can afford them and you can integrate them into environment um and it's great for that reactive approach right so after the principles uh you know have already been deployed after the infrastructure is already working and leveraging those you know those those roles and and and I am users hopefully not the users um but you um you know uh that that really you know it helps out a lot with with that

reactive approach right um but we we feel that there's one opportunity right that is not being taken advantage of and this is the shift left approach right when it comes to allowing um developers to create better IM policies to begin with so that's kind of like how Vapor log kind of came about right is that you know we're looking at why not like you know there's a lot of good reactive approaches out there right like you know with your SAS security models and and you know even manual like reviews of you know your cloud trail logs and and other like control plane logs um why why don't we you know provide something that allows you know developers in a

self-service manner to to actually generate better uh policies uh before deployment this is more of a comp complimenting a cspn right rather than a replacement would you say that exactly exactly um so yeah let's take a look at Vapor loock um so what is it right uh so vapor loock is a uh it's a standalone open source project um that you can deploy you know via Helm on your kubernetes cluster or your favorite uh container orchestration platform using Docker compose um and you can expose this this service right the UI and the API to developers um in order to generate new and right siiz existing um IM policies based off of natural language descriptors and we'll get into

why this is important um later on but you know like like I mentioned some of the core features right um it's a self-service platform right so enabling the developer to get you know to to keep in the driver seat right as they're creating their infrastructure um and you know for platform Engineers right um they can use the API to integrate these processes into their CI pipelines um it allows for the net new policy generation right so that's the proactive approach that we're kind of after um this leverages you know our our the large language model um you know permission Generation generation uh workflow um and then right we still keep the the policy right sizing which is the the reactive

approach right it's still very important um and so there's also no magic bullet right when it comes to generating policies uh to begin with unless you're looking exactly at all the documentation which is we know is not always the case um and so uh you know then from there um as we're returning these IM policies right you can choose different formats right so um Json is one of them it's the main default one we also support um HCL and terraform um and then if you want to take a little bit more ads Centric approach we also uh return these policies in uh cloud formation templates um and so you know kind of the last item

here right um as we were building it we we wanted to make sure that it was extensible um extensible in the way to like as we're getting Community involvement and stuff like that that um to really be able to you know with relatively low low level of effort right be able to extend this and add additional uh Services if you will to this uh IM service catalog um so yeah and as we know right uh good technology um you know won't be adopted unless it's easy to use so I'll go ahead and hand it over to George and he'll tell you how to use it yeah then that's a really good point Tony right so it's really

important that we make a solution that's extremely developer friendly and the way we approached this was to create a cloud native solution um that has both the front at the back end the front end is written in nextjs really easy to use um uh as you see here the dashboard just has two Services right now that we've exposed but the idea is that this Dash would become like a service catalog of different IM service that you could provide to help uh help you with um IM in general whether it's the resizing the policy or creating a brand new one from scratch um now if you go into uh if you click into that first one that create

policy you end up with this web form here right so um really easy to use web form you're providing a name you're going to select the AWS service that you need the source service um that's from a drop down of available services and then you're also presented with the destination Services as well um you're going to provide a drop down you can add as many as you need right and uh you could select the permissions that you need as well and those are easy easy really understandable natural language permissions read describe write admin it's just a little bit obvious so it determines behind the scenes what what permissions you're you're actually going to need um you can add as many as you

need with the plus button there that you see on the on that uh left side and or you can actually even toggle the uh Amazon resource name to provide an actual Arn to be a little bit more precise and hone in on your exact AWS resource uh from there you just hit the generate policy button it's the blue one there can't miss it um and then it makx calls to open AI starts uh creating that um least privileged policy based off of the parameters that you've provided by theault it's going to provide this in Json right so you can grab that Json copy it and apply it to your AWS resource you can also um we also support

terraform right we're big fans of infrastructure as code so we have support for terraform so output the terraform uh style so you can use that and and apply to terraform Cloud if you're using that or if you prefer um something a little bit more AWS native we also support cloud formation now the other uh form is uh a little bit simpler just uh you're only providing an Amazon resource name just the Arn only and you're going to hit the generate policy button in doing so it's going to start to query um access analyze and actually look at cloud trail logs for the last 90 days and see what permissions um that actual resource has actually used and the ones that it

hasn't as it starts to identify those permissions that it hasn't hasn't used it starts to eliminate those and remove those it gets Stripped Away so what gets returned actually ends up being those um permissions that have been effectively used following that principal lease privilege that we're all you know we're all after from there you can just copy that Json policy put apply to your your uh AWS resource and like before we support terraform and cl formation now like I said before we wanted something that was really developer friendly so we do have a front end that you saw there so um for those Engineers that feel more comfortable with that front end they can use the

form really easy to use we've uh made it as dummy proof as possible but if you're a little bit more of an advanced engineer you want to use um apis you want to do something more programmatic using CI for example we do off of the apis and it's all through a Swagger do so what you're looking at here is the uh the first endpoint the create policy and this aligns to that first form that you saw so the scheme is really simple you're providing that that Source uh AWS service the destination AWS service or the Arn depending on your need and then the actions that you need uh from there you indicate the output format and then

you just hit submit it's a post request um what gets returned um should be uh an I am policy within the format that you requested you know that Json that terraform that cloud information and that's uh that's pretty much it on that specific endpoint we could also we also have the other endpoint for uh requests right size permissions now uh even simpler just Arn and the output format but instead of actually getting the am policy here it's actually returning the request ID so like I mentioned before um we're making calls to cloud trail right we're looking at the last 90 days to see what permissions uh were effectively used so that process can take somewhere between

30 seconds to um a couple minutes depending on how much uh how how large those uh those logs are so you get that request ID that request ID you can use for the third endpoint which is a get resize permission to start to query it um and then see if that that job is complete and then get returned the uh the actual I am policy in the format that you specified so um that's a look at the front end and the back end um but let's actually dig a Little Deeper on the architecture and the sequence diagram and I'll let Tony speak to that awesome thank you um so yeah uh vapor lock is comprised of a couple

different components right so if you look at the top here uh you'll see vapor lock that is the the front end the nextjs front end um and API router and then from there we have the vapor loock backend API uh that's really uh used to kind of orchestrate uh the respective processes that are going on um in the back end and then um from there oh sorry and then and sorry this third component is because of the asynchronous jobs that we're actually needing to conduct during the some of these processes we're using celery in the back end uh with a worker node and then a radis cache for uh kind of like a state manager back end um and

so if we look at the top flow here right under the create policy this is like the the generating uh right siiz policies using uh large language models and so as a client right you can see all the way on the top left um if you're using the UI right you'll make a get request directly to the front end um else you'll just you know use the schema that George was talking about earlier you'll make a post request directly to the um API but uh what we're doing under the hood is right once we get those parameters that you've sent us um we're going to interpolate a uh prompt um that we're going to send over to open AI from there

we're going to get back a generated identity and access management policy um and from uh because right we wanted to ensure sure that you know we we take care of some of the cleaning in regards to some of the shortcoming uh shortcomings of large language models uh specifically around hallucinations right we're actually taking that um IM policy document and we're sending it over to um AWS IM access analyzer to check to see if those uh permissions are indeed valid um and if they're not we actually when we get it back we actually scrub those particular um actions um out of the policy um we do some additional processing uh put it in the format that

you desire and then uh we send that back to the client um so if you look at the second part here which is the the resizing flow this is the more deterministic way right um uh of of creating a principle of leas privilege policy um same principle applies as far as the you know depending on how you're accessing Vaper lock whether it's through the UI or the vapor lock um API directly um from there though it gets interesting because then we take that um that that principal Arn and we send it to our um celery worker node that worker node is basically um it registers the job once it registers the job we create a unique identifier for that job we send

it back to the client uh so that way you know you can continue polling on that third API that George was talking about uh the front end we do take care of that for you um for better user experience and then um so as uh as that worker node is you know made the request to IM access analyzer uh to generate that new uh you know IM policy via the uh cloud trail log analysis right um once it's done it will actually put that uh policy into redus and so as that polling is going on right um the the uh vapor lock API is actually going to reach back into uh the redus back end grab that IM

policy put it back into the format that you want it and then um send it back to the client um so has a lot to go through I'm sure we'll get some questions U but um now let's take a look at the uh the demo um so this is basically everything George uh showed you uh earlier um and we will sorry we'll go over here um and talk through it um so um this is the uh the the homepage right now so you have the create policy and uh let's go back a little bit um you have the create policy and resize policy uh so in this particular first uh part of the demo we're going to go to the create policy

sections this is usually using that large language model um uh workflow that we have um from there right you're presented with that same form uh that George presented um you put in all the information you want right such as your favorite policy name um your you know the source service right so in this case um you know Lambda ec2 what have you um anything that can can have a principle really um and then using those natural language descriptors right you can you know pick what you want um in this case I believe it's a list and subscribe um you can add in I think we're going to do it against the SNS topic and um in this

particular case right if you notice right underneath you'll see the uh Arn toggle and so if you don't Supply a um Arn right we're just going to use a service name and then on the right hand side you'll see in a second um we're actually putting in a Arn placeholder so that way you could put your Arn back in um for you like if you don't want to provide it if you do provide it um we'll actually extrapolate the service names from that um and then you know conduct the same processes so you can see that here um so this is in Json um you can see an outp put here uh going into terraform and then um same thing going

through the cloud information templates um so this is a return the Amo file so after that we'll go to the back home um we'll go to the re resize policy section um and this is right providing that you know that simple principle Arn um I know I'm going to get a question about it that uh AWC account no longer exists so don't get any ideas and uh so then we you know the background processes occurred and um you know we have that uh that policy we returned directly from there um and then we also have the light and dark themed which uh we thought was extremely critical um for an MVP because you know obvious

reasons and then here we have the Swagger Ducks right that were generated um at a part of the fast API framework that we're using for the back end um so we have all that all those schemas there um but yeah that's uh that's the demo um and so then we'll uh move on from there okay um and so some some future improvements right so um one of the things we we already know that we kind of want to do to make the the um uh effectiveness of the policy creation even better is start implementing uh reinforcement learning from Human feedback so what that's going to allow us to do um is basically now we have

that we have that data right from uh the original prompt that was created by the application we have the response from that uh you know the I policy from that got returned to the user and then once we get that reinforcement learning uh from Human feedback right we can determine if it was good not good or indifferent uh depending on like the user feedback so then we can start storing all that data and then specifically on the um open AI gbt 3.5 side of things we can start fine-tuning that model based off of that data um and so that's one of the things that we wanted to kind of get into it's a little bit more on the data science side of

things um but but we think it's going to you know make it even better than it already is because it actually already works surprisingly um and so uh then we're going to um add in some authentication authorization mechanisms so you could deploy it um you know within your environment and hook it up to your uh favorite identity provider of choice um and then uh looking at adding additional models right so instead of using you know gbt 3.5 turbo um you know you can use a different model right uh maybe one of the open source models and train it more specifically on um like the IM policy documents and all the documentation accordingly um last bit

here right this is really AWS specific um you know to begin with but uh we want to start branching out as well and add support for gcp um Azure Azure um am policy rendering um as well as kubernetes rback and then um some additional SAS products and yeah that's Vaper loock um so I hope this was helpful to try to understand how I can use open Ai and just llms in general to you know solve security problems in this case and trying to right sizee your eye on policy and you know get close to that Nirvana of lease privilege that we're all after so yeah we'll open it up for questions now all right we give it a round of

applause folks uh fellas we do actually have uh a couple questions here the first one is what is the benefit of using an llm for policy generation and right sizing versus trying to do this using only deterministic methods yeah I think you can go take that sure um so yeah uh one thing to note is that for the uh right sizing portion is that we don't we're not we're actually using a deterministic method right so we're using IMX analyzer to go through the cloud trail do uh cloud trail logs and actually you know really get the specific IM actions that that are um being used by a given principle for the llm side of things

right um our goal here right was I mean deterministic approach is you know if there is a way I'm sure there is somebody will figure it out eventually uh but considering right LMS are available and ready for use it seemed like a pretty decent way to to start looking at uh to to going down that road as far as you know generating new policies from uh just documentation of course right if you're you know you want to do it super deterministically right you can just read the documentation but as we know um you know that's difficult to do um sometimes depending on your cloud service provider and so um and it takes a lot of time so this is kind of

like that bridge into kind of going more of that direction and I say that like the idea with the whole dashboard it's it's a it's multiple Services right so we're trying to Prov provide a like holistic comprehensive um strategy and how to how you address add um I am in general right so we have the El models for that create policy but then the the deterministic approach with the right sizing right so we're trying to use the benefits from both uh both uh sides thank you for that there's also uh you've got some props uh one person is saying great Tool uh what type of anonymization are you doing on the data before sending it to the llm to create

the policy you to well it's simple we just don't send it right we we're sending a a fake one we're actually sending a dummy version of the Arn crafting at the same kind of format same kind of resource but not actually sending the actual um Arn the actual Arn in question okay and vapor lock is open source you mentioned where could we find it how could we contribute yeah uh so so it will be open sourc uh we got to we got to do a few few more adjustments before we let people look at the code so just you know pretty it up a little bit not to embarrass ourselves um but yeah it will be available um we did put

our uh LinkedIn here I don't is the deck going to be be shared out uh yeah folks for all of these sessions you're seeing in 12 13 14 and 15 uh about 2 to 4 weeks from now the recordings will be out on our YouTube channel okay um well yeah we can supply our because these are clickable links um so if if you do have the if you end up getting a copy of the deck that you can just click on it um else I'll I'll figure out a way to uh get get our LinkedIn LinkedIn well gentlemen thank you for your your service here Anthony some uh gifts here care of uh socket security give give socket

security a shout out uh