← All talks

Container vuln management with (hopefully) minimal burnout

BSidesSF · 202329:06743 viewsPublished 2023-05Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Container vuln management with (hopefully) minimal burnout Alex Chantavy In a microservice architecture, it's difficult to tell if a service's vulnerability was inherited from a base image (most cases) or introduced by the service itself. This talk shows how we used a graph approach to know precisely how to fix our vulns across 1000+ services at Lyft. https://bsidessf2023.sched.com/event/1I0HP/container-vuln-management-with-hopefully-minimal-burnout
Show transcript [en]

greetings everyone in theater 15. the best audience in besides SF give it up for Alex he'll be presenting he'll be our presenter and he'll be our Master of Ceremonies for the next 30 minutes awesome thank you very much hello everybody I am all kinds of excited to be here this is my first talk that I've ever given in person ever since like the before times thank you very much um today I'm going to tell you guys a story about our journey at Lyft for through the ups and downs about container vulnerability management it was a disaster well when we're trying to make it a little bit less of a disaster and I'll tell you all about it so back

in 2020 we stood up a container well we still have a vulnerability Management program and at that time this was frankly you know a little bit embarrassing for us because we felt that a company of our technical maturity well this was table Stakes like we needed to have this done already and we'd the prior to having a formal program what we had done was vulner management was done in an ad hoc basis and this was a mess so what we did we also had another big reason a big motivator for standing up a centralized program we wanted to win a contract to provide Rideshare services for the federal government and the federal government needs to

ensure that we protect their data and so we needed they they we needed assurances that we had operated a program that would find and fix security vulnerabilities on all of our platforms that were involved in this flow so for those of us on the security team oh I just realized I totally didn't introduce myself yeah so I'm a sui on the security team by the way and so this was exciting this like when when the hell does this ever happen where you work in security and you can say that you know I am no longer a cost center I'm not that thing that is costing the company money for some dubious benefit this is literally

tied to a business item so we I personally was extremely thrilled to be part uh to be able to say that you can be part of some business generating story so uh we said we sat on this and uh one thing that uh this talk is going to be focused on the container part because there is just so much to cover here there's a lot of technical depth and then so the hope that I have by sharing this with you avoid our mistakes I'm going to go into the good the bad the ugly uh and containers are hard so let's share notes I'll go over the things that worked for us how we used a graph

approach to get to the bottom of actionability one thing that I'm loving seeing at b-sides this year is that there is this theme with a taking a pragmatic approach to vulnerability management looking and deciding and prioritizing what specific things your program can do and as a security team just not like uh doing everything but being a little bit more smart about it and then so uh I think that standing up a vulner management program from scratch especially because we had a specific customer in mind this let us be very we needed to be very deliberate about what things we covered first and I'll take you tour I'll take you through our thought process so thinking about things well our

customer or the customer that we had in mind for the specific program they're interested in ride share services so when the Lyft customer goes and calls up the Rideshare app on their phone and request a ride it talks to it talks to a number of back-end services that are hosted in the cloud and we run about a thousand of these services at Lyft it's quite a few of them all of those services are deployed on kubernetes and I imagine that uh all these talks I've been attending here everyone is very familiar with uh Cloud native Technologies so this is pretty exciting so they're running kubernetes and then so some of those there are other assets that are involved in this

flow some of them are exposed to the internet or not and so this is the way that we're kind of building up those uh components that were involved for our program and thinking about okay well we got containers we got Kate's nodes we got other assets that are hooked up to the internet again I'm doing the focus solely on containers because there is a lot of technical depth in this section what are we going to do with containers though so generally speaking of loan Management program has like Steps like this where you inventory your assets you perform a scan on them and then after you get all you do your scans you have a set of results you want to do some

triaging you need to decide given this issue is it critical enough am I gonna does this need a all hands on deck type situation or is this something that is significantly less critical where it can wait a little bit longer or is it is it not valid at all and so can I just avoid randomizing my security team and then finally you need to actually get something done so once you get all your issues make sure that you're assigning it to the correct team and then actioning on them and that's the key part making sure the as the security team like if we're just spamming out tickets to teams and no one knows what to do with it we're not doing our job

we set out to do this program and we had we did it pretty bursty the goal was to get from zero to good enough as fast as possible so we set aside about six weeks to just build out a minimum viable product and we we found out that there were a couple of things that we didn't quite anticipate uh that we we did our best here one of these difficulties was scale and I'll talk through a little bit of some of the uniqueness of Lyft infra I think that many of you may be empathetic hopefully so um adlift when a developer pushes a commit to GitHub a container image is built and then it gets pushed to an ECR

repo elastic container registry and this happens regardless of whether or not that container actually is deployed to production that's an important detail so if we were to take the naive approach we're doing the scan process right we're trying to find out where all the volumes across our entire company and if we did this one by one by one variety image and let's ballpark maybe about 15 seconds to 30 seconds for an image scan then it will take months to scan all of this and even more so even if we could massively parallelize all of this that's just a lot of money wasted on compute and who wants to do that and I guess like by the

time that you're done with all of this then you know your the developers are going to go and push out more code again and then your results are well worthless so we had to take a little bit more of a pragmatic approach and think okay which of these images actually presents a risk to us which of these images is uh exposed to production running in production which of these commit hashes revisions is going to be set out in the wild and then serving Rider traffic driver track it that kind of that kind of thinking and the next part was actionability so this is uh something really interesting with container images so we initially started this project by

using things that were available to us uh easily available so every container image we scan them with aws's built-in ECR scanner and if you look at the API documentation right there it's got attributes description name severity it doesn't tell you what to do with it like you know going through this I don't know what to do at least uh you know I know that if it's critical or whatever sure but I it doesn't tell me what version to actually fix a given package too and then so if I uh just what am I supposed to do I take all these findings I give it all my service teams they'll tell me to go pound sand

they'll tell me to go away yes ridiculous and another um thing with uh actionability is with images we have uh every uh our infra team maintains the parent base images I think the number of companies do it this way where your application teams they cannot and probably should not maintain the full stack of that image from the OS to all the stuff that they build on top of it so typically an infra team or somebody who is really well versed in this they're going to be building up the parent-based image and then the application team will add more material or they're at their own code and that gets built as their image so it forms a

sort of family tree so there are two classes of volumes in image lineage so there are Vols that the service introduces themselves so if service Imports or introduces some package that uh that they need there are volts introduced by the parent image and then so as you see like if there's a vone introduced by the parent image it's going to Cascade down to all of the children how do you fix this stuff so the first case is really straightforward if you want to fix a phone introduced by service you just tell the service this is the package you need to fix all right done how do you fix the other case how do you fix the case where the vulnerability is

introduced from a base image and it has been cascaded down through everybody else this is a little bit more um challenging because if I was to scan one of these service images down at the bottom cut tickets and then tell that team you need to go and solve this issue with openssl or whatever they will say cool story bro I have no idea what to do there is absolutely nothing that they can do to fix that you what needs to happen is that the infra team will need to pull in their own changes update the image and then Cascade those fixes down to the rest of the fleet so that's what needs to happen so the infra team will

update that image and then Cascade it down and we're lucky at Lyft we have a really awesome infra team where they have a Pros where we have a process that automatically Cuts pull requests to every single child service down at the bottom layer there and so services will effectively consume the fix from those auto-generated PRS so image lineage is a graph problem and how do how do we keep track of all of this we luckily have one um we uh I maintain the open source tool called cartography you can check it out lift uh has been uh maintained this for a while is really awesome for full details on how this Cascade process works check out this uh blog that I

wrote and because this is really interesting because this can be several layers deep a dependency of a dependency of a dependency oh yeah you can check that out right over there so essentially the power of having something represented in a graph like this is you can correlate this context with anything else like such as who owns a service or correlate to the on-call in charge of fixing this uh attribute ownership uh do things like connect it to internet connectivity and then we saw the value there and then so we took all of our various scan data and then we chose that we would have this sort of naive architecture where essentially everything is going to get

sent to cartography and then we are going to Output those as gear tickets and then reports call it a day and we built some data models I'm not going to go over what these data models are just to show you that you can model it as sort of a graph and yay we want a contract yay 810 million dollars split between Uber and Lyft that is awesome well we're we're not we're not done yet though so like um we we had a data model you know and we had the contract and but this was really bad we we did not have any automation to do this so remember that I had this we had this plan right

where you were gonna scan for phones we're going to triage them we are going to remediate them and we're going to live in this beautiful world where robots were doing this for us because as a security team we're all understaffed and then we're very tiny compared to the rest of the company and then so in actuality it looked like this at the end of the six first six weeks we had focused only on the Discover piece we only had a scanner we had some uh data model that I frankly was very proud of they're like oh look how look how smart this is well but it didn't actually do anything you know so the parts that

involved triage and Remediation this was all still very much manual and there were resourcing problems here so at the end of the six weeks uh most of our most of the team that was uh focused on this we all went back to our main projects you know we all went back and then this was left in the hands of one TPM and one very capable engineer but they both ended up extremely burnt out this was this was dark times this is really bad and then especially couple that together this was 2020 and then the pandemic no one was having a good time at this time and something had to be done you know we had to uh the next step we tried to

think okay what are some things that we can do to build buy adopt our way out of this problem and are there things that we can augment with our existing uh structure that we've built up for ourselves and one of these was a tool called trivi and so trivia is really neat and what I like about trivia is that it gives us fixed versions you scan the image and then it tells you exactly maintains a data feed tells you exactly what package to update your uh vulnerable package to next one it tells you you can say things like if this vulnerability is not fixable don't even tell me about it if there is nothing I can do about this

vulnerability then why am I going to spam all of my service teams uh unless of course it's a log4j type situation in which case it requires a all hands on deck situation but majority of the time that's not going to be the case and the cool thing with trivi was that it also lets you define things like open policy agent definition so you can say things like if this vulnerability requires in-person access or physical access to a server I don't care about it and you can describe it like that and it will filter it out all for you we loved that um and then I guess the one final point I'll have on the build

by adopt point is that we found that our info was just so custom to the point where a lot of the things that we evaluated they kind of took a very opinionated way about how the info would be set up and so it didn't really generalize well and then the work to sort of orchestrate all of this or the work to adopt their product would have been just the same or more than just doing it ourselves and so that's what ended up needing to happen we orchestrated the automation ourselves it ended up looking something like this where again you got the scan process you're identifying what is in production you're building this image lineage tree

you're splitting up those volts into things that are inherited from the base image to things that are introduced by the service itself and then you are tying that up to what the team can specifically do to fix it themselves and then so that's that remediate piece right there where we say that your project is affected by this many security issues and you here is the specific PR that you need to do to merge it merge everything in and once you merge that in all of these security vulnerabilities they will go away and I guess I skipped over real quickly the triage phase that's you know do you have to do some there's some Sometimes some hand tuning about like uh deciding

what are the types of bones that you care about or not care about and we had this state management system that was really sophisticated I'm not going to go into this uh this is to say that this was a very ambitious project and it failed spectacularly we did our best here and I'm going to tell you exactly how this failed spectacularly we I wanted to share like the original code segment for this but then I thought okay well it's not I'll just show you the the screenshot right here where if you are on call the security team's bone scanner may have assigned a jio epic to you with no task items attached to it when the

scanner fish is a scrub all tasks and subtasks will be a taxi epic we expect the scanner to be done by the end of the day okay so like what happened here was you know we sent all these teams all these tickets we're unveiling our brand new launch over program and then we told we told them they had all these things to do but then there was no detail about what they had to do the weird details like this weird State Management details about like at the end of all of this uh we should have waited um for all of the scans to finish uh before assigning it to the service owner so that they would not get spanned with

an email that they could do nothing about okay this one was worse uh security loan tickets were closed erroneously between six and seven forty one a.m this was really bad so service owners sent erroneous information that volumes were closed that when they should not have been this is a bad situation to be in when your state machine logic goes and then what happens is it believes due to a data quality error that all vulnerabilities that you have sent out to all the teams have been resolved and therefore all the tickets that you have sent out to them should be closed and so like you have zero Open tickets and then so at first it's like oh you know we can pack it up

it's a great day for security we win in box zero no this is not definitely not the case and so uh we've sent out an apology and in certain cases we actually had to send an apology about our apology because oh wait actually we're gonna Backtrack on that this is a bad place to be you do not want to be here and well eventually it works what did we do to make it all work and oh before I get into that that's a key detail here this was really ambitious the the part that I really want to emphasize was we decided that sending out tickets to teams no this was not an acceptable experience if that ticket especially the

given volume of data was not going to be automatically closed so we had State machine logic that would detect that the vulnerability was gone so if that vulnerability was gone from the environment we would close it and then we'll do our own Bookkeeping on the back end and keep everything nice and tidy so uh in in my opinion I think that if you don't have that then you know if we didn't have that then we weren't really doing our job So eventually ambitious plan didn't work out at first eventually it did why did it work out well it's a work of like many other members on my team we put together we built engineering safeguards

we did things like kill switches and so this one is like all right well if we get ourselves into a really bad situation and we need everything just done then we can just flip that switch and then just like skip over closing the rest of the state machine how do we get our robot to stop running away from ourselves we have a circuit breaker where if the number of tasks that we're going to close is greater than some threshold some sanity check then we will stop it right there and then we'll page the our own on call we have a set of dashboards and alarms we built a set of systems that count the number of assets that we have in our

cartography powered graph and then we kind of measure exactly okay well if you kind of if you can chart these things you can set alarms and thresholds for if this number goes below or above a given threshold you can page the on call and this last point this last thing that we did right here was to staff the program properly and I think that this was the probably the biggest I don't want to say mistake but I think the biggest area of improvement that we could have made from the very beginning where uh like I said we started off in a surge effort and then the we and then afterwards the ownership story was unclear where we turned it over there

was only maybe two people working on it and then other people came back in as volunteers to help get things in a good State again the final thing was we stood up a dedicated program a dedicated team for vulnerability management itself that handled things like building out the platform the care and feeding of these software systems handling answering questions when people got confused as part of the on-call and maintaining all of this so this was this was absolutely critical for us and so I want to talk about the things that we learned there were things that uh you know going back to I think that if we were to think about it from the original times you know

we did things really well in certain areas and then there were definite some places that we missed such as those engineering scaffolds and then Staffing things properly so actionability is everything I think that uh I'm pretty happy that we took this approach where if there is nothing that a team can actually do for vulnerability then why bother them unless it's an all hands on deck situation where it needs Special Care yeah this is we we don't we have to balance all of these things there are competing priorities prefer fixing over telling we needed to be part of the solution so in order to get our teams to work together with us we took a partnership approach and then

used we leveraged the existing work that our infra teams had provided for us this system that gave automatically generated pull requests to update the rest of the fleet and then pull down those fixes for uh for all of the downstream services automate for user experience and for team health so this is a very important point so having it again for us I think that it's a very unacceptable user experience to take all the findings scan findings shoot them up to all of the on calls and say good luck and you know we don't have if especially with the given volume that all of these uh that we had of tickets this is not acceptable so we built some State

Management that handled all of that on the back end and automatically closed those tickets sure we had some problems with the rolling that our own and owning it ourselves but I guess the next part is you know when it comes to the aspects of teams so having automation for your team if there's any manual step in your vulner management program I would question how sustainable that is because I know that when we went through it this was really rough and we definitely burned out a lot of people and yes with the state management use engineering safeguards think about these things ahead of time do a phased rollout I think that we could have did things

like let's first onboard only the security team to this process let's only on let's do some dog fooding of ourselves and let's do like a 20 roll out 60 roll out uh then see how the teams respond to it and then iterate and go from there in our case we had a lot of deadlines staring Us in the face and then we had a very motivated customer telling us oh you need these reports right now and so we had to sort of change we had the we had to improve the engine while driving it at the same time and that was very difficult but uh yeah if a lot of this was thought ahead of in

advance I think we could have avoided some of those problems and then finally you know this last part is take care of each other and then this is especially true for the time when we put this all together this was during the pandemic during probably some of the worst times in a lot of our lives where we're all confined at home and then where a lot of us have been meeting each other only over video called not really building that team Rapport seeing each other in person and there's many aspects so what does this actually mean right what do you do to take care of each other as a team and what I would say to

that is uh it comes in many different places and the first part is pushing back against unreasonable deadlines so if they're if you're getting a requirement I and then if there is a very high time uh susp if there's if there's a very urgent request I would be very critical of what that request would be is this really all that urgent is this really all that important and then kind of like make your own Matrix about what that looks like because above all else if you you have a team that's burnt down and not willing to work uh you know that's not gonna that's gonna be very unhappy for everybody and on the other hand if you have a sustainable set of

Team practices then this will pay dividends in the long run and you know thank you very much for having me over here uh check out you can check out cartography you can reach me at the various socials we built out this approach you know I um I'm very interested to find out to see how others have solved this problem taking things where you are scanning these container images you're finding out where the actual uh where the finding was if it was actionable or not focusing on this making sure that your team does not have toil owning the State Management doing this in a life cycle process the care and feeding of your program uh you know

definitely extremely excited and uh happy to be here thank you very much [Applause] fantastic thank you Alex folks we have time for one quick question and then most QA we would encourage you guys to uh do that in the hallway to make room for the next speaker I see a hand up I'm bringing the mic and I'm not going to fall this time hey thanks for the talk very interesting so a quick question on your MVP you said you did it in six months how many people do you use six weeks six weeks sorry six weeks how many people did you have and eventually how many uh team members for the actual you know I think you made

improvements over time and to get it operationalized yeah so with the original MVP I think that there were 10 to 15 originally across the entire company from infra and then this was not on just containers though this was again on the other work streams that I mentioned including the case nodes internet exposed assets and then all of those other things uh and then eventually when the we built out the full team this ended up being closer to three to five people and so three to five still sounds like it's a it is a very very small team but what I will say is that it is amazing what things can do and how much more fast you

can move when the lines of responsibility are crystal clear when everybody knows I'm working on this bone management platform and this is what I am going to be uh evaluated on or this is like the thing that I own you know that's that's real powerful thing like when people are able to take ownership over a project and then you know make their own contributions you know that's uh it multiplies