
hey everyone I'm Chris this is Vivian hi we work for lyft we're the other transportation company open sourcing some certificate stuff this week so we're excited talk to you about what we've been doing and so we'll kind of tell you about our project and what we've been doing SSH all right so today we're gonna be talking about better SSH punishment with ephemeral keys aka how we made SSH access a lot better at lift
all right so we're gonna start off talking about how we used to do ssh management it lives and then we'll talk a little bit about netflix is blessed and how Netflix uses bless for their SSH access in their infrastructure and then we're going to talk a little bit about how we use we use an adapted Netflix is blessed for our infrastructure and our purposes followed by talking about how we're running a lambda and client how we're running the lambda you know a u.s. infrastructure and how we run the client on the engineers laptops and then I will briefly talk about the shortcomings that we came up with and how we plan had plans to fix them all right so what we
did it live before during onboarding IT would provision laptops they would set them up before the engineering starts at lift so one of the things I had to do in the process was generate a private and public key SSH SSH pair for the engineer so this was done by IT not the engineer they would generate the key saw the private key in the destination folder as usual and then they put the public key in a fall that it was sought managed and this would bring the public key into the authorized keys of the user on every host on our fleet using salt management so this file looks a little bit like this it's a yellow foam and it has a few
properties such as my name my email and most importantly my public key so this fall was a file that's stored in make it repository and it uses it's used in salt management it's run and the daughter ena is used to put authorized keys into all of the instances on our plate
oh sorry okay so that was well for a while we had you engineers who had laptops that already had the SSH key on them they would be able to SSH right from day one using the private key that was provisioned to them into any of the instances on our fleet so this was good it was functional but there were a few problems with it so firstly like I said the NGA themself didn't make their own key this was done by IT so this isn't great because someone else potentially knows your private key and we had to make sure that IT followed the right process to generate a secure enough parva key SSI should keep here and make
sure that this process was secure enough which we couldn't because we we couldn't have a certainty that that key wasn't stored somewhere else during the process so yeah that's that the process was prone to human errors like I said the public key has to be copied onto this users yeah no file every time someone who is on unbought it and if this if there was a typo in it and there'd be a typo in the authorized keys and then we'd have to redeploy this salt management to all our fleet every time there was an error and this deploy process could take anywhere between 20 to 60 minutes because it had to run on all our instances so this wasn't the
greatest process for rotating a kick a key and further furthermore because of this slow rotation process if a key was compromised so for example if the engineer's laptop was stolen then this would this was a result in having to rotate the key which is the same as employ process which means that a malicious user could have up to 20 crab up to an hour with the key and have access to access to other instances like that so that's not great finally this process doesn't have to factor authentication which means that once someone has the private key that's it they have association instance patient Association access into all instances so there's that that's not the greatest
process okay so uh introducing Netflix is less so Netflix recently maybe not so recently anymore but they open sourced a lat they're a diverse lambda plus so blesses a lambda it stands for bastions lambda ephemeral SSH service and basically it's an alias lambda that issues ssh certificates to its bastion in case you don't know anywhere slender is basically serve less computation what we would do is we would upload Python code to Avis lambda which would have some sort of handle like the lambda handler function when the lambda is invert it can be invoked with a payload this payload is passed to this lambda handler function the lambda handler does computation and then in our can output
our JSON response so in the case of bless the input the payload would be a public key the computation would be ACA signing this key and then the output would be a certificate certificate that could be used for SSH a cool thing that we get we get for free with lambda is that we can use a Visayan permissions to control who can invoke the lambda so in the case of Netflix the Bastien can invoke the lambda and we can make it so that the only the best you can invoke the lender and nobody else can do it okay so I'd output like I said of the lambda is a cessation certificate if you're not sure what an H certificate is
it's basically a certificate signed by a CA signing the public key in SSH public key SSH a server can be configured to trust a list of CA fingerprints so for example in necklaces infrastructure they probably have a list a list of bless lambda CAS and the fingerprints on every machine and anyone who presents a certificate that with the sim with a fingerprint that matches the see a fingerprint that they trust will be granted access the cool thing about um certificates it's advantages over public key is that you can specify a couple things so for example we can specify the principles for example who which is who is using a certificate so you can for example
narrow that dance you just Vivian her only between how I can use a certificate only the bastion can use this certificate you can specify the valid to invalid from dates and that makes that gives you the bill option two issues short-lived certificates so in Netflix this case they only they only issue two certificates that lasts for two minutes after that you'd have to call the lambda again we could also specify we could also specify the IP addresses from where you can SSH using the certificate so you can specify where you can SSH from and you can also specify a bunch of other principles and features so it's pretty cool so briefly how the lambda works the
lender is in the middle of this diagram and inside the lambda it has the CA the CA as a private key it's password protected and the password is stored with the lambda but the thing that makes it so that only the lambda can in can sign with the CA is that the password is encrypted using kms and the kms key is protected using iron Commission's such that only the Blessed lambda can decrypt with the kms key so the so that means that only the Blessed lambda has access to decrypt in the password and using that password to sign so as you can see the user on the bastion can request a certificate it will give the its public key to the
lambda the lambda then decrypt the password using kms uses this password in this private key to sign the public key return a cert back to the bastion and then the Vasa tchen can use this to ssh into any instance on the infrastructure
all right so like I said see the there's many good properties to using associations to figures such as the IBM to specify the principles the source IP addresses the the ballots to invalid from dates and a bunch of other principles so those all come for free like those are all options you can have with the stage certificate you can also we could we could also use iron policy to prevent anyone from invoking the lambda except for the Bastion and also again I am policy is used to ensure that only the bass you can decrypt with the came with the kms key to get the password for it let the Animus lender is running on a
separate security account which is the standard a security practice in Atos because if you run out of an account it's less it's much harder to escalate a privilege in another account up to that account so running thing renting sensitive things in such as the lambda in a separate a separate account is something that Netflix does and so we do we do the same thing with our lambda yeah and this this this process is a win for us because all all we need to do is provide a list of trusted CA fingerprints to the to each of the instances in a flame which is just in one liner modification to the sshd config so we don't need to install any
time modules or anything which is something that we might have had to do if we considered other other options such as using Scheuer or yeah are the joyful MFA okay so that was Netflix is less this place is similar to necklaces plus we just adopted it for our purposes so with our blessed we just made it bastion less such that every engineer can invoke it so like I said this list les has really lifts les there is no question the any engineer ik has the permission to invoke the lambda and we use km/s or two we just gave us or 2 min tokens for the engineers that can authenticate the user alright so if it briefly how L less
client bless works it's a little bit more involved in Netflix is less the user not the bastion invokes the lambda but before before it invokes the lambda it actually can get generate so akms auth token now I'll go into a little bit more about what that involves later but it generates its kms also home which can validate the user's identity so it has this camus auth token it passes the cameras or token along with its public key to the lambda the lambda checks the kms or token val validates that the user is really who they say they are and then only after then it will do the same thing as the Netflix is lambda where it
decrypt the password using kms and then Sciences assigns the public key generates us so it gives it back to the user the user now has a set and is able to SSH into our instances we do have a bastion but it's more just for easy to proxying so but so the bastion accepts this certificate and then all the other flight instances enough they also accept the certificate so the user is the one with the certificate rather than the bastion all right so what is KMS off its token that magically identifies the user and proves that they are really them k missile is a park a resort is a library that we open sourced a while back and
basically it uses I am Commission's to ensure the verification of the user who is signing who's generated a token so with kms we can specify the encryption context and in came in iron permissions for kms we can restrict what the values for the encryption context can be depending on the user so in the case of kms or to set up a cameras key that works for this um you would restrict the from of the encryption context to only be the username of the person so I would generate a kms or token and the only from an Christian context I can say so the only token I can generate on behalf of is myself I can't it the from context has to be no
it can't be Chris type for or for example so because of that assurance I would sign a I would get a token I would say the to context would be the west lambda the service that I'm like giving a token to the from context would be me vivianne home and then this token would be sent to the Blessed lambda as part of the payload on the Blessed lambda side out the Blessed lambda would be able to decrypt the token and check the encryption context and I'll see it's from the ho Vivian ho which means that it's definitely from me from me and it can have that assurance because of iron policy only I would have been able to
met this token like I said in lips plus you have to give the chemist or token as part of the payload if you don't give it then it just doesn't work blessed lifts less validates this came as a chicken to ensure that you are really who you say you are and the process of invoking kms auth and the process have been working the lamda both require multi-factor authentication so in when the user calls the lamda on their laptop they need to type in the six digit code they get from their phone otherwise they wouldn't be able to invoke the lambda in the first place so with this with these measures we can be assured that the blast lamda is actually
being invoked by the person who they say they are all right handing over to Chris to talk about the West client all right I'm a little taller all right so obviously this is a little bit more complex so when we're looking at options we for adding to factor to our ssh authentication we looked at duo we looked a couple of things like you said you know we wanted to do something that didn't require pam modules we wanted we'd like the properties of certificates but one thing that we had to do was actually have an agent running on the user's laptops to actually handle do this whole process so that is something that we've open-sourced now so has a
Friday our plus client is open source yay so you can go there and you can read lots about it what is this client to you so you know essentially sorry I have to look behind me a lot so we have the agent running on all the developers laptops it invokes the lambdas so it does that whole process on the diagram gets mouths off token talks to the lambda gets a certificate and then injects that certificate into the users SSH agent as well as having on the hard drive but one of the hard requirements we had and we'll talk a little bit more about kind of our operational requirements but one of the things we wanted to do is make
sure that unless the user was actually putting it in their MFA token everything needed to act exactly like normal SSH our lift is very much of a DevOps shop so all of our developers need to access their hosts just like normal SSH if we got in the way they would get very upset with us so thank you so much so so that was a hard requirement for us so a lot of trade offs that we're talking through here we're done to to make it easy for our users even though probably from security perspective we may have tried to do something different cool it's open source and so please we would love to there's a lot of work to be done we
would still love to recruit volunteers for that but it's also you know free it's open source so you can go and check it out yourself all right so the things that we really liked about certificates again users have to put in their MFA before they actually assess h which was a huge win for us we do it every 18 hours so we require you know kind of verification of aliveness every 18 hours these certificate is valid for 30 minutes and it's basically the window where you know we have to re talk to the user revalidate that the users AWS account is still active so every 30 minutes the user talks back to the lambda and gets a new certificate and we
we save that certificate for 30 minutes this is all logged cloud trails which is awesome and yeah slight modification from the net fixes bless that again is open source if you look at our Fork of Netflix's bless there's a couple patches they're just very some minor things that came this off being the the biggest piece there all right so being a security person I came to lift like I said we practiced DevOps a lot which meant that we as hairy team we wanted to set this up but it also means we have to run this and we have to be like critical path to engineers Esther stay tuned into their into their instances so that was a
challenge and so as a security person it was it was an experience and so I thought it would we would share some of those with you guys in case you want to run this to you so number one I'm just writing the lambda you know it's really easy to say like hey we'll go we'll get this this thing from Netflix and we'll just run this lambda it was actually a little bit of process so a lot of those lambdas are you know they're serverless so you have to give aid us everything it needs so you have to go and you have to like pull down you have to compile all the Python libraries uses the
cryptography library which is a native library didn't get that all compiled up zipped up into a tarball and then actually upload it to lamda each time which will create a new version we had to build a container to actually go and do that so every time that we have to do our libraries we can just kind of in our dev environment build a new set of libraries tarred up and upload is the lambda right now we're actually doing some like ABS CLI to to actually do the uploads because salts just the the lambda management modules of salt we're kind of recent and so we're kinda having to deal with some like old salts and so
we have to be able to do that Virginie and aliasing Lampe has some cool things it will create a new version each time you upload a new code base and then you can point an alias at a particular version so this led us one we can like we can change our lambda and we're not gonna impact users we know that they're gonna be keep using the the same good version of our lambda and then if we make it versioning compatibility changes we can have just two pointers and then point point clients to the new version actually deploying the client apps to users laptops we have a dev environment that all thank you all of our employees
get set up with basically day 1 when they started lyft and part of that is installing the bus client getting their their 80 best credentials sitting on their laptop and being able to SSH in we it sets up in a virtual environment because the Python scripts run all of our users laptops and then it's just a script that gets generated through a pip install we have to ensure the client is actually called so good if we only have three minutes or typically we have to actually make sure that the the bus client invoked whenever the user needs to SSH so the way we do that is with with two ways we we a leus SSH so if you
user just types in SSH to their their hosts there we actually invoke a wrapper script that invokes bless first optionally you know maybe ask the user for them if a if they need to and then it goes ahead and runs blessed to to get that recent cert we also put a match exec call which is kind of a more recent I think it's uh OpenSSH 6-1 ish maybe was five nine I forget when they actually put this thing in but that actually lets the SSH itself go and invoke a command when SH is run and that picks up things like our sink and all the other commands we didn't want to alias so that was nice we push out a
client so again speaking of security engineer I'm like auto updating this is kind of scary but when you know operationally we pushed out this we we get all of our users to go update this we really needed a way to actually be able to push out changes pushing out you know security updates those types of things to our users so we did build in auto updating so every seven days basically the agent checks back in pulls down a fresh copy of it and this is not our open-source version but in the the person we're doing we actually check a GPG signature and and we check out the most recent valid signature tag alright so again
operationally uh you know we want this to be fast we really were concerned that if we push this out if it takes like seconds every time the the user in pokes SSH this is not going to be tenable if you're responding to a page at 2:00 a.m. waiting for five seconds is not an option so we need everything to work really fast so we had to end up caching a lot of things so we do cache work a my soft token for 60 minutes the user credentials the cache for 18 hours so basically the user has to put in their MFA once every 18 hours and any our certificates are good for 30 minutes [Music]
alright so just talking about a few of kind of the limitations of the challenges that we we ran into as we were rolling this out so again one lambda so we were actually the one of the first teams at lyft to be using lambda heavily there were some kind of toy projects but basically the first ones to really do it so we didn't know how it landed was gonna fail for us and we really didn't want to fail in the middle of another outage and have teams locked out of their boxes so we know we knew we needed to run in two regions so we we set up in u.s. one which is where everything we run a lot of things and us
West's one but that 8 OS doesn't give as lambda there so we had to run us west - not a big deal but that was just kind of weird thing we had we did know that if Lamba totally fails across all regions then we needed some way to get into our boxes so we we have to push to see a fingerprints to all our boxes and the the second one is kept by the security team and so if we need to we could go in and actually manually issue a certificate to our users if for some reason all windows everywhere have gone down that is not happened yet we've actually been very happy with performance of lambda it's also pretty
cheap so it's actually been a really good use of our time and resources and when the great DNS thing happened last September and East region basically went down we could just we didn't automatically failover but users were still able to hit the west region and I get that stage certificates so that was that was good other issues we've hit kms rate limits idea was recently changed how they do the rate limiting so it's a you actually had a flat cap a 100 encrypt or decrypt spur account now which was really just pointed to us used to be rate limited by key and so we could we knew that you know if we failed because of a rate
limit we can always go to another kms key in a different region and so now we're actually capped on that and that's something we've been very disappointed by but we're trying figure out ways around that Atos mfa we thought all of our users had a good Atos their their mfa token setup with their their account when we actually rolled this out we realized a lot of users had set this up when they on-boarded never actually used to factor for anything so we actually it's been a something that we head to a security team we suddenly had to actually set up to FA for basic our entire data science team and a couple other engineering teams that
didn't the didn't have a much speak of our data scientists apparently we thought Python was standard across platforms and lift we most engineers use max we have some Linux not much Windows but we thought Python its its standard but turns out it's not so yeah data scientists like have anaconda and other weird things that say they're python they they take over Python on their the users laptop but it's it's not Python and it causes all sorts of issues so as a result we're actually considering possibly pouring all this to go just cuz pythons been more the struggles that we've had on conventional shells you know we try alias things but then fish-like comes along and aliases don't
really work the same way so it's been it's been a challenge photo libraries when you're running on lambda the the any of us they can go and upgrade your boto libraries whenever they feel like it so that's we've been hit by that a couple times oh yeah and we captured our MFA token not a huge thing of that so what does this look like now basically user gets their their Atos credentials they download the bus client as part of their development environment and they're immediately able to ssh into all of our servers which is awesome if we need to deep revision then we could just disable Eric I am account and they're really locked out so lots of problem solved
yay we felt good about that and that's the end [Applause] all right thank you so here's your gift from Fitbit thank you
[Music] [Music] you [Music]