
uh hi everyone my name is Dean I've been working um part of the security team I live for five years software engineer there that's focusing on infrastructure security I'm really excited to be talking about this project that we worked on at Lyft um there were many people who've contributed to this project too I'm just fortunate to be presenting on a lot of people's uh behalf here so I just wanted to acknowledge them and also thank them so well Ingress filtering from the internet to your services is very common using firewalls it is far less common that people are observing traffic that's leaving their services Network it's it's one of the most impactful things that you could do for your company um if you don't observe and control this traffic it's it's a significant risk where uh if a service becomes compromised then oftentimes the threat actor is going to potentially exfil data download arbitrary payloads during run time or a call back to command and control so there's a couple recent prominent hacks that answer certain people have heard about or at this point really sick of hearing um solarwinds supply chain attack use the back door that regularly communicated with command and control through DNS and http uh the log4 shell log 4J vulnerability another zero day all hands on deck type event where um everyone wished they had some sort of egress filtering during this time so when when exploring these types of vulnerabilities like log4j attackers often still need to uh make an egress call um to download additional payloads or to exfil data so being able to control this egos traffic can directly help mitigate zero days like this uh so why is this difficult why aren't more people doing this today uh the first call out it's it's people it's organizational you need to convince the decision makers that you are going to now funnel all your traffic through a proxy a single point of failure and that the security benefits of doing so will outweigh the risk of this single point of failure so now you need also this team to operate this Mission critical proxy it's it's high stakes right at Lyft this was a bit of a culture shift for our security team because we weren't known historically for operating these Mission critical services and we were the ones building and driving this initiative we have a culture where if you build it then you have to operate it so you're you're really incentivized to make sure it's running well um our Engineers were actually really excited to be taking and and we lean into this opportunity to do this so now let's talk about why this is technically difficult um and it really boils down to two things uh TLS encrypted traffic and DNS right how do you control DNS exfiltration attempts um the vast majority of your egress traffic is going to be tilis encrypted web traffic https let's talk about that first um a reminder of these are the steps that occur when you establish a TLS connection and HTTP request so like let's say we were to funnel this traffic through a proxy uh just transparently and we're not going to be manually Milling anything there's really only a few attributes that we could control this traffic right we have IP address available to us we have the ports maybe we have Sni and I have this highlighted in a question mark because not all clients relay this information and there's the potential down the line that sn9 might be even encrypted so let's talk about also the options explored um that we we did at Lyft so these are things that we explore but ultimate didn't go down and uh one of the first things is man in the middle this man in the middle all TLS traffic so this directly addresses the Tran uh the challenges faced with TLS so this is where we decrypt everything and now we can control traffic based on the host name the URLs and the payloads if we wanted to but we felt this this really introduced too much risk for us to adopt this option right you're you're essentially breaking into an encryption and uh there's going to be a single point where all traffic is being decrypted now you also have to worry about managing your own certificate Authority and then um how would you you need all services to trust a roof certificate and then you know worry about um how to rotate and manage the life cycles of these certificates so we thought all right let's try to keep things a little bit more simpler let's just cruelt control DNS all requests make DNS requests right we'll maintain an allow list of domains that services are only allowed to resolve but this doesn't quite work out completely because of a threat actor or compromises a service they could just circumvent DNS altogether and if they just use the IP address directly so then we thought well okay um let's pre-resolve all the hosts so we'll maintain two allow this we'll have an allowed list of hostname and then we'll have an allow list of IP addresses so we'll constantly pre-resolve all these hosts in the background sync these to a layer 3 proxy and if a service egress is out to an IP address that hasn't been pre-resolved then we're just going to block it um I I think this actually might work out in theory I'm just not sure if it's really practical at scale um so we were actually really close to go down this route but ultimately decided not to it just felt pretty yucky like really messy so now let's talk about um what we did ultimately decide to do so this was our ultimate design goal at Lyft we really wanted to delineate Services into public and private subnets we wanted to force all outbound traffic through a proxy and at Lyft we use Envoy uh because well we built Envoy at lift particularly Matt Klein did and I don't want to go too many of the details of what Envoy is there are entire conferences just dedicated to Envoy and the will is really deep there but for our proper purposes um for the purpose of this talk just know it's a proxy and it's a very highly extendable proxy at that so private subnets they're defined by two nice characteristics um first characteristic is internet egress traffic cannot go directly out to the internet it must first uh Traverse a Gateway or proxy that lives on the public subnet and the second nice characteristic is there's no public IP addresses assigned to the service there's no public interfaces so you're not a a single Security Group misconfiguration in a way or accidental firewall configuration away from directly exposing your services to the internet so whatever you do decide to do to improve your uh the network security posture of your services I would argue it should get you one step closer to this model at Lyft we use AWS and it's very difficult just to start with this model because a lot of aws's services their offerings by default will Traverse the public internet right like S3 dynamodb STS these all Traverse the public internet by default so lit is also a microservices architecture and we have over a thousand services that live and we ultimately wanted to know the individual services that are egressing out to the internet and because we're not manually mailing traffic we felt that the majority of the value we get is just from controlling the domains that we're allowed to egress to so we wanted to be really granular every individual's Downstream service is going to have this unique allow list of Upstream domains so we could represent this as a dictionary mapping that we see here right so a thousand services and a thousand unique allow lists if a Services continue to reach a host that's not an allow list it would just block it so a quick primer on proxy types there's ultimately two types of proxies that Lyft uses as the foundation to uh for its egress restriction and the first is the client aware connect proxy where we wanted services to permanently use we've been calling this this happy path at Lyft so denoted by this sun emoji here um we want all service to use this and as a secondary path we call it transparent proxies this is going to be a catch-all it's going to catch all this leaky traffic that's not a bane to connect proxy we'll also dive a little bit later we also have additional levers to control traffic that we do see in the transparent proxy so client aware connect proxy is self-descriptive you may also have the term explicit proxy and it's where the service or the client knows that there's a proxy Stadium between it and the Upstream server so connect is an HTTP method majority of you all probably heard of getting post requests well HTTP connect is just another HTTP verb and what it is is it's an instruction for a proxy to open a TCB connection to an upstream host as soon as that TCP connection is open the proxy just 40 bytes back and forth and what this diagram depicts um is is the steps that a connect proxy uh calls make and there's really two important highlights within this diagram that makes it really attractive to control egress traffic the first important highlight is DNS you've offloaded the responsibility of DNS resolutions now to the proxy so this makes it a lot easier to control DNS for your services because now there's no reason for external resolutions to happen from the service uh the second attractive aspect is you get this really um you get this nice Tuple you get to see the Upstream hosts that the service is attempting to connect to and you also get the downstream host uh that's that's making this call so we can inject these um the downstream hosts as an HTTP header um there's ways multiple ways you could configure connect proxy uh the first way that I uh that we use that Lyft is we can check these HTTP environmental variables to all our services at Lyft um so if an HTTP client sees these environmental variables and knows it must first make a HTTP connect tunnel and then there's also other ways like you could also divide it in code so this is a short python snippet that shows how to use a connect proxy but the the first option definitely is a lot more attractive because um uh it's it's language agnostic so just to drive the point home a TCP dump of a connect call so this really nice Tuple that you're interested in you have the Upstream hosts uh that's highlighting red apairlift.com and then you have the downstream hostname uh this basic64 encoded value as the host so some of y'all might be thinking or just kind of tricking your head like how do you authenticate to their proxy and it's an orthogonal talk that I honestly can spin to a separate talk but just for the basis of these talks this particular talk is just out of scope but find me afterwards if you have more questions on that so let's talk about the other path uh transparent proxies they work by rerouting traffic without the client knowing and this can be done through iptables so ibtables will redirect all inner bound traffic to a proxy and it's vastly more complicated compared to the client aware proxy and again primarily due to TLS encrypted traffic and then DNS exfiltration so once the proxy receives the redirected traffic it has many ways to Route this traffic so ibtables operates on layer four three and once the proxy sees this redirected traffic um if it's if the traffic is TLS you have the option to allow a block on Sni but there's no guarantee we have Sni if it's plain text HTTP we can block route based on HTTP host headers and then for everything else you could block allow based on IP and then finally again because the client doesn't know there's a proxy sitting between it and the Upstream host is attempting to reach out to you now have to worry about DNS so cool now we've covered um the foundations of what we use the connect proxy and the transparent proxy how do we actually use this now in our in our restriction so here was the strategy high level strategy this is a 1000 feet View using Envoy we used we created an edge internet gateway that contains both a connect and a transparent proxy so again the client aware connect proxy is the primary path that we want in all services to use the design is is far simpler the big plus is that we can control DNS for services and then the transparent proxy is going to catch all this leaky traffic so it's going to give us the confidence that all traffic is being the connect proxy and if it isn't we know about it and then we could spend some time to figure out how do we get this migrated this traffic shifted over to the connect proxy but again there's there's additional levers we actually do have which we'll cover in more detail on the transparent proxy side to control traffic so 1000 feet view now let's really dive into the nitty-gritty details um oh sorry uh one quick thing too is like um if we don't see any traffic at the transparent proxy we could just uh just block everything in that transparent proxy at that point and then everything we're confident is going through the connect proxy so we just eliminate that path completely um so this is this diagram is lifted directly from our internal tech specs it's effectively a blueprint on how the internet uh Gateway works and uh we wanted all services to use this as their main way of egress so again we use Envoy at lift and I don't want everyone to get too hung up on elvoy because the steps here are still ultimately what a network proxy needs to do regardless if you're using Envoy or not so let's walk through the steps what happens when a web request occurs so starting from the the left side over here a service wants to make a request out to the Internet it's going to see these environmental variables injected so it knows it must make a connect tunnel out to the internet gateway the connect Gateway is going to receive this call and it's going to see it's going to terminate there and I'm going to see this Tuple of the downstream host that's making the call in this case python example two and then it has the Upstream hosts that it wants to connect to we pass this information into a layer 7 HTTP R back filter uh this this filter has a dictionary mapping of all our allow lists it's going to look it up as the python and Sample two uh grab the list of allow lists and then if it's in there allow that traffic if it's not we're going to return a 403 and just deny that traffic in this case it is allowed so then we forward it to a four cluster which is in Envoy terms you could just think of this as a forward proxy HTTP 4 proxy make the DNS requests and then forward it onwards to the internet so next I'm going to talk about uh the transparent proxy and this path is how we capture leaky traffic and also we have a bunch of Leverage to control traffic so comparatively um it is far far more complicated and this is the heaviest slide of the presentation so please bear with me hanging me hang with me in there we'll go step by step of how web requests goes through so this is lifted again directly from our internal technical documents and the reason why it's far more complicated because we need something to extract and preserve the data we need from the calling service and make it available as transport properties to the internet gateway to allow or deny on so let's walk to the request we make a web request and for some reason this traffic isn't obeying our connect proxy and we're going to capture it right so we iptables redirect this traffic and you'll see one two three four gets changed into a local host and this is a local running proxy uh living on on the same network space as the service so we have this additional proxy that needs to live local to the service because um this proxy needs to be able to preserve this transport property it needs to be able to restore this destination IP address and when you IP tables redirect something uh this destination IP uh gets preserved in a local socket called Original destination socket so Envoy is going to read it from the local socket which is in the same network namespace as a service and it's going to use a proxy protocol to preserve that IP so it gets added as a header in the TCP packet we're going to forward this information we're going to we're going to upgrade this connection into a connect call and then we inject these additional properties at layer seven so the Upstream hostname which is sourced from Sni in this case where TLS encrypted traffic and then also the downstream Pi example 2 service we now we layer four where these IP address layer 7 we have Sni and service and then we forward this to the internet gateway and at this point it's very similar to the previous side and how things work so then we terminate connect we have the the downstream service name and then the Upstream host pass into nor layer 7 HDR back filters uh so yeah check it against allow this again and if it's allowed we forward it to we unwrap the proxy protocol so terminate that to restore the destination IP addresses and then we forward it to a TCP forward cluster so I'm out of breath uh just talking about this so yeah it's uh I I think the I wouldn't it's okay if you fogged out there uh paper we take a bit to digest but the key takeaway is transparent proxy is much more complicated because you need something to preserve the transport properties of the calling service and we do this with a local running uh proxy local to the service before before we forward it to an internet gateway so let's walk through things now if a service gets compromised now and it attempts to reach out back to um a malicious domain if it's using if it's using things that obey the connect proxy uh we're gonna see it we're going to check it against the Lawless and if it's not part of the domains we block it if it tries to circumvent the connect proxy well we um we trap it via to the transparent proxy via iptables so we block it there and then finally uh if it tries to use DNS exfiltration attempts um they can't use that there because we've we've blocked uh external resolves so roll out um the core team of the project was actually really done by two Engineers uh one of them being me and configuring the Gateway was actually the the fastest part there was no custom development in our solution uh everything was out of the box from Envoy which is open source uh uh I'm not a c plus Envoy developer I'm still filing public GitHub issues reading the open source docs and searching uh public slack channels for help so what uh studying the infrastructure was the quickest part and what really took the most amount of effort is how do you now shift traffic through the internet gateway without disrupting production so fortunately um at Lyft there's many levers we have we can onboard by staging versus production versus canaries versus language types and then finally by Tears where tier three means it's not that critical if it's disrupted like no one really kind of cares but all the way up to tier zero where if you disrupt that traffic then you might be affecting or you will be affecting our Riders and drivers from going online so as we've onboarded Services we just passively observe this traffic and we use this that's the opportunity to hydrate that loud list like what domains or Services reaching out to and we're just Shadow denying uh uh traffic uh that allow this so once we see the the shadow denies um essentially flat line to zero that's when we switch the shadow to enforcement mode so this is when we're actually enforcing those HTTP R back filters that we saw earlier and then finally um once we flip that switch we we really made sure uh we communicated broadly with engineering we made sure our infrastructure documentation our runbooks were really crisp because when we now see traffic that is denied uh service owners will receive a pager Duty alert and then within that pagerdoodle alert will show like this is the traffic that your service is attempting to reach out to it was blocked for this reason and if this is normal this is how you correct it and then if it's not this is how you reach out to security um some of the concerns were potential latency incurred by the pr