
Well, it's great to see so many faces curious about a topic like load balancers. I absolutely love it. When it comes to infrastructure, let's be honest, it only gets discussed when things break. No, seriously, it only gets discussed when things break. Raise your hand if you relate to that. Wow, quite a few. Almost everyone. And raise your hand if you have supported customers late at night or during early hours of the day. Okay, I really feel your pain. Let me share a personal story. It's 3:00 a.m. and I get a call from my manager and he goes, "Arjun, we got a problem." He's on my do not disturb exceptional list, a decision I second guess every
time my sleep gets disturbed. He sounds rattled and he tells me, "Arjun, the attacker has found the backend application IP. He has circumvented the entire EDS stack and he's hammering our application. We are under a DOS attack." The call that day taught me a brutal lesson. Your edge is just a theater just like this. If your application allows traffic from everywhere, you can spend half a million dollars on enterprisegrade security. But if the attacker knows your origin IP, it's like buying a really expensive front door for a house which has no back walls. Hi everyone, I'm Arjun and I lead IBM cloud's global load balancer service. In 13 years, I have built and secured this infrastructure which serves
billions of requests each month for our enterprise customers across the globe. In my day-to-day, I work on architectural uh reviews, working with security teams, and ensuring good security controls, threat modeling, and most importantly, preventing infrastructure bypasses. In my third in my 10 years, I have seen load balancers going from simple distribution devices into one of the most security critical component in your enterprise stack. So why are we here? My goal today is to show you four rounds of how these attack look in practice and how we defend against it. Real techniques and real defenses. The format is going to be super simple. We will attack the infrastructure, a demo showcasing the attack, how we defend against it, and a
quick win to follow. Through each round, I'll systematically escalate the attack. In the first round, we will try to find an exposed origin IP through a leak certificate. And by the fourth round, I will circumvent the entire edstack with a simple DNS query. But here's my promise to you all. By Monday, I'll give you a 20 minutes checklist which you can go back, audit, and fix any infrastructure issues you might be making similar to this in your enterprise stack. Which brings us to this paradox. We assume that the edge protects everything downstream. It's a beautiful dream when the request is nicely scrubbed, sanitized, and given a spa treatment, goes through the edge and onto your backends.
But the reality is the origin IPs leak through misconfigured DNS configuration and certificate logs. Look at the dash line at the below. Your load balancer is sitting right there ready to perform worldclass MTLS, but it's only as good as a security guard in an empty lobby while the attacker is already inside the building. The parameter works perfectly as long as the attacker is polite to go through the CDN. Spoilers, they aren't doing that. When the attacker penetrates your infrastructure, the consequences are immense. Let's let the real data from 2025 sink in with a deep breath. But here's a good news. The defenses I'm going to show you cost you literally nothing and they are really easy to
implement. All right, let's get into the first attack which arises from certificate transparency logs. Here's something most teams do not realize. Every time a certificate is issued by Let's Encrypt, Digisert or literally anyone, it gets logged into a public immutable searchable database. And this is by design. Certificate transparency logs were created after a digert and Google incident to provide accountability to all the certificates. And now it has become a nightmare for your backend origin protection. So when your load balancer is configured with that certificate and maybe with a subject authoritative name uh pointing to your main domain and maybe subdomain it's now publicly searchable and that's an attacker's dream. Let's take a look at this diagram
and I'll show you in action.
Okay, let's take a look at a typical enterprise setup. This company has invested heavily. We talking 100k an year in Cloudflare Enterprise. When I curl the main domain, I get exactly what I expect of 403 forbidden. And this is enterprisegrade security in action. Your wolf, DDoS, protection, rate limit, everything you can expect. It's all there. And on the surface, the fortress looks impenetrable. But the attackers just don't knock on the front door. They're looking at CT logs. When we query these logs, we see the main domain which is protected, but we also see the subdomain and that's the crack in the armor. When we query these logs with a simple dig or NS lookup, that
reveals the truth and then the subdomain resolves directly to your load balancer. The IP is completely exposed and there is literally no protection. I'm going to make a direct call to your load balancer right away and I'm going to show you how your 100k investment got wasted right there. When we make a call, it returns a 200. Okay. And with that, the attacker is in. There's literally no CloudFlare headers, no rate limit, no WF, and I have direct unfiltered access to your load balancer. In less than a minute, costing me literally nothing, I completely circumvented the edge.
So how do we defend against it? There are two critical layer. We start off with network isolation. We only allow inbound to your application from your load balancer which sits right in front of it. And on the load balancer itself, we want to uh enforce origin authentication. It only val it only allows requests to come in via the CDN. It validates a request through X origin verify and mutual TLS. This prevents attacker from bypassing your Cloudflare and by directly connecting to your load balancer. Without both the layers, we are vulnerable. And I've seen many such examples uh in my real life. And here's a five-minute fix. We list down all the security group rules. We look for the vulnerable any security
group rules. We add a new rule to allow ingress only from the load balancer. And now the most important thing, we test the data path and we triple check your security group rules. These tend to be complicated. And we ensure that your production workloads are still working. I don't want you to chase me down on LinkedIn uh because my my do not disturb exception list is already quite traumatized. and only then go ahead and delete the the any any rule in five commands and in five minutes you made it a lot better. The attack 2 builds directly on attack one. Now that we have your load balancer IPs, we're going to see what else is
running behind it. Now load balancer could be frontending many different services. These could be radius without no authentication by default. Your hoxroxy front ending your management interfaces hell checks your database ports everything represented by that listener running on the load balancer. The attacker runs the port scan on the load balancer and connect directly to these services and extract all the data they need. We go from I know your IP to now I have your admin social uh session tokens in a matter of seconds. And this is what happens when we think that you know everything behind a firewall is protected. Well like if your your uh your services are allowing authentication from everywhere uh there's literally no firewall.
Okay. So now that we have bypassed your edge security, let's see what we can find on that exposed load balancer server. We'll start by resolving the origin domain. Once the target is identified, we'll scan for the open ports. Port 6379, that's your reddus, your in-memory database. Port 8080, that's a proxy stats page. These services should have never been exposed to the internet. and let's see what else we can get from them. We'll start by doing radius exploitation which is used to store your session data API keys credentials and we'll try to connect it with the reddish CLI and we see there was literally no password required and from here let's see what else we are caching we run the
keystar select and we see API keys database passwords session tokens everything and entire the attacker needs sitting right there in plain text accessible to anyone. Further, the Hyproxy stats page that shows your complete map of infrastructure and that's an attacker's dream. In 2 minutes of scanning and just misconfigured services accepting connections from the internet, that's how most of the breaches happen. just basic misconfigurations. The defense is twofold. We map everything to the internal interface and we enforce authentication. For example, here we are talking about a radius hardening. The bind directive restricts radius to the local host and the require pass forces authentication. In addition to that, we enable TLS. So even if the attacker knows your
password, it has to present the client certificate. As a broader principle, this can be applied to all the services in your in your architecture. Assume that your network would be compromised because I just showed you in a matter of 60 seconds. Let's look at a 10-minute fix. You log into your server by doing an SSH. Run a show stats with a TNLP flag and that shows everything bind to TCP socket. You pipe it to your local host and look out for your 0.0.0 wild card mappings. Those are the ones which are accessible from the internet and the attacker can reach it the moment they find your IP. Rebind all of them to the local host for
Reddis is just a oneline command. For most of the other services is just reconfiguration and restarting those services. Enable authentication with an OpenSSL ran hex32 generator and that can be done directly on the command line. confirm that everything is working and we move on from there. Which brings us to attack three which is subtle but really powerful. Most applications return 403 for restricted paths and 404 for non-existent path. But from a attacker's point of view, each 403 is a confirmed endpoint. The difference is pure information leakage. They fuzz all the common path, separate 403 from 404 and then defuzzes the confirm paths and try to figure out your entire API surface in a matter of
minutes. This the worst thing is this works through the CDN and ws because these error responses are being returned directly from your application. So even if the services were locked down right the error messages would leak me uh leak information and I'm going to show you we start error enumeration attack the attacker is going to probe different endpoints without authentication and we'll watch for the response codes we'll start by running an NS lookup on the leaked subdomain and we'll start trying different endpoints available look for that pattern any inconsistent errors, a 404 not found, a 403 forbidden, something means it does not exist and the other it just confirms the path. These two codes just
showed me how these endpoints look in real and there were literally no authentication required. With endpoint discovery, the attacker starts enumerating all the common admin paths and within a matter of minute they can map your entire API structure, admin endpoints, internal APIs, debugging tools, all discovered through these error codes. With a word list, the attacker goes and enumerates thousands of requests and in 30 seconds what your developers took months to build. I can feel it. Uh they get to know right away. These error messages are documentation for these attackers. Please be mindful. The fix is elegantly simple. We need uniformity. The attacker treats 403, 404, 500, all of these as a map they're looking out for. We're going
to change the rules. Every unauthenticated request gets a generic 404 not found. Whether they hit admin or any non-existent path, the response needs to look identical. There's going to be zero information. and that completely neutralizes error-based enumeration. A quick win would be to implement a default deny policy at root location. We return a 404 for every request. That's going to be your default. Nothing exists unless explicitly allowed for all the rest of the location. Your service specific location u we those were the ones which were allowed on the public path. Whatever needs externally accessible needs to be defined as such and everything else would see a 404. The validation would be simple. We run the
same fuzzing script and we eliminate the entire class of reconnaissance in under 20 minutes. And that brings us to this fourth vector which showcases how a simple misconfiguration can be a very very expensive mistake. In Cloudflare's terminology, Orange Cloud represents a domain which is proxied and which enjoys all the enterprisegrade security like rate limit, DOS, bot protection, WAF rules, anything you would expect. But a graycloud represents a domain which unproxied and was configured for uh for DNS resolution. a leak subdomain configured with uh Greycloud pointing to the same production application, the attacker can bypass the proxied uh domain and make you really susceptible to the DOS attacks. Many a times solution engineering teams they implement these techniques to do stage
testing before they do production rollouts and that's where some of these residual effects of these configurations lead to such attacks. So we start by looking at both the domains. the the the main domain as we discussed enjoys enterprisegrade security because it's behind the orange cloud and it's it's um the enterprisegrade parameter with DOS protection w you name it the subdomain was however configured with the graycloud that was non-proxied and DNS only and that record was unfortunately leaked and that's your bypass watch what happens When I curl the main domain and I get a 403 forbidden, the wolf is doing his job and I get an unauthorized user and I'm blocked. And I when I shift the
attack to the graycloud subdomain, the backend load balancer, the same request returns a 200. Okay. And we have completely uh sidestep your enterprise stack. From here, the attacker has a clear shot. We can now test rate limiting against the main domain. I get uh 429 too many requests as expected. But against the unprotected domain, that's where the problem happens. There's no limit. We can flood the origin until our application breaks down. The customer sees a total outage. Your security dashboard unfortunately would show everything is green. But you are left wondering how your 100k investment was completely failed in a matter of seconds just because of this misconfiguration. So in order to defend against these uh
it depends on what kind of architecture you're using. The gold standard is VPC with private DNS. Your origin has no IP at all. uh no public IP at all and it lives inside a private subnet. The DNS to the origin resolves only to the private DNS inside the VPC. The traffic from the CDN goes through the private path and directly into the VPC. There's no graycloud records because there's no public IP to point them to. And this is what we run for enterprisegrade uh deployments. So if in case you cannot go fully private, maybe you need public IP for legacy reasons for migrations, the alternative is to configure your security groups to allow inbound only
from the CDN's known uh published IP ranges. And like for Cloudflare, it's a pretty popular list. Uh we enforce validation through MTLS uh between your CDN edge and your load balancer. So even if someone reaches the IP, they cannot establish uh a connection without the TLS and by providing the CDN client certificate. validate the origin pull header that your CDN sets and rejects any request that comes directly to the IP address rather than through the requested and expected host name. Each approaches achieves same goal even if the attacker discovers the origin IP they cannot establish any meaningful connection. A 10-minute fix would be to go scan for all those graycloud records, toggle them to the proxied orange cloud
configuration. We delete all the stage uh and residual uh configurations on the Cloudflare or any CDN provider. We verify it against a dig and it should return no original IPs. any domain which returns your original IP without going through CDN that's problematic and this is the single most impactful thing you can do do today and because one greatcloud record unders everything else now let's talk about comprehensive rate limit techniques right so on the edge that handles your noisy volutric attacks and we would enforce IP based limits bot detection target around thousand requests a minute for your load balancer that catches your distributor attacks by capping concurrent connections validating headers there you can do half of it which is 500 requests per minute
further down at the application we move from IPs to identities and we enforce something like a token weighted kotaas per authenticated user. Uh target around 100 requests and behind that the backend services like your your databases and everything you can apply um circuit breakers and back pressure mechanisms to ensure that your service eventually degrades gracefully uh rather than catastrophically. Each layer catches what the previous layer missed. Okay, how can I be here without talking about AI? So, let's see where all of this is heading. In the cloud native era, u the load balancer and proxies were mostly handling stateless connections. Your client would make a request to the load balancer which will terminate and relay
it back to the server and respond back to the client. These were fairly quick transactions. But every time you use claude, GPT or any agentic framework, your agent is now connecting to these tools with MCP. Um, and these tools could be Slack, GitHub, your databases, APIs, but these aren't quick uh stateless API calls. MCP uses JSON RPC and these are longived stateful connections with really really bursty patterns. 50 or more tools firing in matter of seconds. And this creates a completely new attack surface. And we need a centralized com component which will oversee each connection and enforce identity to prevent session hijacking and credential exposure. And you guessed it right. It's our friend load balancer once again to the
rescue. As I see it, the load balancer is evolving into an agent gateway is it'll be required to establish a centralized tool registry which ensures that your agent um have a single routable canonical endpoint to discover all the available tools. It's going to assist with tool slicing so that we manage uh sessions and the gateway uh aggregate the response and response back to uh the right agents and clients which were requesting those tools. The modern rate limiting has shifted from counting connections to an actual token weight weighted consumption which allows us to throttle those rogue AI agents before it drains all the resources. It's going to provide a uniform security and identity posture to your clients no matter what's there um
downstream on the server side. And last but not the least, it's going to provide consistent observability and traceability through uh integration with open telemetry and log analysis. The bottom line is as these AI agents proliferate, the load balancer becomes even more critical. And this is a slide to photograph your Monday morning checklist. In first 5 minutes we audit the security group rules and we bind everything to the load balancer subnet. In the next 10 we find the services bound to those wildcard interfaces and we rebind them to the local host and enable authentication. We want to normalize our errors and we return consistent 404 for any unauthorized applications. Last but not the least, scrub your DNS records and
toggle any great crowd configurations to those uh proxied configurations. And in 20 to 40 minutes, each item addresses one of the four attacks we discussed today. You walk in Monday morning, you grab your coffee by lunch, you're all done. Let me leave you with three things. First, your load balancer is your parameter. It oversees everything. Treat it like a security tool and not just a router. The second one is defense comes in layer. Network isolation keeps attacker out. Combine that with service hardening and a proper DNS hygiene at scale. Make it organizational. Use cloud policies to prevent any bad configuration at at an organization level. Give developers secure Terraform uh templates and make it easy for them to make these
deployments and uh automate the scanning. Um because someone will always go rogue. Let me leave you with this. In the AI native era, your parimeter is only as strong as your deepest configuration. Real security isn't just firewall. It's pervasive governance of every identity, every agent, and every interface. Thank you guys. >> Gentlemen, please give it up for Arjun Sharma. We have one question and then we really got to go. Um so the question says did you consider that an attacker could bypass filtering with the volumetric DOS L3 using P spoof cloudfare source IP >> that's one question >> yeah it's not
>> yeah so uh the question is around uh spoofing those Cloudflare IPs yeah I think in addition to that what I highlighted is you need to enforce force MTLS in addition to uh those IPs getting leaked because those are published IPs. Um and that really ensures that everything is um authenticated between your your Cloudflare and your load balancer and both uh client and server side is validated against the common uh CA which your Cloudflare uh the Cloudflare provided to your um to your load balancer. So that really ensures that we have uh security intact. >> Thank you Arjun. We really appreciate it. Thank you guys. Um that was great. That was awesome. Yeah.