← All talks

Securing GraphQL from Design to Production

BSides PDX 202524:3428 viewsPublished 2025-12Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Corey Le shares lessons from securing a dozen GraphQL services at a tech unicorn, covering introspection risks, error handling vulnerabilities, logging challenges, and rate-limiting strategies. The talk dissects real attack scenarios—including a denial-of-service via unbounded queries and an API key leak via verbose errors—and demonstrates practical defensive patterns for GraphQL design and deployment.
Show original YouTube description
Securing GraphQL from Design to Production - Corey Le Learn to secure GraphQL interfaces by looking at design decisions and actual attacks. This talk dives into a half dozen GraphQL services that were deployed at a tech unicorn. You'll learn practical defensive strategies, discover where common security controls fall short, and see the fall out from attack scenarios that were missed. Corey has been in the Information Security space for over 20 years and building software applications even longer. He spent years on the east coast as a principle security consultant with the Interpidus Group before joining the in-house security teams at places like Etsy and Simple. He spent 6 years at a unicorn tech company becoming their Director of Product Security. Currently living on the Oregon Coast, he enjoys tinkering with PCB designs in KiCad, signing off-key punk songs with his son, and trying to convince people that video games can be art. Corey has previously presented at BlackHat, CanSecWest, Yandex, and BSidesRoc. --- BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. bsidespdx.org
Show transcript [en]

[music]

Uh welcome to uh securing GraphQL from design to production. Um it's a real pleasure for me. I've been able to come to Bides here in the last couple of years and so it's really nice to get the opportunity to present here for all of you. So at the end of this talk, I'm hoping you'll understand how these small little requests that we got once every five minutes coming into our work environment caused a denial to service attack for our back-end systems. Uh hopefully for those of you that are pen testers and coming across GraphQL have a few more ideas of things to try and do for testing and for those of us helping to build secure applications with

GraphQL, some more things to kind of keep in mind at the development stages. So quick little bit about me. Uh, this is how my 3-year-old thinks daddy works, which is not completely wrong, but I do have slightly better typing skills. So, for this particular talk, I was a former director of product security at one of those tech unicorn companies uh that made a lot of use of GraphQL. So, the stories that I hear have here for you today is from my experience with them. I've recreated uh a lot of the sort of services, the requests, and the responses. So, these aren't the actual uh services anymore, but it captures the idea of things that we actually had

reported to us through our bug bounty programs, through our pentest, and our other reviews. I'm going to assume you're all a little bit familiar with uh REST APIs. Um, and if you got to Cory Ball's talk earlier in track 2, that was a great sort of um fundamental sort of foundation for what me and the team understood about sort of security for REST APIs and endpoints. But as I started with this company in 2019, they are already starting to use GraphQL. And my whole security team, we kind of had to come up to speed and realize those differences between securing REST endpoints versus GraphQL. And over my time there, there was about a dozen or

so different GraphQL services that they had in place. But they pretty much lumped into these three different categories. We had internal services, GraphQL gateways. We ended up developing a GraphQL gateway for all of our mobile apps to use. And then finally at the end of my time there there was a partner API that we developed um to help our our external partners integrate with our systems and it was really our development team that was driving this adoption for GraphQL uh and really create a useful abstraction for them uh with their microservices being able to kind continue to develop and change things and not break things for the clients that were reliant on them. So um

they're really kind of the ones that helped drive the adoption of GraphQL. We weren't using uh MCP at the time or any LLM, so we didn't really have GraphQL integrated with that, but it's kind of the new chocolate and peanut butter uh works really well together technology. So, if you haven't hit GraphQL yet, um I got a feeling you might be experiencing it more and more here with with the current hotnesses. And if you know anything or heard anything about GraphQL versus REST API security, you've probably heard about introspection. So introspection is special queries that can be sent to your GraphQL server and returns information in the response about the whole schema for the GraphQL service. So any queries

and mutations it supports, any of the objects, any of the field names, all contained within that schema. So anything your developer needs to know to understand how to make a request to that service and get the data back that they want. Obviously as an attacker, that's great information to have as well. It's like somebody publishing their API documents out there online for you. And so if you've ever come across this sort of interface, this UI before, this is GraphQL playground. It's really popular. Um, and off to the side was usually the schema and data section that you can expand and start poking around and seeing that information. And all that's getting uh coming from the parsing the

response from one of those introspection queries. Now, if you ever get your GraphQL service pen tested, you have introspection enabled. Expect you're going to get that as a finding. Uh, so we disabled introspection in production, but as we were doing our own internal reviews, we'd either test against a staging system with it enabled or get the schema from our development team and use that as part of our security reviews. And what we ended up finding out uh after a little while, a really helpful way for us to understand that schema a bit more was to actually visualize it. So this is GraphQL Voyager. It's not meant to be a security tool necessarily, but our team found it

extremely helpful for helping us understand the relationships within the graph and kind of visually quickly understand and see things. Similar to GraphQL playground, it's getting all this data from sending introspection query and parsing that response. But in this case, it's parsing that response into this interactive uh web page that you can use to drill down on the different objects and relationships within your graph. If you're using Burp Suite as your sort of testing tool, there's the InQl plugin. Thank you for Xavier for pointing that out to me. It's it's been a great tool and resource to use. It makes it really handy if you're actually testing a GraphQL uh interface on formulating your queries. But also,

it now has uh GraphQL Voyager built right in. So, if you want a real simple way to just kind of pop that open and take a look visually, um that's a really handy tool. Like most things in Burp, if it's a really large schema that's taking up a lot of memory, you might have to walk away for a little while, let it load, and come back after a day, but um it's still really handy. So, let me show you how these relationships ended up uh impacting us at the place that I previously worked at. So, I realize this might be a little hard to see, so I'll kind of talk a little bit about it here.

On the left hand side, we have the GraphQL query that we're our mobile app was sending to the service. And over on the right hand side, this is the response data that was coming back. Um, I apologize a lot more of these down the line is hopefully going to be clearer to read in the back. Um, but basically our mobile application was searching our locations over a certain time period uh and asking for any bookings that were already being taken place at that location. And so that was the data you're getting back on the right hand side. So UYU IDs, timestamps, publicly available information. Our security team was perfectly fine with this information being returned uh to the clients. But if

we took our graph and we actually visualized a simplified version of that schema around how we were handling those bookings, this is what that graph would have looked like. So we do have our booking object there in the middle. Uh we got our booking UU ID, our start dates, a relationship over to the location object that had all that building location information. But probably jumping out at you already is that there's a a relationship to a user object as well, which made logical sense. we are it was our users of the system that were making these bookings they'd have to own these bookings in some sense um but our client side wasn't requesting this but since we do see that

it's here in the graph it is something that we could form a valid GraphQL uh syntax for and see if we can return that data along with it and GraphQL makes that really easy to do you just say hey uh can I have that user object uh I think I saw that there was some PII around there first name last name email address great return that to me and Oh, also wasn't there payment object related to this? Uh, there was and all that data did come back. So, uh, this was something that that kind of was an eye openener to us and understanding these graph relationships. Um, while I talked about this in context to introspection

and having that schema and being able to visualize it and see those relationships on how to ask for the data in GraphQL, really the issue here is that we didn't have the proper off checks around when we're allowed to have access to that user object. So that was what our fix was to basically build into here. Um, and in fact, we had introspection disabled on this service. I mentioned it came from our mobile app. So the person who reported this to our bug bounty actually did it through um decompiling our mobile app looking at all the different GraphQL requests that we were sending and kind of building their own schema and understanding that this relationship existed.

So this was sort of a wakeup to us on the security team of using Voyager more and understanding those relationships in the graph and start looking for them. And in fact this almost happened to us again when we developed our partner API service. So our partner graph was quickly becoming this everything GraphQL API. So it have lowrisk operations like what time a building was open and hours and stuff like that but also these really high-risk queries and mutations like getting invoices and doing bill billing and payments. And so this really caused a lot of those relationships to be there between high-risk and lowrisk sort of objects. To some degrees we try to separate that out. I think at one

point we worked with having maybe a public user object and then a my user object for more of those internal things that somebody would set for their account ended up causing a lot of confusion and being a lot more complexity than what our developers wanted. Um so uh like I was saying this partner API was turned into this everything app and just because you can add everything together into a single graph it's not always the best idea. We definitely experienced our graphs growing larger and larger and larger, but we did look to Shopify as a good example of one resource where they split out their admin GraphQL API from their user API, their storefront API as they

call it. Um, so this was something we referred back to again and again as we were figuring out how we're going to separate these services and not lump everything into one single graph. One last introspection related item uh that I want to talk about. Um so you know as I mentioned we would disable introspection uh from our graphs but we had a number of times where the graph was still trying to be a lot more helpful when it came to any errors or typos. So if you didn't know for like in this example that it was a user object and you just guess like maybe it's called username. Do you have one of those? Uh the error mass message over

here was trying to be helpful. It's like oh I don't have username but uh we do have user. disabling introspection and disabling either these hints or suggestions is often two separate flags in most GraphQL implementations. So, you'll want to check for that. Now, uh if you're like me and I you're coming from using a lot of your web server request logs as sort of understanding how your users were interacting with your different REST API endpoints, uh those logs are going to be probably a lot less useful for you when you come into GraphQL. And a big reason for this is that all your GraphQL request introspections queries mutations they're all post requests to the same

endpoint. Um, so if you want to understand what your users are actually asking for and doing with your graph, you're going to need a lot more logging in your application side to get that data. And if you're starting to do that, one of the things that we learned is that the operation names are optional and users can kind of alter those and choose whatever they like in most cases. Uh so if you are going to try to tag those or log those operation names, make sure you do that with something you have tagged on your server side. Um also we had instrumented our our graphing or sorry our our logging system to do alerts based on a lot of 500 error

messages. But in a lot of cases for GraphQL, if there's errors, it's actually going to log a 200 okay message and package those errors in its response to the client. Um, so I'll show an example how this kind of got us a little bit later down the line. Uh, but that's something to be aware of as well is that you'll probably see a lot less of those 500 errors in your logs once you move to GraphQL. Uh, and the other logging feature that we need found we needed to implement more was having a consistent uh, request ID associated with each of our requests that came out of our GraphQL service. Typically, we'd have one query that

would come in to our GraphQL and then that would spawn maybe a dozen or so different requests to our backend services. Having a request ID that would tie all those different backend requests with the same client side requests that we got was something we found we actually had to build into our GraphQL system for us. So, with those error message, I sort of mentioned that um it's always going to typically log a 200 okay message for those and then package the errors in the JSON response. as long as you have valid GraphQL syntax. Originally, our GraphQL developers really like this approach because their team was responsible for the graph and not most of the backend

services. So, graph was running fine. You know, here's the error message from the backend service. Go and talk to this other team. That's where your your issue lies. Um, and on the security side, you know, we got our GraphQL system pentested. We were looking for verbose error messages. We checked the configuration for that. that wasn't enabled. Um, but there was more and more backend services being connected to the graph. Uh, most of those had verbose error messages disabled, but one of the days something changed, one of those connections changed. Uh, and it was a pretty subtle error message. Um, you know, we don't get a full stack trace here. It was just kind of a URL that was part of that

response. Uh, and tacked on to that URL was also the API key that we're using to communicate to that third party service. So uh basically anybody who saw that that error in detail what it was they completely bypassed going through our front end uh didn't go through our W or any of our other sort of security controls just directly connect to that third party service. So obviously we had issues around how we were authenticating with that service that we could improve. But this event was sort of our wakeup call that we wanted a better way of handling error messages with GraphQL. And fortunately, graph most GraphQL implementations actually do offer a very simple sort of catch and formatting for

error messages. Uh this is was a great way that we could say, hey, we know these error messages are helpful to our clients. They're coming from systems that we've worked with before. Uh we have good validation around that, but we're going to have a default generic error message that's returned in all other cases. Um, and so that was something that ended up being really useful and helpful for us, but it was also something we needed to help our developers understand to build in early on to sort of the GraphQL development process and not have it be a last minute thing that came back to them from the pentest. One last big topic I wanted to touch on

was rate limiting for GraphQL. Uh, as you can already imagine with what we just talked about from the differences from GraphQL to your REST endpoints, uh, if you have rate limiting rules set up in your W or gateway firewall around sort of REST request, those aren't really going to translate as well to helping protect you uh, from different GraphQL attacks. In fact, it's normally a feature of GraphQL to take what used to be multiple REST requests, package them into one GraphQL request, so your client's actually interacting with you less. But this can lead to two types of attacks that we we frequently see with GraphQL. Um, they're kind of query stuffing attacks. There's sort of this

idea of a depth attack and a breath attack. Creating this depth attack is sort of this recursive GraphQL query that's really compact and usually simple for clients to figure out and send to your service, but it can cost a lot of resources and and time for your server to actually process and handle. This very simple generic example here is you can imagine you have a store webfront that has products in it. Those products belong to a store that has products in it that belong to a store that has products in it. So on and so forth. And this is all very valid GraphQL that you could ask for. Um, most GraphQL implementations do have a simple

configuration setting to say this is the maximum level of depth that I want. It may take you a little bit of trial and error to figure out what's the right setting for your environment. But that's usually a pretty easy uh problem to solve. Where it gets more complex is this idea of breath where it's somebody can just ask for this very large query with lots of different objects. Um, and some of those may be very simple for your system to return and gather and other ones might require a lot of resources in your backend to piece that data together. And that's basically what led to this issue with our services here. Here's these Splunk logs. Again,

this is a single request. It was spaced out every 5 minutes. Uh, and when this hit in the evening time, our internal service would be dosed for, you know, 15 20 minutes or so trying to handle these. And at first, we didn't even think what was causing the problem was happening through our graph service. You know, there wasn't a whole lot going on in these logs. There wasn't a whole lot of traffic or other activity. Uh we had some of those logging issues like I mentioned earlier. Uh but then when we dug into here, we realized that some of these request response sizes were really huge. Um and that was sort of our tip off that okay, something weird is

happening with with these queries and and the service. So as we started digging more into it, uh we realized that this was actually one of our partner teams externally sending a request that was basically scraping our locations for all those booking informations. Uh like we showed earlier, they actually thought they were being very efficient with how they were formatting their query, which is true to a certain degree they were. Um but they had no visibility into what that was causing to our backend systems. So in our case it was you know the DOSs was coming from inside the house. We could just reach out to that team and say hey we we got to restructure this query and

how you're accessing the the data. We added some more data loading to make uh these systems more performant. Uh we also changed the size of our our overall post request buffer that we were allowing in because it didn't need to be as large as what they were doing. So we had some pretty quick fixes for for our issue. Uh but really it started us down a path for a long-term fix which is needing to be able to calculate the cost of these GraphQL queries and requests coming into our system. Uh and so basically you needed to do the cost at two different points. One, how much does this query or mutation actually cost you uh and our systems? But then also a way

to look at uh GraphQL statement before we actually run it and figure out how much this is going to cost our system and what's a reasonable limit to expect. Like with most rate limiting tools, a generic one-sizefits-all out of the box solution is probably not going to give you the security you need for your systems. Um, and so there's a few tools that let you get as deep with this as you want. And typically it means adding annotations to your schema and saying how much each of these operations are is going to cost you. But you'll need a way to sort of figure that out and decide what makes sense. Um, we really look to

GitHub and Shopify. Again, they have some great blog post about how they've set up their rate limiting rules. Uh, and if you're using Apollo, which is really popular, um, they actually point to an IBM study uh, about um, how they went through GraphQL and started figuring out what costs should be in most environments. I did want to talk a little bit about what we learned from doing pentest against all these GraphQL services. What worked well for us and what didn't. Uh, what didn't work well is just saying, you know, here's an here's a here's the website introspections didn't enabled. go for it. Um, we definitely found to be better with our pentest, we would try to

include as much documentation for our pentesters as possible. While we couldn't share the actual code for for our projects, um, we would have them test in a staging environment with introspection enabled, we'd also share with them any of the saved queries and mutations we were using in our CI/CD pipeline for developers to test things, as well as any queries that we would save on the security team. Uh, and a lot of times we sort of had a postman package that we would allow the pentesters to use. And seeing what some of that default data looked like and what a valid response should be was a huge timesaver and really made the pentest a lot more efficient. Also,

during the kickoff calls, we would demo GraphQL Playground and Voyager for them. We would talk about what queries and mutations were more high-risisk for us and costly. And if we did have any rate limits in place, explain to them what those rate limits were and if they could bypass them. I know I couldn't cover everything, all the security features of GraphQL. Uh, obviously you want to keep the whole OAS top 10 in mind. There's a great OAS cheat sheet specifically for GraphQL nowadays. Um, so that's really handy. And so with that, uh, thank you all for coming. I'm local from around here. I'll be in the hallways. I'd love to chat with you about this or anything

else afterwards. Thanks very much. The website has the slides on it. And also, if you're in the CTFs, there might be a little something extra on there for you as long as IT STAYS ONLINE. [applause]

ALL RIGHT. SO THE question is um I again I understand this is likely a business decision so very case specific but from somebody who doesn't do a lot with GraphQL where do you draw the line between implementing controls at the GraphQL endpoint for like rate limiting and cost versus like data structure design and SQL queries and like whatever is doing the stuff on the back end because it seems like they're kind of at contention, right? like you you always want to be more efficient then your GraphQL endpoint can do more but where like how did you draw that line in the sand? >> Yeah. No, and definitely that that was difficult and I think something like our

development teams kind of went back and forth with in some cases it was easy when we were designing sort of like the mobile client that was talking to the GraphQL gateway. But in other ones that were supposed to be publicly accessible and we didn't really have a good sense of what queries might be coming to our system. It was really sort of a a back and forth. And typically with most security sides, you know, we're fighting for sort of a a tighter control around the rate limits and the cost. Um, and our developers would kind of want it more towards the looser end. Uh, so in a lot of cases, that was really where sort

of tracking like performance metrics with the GraphQL service and the backends uh was something that we needed to look into. To an honest degree, like only some of that decision was part of the security team. A lot more of that had to be done by the engineering team. >> Fair. Okay. So like do you use hotel data or something to to tell you about what's happening on the back end for all that? I suppose money talks, right? Whatever's cheaper to implement. >> Well, whatever's cheap to implement. I I mean, you know, we we did pay more to have Splunk at at certain points and stuff like that and that's what we're using for a lot of our metrics before we

move to APM. Um but uh yeah, there's not kind of a I wouldn't say that you want you could have all your controls just in the front end. There was definitely like a lot of times where it's like we need to rely on the backend API to have like the proper rate limit controls around how many requests or how much time we'd allow this to take. >> Okay, fair enough. Thank you. >> Yeah. >> Hi, thanks for the talk. I missed the first part of the uh your talk. Um are you so you're on a security team at a a business or >> I was the director of product security at a company using GraphQL. Have you had

much success or any any um effort in controlling like uh team like developer API um allow list like endpoints or it sounded like sort of the um the rate limiting was kind of along the same lines cuz we've looked at it with like swagger docs through the W like this isn't in the swagger docs so the request isn't going to work but but then kind of we always backed off on it because it seemed really sticky. So I was wondering if you had any experience with that. So sorry what what part exactly with >> oh with uh API endpoint like allow listing um or if uh the rate API uh GraphQL rate limiting with uh kind of passing

that down to the dev teams if you've had any experience or success with that and if it's as sticky as it sounds or if um there's something that you >> Yeah, I guess to a certain degree with like I was kind of mentioning with the rate limiting it's going to work differently with GraphQL how you you really are going to have to need to kind of calculate calcate these costs to go along with it besides just setting like I'm allowing these number of requests that come into my service or these many this amount of requests coming from our GraphQL gateway to the endpoints. Um you know we basically had to yeah make an allow list for the GraphQL service to

bypass any of the regular REST uh API rate limits we had in place because it was our gateway and funneling traffic for multiple users in a lot of cases. Um, so we did that and then you were kind of asking a little bit about an allow list and it was really part of our GraphQL implementation itself that would determine which API endpoints were getting queried from GraphQL. Um, so it it did sort of work as sort of a a gateway for us that would limit access to only the API endpoints that we built into the graph if that makes sense. >> Yeah, that's really impressive. >> Well, I think that's time. Thank you everybody. [applause]

>> [music]