← All talks

BSidesSF 2025 - Resilience in the Uncharted AI Landscape (Ranita Bhattacharyya)

BSidesSF · 202531:2462 viewsPublished 2025-06Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Resilience in the Uncharted AI Landscape Ranita Bhattacharyya So you've just battled a dragon: how quickly and effectively can you fight the next one? We dive into Resilience by Design for an AI chat/search product — based on considerations like disaster recovery, availability, foundational security, etc., while meeting audit/compliance & privacy regulations. https://bsidessf2025.sched.com/event/7be1cd52d8522886bcacdb2f92622db2
Show transcript [en]

All right, we're going to go ahead and get started here. Uh, this next talk is Resilience in the Uncharted AI Landscape by Reita. All right, everyone. So, there we go. So, you've just had an epic battle. How quickly can you get yourself back up, dust yourself off, and be ready for the next fight? Let's talk about resilience. Hi, my name is Ranita Bhachara. I've been in the security GRC and PM space for many years and I've had the pleasure of leading teams doing pivotal work in this space, some of which is listed up here. So today on the agenda, we talk about resilience in the current context, the basics of resilience. We talk about how to plan for it, how to be resilient

by design. We take an example and go over it. how and where to invoke recovery processes come crunch time. And lastly, if you are resilient, how do you prove to your customers and to regulators that you are in fact resilient? It's always very interesting to me how foundational terms get enlivened with new context every time there's massive technological advancement. And so I've recently enjoyed this definition of resilience. It comes from a research paper. The source is linked there. And these researchers break it down into three key pillars, right? First is robustness. How well are you able to impact uh are you able to withstand impactful events? So in our vernacular that is good solid foundational cyber security. Next is

resilience being part of an impacting event. How well are you able to recover and get back to stasis to steady state? So in our vernacular that would commonly be business continuity or disaster recovery and lastly plasticity being part of an impactful event and then having recovered from it how well are you able to incorporate your lessons learned back to your steady state. So again in our vernacular that would be process improvements. Now, as I keep doing this over and over again, the one glaring thing, the one glaring experience that stands out to me is just the vast number of granular decisions that you have to take that ultimately um converge together to build something that's

resilient by design. These little tiny in the weeds decisions right from the start, from when your product is but a doodle at the back of a napkin. Now, if at all at this point you're thinking, well, this seems tedious. Why would I bother? I just want to build my product and go. I'm going to remind all of us of um you know, in our near in our recent past, the 2024 July crowd strike issue where unrelated companies and across various industries were decimated um operationally. And at least in the aviation space, would you rather not be like United or even Allegent Air which were impacted but got back up fairly quickly versus something like Delta

which had what like 1,200 flights grounded that Friday and more on from there out. So resilience is important and it is supported by all of these tiny decisions that you're making. Now decision-m like many other things is an art. And so I take some inspiration from this artist quote. Hans Hoffman I think said eliminate the unnecessary so the necessary may shine right remove the noise find the pivotal parts of your of your process and really focus your resources on it. But then there's a scientist quote to conflict with it. So I believe Einstein said uh everything should be made as simple as possible but not simpler. And it is this ying and yang. It is this balance between these two

approaches that you gain experiences as you do this a few times. And it is this balance ultimately that will correspond to something that is built well and resilient by design. All right, I'm going to take a moment here and sort of shoot off five fundamental tenets of what it means to be robust of having good solid cyber security. So um in my opinion five great tenets of foundational security. One be secure by design. Two, defense in depth. More depth where it is needed. We've eliminated the unnecessary. We've not oversimplified. We found the critical parts and we've add defense. We've added defense in depth. Three, policy adherence. It's good if there is a central body doing this. But at a

minimum, we should be able to track deviations. Four, a culture of accountability. Everyone from dev to ops to even customer success. If a change was made, we know who did it and why. And lastly, zero trust. Now, zero trust can also mean micro segmentation of network. It can also mean really stringent access controls and it can also mean segregation of duties and peer reviews. Similarly, we'll quickly go over five resilience fundamentals in my opinion. The first redundancies and failover systems. How faithful do your contingency environments even need to be? Two, multi-reion, multiszone availability, right? How much investment really needs to go in this sphere? Three, uptime considerations. Are you beholden to customer SLAs's? Does uptime control your revenue? You

know, in um 2021 when Facebook got went down for like what 6 hours? I think Forbes estimated that they lost what $65 million. And I think Fortune grouped Facebook and WhatsApp together and said they lost $100 million for six hours of downtime. So no matter what the actual number is, it's an eyewatering sum of money. And clearly uptime is important for them. So number four, moving on, automated backups. What should be the frequency, etc. And last, data recovery and prioritization. Everything everywhere all at once is just a movie. It's not real. Doesn't happen in real life, right? You've got to figure out what comes first and what comes next. Crumb crunch time. So, quickly shooting

off a few fundamental principles at us. Now, with any adventure that's worth having, it's best had with your friends and your allies. And so, the key players in this adventure are one in red. We have security people like myself, people like you, GRC, incident response, security operations, but also apps, infra, everybody who helps us provide the security context. Next in blue we have the product devs, product leaders, the AI platform owners, AIG etc. In pink we have infrastructure engineering, so P, S sur. And last in black we have the auditors and the regulators. I will be reusing these icons later. I'm just trying to establish context here. All right. So now that we've done that,

let's actually take a tangible example. Let's touch it. Let's feel it. Let's take actual decisions in this context, right? So we'll take a genai feature, let's call it a genai based product and then we'll nest it inside a larger agentic process. So on the screen here is a very very basic genai chat search product workflow, right? So let's actually build a business around it. So I am a sitcom enthusiast as as it said in the first slide. So let's say mine is a grocery chain called food and stuff, right? for all your salt of the earth meat and potato needs. I am serving up this chat function to my loyal customers. I want to drive business. I

want to bring them value and at the same time I want to get insights. So let's say loyal customer Ron Swanson comes to my chat function and he asks what would be the way for me to plan for a dinner party for 10 of my parks and wreck office colleagues. What would a menu look like? What would recipes look like? What do I need to buy? How do I need to decorate? Right? How do I even account for a variety of dietary restrictions and allergies? So, Ron feeds his question to the user interface, goes to the traffic server, hits the AI engine. His question is broken down via NLP. His question is tokenized and contextualized. The LLM is

rag enabled. We are scouring the web. We are looking at blogs. We are figuring out, you know, how best Ron can leverage this information. But our goal here is to highlight our business. So food and stuff has their own inventory, their own knowledge base. And so we're maybe making our own ML model, training it with our data and also integrating that into this into this product. But that's not all. This product is let's say oops this product is let's say one of the feature in my red agentic process. It's one of the agents, right? And potentially after this stage is done, after this chat interaction, maybe the customer is handed off to the next agent, which is an AI assistant that

helps them make the shopping list. And from then on, the customer goes to the next agent, which helps them in purchasing. Maybe it even has APIs off to Shopify. And from then on, it goes to the next agent that helps them do order fulfillment and delivery. Perhaps you're working with Instacarters and Door Dashers, etc. But then again also that's not all because hundreds and thousands of Ron Swanson's are coming together to generate insights for me which I might be using in another semi-autonomous process to optimize my stocking and procurement. And then of course all of these insights together also help me serve our targeted ads to acquire yet more customers. Really as you know by

now the sky is the limit for how these processes can be used. But you and I will pause here because the goal for today is to build something that's resilient by design. Right? And so we will take uh we will conceptually think of the product that we just thought of building and we will take a lot of decisions. We said we take hundreds of decisions but maybe not a hundred. We take at least a few in these eight categories here. So starting off with product architecture presumably this is done by your people in blue product leadership you're working with your AI engine maybe you're working with an AI platform and so very much really at this stage we're deciding you know uh

what the business proposition is what agents should there even be how much autonomy is okay and how much autonomy is too much autonomy right and this factor right here this this tenate is what helps you define which parts of your process need to be given special focus and which will be maybe targets for exploit and need to have more resilience activities. Another key thing to think about at an early stage is build versus buy, right? Um I am grocery chain food and stuff. Do I really have you know the resources to invest into my developers? Do I really want to build things from scratch or is it better to just purchase something off the shelf to

meet my needs for today? But what happens when my business grows tomorrow? Do I go for managed services or do I rather maybe try to go for more strategic partnerships with like a Google or a Microsoft and invest deeply in one ecosystem? So these decisions at this stage help us define role and responsibilities, right? Come crunch time, who is responsible? Am I responsible to plan for it or are the vendors responsible to plan for it? Are we making sure that all of these things are taken care of from a resilience standpoint at least contractually all right moving on coming to infrastructure design you know let's actually make some choices here so let's say I've decided to go with a GCP

backend and very early on I'm deciding to go okay I'm not going to go monolithic I'm going to go microser and so with keeping resilience as my you know as my northstar I'm going to say that let's say each of my agent is packaged in its own container container. But what about the workflow then? Then that you know controls the orchestration between these agents. Why not the workflow be its own microser in its own container? Very good. Now to I now to orchestrate between all of these containers, maybe I decide to invest in Kubernetes. So I deploy GKE. But then this is a lot of infrastructure to control. So then I go with IAC. Maybe I

invest in something like a Terraform. So all of these decisions that we've just made or rather you know your infrastructure engineers in pink have just made these are all with a viewpoint of enhanced visibility great isolation so potential for strong cyber security we are also uh have excellent we also uh we will be able to have excellent reproducibility which helps you with resilience. Moving on, the next category of decisions. This is the building of your LM product, of your AI agents. And quite the only thing honestly that should that we should worry about here is to make the best product that our budget can buy us, right? Um everything else very much controls resilience around it. So in this context, maybe you

know you're you're maybe doing it with dialogue flow, you might be using Gemini in the back end, maybe you decide, nope, GPT is a better fit along with Gemini. So now um coming from Google next maybe you're using Vortex AI and uh but remember we had our own database food and stuff wanted to highlight our own inventory and so we trained our own MN model and we want to make sure that the workbenches take that into account and appropriate orchestration's happening sky is the limit focus on the product but in contrast the deployment pipeline is very much focused on security and resilience right this is not do what is the easiest and the smoothest thing This is very much do

what is the safe and the more resilient thing at this stage. So pretty standard you might go with git for uh your source control. You might be doing uh Jenkins and cloud build GitHub actions even for CI/CD. Um on your on the creation of your AI uh agent front you're running unit tests and integration tests and when you're ready to deploy remember we said we're packaging each agent in its own container. So we want to make sure that we are you know loading these docker images onto docker hub again excellent reproducibility and finally when we're ready we deploy to the kubernetes cluster right so at this stage we're let's say halfway through this cycle the product decisions have

been had you know the product is built maybe we want to release a beta maybe emails are going out to your Ron Swanson's and uh this is when we start seeing traffic right people start using the product and that is where S P come in and presumably you've deployed a number of observability tools. Uh so let's say anomaly detection is a focus maybe you've gone with elkstack maybe elastic maybe you've indexed with log stash maybe you're using kibbana for visualization coronus is a popular choice there's also prometheus you could be using graphana so the security space has many many many vendors it's your pick but the point is really contextualize and fine-tune what alert you're looking for

in an observability from an observability perspective because that ultimately is of great interest to these folks, the security folks doing security event monitoring, right? They're doing your logging and your alerting. We're taking in the audit logs, VPC flow logs, application logs. You're probably also getting, you know, a bunch of logs from the AI platform itself, everything going into your seam, all the alerts being fine-tuned. But it's important here to not lose focus of basic vulnerability scanning tools, right? your SAS and DAS, your SCAS, making sure CSPM and SSPM tools are appropriately configured. Wall management, even though is not the context you speak of in the middle of an incident response, it helps you establish context. Remember, we're

preparing to be resilient and more information is good is is good in terms of preparation. Now, taking a small pivot, security, privacy, compliance, right? If your product, if you as a company, if you're going to be beholden to a particular compliance regime, now would be the time to figure that out. So, in our example, right, this uh food and stuff feature is presumably collecting a lot of very sensitive information, demographics information, geographical information, even health information, you know, dietary preferences and allergies. Now, if they're doing business in California, presumably CCPA is something of interest to them. So, they need to figure out early how they're going to comply with that. This is very difficult to do in the end. And

last, but certainly not the least, backup and storage architecture. Again, what faithfulness of backup do you need? So for it for the chat feature for example, do I need to have a full on reproduction of a fully gen AI enabled product again or if for crunch time for a very limited period of time am I satisfied with just routing customers to a very heavily populated FAQ database? Maybe that's something to think about. All right, now we're all security personnel. We'd all like to make the most top tier platinum level decisions across the board for security control. But you guys pur strings are always tight. We are beholden to our budgets. Yes. So as cost optimization becomes an

important discussion, here are some things to think about, right? Number one, quantify your ROI and resilience benefits. For instance, if I were some kind of content streaming platform, yeah, uh it would be very important for me to be available no matter what. any user coming to my platform should be able to access content no matter what. But that's not the case here. Food and stuff, for example, is quite likely a geographically constrained grocery chain, much of whose business is probably still driven by foot traffic. So maybe I don't need to go multi-reion, multiszone. Maybe I'm okay with it being single region, multiszone, right? That way I save some dollars and also get some availability benefits.

Next on the docket we have scenario planning in red. Now this takes a few flavors. One you figure out what is a valid scenario for your business for your use case. Food and stuff's use case is very going is going to be very very different from let's say uh a company operating a pipeline or operating trains right there's no critical infrastructure in that context for them. Um but at the same time uh no matter how strong your security is attacks are going to happen. So if you've done the part, if like we said, if you've eliminate if you've eliminated the unnecessary, if you've fixated, if in the first stage of our decision-m, if we figured out the key

parts of our process and we've identified where resilience efforts need to focus, have we created uh the disaster recovery plans for them? Something to think about. Moving on, in yellow, I guess we have the balancing of the immediate savings with long-term value. Do we want to pinch pennies which we could have had we decided to go monolithic or pick something off the shelf or are we deciding to invest in the future? Are we deciding that our business is only going to grow from here and this is the path of growth? And so we went with microservices. We went with Terraform. We we went with Kubernetes. All right. And lastly, KPIs. These are the KPIs that make sense today

with the knowledge we have with the technology that exists now. So certainly there there need to be metrics around these KPIs today but then these KPIs themselves will also change with time and it's important to have checkpoints regularly to change what those KPIs are going to be right it's a two-fold process this debate can go on till the end of time but the one key set of people who will help you focus maybe a lot of these decision-m are going to be these behemoths in the black box our auditors and our regulators if you remember my icons from earlier in this talk. Um, these regulators, they're defining your rules for the road, your your price of doing business.

Effectively, if you've if something is a mandatory requirement, if you've got to invest, there's no point pinching pennies. Just do it right and do it the first time. All right. So at this stage, you know, you've designed your product, you've deployed it, you've uh configured security controls around it, you've debated cost to death, and now once it's live, it's time to check how resilient we actually are um with something like a predictive risk management that's listed here. So again, I'm fond of tenets of five clearly. So five steps here. Number one is uh have an asset inventory. Identify all your key assets. Two, process inventory. If you've identified all your key assets, infra tools, applications, uh figure out how they

interact with each other. Three, uh risk inventory. Figure out, you know, what the key risks are. Conduct workshops. Go and talk to people in your company. Four, mitigation and resilience. Are we are we sure that all the risks that are in our in our register do they have mitigations in place or is somebody out there in the company saying okay s we accept the risk and if so who is that person right uh and number five communication plans do we know how to talk outside the company do we have PR playbooks do we have runbooks do we have canned responses and do we know how to talk inside the company do we have internal runbooks when when it's time

for when it's time to manage the incident. Do we even know what the Slack template is going to look like? Right. Bas these are the basics of your continuity planning. All right, so the warriors are trained. They're armed. Let's give them a dragon to fight. This is a pretty generic incident response process. It goes from detection to alerting to respond to breach assessment to restore and recover. Let's actually walk through some of the process. And in the interest of time, I'll just go maybe a little bit faster. So let's say our same professional for food and stuff is seeing a lot of traffic at a particular end point or maybe he's seeing really interesting traffic right the language

is incorrect the syntax is off the geography is off there's no food and stuff in North Korea what's happening they're working with their uh AI platform folks either vendors or in-house who they rely on for abuse monitoring and content safety and when an alert is actually raised it is these security folks who are assembling the tiger team. They're verifying and triaging. They've called in legal. They've called in coms. Response is now starting. Now, what would be a good way to do that? Right? This is a DDoSesque prompt injection type of attack clearly. So, uh check your rate limiters, check your traffic quality, disable, you know, the impacted endpoint. Maybe you need to rotate keys etc. Next, when you have some control

over the actual response, we go into the breach assessment. We figured out what the blast radius was. You're looking at all the logs. You're tracing the exploit. Remember, we had our own ML model, right? And we prepared for this. We did disaster recovery. We did scenario planning. So now we know what type of data we need to focus on. Remember data and categorization. So now we know what we're thinking of first. What to look at first. We and we are helped by our friends in blue to extract forensics. Now restore and recover. What I find is most often the restorative activity is generally rolling things back to a last known point of good. We might try a variety of

different mitigations and fixes. But very often the very first thing and the most logical thing to good to do is going to roll back the model, roll back the user sessions, roll things back for the to a time when you knew things were good. And from a security standpoint, we then invoke recovery processes. Now, what might that look like you guys? This is a bit of a checklist. And that is by design, right? It's like coming out of surgery and your doctor having a checklist of running vitals check on you, running certain labs. They know what to look for so they can adjust your medication. Same thing, right? So, in my checklist, there are five key categories of things you check

for. First you validate each systems integrity. So since we're talking agents, you validate every single agent. You validate the model. You validate the data. We said we have APIs and workflow. So we want to make sure those APIs are working appropriately. Test them. Next, when satisfied, we reintroduce traffic gradually. Maybe you configure your load balancer to only send a certain percentage of traffic. Um, and you you've monitored that traffic. You monitor the behavior very very stringently. you know, you have your P and security folks glued to their dashboards and ready to fall back to the contingency environment as soon as something goes wrong, right? Um, and maybe your load balancer then is also configured to increase traffic based on

like metrics or time. All right, so we validated every single unit, but how about validating the whole process next? So, number three is you synchronize your AI agents and its components. You validate the workflow. You check for uh the health of the dependencies. You check the decision engine, right? Uh we make sure that we are stress testing. We're edge case testing. We run things back through the through the vertex workbenches and pipelines. Now remember, the choices we made from an infrastructure standpoint have already taken care of capacity and scalability. So this should be a breeze. And last but not the least, security and compliance. Like I said, very often the thing to do is to roll back. Roll back

to a last known point of good. And as you're doing that, you're going to lose a lot of security work. You've lost patches. You've lost scan related data. And so scan them again, right? Scan again, do the wall management processes again, repatch. At this point, there has been a wide variety of access elevation that has happened to fight this incident. Those need to be rolled back. And lastly, compliance reporting. Is food and stuff a public company? Right? Uh was a financial reporting mechanism impacted? Maybe socks SEC requires that you actually report on this. Once all this is done, we come to the plasticity section of this programming. So we've done our afteraction review in black.

Any manual workarounds, VDCOM, that's in red. We restore automated processes. bad job, scans, etc. And lastly, we reach out to our customers. We tell them, hey, we were under attack, but you know what? We tamed the dragon. All right, so let's say this worked so well that we are now able to maybe sell this out, license it to something like a central perk who wants to use the same workflow. How can they be sure that this won't go bust meet somebody's latte order? Or another case is what if you know corporate giant Dunder Mifflin decided that for our Scranton office food and stuff is going to be the snack vendor of choice and so Kevin in finance goes through the whole

workflow comes to the payment page but then is stopped. Angela makes him think is this PCI compliant? Is it safe to put your credit card in here? At this stage you guys if you're going to go and try to figure out all your compliance needs, your resilience need a bit late. It's very arduous. You could though track all the effort we've put in. You could take this example, make all these decisions and just align them to any one of these certification platforms, right? And barring small twinges in documentation, you are ready to be certified. At this stage, the only thing really that you have to define are your boundaries, the narrative, the scope. What are you exactly responsible for if

you're going to be selling your product? And scope definition is based on a handful of decisions. Does your organization only provide the tooling in the environment and or does it also provide the data and the storage of the data and or does it also provide query development and the a IML models? Which of these segments does your service organization fall in? And once you've determined what section it is, the only thing that's really left to do is determine who else owns the rest. Is your customer responsible or are your vendors responsible? Once you've done that, my friends, you're ready to be certified. You're ready to send this out into the world. So, I've been speaking

at length and we are at the end. So, if I can leave you this talk with only a handful of takeaways, they are this one. Resilient by design starts at the start when your product was, like I said, a doodle behind the napkin. Hundreds of decisions that begin early. Next, compliance is your friend. Regulations can be friendly. Use that to prioritize this effort. And last, look, with AI and all the new upcoming technology, the tools change, the challenges change, but ultimately the approach, the goals, the ethos remains the same. Good resilience is based on thorough risk assessment and practiced plans. Remember, we're not trying to reinvent the wheel here. We're just making sure that it is fit for this new

rugged terrain. Thank

you. And if there are any uh questions, I didn't see anything on the slido, but I'm happy to run around. Yep.

[Music] uh after the talk. Sure thing. I I don't think I put it as part of the slide but I'm happy to have on. Hello. Yeah, thanks for the talk. So my question is that previously we were sort of securing our infrastructure and APIs with firewalls and standard tools. Now with a LLM based chatbot, how do you protect or guarantee that people cannot excfiltrate data by changing the prompt or asking things in the LLM which ultimately would have your backend leak some data because I said okay pretend I'm this person, give me information about that person because I'm just writing things in there. Yeah. So is there a way to protect against such things? How would you go about that? I

think excellent and the answer here at least in my context is you can't be sure. You can never be sure, right? You can try and that's what micro segmentation is for and that's what containerization is for and that's what you know zero trust is for you can try but ultimately I think the point is no matter how high your walls are somebody's going to find a way to scale them right um which is why this whole talk matters. We are at time. Thanks a lot Rita. Please everyone give it up for her. Thank you so much. So contact information I realized was not here but I will be around. Please come say hello. Thank you.