From Context-Switching Hell to AI-Powered Ops

Name: From Context-Switching Hell to AI-Powered Ops
Uploaded: 2025-12-08
Duration: 25 min 18 s
Description: Will Urbanski explores how Model Context Protocol (MCP) servers can dramatically reduce incident investigation time and eliminate on-call toil at scale. Drawing from Databricks' production deployment across 55+ multi-cloud workspaces, the talk demonstrates how AI-assisted triage cut investigation ti

BSides PDX 202525:1850 viewsPublished 2025-12Watch on YouTube ↗

Speakers

Will Urbanski

Tags

CategoryTechnical

TopicAI Security Detection Engineering Tooling

StyleTalk

About this talk

Will Urbanski explores how Model Context Protocol (MCP) servers can dramatically reduce incident investigation time and eliminate on-call toil at scale. Drawing from Databricks' production deployment across 55+ multi-cloud workspaces, the talk demonstrates how AI-assisted triage cut investigation time from 15+ minutes to under 2 minutes, freeing security engineers from context-switching overhead to focus on actual problem-solving rather than tool navigation.

Show original YouTube description

From Context-Switching Hell to AI-Powered Ops: Eliminating Security On-Call Toil with the Model Context Protocol - Will Urbanski Context switching during incident response is a silent productivity killer that costs security engineers hours of valuable time and significant cognitive load. This talk shares a real-world case study of how we transformed our on-call experience at Databricks by implementing Model Context Protocol (MCP) servers to enable AI-assisted incident triage and investigation. Attendees will learn how traditional incident response workflows—involving dozens of browser tabs, multiple tools, and constant context rebuilding—can be revolutionized through natural language interfaces. We'll demonstrate how MCP servers provide a standardized way for AI assistants to interact with infrastructure tools like PagerDuty and Databricks, reducing incident investigation time from 15+ minutes to under 2 minutes. Through real-world examples, we'll show how this approach eliminated overhead during on-call rotations, enabled cross-cloud investigation capabilities without manual intervention, and allowed engineers to focus on actual problem-solving rather than tool navigation. The talk includes practical implementation details and lessons learned from production deployments across 55+ multi-cloud Databricks workspaces. Will is the tech lead for detection and response at Databricks. His expertise lies at the intersection of threat detection and software engineering, specializing in detection engineering, attack simulation, and the practical applications of threat intelligence. Previously, Will drove detection and intelligence initiatives at Stripe, Datadog, and SecureWorks, where he played key technical leadership roles in shaping security strategies and mentoring teams. He has authored four patents in the cybersecurity space, and his research has been published in well-known academic journals, including IEEE Security & Privacy. --- BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. bsidespdx.org

Show transcript [en]

[music]

[music] It's great to be here. Um, so a little bit about myself. I'm a security engineer at Data Bricks. Um, previously I've worked at uh companies like Stripe and Data Dog and um I have over 15 years of experience in security toil. Um let me just see a quick show of hands. Who here has been on call in some capacity for their work? Right. Okay. Well, this talk is for you. Um I have been on call a lot. I've been on call in uh for vulnerability management teams um for detection engineering teams. I've also taken the pager for services, service outages, things of that nature. And um being on call really sucks sometimes. and and so this talk is just how I have

tried to make uh life a little better for me and my co-workers. [snorts] So in this talk we're going to talk a bit about how at data bricks we reduced the instant investigation time um for service issues from around 15 minutes to under two minutes in most cases and this talk is really about the practical implementation of MCP servers for security operations. So, if you toyed around with LLMs or MCP servers in the past and you're interested in how to potentially use them in a corporate environment, um we'll touch on all of that. Um we're also going to talk a little bit about some of the challenges of using MCP in multi cloud environments. So, um I thought it'd be good to kind of

level set by talking about operational toil. And um I asked my good friend Claude here if he could define operational toil for me. [snorts] And um I actually think this is a pretty pretty solid definition. Um the thing that really stands out to me about this definition is the emphasis on manual and automatable work. Um if we're doing something that involves us to put our hands on keyboard and to stop what we're doing. Um and if it's something that can be automatable, um I I found that to be frustrating. And I think that's something that you know as an industry we we like to work towards a state where the computers are working for us. We're

not working for the computers. Um I've also found that during on call rotations we rarely have time during the rotation to implement a lot of the changes or a lot of the automations that are needed to make life better. Um and ultimately this becomes tech debt. And we all know that tech debt is notoriously difficult to pay down. Um so on call rotations uh by their nature are interrupt driven and we all know that when you're working on a project you're pulled out of that project because you received a page suddenly you're looking into an alert or an incident. Uh it's exceptionally disruptive to our productivity. And there's a a workplace researcher at UC Irvine named uh Gloria Mark who did a

bunch of research about a decade ago into um workplace interruptions. And she found that it takes about 23 minutes for an employee to regain their focus on the task at hand after they're interrupted. And I'm I'm sure that many of you, you know, if you work in an office, have had that experience where you're in the office Monday morning and your co-orker comes by and wants to ask how your fantasy football team did when really you're just trying to focus on whatever project you're you're uh currently working on. Um, and it's very hard for our brains to pause on one task to quickly switch to something else and then to switch back. And uh Miss Markx

found in her research that these um these impacts on our kind of cognitive context tend to compound. So the more that we get the more that we get interrupted, the harder it is for us to refocus on whatever task we were originally working on. And I think this is really exacerbated in the cyber security industry. Um as a detection engineer or a security engineer, we have to kind of build complex mental models around threat actor TTPs. we have to understand service architectures and network topologies. So we're constantly getting pulled into other types of activities when we're on call. Um it's just exceptionally disruptive and at the end of the day I think a lot of this

leads to burnout which we know is a huge challenge in our industry. Um so what are some some things that we can do to address that right? Well the first thing is we need to look at what we're doing today. And if you look at security teams today, I I think we'd all agree that a lot of security teams scale unevenly. And when I say unevenly, um I mean that kind of regardless of your expertise, your background, your tenure, and your career, um what I found in my personal experience is that someone who's been at a company for two or three years is going to be more effective at responding to security incidents than someone who's

been there for 3 months. And in the past, we've tried to address this through playbooks. Um, but playbooks are really an 8020 solution. Um, they require that we kind of tabletop or imagine all of the different ways that we'll need to investigate an incident or respond to a scenario in our environment. But that 20% is where we really struggle. Um, the 20% that we haven't anticipated and we haven't accounted for. They also require continuous upkeep. Um, rarely have I used a playbook that didn't need some kind of update or some kind of tune in some way. And this all just becomes tech debt that again is really challenging to maintain in the heat of the moment. Um

and uh finally I think it's important to point out that like when we're working with a playbook, we still have to fundamentally understand the problem that's at hand. So if you get an error message or you're working an incident in a technology that you have no understanding of, even if you have a great playbook, you may really not know how to use it or um how to operate it. And so these are all challenges that we have to deal with when we're thinking about cyber security incident response. To me, the person that has the pager is really seen as um the first person to the scene of the crime. So if you want to make an analogy to an EMT, uh the EMT

is at the scene of the accident as quickly as possible and their job is to stabilize the patient and ensure that they get any critical care that's needed right away. But it's not their responsibility to take that patient from the moment of the accident and work them through the long arc of their recovery. And in incident response, we're functioning the same way. Uh when we take the page, it's our job to triage the incident as quickly as possible to try to understand the what and why of of what happened. Um and ideally reduce the cognitive penalty that we're paying when we're pulled out of the other work that we're trying to do. So, what does an ideal incident triage

solution look like? Um, to me, in my experience, an ideal solution can quickly get to the what and the why of a security incident, regardless of whether you're troubleshooting a failed service or um some kind of alert that's been triggered in your SIM. Um getting to know the the what and the why is very very important because ultimately uh as the first responder you need to have that context to understand how to escalate this the situation. I think it's also really important that a solution shows its work. uh particularly in the age of large language models where we have concerns about hallucinations, it is really really important that when you are reviewing what's happened in an incident, you can

understand exactly how a triage solution has reached the conclusions that it has. Um and finally, I think a good triage solution needs to tell you what to try next. um it doesn't have to necessarily solve the problem for you, but being able to point to um the next steps that you should take to investigate an issue, the next potential um set of questions that you should ask of the system ultimately is going to make your incident response process much easier. So, what does this have to do with data bricks? Let me tell you a little bit more about our detection response environment at data bricks and how we've been able to use some of these tools to

help us out. So uh data bicks is a multi cloud company. Um we are uh as a security team performing continuous monitoring across AWS, Azure and GCP. Um plus a few other kind of custom environments. Um we are monitoring over 50 cloud environments. And in each one of those regional deployments we have hundreds of batch and streaming jobs that are supporting thousands of individual detections. everything from detections that are monitoring uh logs from uh from cloud trail to um endpoint based detections and things go wrong right anytime you're operating at at at that scale there's just a host of issues that are going to come up um the first and probably most common issue is delayed log delivery

right we have an upstream uh provider problem particularly we see this with SAS products where there's either a gap in logs or a delay in log delivery Um, we also have cloud provider outages. I think we all got a reminder about that um on Monday of of of this week. Um, and then we also have artificial issues, right? We have quotas and rate limits. So, if you find that your workloads do really well on one instance type in most AWS regions, that instance type may not be available everywhere. And so, every region is a little different and we have to be able to account for that. And finally, people make mistakes, right? Um, how many of you have ever shipped

code that's broken something in production? Right. Okay, that's right. There's probably more of you than uh than raise your hands, but yeah, I've broken production. People break production all the time. And this is despite the fact that we have unit tests and integration tests and CI/CD. These are just things that happen. So, like I said, we have this regional deployment model. If you can imagine, you know, 50 of these regional workspaces, all funneling data um into really common security operations tools, pedagla, and then ultimately when something breaks, it goes to my phone and and then I'm really sad because I have to stop what I'm working on and and pivot to a new project. So here's a very kind of common case

study um of just a failed Spark job in a single region just to give you a sense of uh what it takes to kind of dig in and investigate one of these problems. So first of all I have to pull out my phone and acknowledge the page right um click the button on my Apple Watch click the button on my phone and then I have to log into pager duty and pull up the incident. I got to find the right incident and then I have to read the error message. Um, I got to figure out which region is involved and uh potentially which which job is implicated or which data source. Then I got to open up a whole new set of

browser windows and I got to log into that region. I got to find that job. Um, I need to find the most recent execution of it or where it aired out. And at that point, I'm like 5 or 10 minutes into this process. I've completely forgot what I was working on before I got the page. Um, and you know, I'm reading an error message about, you know, delayed log delivery or US East1 um, having a problem. And, and then after I gro all of that, I have to actually decide what to do next, right? Is this a job that I know a co-orker was recently working on when they were shipping new detections, so I should just ping them and see if

they can help out or do I need to escalate this to a different team? And it's just really frustrating. And and like I said earlier, I I think the computer should be working for us, not the other way around. And um earlier this summer, I I just had a really terrible on call rotation. Um there was like one one single day where I got paid like 60 times. Um and I just got to the point where I was like, you know, I'm not doing this anymore. There's got to be a way to improve um improve this this this process. And so I committed to myself that during my next on call rotation, I was going to do everything I

could to use AI assisted tooling to to try to make this less painful and more consistent across the entire team. So when I started down this road, um my first thought was like I'll just use an LLM to solve all of these problems, right? And an LLM makes sense for a lot of reasons. So um they have the ability to consolidate a lot of different disperate pieces of information into a cohesive story that tells us a lot about our triage um our triage process. Uh they have the ability to traverse complex code bases. So uh at data bricks we use detections as code. We also have infrastructure as code. And so I'm able to expose all that content directly to

the LLM. So when it's reading an error message, it can actually go in and then look at the source code itself to try to see if it can identify the specific problem. Uh and finally, LMS have a lot of subject matter expertise, right? They know a lot about computers and um even though they get things wrong occasionally, they can generally point us in the right direction, particularly if we're dealing with something that we don't have a lot of experience with ourselves. So my kind of like back of the napkin drawing of how this would work would be, you know, some kind of operational agent, some kind of solution that had the ability to reach into all of our

regional workspaces, uh, to reach into Slack, to reach into Jira, and to really be able to give us the information that we needed, um, to to triage incidents faster. So um about a year ago, Anthropic released a uh open specification called uh the model context protocol. And this is designed to be a data interchange between large language models um and services and tools. And essentially what this allows you to do is to um to define functionality to to make an analogy, you can essentially define a Python function um where the function accepts input, does something, and then provides output. And we're giving the models the ability to discover these functions and then to invoke them on our behalf. And

um this was kind of a huge hit uh as soon as it as soon as it came out. Um this is the these are some screenshots from the model context protocol GitHub page from a few weeks ago. Um these are just a few of the MCP servers that have been released. And as you can see, it's um everything from MongoDB to Microsoft Teams um to all kinds of AWS and GCP services. Uh like I said, we're using JR and Pager Duty. They all have have great support. And so these are essentially tools and functionality that enable these LLMs to reach into these services and interact with them much in the same way that we would or you would if you

were interacting via um an API. Uh this hasn't been without its problems though. Um the the the model context protocol uh domain specifically with AI and cyber security has been very very active. Um there have been a lot of findings and I would say that as a community we're still learning like how to safely use these tools. Um, we have had multiple issues uh with supply chain attacks where um we've had MCP servers get compromised in some way and they're suddenly exposing information um to outside sources or um data is being leaked in a way that we didn't intend. Um and so if if if this if these are solutions that you're going to investigate, I would just encourage you

to uh tread lightly. Um and uh certainly when in doubt use them in a readonly context, right? So um I'm not recommending that you allow these tools to kind of run rampant um in your codebase without without supervision, but they are really effective. So um here's an example of uh one of our integrations that we've set up with um with Pedro duty and data bricks. This was kind of like the the first version of my incident response agent that I wrote. And as you can see here, um, in this specific instance, I have, um, uh, Claude pulled up, and I just say, "Hey, can you investigate this page of duty incident?" And I give it the ID number.

And then what we see is Claude goes in and logs into Pager Duty on my uh, behalf. It pulls a list of incidents that are assigned to me. It's then able to figure out that the specific incident that I've referenced is an issue in US West 2. And then I've given it the ability to log into US West 2 and take very specific actions related to troubleshooting. So the ability to see what's running in that environment, the ability to see the status of the jobs, the status of the workloads that are there. Um, and it goes in and takes all these actions autonomously. The only information that I have given it is the pager duty incident number.

It then goes in and gives me a really solid summary of what happened. So in this case, uh, we have a Spark job that is running, uh, a lot longer than it's supposed to. Supposed to execute on a 15-minute time window. We're bringing data in, we're analyzing it for security purposes. Um, and when you have a sudden surge in data, sometimes jobs will take longer than that and it results in a problem. And so it goes in and it's done an analysis of all of the historical executions of this job and given me frankly some really useful information that that would have taken a lot of time for me to collect in a very short time.

It also goes in and gives me some some next steps and some recommendations. Right? So based on the incident, the error message, the context that it's collected, um it's gone in and said, "Hey, like maybe you should troubleshoot um next by reaching out in this specific service. Um consider rescheduling the job so it runs on a on a different schedule and things of that nature. Now, why does this matter with productivity, right? So, like I mentioned, um, if I'm getting paged 50 times, I don't have 10 minutes to go through and respond to every single page and figure out what's going on. So, at at a minimum, a solution like this allows you to speed

up your response time, right? If you integrate this with your SIM, your soore, um, any of the kind of internal security tools that you're using, uh, you have the ability to to basically respond more quickly. The other thing that I like about this is it has allowed us to completely change the way that we manage our page duty schedules. So, um, a lot of the SLAs's that we are dealing with internally are typically like an hour for something like this. And so we've been able to reschedu um the way that alerts get delivered so that I can have 30 minutes of kind of like time when I'm I'm not being paged and I'm not being bothered knowing that if we do get

slammed with pages, I'm going to be so effective in that next 30 minutes that I've devoted to to triaging alerts and triaging pages that that it's not going to be a problem. So, this essentially allows us to focus more on our day-to-day work, our project work, the things that we're responsible over quarters or months, um, as opposed to kind of constantly fighting that battle. So, um, just a few more, uh, kind of ideas that I want to seed with you around how we're using MCPA in our environment and how it may be helpful for you. Um, so there are a few other ways that we are um, leveraging MCP for detection and response. So, first, uh,

we found that large language models, if they have the ability to access your detection codebase, can do a great job of, uh, doing coverage analysis. They're also really good at taking thread intelligence reporting, reprocessing that, and then figuring out where you potentially have gaps or coverage. Um, and they're also great at answering really lazy questions. So, if your manager pings you and says, "Hey, uh, do we have coverage for that vulnerability in our Azure environment?" you can literally just copy and paste that message into an AI assisted tool and uh be able to get a very quick answer. Um they're also great at answering vague questions like is US East1 broken today? And they can get you an answer very very

quickly. And uh finally, we've also been using large language models to uh tune and improve our detection rules. So we can take false positive feedback from our instant response team and our sock and we've been able to use MCP and uh some vibe coding tools to essentially uh automate the process of tuning false positives. We have also found that uh LLMs are really effective at creating additional true positive and false positive test cases. So if if we've envisioned certain scenarios where a false positive may occur, they're able to go in and provide additional context. Um that's been very very helpful. So anyway, I hope this gives you um some ideas for how you might be able to use

these tools in your environment and uh really appreciate your attention. Thank you. [applause]

The question was, it seems like there's a use case for LLMs to be able to answer questions from auditors. I would completely agree. Um, in fact, I I would argue that you could take most of the content you're receiving from auditors as well as internal documentation and have the LMS be able to largely automate that process and potentially anticipate things that they haven't asked you about yet, but they may ask you about in the future. >> [laughter] >> Yes. >> Other than seems like legit company, how do you vet which um MCPs are safe and which MC >> Yeah, great question. So um the question was how do you vet which MCP servers are safe to use? So I I think this is

largely a software supply chain problem. Um I you know as a individual practitioner I would encourage you to look at where the MCP server is coming from. Is this coming from someone's personal GitHub page where they have implemented their own page of duty MCP service or is this coming from page of duty themselves? Also, uh really common like software hygiene things like um how many bug reports are open on this GitHub repository? Um do they have kind of CI/CD tests that are validating the code when it's published? Um ultimately at the end of the day though I think that the best option is one ensure your legal team is signed on with you taking the

data that you're working with and using it in this context. Um and uh two only use these things in a read only capability. So we aren't allowing these tools to take any actions on our behalf yet. We're simply using them to speed up our response process. [snorts] >> I had a question kind of along those lines. You mentioned having it logged into pager duty and some of the other services. Are you creating a new account that's a readonly account for that sort of service or is this usually like your own actual account? >> Yeah, that's a great question. So, um, data bricks is a little unique because we're actually a model serving provider. We allow um, companies to host large

language models on our infrastructure. So, when we're working with internal security data, we're actually using models that we host. Um in this case though with the example that I gave that's actually running in claude under the context of my personal account. So it's essentially piggybacking on the credentials that I have already set up to be able to act um in that way. So I think the answer is both but it's going to depend on your environment. Hello. Uh thanks for a very great presentation. I was just wondering from an audit point of view, when you have an agent or an MCP server logging in on your behalf, does it identify that request as being from not not from you personally?

>> Yeah, that's a great question. So, I think it's going to depend on the logging that you have with your SAS provider. Um, there's a lot of challenges with SAS provider logs. And so, like for instance, um, when I'm logging into Paged Duty, you'll see my Chrome user agent. you you'll see my uh my user having gone through the process of tapping my UB key several times and kind of going through the OOTH flow. Um when these services are interacting, it's coming from source code, right? So either a Python UA or a UA associated with the EMCP server. Um so I I think it'll largely depend on on your environment. But again, if you're going

to go production with this, I would encourage you to use service accounts in which case it becomes very auditable. All right, I think that's all the time we have. Thank you so much. [applause] >> Thank you guys. Give it up for Will.

>> [music]

From Context-Switching Hell to AI-Powered Ops

Related talks