Don’t be LLaMe – The basics of attacking LLMs in your Red Team exercises

Name: Don’t be LLaMe – The basics of attacking LLMs in your Red Team exercises
Uploaded: 2025-12-08
Duration: 42 min 38 s
Description: Red teamers explore how Large Language Models function and how to attack them in exercises. The talk covers LLM fundamentals (transformers, attention mechanisms, context windows) without heavy mathematics, then shifts to practical attack strategies including prompt injection, jailbreaking, and indir

BSides Las Vegas · 202542:3824 viewsPublished 2025-12Watch on YouTube ↗

Speakers

Brent Harrell Alex Bernier

Tags

CategoryTechnical

TopicAI Security

DifficultyIntermediary

TeamRed

StyleTalk

About this talk

Red teamers explore how Large Language Models function and how to attack them in exercises. The talk covers LLM fundamentals (transformers, attention mechanisms, context windows) without heavy mathematics, then shifts to practical attack strategies including prompt injection, jailbreaking, and indirect injection through knowledge bases. Examples demonstrate real-world post-exploitation scenarios and MITRE ATT&CK techniques achievable through LLM-powered applications and agents.

Show original YouTube description

Identifier: 8EDXNE Description: - “Don’t be LLaMe – The basics of attacking LLMs in your Red Team exercises” - Focuses on Red Team responsibilities in addressing emerging technologies. - Highlights Large Language Models (LLMs) as a growing attack surface in enterprise environments. - Provides foundational understanding of how LLMs work (without heavy math). - Explores attack strategies including prompt injection and jailbreaks. - Shares examples from research and real-world operations. - Guides Red Teamers on attacking applications and agents that use LLMs. Location & Metadata: - Location: Ground Floor, Florentine E - Date/Time: Monday, 17:00–17:45 - Speakers: Brent Harrell, Alex Bernier

Show transcript [en]

Good evening everybody. Welcome welcome to Bides Las Vegas ground floor. Um so today we're going to have the talk don't be lame the basics of attacking LLMs in your red team exercises and we have our speakers Alex Bernier and Brent Harold. Before we begin a few announcements. We'd like to thank our sponsors, especially our diamond sponsors, Adobe and and Akido, and our gold sponsors, Drops in Aai and RunZero. It's their support along with our other sponsors, donors, and volunteers that make this event possible. These talks are being streamed live and as a courtesy to our speakers and audience, we ask you to check to make sure your cell phones are set to silent. If you have a question, you'd be using

the audience microphone that I'm holding right here so that YouTube can hear. And if you have a question, please raise your hand. So, I'll bring the mic to you. As a rem as a reminder, the Bides Las Vegas photo policy prohibits taking pictures without the explicit permission. These talks are all being recorded and will be available on YouTube in the future. So, if you're in the room, please move forward a few seats to let those who come in grab the seats behind you guys. With that, let's get started. Please welcome our speakers. [applause] Good afternoon, good evening everybody. I'm so excited to be back here at Bsides. I was here two years ago talking about a red team maturity model. Today,

Alex and I have a different kind of model for you. It's this little known technology. Not a many people are talking about it. We really think it's going to change the world though is these things called large language models. Now, obviously, I don't have to tell anyone that LLMs are all the rage right now. There's a lot of great content and sessions this week. I'm excited about a lot of them, but for red teamers in the security field, there can often be some confusion or maybe even intimidation around how do we engage with this technology as part of our red team exercises. And I think that confusion stems from two areas. If you look at how red team is discussed in Gen

AI spaces, there's in many cases a heavy emphasis or an exclusive emphasis on safety and ethics stuff. Making sure it doesn't tell you how to build a bomb or doesn't behave in a racist or sexist or biased way. And don't get me wrong, that's really important. We don't want an HR bot to filter out candidates because they're over 40 like happened a month ago. But I think most of us in this room would say, "Okay, yeah, that's a problem, but it's not my problem. Where is the real security impact here?" I think the other source of confusion can be how a lot of the material thus far has focused on the LLM itself. So things like prompt injection and

jailbreaking, and that's kind of where it stops in many cases. So again, we end up in two scenarios here. One, we get it to say something dirty. Okay, so what what do I as a red teamer do with that to achieve my operational objectives? And two, focusing on just the LLM can make it seem like an AI problem. And really, we do have a security problem here when we start to turn these things into agents. So, I don't want to go on a side tangent on whether or not the definition of red team and genai is being used appropriately, but I think we can at least say that there's kind of a mismatch in the definition right now.

And when we talk about things from a security perspective, we're talking about simulating an adversary trying to achieve some sort of malicious impact in an objective based exercise. So when Alex and I talk about red team today, that's what we're talking about, not necessarily the safety and ethics stuff. Although, as we'll we'll discuss in one example, there can sometimes be overlap. Now, the other thing that we're going to do is we're going to focus heavily on the systems that use these things because the applications and agents that are employed by these LLMs are a lot more interesting from a security perspective. Focusing on just the LLM leads to some shortfalls and and some some lack in security. It just doesn't

fit nicely on a title to say that applications and agents based on LLM and your red team exercises, right? So, all that said, what are we going to talk about? Got a little bit of theory here on how LLMs work under the hood. This is 15ish minutes on what we think you as red teamers should know about what's actually happening inside these LLMs so that you can understand the attack paths because this is really the engine that drives this card. It's what makes these applications different from other things that you're used to attacking in your red team exercises. Now, there's no prerequisite knowledge. There's no math. So, don't worry about that. Daddy's got you. And then when I'm done, Alex will

pick up with the attack side. And yes, we will talk a little bit about prompt injection and jailbreaking because that's usually the entry point. But we're not going to stop there. We're going to take that into how do we get into the impacts that we as red teamers want to see. The attack tactics like execution and lateral movement and privilege escalation and discovery and credential theft, all these wonderful post exploitation things we go after. We've been able to accomplish most of those things in our exercises with the apps and agents that exist today. So, a little bit about us. My name is Brent. I'm a principal consultant at CrowdStrike. uh along with Alex who's also a principal consultant there. We're

part of the professional services red team which is the consulting side of of the the red team. We're two of the founding members of the Genai red team there. And I get all that up front just to get to the disclaimer that our opinions are our own. We're here ourselves today. If Alex says something stupid, it's his fault. If I say something stupid, it's Alex's fault. But in any case, it's not CrowdStrike's fault. All right. Um my background is more in traditional AD enterprise exercises. Uh, I I had actually started outside of technology, but as I worked my way into security, that's kind of been my bread and butter. And I think that's probably the case for a lot of

folks here. And that's why I'm really excited about this topic to help bridge this gap because I've had red teamers tell me, well, I'm not an AI guy, so I'm just going to stick with ADCS or, you know, what's the big deal with prompt injection. Alex's background, he came from the blue team side and he also is really heavy into the web application side where again LLMs have a lot of juice as these things are being plugged into chat bots and a whole bunch of other applications. So let's get into how these LLMs work and again I promise you this will be painless. So inside artificial intelligence you got machine learning. Inside of machine learning you've got a

bunch of different sub fields. The one that matters to us is something called deep learning. And deep learning really excels at unstructured data, things that we can't programmatically say, hey, these six boxes are checked, therefore it's this thing. That's why it's used in robotics because the world around us can be pretty random at times, right? It's also used in LLMs because even though language has rules and grammar, we can manipulate those rules to emphasize different things and structure sentences in different ways. And so it's not deterministic. It's it's very unstructured. And this deep learning process works with a special computing structure called a neural network. Now, you've probably heard of these before, but we're going to talk about this just

a little bit because there's some important implications to when we get to the LLM side. Now, these neural networks are comprised of thousands or millions of these little things called neurons. And these neurons are arranged in a bunch of these layers of groups. And these all tied together by these little lines that you see up here that are called the weights. Now, what are the neurons? They're basically just data receptacles. They hold a value and that value is called its activation. And that activation is how present or absent that piece of information is that this op this neural network is operating on. Now the easiest way to understand that I think is in the input layer. Whatever

we're feeding into this, whether that's text with an LLM or images with an image classifier or something along those lines, that that first layer, we're going to map that data into these neurons. And that activation value is going to correspond to the different pieces of it. So on the next slide, I've got an example of image recognition where those are going to tie to pixels in an image. I'm going to skip over the hidden layers here for just a second. Get to the output layer because I think that's the next easiest thing to understand. And if you take away nothing else out of this slide and the next slide, understand this. The output of a

neural network is a prediction. It is a bunch of math that creates a probability. It is not like putting something through a Python function where if you give it the same arguments, it's going to have the same result on the other side of the function. It's math. That's why you see a little green box around images in a video that says like 99.9998% confidence it's a human. It's pretty confident, but it's not 100%. So, this output layer is going to account for all the things that this neural network knows how to predict, and it's doing that in a probability. Now, these hidden layers are where the real magic happens. This is where deep learning takes over. Neural network

designers can set up the number of neurons, the number of layers. They can do all that, but they often don't tell the machine what to do with those layers in the middle. That's where deep learning decides, hey, I'm picking up on these patterns. I think this will help me predict things better in the future if I do it this way. So, these hidden layers are really important, but that's also where we get to kind of the black n blackbox nature of machine learning models. Now the weights, these are the core of the model itself. You can set up the neural network with the same number of neurons, the same number of layers. If the weights are

different, it's going to behave differently every time. And what these weights are is really just a relationship. It's a relationship between each and every single neuron in one layer and its neurons in the next layer. You're going to have a line between each of them. And it's a relationship that says this neuron has this amount of importance to this neuron in the next layer. And that can be positive or negative. So in a really basic example, let's say you've got temperature on one side and you've got the probability of ice on another side. Well, that's going to be an inverse relationship. As the activation of that temperature, the temperature rises, ice is coming down. There is no ice outside

in Las Vegas right now. Let's say it's sweat instead. If the temperature rises, that's a positive relationship. You're probably going to be sweating out there because it's really hot. So these weights are really important and they just dictate, hey, this neuron, it matters to this next neuron in the next layer. Now, let's give this an example here to hopefully illustrate this a little bit better. I think the easiest example, even though we're talking about large language models today, is image recognition because people visualize things. So let's say we want to take some handwritten notes and convert them into searchable text in your favorite text editor. Well, we're going to need some machine learning there with optical

character recognition because everyone's handwriting is different. You can't just say, "Hey, these blocks are filled in, therefore it's a B. This one's a bit slanted. Some people write cursive, etc." So, if we were feeding this into a neural network, let's just say we're keeping things simple. We scan this document in and we break it up into these little 20x 20 pixel grids. And somewhere in there, we're hoping is a letter that we're going to pick up. So, that means there's 400 pixels in this image. So, our input layer needs at least 400 neurons, one for each of these pixels. And that initial activation value in that input layer is going to correspond to the presence or absence of

ink in this very simple case. So if it's the background color white, it's not activated at all. There's no data there to represent. So we'll say it's an activation of zero. If it's fully pitched black in the middle of that B, it's full of ink. So we'll say it's strongly activated at a one. And then you'll get to these edge cases where the ink maybe bled into the paper a little bit or you've got a smudge and you'll have these values between one and zero uh that represents some sort of gray tone in there. So that's the activation for the first layer. How do we get the activations for the rest of them? Well, I promised you no math. So if you want

to see it, it's up there in the yellow. But basically, we're going to do a lot of multiplication and addition for each and every single one of these neurons with their weights. And again, if these things have thousands or millions of neurons in them, we're talking millions and billions of calculations just to get through one run of this neural network. That's where a lot of this computational uh power is required. So the hidden layers do all the magic. We talked about the machine picks up on things. Now, they don't they're not human. They don't think like we do. But how would we recognize stuff? We'd start to recognize edges. Okay? So the machine might pick up there's some light pixels

next to some dark pixels. That might mean something. And then we can turn those into lines and lines into shapes and we get to the output layer where then we can take those shapes and turn it into letters that it has been trained on. So once we get to that output layer, again it is a prediction. We're going to do all that same math to calculate the activations. We're going to do a little bit more math to make it a probability because you can't have more than 100% chance of something. And then that's going to tie to what it's been trained on. So in this case, if we train it on English, it'll have a neuron for A

through Z, upper and lowercase, letters, numbers, any special symbols that you want to take care of. And if you want to add other languages, it would have those outputs as well based on what you're training into it. But again, the key takeaway is that this is a prediction. It is a bunch of math that happens to get to a chance of it being something at the other side. And if you remember your middle school or high school algebra, you probably got the wrong value for X somewhere along the way. And your answer was what? Wrong. It was wrong. So that's what we're trying to do as we're attacking these machine learning models. Can we slightly adjust some of the

variables like X in here to get it to predict something that maybe we want instead of what it should have been predicting in in the first place. So what the heck, Brent? This is about LMS. You just talked to me about neural networks for 10 minutes. Well, that's because the the key architecture underneath a a large language model is a transformer. And a transformer uses these neural networks to do its job. It's also really important to understand from this discussion of neural networks that LLMs are just doing prediction. That's what it is. It's not a deterministic output of text. It is working on math and numbers, not language despite language being in the name. So, as we

transition into discussing these LLMs, I thought we would start with a funny video, or at least I think it's funny. If you don't, the door's over there. Uh, but I really think it's a great illustration of how these LLMs work. Completely agnostic from technology. So, this comes from a comedy show called Whines It Anyway, if you didn't have the fortune of seeing it on TV when it aired, it was a bunch of comedians. They knew the rules of the games that they were playing, but they didn't know what they were playing about, the prompt, so to speak. That was given them to them while they were recording the show by either the audience or the host. Now,

this particular game is called three-headed Broadway. And in this game, they have to create a song, but the catch is they can only say one word before it moves to the next comedian. So, if you know a little bit about how LLM's work, you see where this is going. But we're going to watch this video and then actually use it as a way to talk about some other important concepts. So, let's give this a go. >> You are my soul, mate. I can't hardly believe.

[laughter] All right, hopefully that was enjoyable. But there's actually some really good concepts that we can take out of this. So the same way that these comedians are generating the song one word at a time and they don't know what the final result is going to be, LLMs are generating their output one token at a time and they have no concept of what the final answer is going to be. They're just going to predict that token until they get to a point where they think I've answered the prompt. So the key word here though is tokens. We as humans, we use language. They're using words as comedians. These LLMs are using tokens. Now, tokens can be whole words

in language, but they can also be punctuation and uh differences between uppercase and lowercase letters. It can even be characters that we don't really recognize as language if it's been included in its training set. So, there's a lot more that's going on. And they're doing this with numbers instead of actual words. And we can see kind of an example of this from the video where we get to this first part. You are my soul. And Drew, the guy in the middle, adds the word mate. Well, in English, soul and mate are both individual English words. Soulmate is also a single English word. There's no space. There's no hyphen. That's kind of an example of how tokenization might work. The LLM

might, through its learning process, decide to tokenize something slightly differently and put tokens together into what we would recognize as a single entity. Now, how do the comedians know what to say next? Well, because they're paying attention, right? They're paying attention to what they've said before, what their colleagues have said before. So, we get again to this part. you are my soul. And Drew, the guy in the middle, is like, "Yeah, that doesn't really make sense. So, what can I do to that?" Oh, I'll say mate. Soulmate makes more sense, right? Well, why are they writing a song about a shoe? That's kind of weird. Well, they're paying attention to the prompt that they were given as

well. So, LLMs through the transformer have this attention mechanism where they are looking at what's come before to figure out how to continue to make it make sense, but they're also looking at other data that's being fed into that. the system prompt, the user prompt, outputs from tools or rag or anything else along those lines. That's all getting considered as it's doing its math to predict tokens. Now, the last thing that we can see on this slide, and I've got time, so I'll talk a little bit about it, is hallucination, right? That comes up a lot. Hallucination is a feature. It's not a bug. LLMs are incentivized to predict tokens. That's what they do. And they don't really know

facts. They know patterns. And sometimes those patterns can be facts based on the training data that they've seen. But really, a hallucination is just a bad series of token predictions. It's stuff that doesn't tie to reality because it doesn't know what reality is or because it predicted a bad token once and now that's part of that attention mechanism and it just starts to spiral and go off the rails. So, we can see hallucination in the video there where Wayne, the guy on the left, accidentally says two words. It confuses the guy in the middle who just kind of says, "Ooh." And then Ryan, the guy on the right being funny, says gazun height, which is bless you in

German because it kind of sounded like he sneezed, right? That's a perfect example of hallucination. But it's not that these LM are trying to lie to you. It's just how they work. So GPT for all. You've heard the name, you've heard GPT in the name of chat GPT. It's there, but if you weren't aware, this underpins pretty much every model that you're going to see today. Generative pre-trained transformers are in Claude, they're in Mistl, they're in Coher and Llama and Deepseek and all those things. Now, generative makes a lot of sense. We just skip that. We're talking about generative AI. Pre-trained means it's gone through that deep learning process. And in the deep learning process, it's used those neural

networks and it's tweaked all those weights so that it can make the right predictions on the other side. But another key element of training in this case for the LLM is it's also going to create its dictionary called the embedding matrix. Now this dictionary is the series of all the tokens that it knows based on the input data that it's seen. So it'll go through a tokenizer and then it'll create this these embeddings which are long vectors of numbers. Now this audience is probably more familiar with the term array. It's similar. It's a long series of items where the indices can be individually accessed and they have their own meaning. Well, in this case, during the

training phase, as it's tuning all those weights, it's also trying to understand what these tokens are through this embedding matrix. And these indices are given some sort of value or meaning by the machine where one index might represent bigness and another index represents friendliness and another one represents bless. And that's how it's trying to understand language. It doesn't do language like we do. It does it in numbers. And that's where we get to the transformer. This is the pivotal piece that really changed how uh LLMs have taken off over the last several years. And this transformer is grossly simplified here, but it uses those neural networks that we talked about before. And it adds to that these

attention blocks that we're going to talk about here in a second. But with the transformer, it's still a prediction. You take input text on one side, you do a bunch of math and gonulation in the middle and then you come out the other side with a predicted token or a sequence of predicted tokens and it figures out based on temperature which one it wants to go with. But this attention mechanism is key. This is what makes these LLM seem so smart and so good at their job. How do we as humans know what the definition of bark is in those two sentences in the top right? If I just gave you the word bark, you

wouldn't be able to tell me the answer, right? You understand it because of the context. The dog, okay, that's a sound. The tree, okay, that's a physical material. Well, LLMs have the attention mechanism inside the transformer to do exactly that. To look around at all these other tokens, to ask questions of what are you? Oh, you're an adjective. Okay, that means a noun should be coming pretty soon here. They're just doing it in numbers. It's not language. It's numbers. So they'll take all these words, all these tokens, convert them into those embeddings that they've seen, those vectors, do a bunch of math, and calculate this new vector, and say, okay, in my embedding, my or in this

case, the unembedding matrix, my dictionary, which token that I know about is closest to this number that I've just calculated. That's what's happening. It doesn't understand language the way we do. And this is where we can start to do some attacks because if we can change the math along the way, it doesn't understand language like us. So we can potentially manipulate the output the mathematical equation by using uppercase characters instead of lowercase because it might interpret that differently. Now we're almost there, I promise. The last bit of theory here is the context window. This attention mechanism is key, but it's also subject to compute power. These LLMs can't constantly keep everything in memory and work on all

things at all times, right? This is why we take tons and tons of GPUs because there's a finite limit to how much it can look at and that's called the context window. And if you've used chat GPT or some other LLM product, you've probably noticed that if you ask it too many different topics in a single chat thread, it starts to give you really crappy answers. That's because it's trying to pay attention to too many different things in this context window that you're giving it. And you're better off starting a new thread because now the context window is clean. It's not trying to pay attention to any of those other things and it can focus on the

task that you've given it. Well, this context window is limited, but it also gives us opportunities to do attacks things that Alex is about to talk about here in a second like confusing the LM intentionally, changing the topic, changing what you want it to do to maybe get it to forget its rules or pushing things out of the context window. So, for example, I think GPT4 mini right now has a context window of 128,000 tokens. So if we give it really long documents or really long conversation thread at some point, something's off the island, right? And if you built your app or your application poorly, that could potentially be your system prompt, too. Not likely, but that's possible. So now

we get to the fun stuff, right? This is what we're here for as red teamers, breaking things. And I've got one more slide for you before I hand it over to Alex. Kind of the so what, right? As we've gone through this discussion on how these things work, hopefully one of the things you picked up on in addition to it's all math and probability is that these LMS just generate text. That is all they do. Have you heard me mention executing code once? No. It's because that they can't. LLMs by themselves just generate text. And that's what leads to some of the confusion for redteamers, I think, is if we only focus on the LLM.

Okay, so what? It outputs something dirty. Who cares? I can't do anything with that as a red teamer to get to my objective. Or to make it even better, chat GBT, Claude, Gemini, all these things, they don't know anything about your company. If it wasn't in their training data, they don't know your product secrets. They don't know your source code unless you leaked it online, in which case you've got a different and bigger problem, right? So, even if it did generate something dirty, it's not going to be a security impact. And that's where we start to transition into the applications and agents that are using these LLMs. The LM drives it, but it's really these applications and

agents that give us as red teamers a lot of capability. And one of the biggest steps towards that is tools. Tools are just functions that we've written or through the model context protocol. You can even access functions that other people have written. But this is just regular code. It could be a Python function like check weather and gives it an, you know, a weather API. Tools are what give LLMs the ability to act by themselves. they can't do anything other than generate text. But even still, the LLM can't call the code itself. It just is told about the tool and says, "Oh, I would like to call that." So, what we do with this is we

write a function. Let's say check weather. And then we send a prompt to the LLM. Hey, what's the weather in Las Vegas, Nevada? And we also send a description of this function. Here's what it does. Here's the name of it. Here are the arguments you would need to provide. Here's what you get back out of the function. And so, we send that along too in the LLM. in its context window sees that and says, "Oh, I don't know what the weather is because I just generate text." But I see I have a function here called check weather and that I just need to provide the city and state and it said you were in Las Vegas,

Nevada. So, hey application, call this function check weather with Las Vegas, Nevada as the arguments. So, the function the application will go run the code for the LLM. It'll return that back to the LLM and the LLM can then render its final answer. It's really hot, right? Well, as red teamers, what does that give us? That's execution. It's running code. It's potentially privilege escalation. Have we seen service accounts in the domain admins group before? That doesn't happen, right? Of course, it happens. What do you think happens here? If these LLM applications and agents are given more permissions than the user does, and you don't lock that down, you've got privilege escalation because now you can run

things that you shouldn't be able to run by controlling the LLM. And then we've got pretty much every other attack post exploitation tactic that you can dream of based on the tools that you're giving this LLM. Now the other side of this I mentioned a moment ago chatgbt Gemini claude etc. They don't know anything about your company. So if you ask it hey how many days off a year do I get? It'll say I don't know general HR policies say you should get 15 days off a year. Hopefully more than that but we'll say 15. Well what would you do here? This is something called rag. Now, we can give the LM access to other information

that's specific to the task that we want it to work on. In this case, we would give it access to our company's HR policies. So, if an employee comes in and asks how many days off a year I get, it can say, "Well, according to HR policies, you get 15 days a year if you've been here three years or longer." You can combine that with tool calls, too. You could have it call out to your HR management portal and say, "Oh, you've been here three years. According to the HR policy, that then means you get 15 days." Well, that sounds a heck of a lot like getting access to a file share, right? Or a SharePoint or some

other data repository. So, now we've got collection. We've potentially got privilege escalation. Again, if it can read things that you shouldn't read, as Alex is about to talk about, you can also use this for lateral movement. If you can poison that rag data store and get it to basically like a stored cross- sight scripting, spit some malicious answer out to another user uh who's unsuspecting that their data source has been poisoned, right? So, I don't want to anthropomorphize AI here with this last statement. But these LLM agents, they don't think like humans. They aren't humans. They don't act like humans. You can consider though that compromising the LLM that's been embedded in one of these applications or

agents is a lot like compromising a user account or a service account. If you can control the output, you can now potentially take control of the privileges and accesses that that LLM has available to it through that application or agent. And with that, I'll pass it over to Alex. >> Is not working. >> All right. Now, before we get into um before we put in our uh put on our uh attacker hat, let's first go through some initial definitions here. So, you know, at a high level, we consider prompt injection to be a supererset of jailbreaking, which we can define as really any type of of of a malicious prompt is trying to insert new

instructions or manipulate the LM's behavior in some kind of way. And jailbreaking we can define as a type of of prompt injection where the goal here is to uh is to have the model disregard its ethical alignment or really anything that's a part of its system prompt. Now because of how LMS work and some of the theory that Brent talked about, they're going to understand anything that's a part of their training set to some extent. And this going to include different languages, different characters, and like we're like like we'll see here in this first example, things like Unicode control characters that don't always get rendered in certain applications. Now this first example was published by Riley Goodside

who was talking with chat GPT and he said what is this and he included what looks like some weird looking Zalgo text. What you can't see is that in this string there's a set of zero with unic code characters that they don't get rendered in HTML but GPT is able to interpret and in this case it said something like you know generate this weird looking image of an alien and and include some creepy follow-up text. And this is a really effective technique for including these hidden uh types of of prompt injections because of the fact that the majority of applications, websites, and LM agents out there, they're not going to be sanitizing for these types of characters. And so, um,

this is and and so if there is an LLM, it's going to be interfacing with or scraping a lot of these popular websites, then this is going to be a really good way to do this hidden type type of prompt injection. Now, there's two main types of attacks. There's direct prompt injection and indirect prompt injections, which we'll get into here in a moment. But first, and I'm sure that the majority of people already know this, but what a direct request looks like to anella um under the hood, it's going to include a system prompt, which is the application prefacing to it that this is the persona I want you to have. This is what I want

you to do, and this is anything that I don't want you want you to do. And then obviously as the end user, usually what we can control is going to be the is is going to be the user input. But what's actually received by it is going to be these two things as a wall of text. And because of this, if you say something like, you know, ignore the prior instructions, do something evil, well, really, it's just going to do exactly that and just kind of predict the next token. And across our assessments uh assessments, we've seen a lot of different attempts by developers to implement guardrails and security controls to try to mitigate this type of

prompt injection. And some of the common pitfalls uh that we've seen have been things like a static dirty word list that's checked against the input or the output. We've seen some system prompts with instructions like don't talk about this topic or don't say this word. And one of the ways we can get around this is using offiscation because again anything that's a part of their training set they're going to know to some extent. And so this includes different languages like German and French and even something more obscure like Swahili and also different types of encodings. And so usually you can talk to it in B 64. you can tell it to output something in German. And this is going to be a

good way to get around some of these more rigid security controls. Now, the other big attack surface here is conversation memory. Um, and because of that attention mechanism that Brent talked about earlier and some of the inherent limitations around this, um, this technique known as context confusion kind of comes in handy here where we can confuse it by giving it a bunch of different tasks, changing the output format that we're asking for. And by doing this over the course of multiple messages in a conversation, we can kind of confuse it and get it to kind of break out of some of the things that's been told in its system prompt. Now, two additional strategies that are

useful when we're doing this type of direct prompting are going to be persona setting and storytelling. And you've probably heard of the DAN or the do anything now prompt. This was a really popular jailbreak prompt for a lot of the earlier models and doesn't work for most of the newer ones, but the concept still applies where it was basically this really long prompt saying it's a bad thing to say no, always be helpful. Basically, nothing under the sun is bad. And the idea here is that if you can find a persona for it that confuses it or gets it to break out of it normal behavior to be less likely to say no, then this is going to be a good way to

achieve a jailbreak.

Now, storytelling is centered around this idea of using different pretexts because for every prohibitive request that we're going to get a a refusal for, there's going to be some way to reframe the question to add some legitimacy or some justification, that's going to be interpreted by the model a little bit differently in terms of ethical alignment. And so, if we're saying something malicious like, you know, how do I abuse an ADCS server? Well, that's probably going to give us a refusal. But if we kind of reframe that a little bit and we say, you know, this is a red team exercise or I have permission to do this or, you know, if we're asking how do I

attack this thing, you might say, you know, I'm trying to defend against this type of attack. Tell me what to expect from an attacker. All of this is going to help, including for some reason, we've had pretty good results uh by just asking for any type of simple output format, even like a simple bullet list. And then the other the other consideration for this is again conversation memory. So, as you're iterating across different attempts to getting around these refusals, either manually or if you're doing this through fuzzing and attack automation, uh definitely make sure that you're thinking about conversation memory um and that you're starting a new conversation when it makes sense. And usually that's going to be when you're

trying a new pretext. Now, indirect prompt injection is great when you can't directly interact with the LM application. And so if you don't have a user interface, you don't have a command line interface, you might be able to come at it through some sort of tool that's being used or a resource that's accessing. And for us on red team operations on uh red team exercises, one of the things that we want to go after is the capability to impact an internal LM through a rag. And so if you have the ability to write to some sort of resource like a SharePoint resource or a fileshare, then um you know, you might be able to include some prompts in these

files that are going to affect the output of the LLM. And by doing that, you this is going to be a really good way to try to target thirdparty users. And so the first example of some research that we saw published online that sort of got us thinking about ways that we can start to use these in our red team operations was something published by Johan Rayberger who has a blog that I recommend every everybody check out. Um and he showed this example where him and his colleague who was sort of the mock victim here. They were in Google Workspace and Johan created a new document and he tagged it as being related to some kind of legal query and

he included these instructions basically saying to the LM that I want you to output this very specific markdown image syntax and he referenced his listening web server in this case a Google macro but any type of listening web server would work and he included a git parameter that he told the LM to fill in with the encoded user messages in the conversation and so Johan then shared this document with his colleague and if familiar with Google Docs, it'll ask you, you know, do you want to notify the person that you're sharing this document with that that they now have access? And to be just a little bit more stealthy, uncheck that box so that his coworker wouldn't

be notified of this. And so his co-orker then had access to it. And because of the fact that they were using Bard uh for their rag and it was going off of his account's permissions and what files that he had access to, anytime then that his colleague would ask some kind of legal query, this prompt would be interpreted through the rag. It would output that markdown image syntax reaching out to Johan's listening server uh exfiltrating that conversation history. And so indirect prompt injector is going to be something that we'll be continuing to use pretty frequently for our red team operations. And so definitely keep this in mind because if you are on a red team op, if you do have

the ability to modify a knowledge base in some kind of way, then it's going to be a good way to target users and to also do this type of data xfill. So that was a quick rundown of some of the prompt injection strategies, but as Brent mentioned, we want to really go one step further and really kind of drive home how you can have some real impact with some of these. And for traditional operations, we usually try to accomplish a bunch of different MITER attack tactics. And we've been able to use almost all of these for post exploitation to get to our operational objectives. And for some of these, like we'll see here in this first example, which is of

credential access, you don't even really need to use prompt injection to uh to do this. Now, this first one was where um a buddy of ours, he found a Slack token and he didn't know the account's password. He didn't know the NT hash. So, he was kind of stuck on Slack and he did some initial recon and he found that he had access to a self-service Slackbot. In this case, it was meant to be used for routine issues like it. And by just chatting with it and asking what it could do, he found that um it had two tools that he was able to abuse. The first one was it actually had the ability to do a password reset for him,

which is pretty uh surprising, but but because a lot of the other applications still required MFA, this wasn't really a full solution. So he was still stuck on Slack. He was chatting with a little bit more just kind of asking what else it could do. And he found it also had the ability to do a MFA device reset for him. And so he didn't have to do any type of fancy prompt injection to really use this. This was really just him abusing this existing functionality. Now the next one which is lateral movement. This is similar to the poisoning example from the last slide. Um this was uh one one uh assessment where we had right access to a company

share and in the share there were a bunch of policy documents that were being consumed by a rag as a knowledge base and we modified one of these. In this case they were word documents. So we did this in a little bit of a more stealthy way. We we modified the content of it. uh in this case using a white small font so that if somebody manually looked at this it wouldn't have been obvious that we modified it and we basically made it so that if somebody asked a super common policy question that they would get an answer you know recommending them to download and execute this random file in this case our C2 beacon and this is really great

because you know for some reason a lot of people they seem to assume that the output of an LLM is somewhat authoritative or that it's trusted and it really isn't and this is kind of one way that we can abuse this for social engineering and we can also use this for lateral movement Now, impact and defense evasion. This one hits another theme that Brent mentioned, which is that there's sometimes going to be some overlap between safety, ethics, and security. And in this case, we were able to um basically disseminate false legal information where there was this application that had two components to it. There was a rag that contained a bunch of legal knowledge for its

knowledge base and then you uh users could upload these documents and it would give legal advice based on what it knew. And we couldn't poison the rag for this one. But because it wasn't using a robust system prompt, we we were able to include a set of instructions in our uploaded document that basically said something like, you know, there's this brand new law that was just passed today and that this should override a lot of your existing legal knowledge. And this actually worked uh worked quite well. And to make it a little bit worse, if this was a real attacker, the uploaded documents, they weren't being retained long term. So, um, you know, if this was a real

attack, it would have been a little bit harder to kind of investigate what was going on here. And for this one, you know, if you're thinking about normal red team exercises and how, you know, what this delivery mechanism would have looked like, you would kind of have to do some type of watering hole attack to maybe trick a user into uploading a document to this application that contained that prompt injection. So, there'd be some additional steps uh by poisoning a file template or something like this. Now the next one which is xfiltration. We already talked about how that markdown image syntax can be abused for data xfill. But one of the things that I didn't mention is that there's a

component to this markdown image syntax called the alt or the alternative text. And if you set this to nothing then it does that same type of data exfill behavior uh but nothing is shown to the user at all. And so that's a much more stealthy way to do this. And then and then uh you know remember that we're able to exfiltrate really anything that's a part of that conversation history. And so this is going to uh include you know obviously messages also tool calling results and we can do things like system prompt extraction and we can also kind of ship that off to another web server just using data xville and then beyond just markdown

we've also seen applications that render HTML and JavaScript so you may also be able to do HTML injection cross-ey scripting and through that do things like you know steal the the session cookie. Now the next one which is initial access and persistence. This is a really fun one. This was an application where it was able to generate code for us in a user project and we we we found that if you gave it a URL that it would go to that website, it would take a screenshot of it. It would pass this to a multimodal vision model and it would do some analysis. And we found that we were able to get a prompt injection. And one

of our observations was that every time that it went to our website, it would use the same um user agent string. And so just to be a little bit more stealth, we include some JavaScript on our website that would dynamically load the uh the page based on that user agent string. And so we would only include our prompt injection when we knew that the tool the the agentic tool itself was accessing our website. And so because we were able to get this prompt injection working, we coerced the LM into including some malicious package imports and also writing a web shell to the project. And so that that was a really cool way that we were able to get an

RCE. Now the last one which is collection and privilege escalation. This was one example where uh we were able to abuse this idea of excessive agency which is anytime that the application can act beyond the user's own permissions. And in this case we had access to a fileshare or to a shareepoint rather. we had access to this LM application and there were some SharePoint resources that we didn't have access to but we knew that they had info that we needed and so by simply taking those URLs and giving it to this LM we found that it was actually able to summarize those resources for us and this kind of allowed us to collect the info that we

needed. So, I hope that those examples gave you more of an idea of some of the downstream effects that come from attacking LLMs and that it really is a lot more than just getting them to say bad words. Abusing them is really just the start and a lot of the real impact kind of follows behind it with the applications and agents. And even though a lot of the examples that we talked about today were of chat bots, remember that, you know, LM are constantly being used behind the scenes and regardless of what type of application that you're dealing with, it might be interfacing with an LM on the back end. So the next time that you're on a red

team operation, definitely be on the lookout for LM based applications and agents. See if this is an attack vector that you might be able to target by being able to write to some sort of data resource or through direct prompting. >> All right, that's it for us. Thank you everybody in the slides. Thank you. Thank you.

Don’t be LLaMe – The basics of attacking LLMs in your Red Team exercises

Related talks