← All talks

Keynote, Day 1: "Secure AI" is 20 years old

BSides Las Vegas46:35102 viewsPublished 2024-09Watch on YouTube ↗
About this talk
MiddleGround, Tue, Aug 6, 11:30 - Tue, Aug 6, 12:15 CDT Keynotes Machine Learning (ML) security is far older than what most people think. The first documented "vulnerability" in a ML model dates back to 2004. There are several well oiled teams that have been managing AI risk for over a decade. A new wave of “AI red teamers” who don’t know the history and the purpose are here. Some are doing brand safety work by making it harder for LLMs to say bad things. Others are doing safety assessments, like bias testing. Both of these aren’t really “red teaming” as there isn’t an adversary. The term is getting abused by many, including myself as I organized the misnamed Generative Red Team at DEFCON 31. There are new aspects to the field of ML Security, but it’s not that different. We will go over the history and how you should learn about the field to be most effective. People Sven Cattell
Show transcript [en]

uh hi everyone uh I'm Sven catel I have been uh uh running the air Village sorry uh I'll step a little closer uh hi I'm Sven catel uh I've been running the a village at Defcon for the last eight years uh kind of uh it was basically eight years ago that I actually texted a few there's a couple people in the room and saying hey do you want to meet up for coffee uh and then pitch running making the air Village and then the following year we had our first one in 20 uh 18 um and it's been a bit of a ride learning about how AI security Works uh because I was a mathematician at John's

Hopkins um doing a postdoc and geometric machine learning and then suddenly I was helping run a AI conference at Defcon uh and because of that I got to know a lot of people in the industry who have actually dealt with the stuff and it's um so I had some love interesting conversations um so I uh as you can see from the tag tagline there's the llm craze which I helped contribute to with some of the things we did in the Gen to red team last year with the White House and a bunch of the vendors um and a bunch of snake girl has been in AI uh trying to sell you things that don't really help

much um and I this is some of this has pissed me off and I've given rants about this to people and now unfortunately the rant have gotten away from me and I'm here uh so um to start with uh the first reported vulnerability in a production machine learning model uh that I is from 2004 uh there is older versions of this from 2003 um but those were usit posts uh posts just email to each other and I've lost those uh to The Ether The Bookmark exists on my computer so I wasn't imagining things but the now 404s and what it was was Bean poisoning which is one of the it's the oldest thing if you ever get into uh machine

learning security this is one of the things that they teach you uh the first things and what it is is you append a bunch of hammy words to the end of your spam uh so you if you want to send that link that sending a single link inside of an email is a bit spammy but then if you add a bunch of words and and put it maybe put it in white text so they can't see it um this fil old email filters would just let you through and this is um still effective against some machine learning models as you can will see um but this was like the first one we like oh God we got popped the machine

learning model has a vulnerability and this is kind of unpatchable um you had to do something else and the this the the actual documentation from this came from John Grahams uh he's now the CTO of cloud players um presentation on this and he had a bunch of other uh bypasses in that presentation I think there were like six total different ways that people had come up with in the last year to get past their his machine learning model and this is one of the kind of rules that you have to learn with AI security there are always other bypasses and machine learning is just not secure like you it you need to shift your mindset of how you think about a

security when you want when you come to machine learning so the the main thing is models make mistakes so uh if you have a model that is 100% accurate something went wrong um uh production Mal models are 99.99% accurate four 49's maybe 569 % uh accurate and still the mistakes they made come back and bite you in the ass sometimes there's no way around the fact that they just make mistakes and each mistake could cost one of your customers a lot of money if for malware or spam models it's just like thing but we build our ecosystem around the fact that these models are just there to like Sim the tide not make the make they aren't the

critical component they just help a lot so and the thing is attackers they just exploit the areas if they find a way to get past your machine learning models they're just going to do that same o thing over and over again and like there's no way for you to prevent them to because if you never make a mistake you're not doing machine learning this isn't cryptography then the other big thing that you have to know for AI systems is um they don't do well on AO distributation and the classic example from this for from Academia is you take mnist which is uh the digits of the post cards uh and you train a model it's 50,000 images in the

mest data set uh training data set that people use um and you train a model on that and then suddenly you start feeding at things from Fashion emnist which are the correct format to feed this model but you don't expect this model to handle this stuff well and it can do all sorts of funky stuff um the theory that some people have with like PE data is if you collect enough PE data and you train a model well enough it'll generalize to the new stuff and people if you went to the RSA floor and the or black hat and people were talking about the AI and their thing the next AI back in 2018 and 2019

when that was really exploding uh exploding they're like oh yeah our AI generalizes to the new threat so we can keep you secure against new things that mostly Works um but one of the problems is AI doesn't actually generalize that well every single time that I've a new model comes up and everyone's like oh it's so creative and can do things we find out that it's actually repeat like a remix of things that it's already done remixing stuff is very helpful there are a lot of um like a lot of academic work and music work is done with remixes but like when you actually get new things in the system which happens all the time

with uh PE data um with Mal other things especially with malicious stuff it remixing doesn't get you there and one of the things that you find out and we know from security is attackers do weird things all the time to get pass your bypasses so you're always going to get out of distribution data you're always going to get uh you're always going to make mistes takes so I to get this to like make a comparison to drive at home imagine if you have a uh a single API really simple API uh it just like takes in some bites and return some bites just C like you know something that you can Implement in like under 5,000 lines of C with no

engine x no Apache just like nice like simple like you take the packets out of the kernel and reply very the simplest you can actually get work may get working um you could get get a competent a competent engineer could get to this point where you put it in a box ignore it in the corner and you can pretty you're pretty certain that it's not going to that box isn't going to get popped but this is like you know a constrain like thing that's what some people will pitch you and a like and you people have gotten some secure systems when have you heard about like the actual financial transaction parts of a bank getting popped that doesn't really

happen anymore there was major changes in the 80s and regulations and and those are handled on May frames things around the banks get popped but the actual Financial transactions are much are handled well but now if you think about securing an Enterprise with tens of thousands of endpoints and no idea what they own no idea what the things users clicking stuff all the time um you you you don't that's that security thing is not the same as the first one you're not trying to secure one system you are assuming that a user will screw up and you're going to get popped and you want a good EDR good Enterprise detection response thing system a nice sock in

there you want people to come uh incident remediation you focus you you drive less of your resources to towards securing your environment and making sure that the there's a wall between you and the world and more resources towards making sure that when something happens you have a good response time and the damage is minimized and you should think about AI as the second one and not the first one and it's even worse than that because defending machine learning models is even worse than defending against uh you know defending a large Enterprise with unknown unknowns and people doing Shadow it and a bunch of other stuff um this is one of my favorite examples of this um this is a

two-dimensional slice of 780 dimensional space and each color is sort of a unique decision made by a neural network and now this this is a two-dimensional slice there are for this very simple neural network there were two to the 60 regions in here very simple neural network because it had 60 hidden layer hidden nodes that means it had a very small amount of parameters and now we're talking about chat gbt with trillions of parameters this thing had thousands of parameters and each of those things is a unique decision and we know from um studying these studying machine learning models that pretty much if you're in one Decision One region here and you change that region over there the model's

decision could change and you don't know that it's going to be correct there's this thing called out of a serial examples and this to me is like a good represent of like why you kind of be should be concerned about the like think about that so you don't know what's going on in most of these things you have two to the 60 um regions you've only sampled and know about how your performance are in 60,000 of those regions and you know the 2 to the 60 is much bigger than 60,000 so you can't control all those things that you haven't seen so the first thing that you should always do is assume that someone's going

to pop your model no matter what you do there's nothing you can do to stop it and all you can do is delay the pop hope you see it when it happens and fix it as quickly as possible that is AI Security in a nutshell is it is just delay it so you have to deal with it less but when it happens respond hopefully you see it and respond as quickly as possible that's it there's not really defending the model I'm going to put robust Securities a big firewall and things like that the a lot of the AI firewalls don't buy you time and that's all you care about is will this buy me time if you are in introducing new

systems it will your questions are will this buy me time and if I or will this let me respond faster those are the two questions you ask and answer with with AI systems but now for the this is more of a slide for a bunch of the people who know what I'm talking about what about my defense um is adversarial training um people will bring up adversarial training if it doesn't really help um for malware models we found that I I found that it seems to hurt the model more than it harms it didn't buy me time and it didn't help um adversarial detectors uh very easy to build an adversarial detector that's overtrained on like IBM art and my

attackers aren't going to do this uh adversarial layers thing if you want want to have a better understanding of how air defenses work in reality go see all of the work of Nicholas Kini he has a lovely man who loves um finding poor academics who come up with a love an AI defense that they're very proud of and then just like wrecking their um he that's his favorite thing to do and he's very good at it and so far he's like um you know won every match he's like defeated every defense he's put his mind to so what have we done how do we actually manage AI risk one uh this is the quote from one of the founders of

the Facebook SP site Integrity CC we measure ourselves in the speed we could detect and respond to new types of attacks and mitigate the potential damage caused Facebook knows this case Facebook spammers always find a mistake in the spam detection service and ruthlessly exploited until it's patched so that's their response from 20 2007 uh malware I for this is one of those graphs where I was forecasting how bad our things are and if you look at I don't need to explain the graph but the graph basically says we get popped in about 3 months we deployed this this model was deployed on January 1st 2019 and about 3 months later we know that there's a bypass because the the Blue

Line went up um and so we would forecast it like that so some historic strategies so like it's all is not lost people have been doing this for 20 years and they've come up with some ways of dealing with it so the first one is robust features not robust models but robust features and what I mean by that is you don't let the model see things that the user can control uh so features that are hard for the users to control if you are used to things about L if you want to know a different word that people use for this Lis thing uh embeddings are the new word um which is mathematically correcti but

I'm used to calling them features um and for spam it's basically use the behavior of the spammer um for uh for malware use features that change the execution behavior and basically malware is very this is very hard to do correctly and llms you want to use API key Behavior Uh so when you register for an AP for a session an API key with open a they're monitoring how you use it and they will look for abuse in not not just the content but actually what you're doing so I want to give you an example of like how to screw this up very badly um uh silence I kill you so remember the example of the very first patch the very

first vulnerability that was disclosed in 2004 well this is from 2019 um so what they did was they created a PE classifier that's trying to decide where the your PE portable executable that you've just downloaded from the Internet is malicious or not uh so you take your PE file look at the header data do a static analysis on this and then you grab about 7,000 silance grabbed about 7,000 features out of that and created a embedding like a feature that they're going to use then they feed it to a model that has been trained on hundreds of millions of PE samples about 50/50 malicious 50 uh uh 50 50% malicious 50% benign and then they train

it to say good or bad and if it says bad it won't let the thing execute but they had a problem um rocket league and fortnite um would do Shady stuff on your computer uh they do all sorts of Kel exploits to make sure you're not cheating in fortnite and epic games uh do all sorts of like colonal Shenanigans to make sure that the Dr works and the cheating is not happening so their thing was gra doing had a lot of false positives on fortnite and Rocket league so what they needed to do was have a way of allow listing all of Rocket League's binaries but rocket League pctures their stuff every so often and siland didn't

want to keep updating their their allow list for each version of Rocket League because what would happen is a new version would come out all their customers things would all the rocket leagues and would uh customers of theirs would complain they would add it to the band thing and they would just keep going so uh to make this work out they put a used these thing covid centroids so these features are a point in space and they took that point in space and they put a sphere around it and they said every single time I have a new piece of malware if it has the same embedding if it's embedded ends up within the sphere let it run which seems

like a good idea okay this is I'm going to put a little tight bow and if I have the right static analysis I'm going to have I'm going to only allow things that are rocket League like to run except they have string data in there and the string data was packable with you can just pack whatever you want in the string data so what they did was they took uh what Skylight cyber did was they took the string from rocket League pended it to some malware just in the string data section and then gave it to silance and silance were like cool this is rocket League I'm going to let it run and so you could get any piece of mware

just append a bunch of strings to the end of the executable and then get it and it would uh from rocket league and it would Sky silance would let it run which is not good um so what good features are are hard to modify the strings don't affect execution so if you're really good if you are good at your featurization you wouldn't you shouldn't use them because it too easy for the users to modify but what you do to get things that are uh good features uh for spam IP address Behavior there are IPS that are just blocked from Gmail will never get a receive an email from that IP address because it just sends

too much spam that's a huge part of how they deal with Spam now they're even making you uh in 2024 2023 they're making you pre-register all your bulk emails to Gmail to make sure that you can't send spam they are loocking that down making sure your behaviorally locked down so you can't do it um so domain Behavior Uh these are like these are more key indicators of spam on social media sites like Kora like um there's a long list of them and I'm not going to get into them malware the very fact that it has to execute constrains your system you can't just do a weird you you have to make something that is able to execute on the user that helps

constrain the people um uh but one of the things is you can just pack your binary and then you'll get past it get past the crappy detectors um Dynamic behavior is really what you want to do but it's really hard to do and I can talk about that some other time llms count API Behavior if you keep asking it for like violent malicious violent things that open AI doesn't want you to do it they're going to see okay cool he's asked for a violent thing that's normal users users usually ask for a violent thing once a week that I can people I'm fine with that but if you ask for violent things repeatedly over and over and over again

you get a warning saying don't put give I don't want to give to this but that's how open AI deals with it next thing for AI security you know since you know you're going to get popped you don't want to talk about it ums security is security here so not telling your users how it works means they don't know what they're doing to when trying to mess with you so it takes them longer to get past um so you just would want to say anything an example of a bad example uh proof putting so proof Point used to reply with all that data whenever you got a whenever I got an email and so that data is basically the outputs of

your machine learning model and it would just give it to you and now uh will pierce and Nick glanders use these to steal the bottle weights and now they could do all sorts of fun stuff with this because they stole it because proof Point gave them way too much information um and if you steal the model in the right way you can now like go make your own custom emails on your home computer make sure that it gets past proof point and then start sending those without letting proof Point know that you're making nice emails that can get past them um so the proof Point remove those from the response they weren't needed nobody cared about them except for

people trying to mess with proof points so they didn't need to be there good obscurity uh this comes from a social media uh website that is not uh you know not the biggest one in the world but uh this is how they handled spam on social media website they had a bunch of their data user data in a data set uh and then they had a bunch of small featuers so one of those featuri will be like hey where is this IP address from which country it's from all that stuff another featuri would be like how often these people log in all sorts of different featuers they had way too many of them and if they trained a model on all of

them it would be a very good model but they don't want to make the best model like the best model for the next week they want to make the they want to maintain the service so what they did was they selected a random subset of 50% of those features just 50% and they trained a model on just those features so that the model couldn't see all possible data that they used all the possible signals of how malicious users use uh their website um and because they could only see a the models could each each model could only see a random subset that was different each spam model behaved very differently from the previous one or the

next one so that the spammers who were now like frantically trying to spam this thing and trying to get past it had to relearn a new model every week um one of the best AI hacker like best AI hacking groups like in the world is just the social media sorry the search engine optimization people and they just get to know Google search engine and the YouTubers get to know the algorithm and the these the the the people trying to spam this website got to know the spam algorithm so they to prevent them from learning that they made the spam algorithm week to week change its Behavior quite significantly still good performance less than but the performance each week

was less than the best of the could ever get but because it changed week to week the spammers were constantly having to learn new things on their toes and couldn't get really get their footing so the overall spam was lower so that just keeping things obscure and hard to learn very very useful and that's part of why people don't know about a lot of these like the stuff that I'm uh the history of the stuff is because one of the principles is don't talk about it last one um speed is security well this is the pronouncement moment speed is security as you know in if you got an instant response I'm sure many of you

have dealt with instant response speed of security on instant response but for AI systems like uh you do a few things so initial response to a new attack block list you just need a fragile blocklist um The Silence example if it was just a temporary blocklist thing that they could run for like 3 days while they fixed the problem and redeployed it a new model that would have been great but because they used it perpetually not a good idea uh but block lists all sorts of different ways to do block lists um doesn't have to be robust doesn't have to last a long time just needs to get you up past the thing uh this layered security principle of

different types of block lists different types of models you can redeploy like if you if you have sofa stuff if you can redeploy your spam model to prevent uh Hackers from getting stuff over if there's a new strategy that's uh people are uh new fishing strategy that people are doing and you don't you're training retraining your malware model U for a new fishing strategy with a ransomware in there retraining your malware model is probably more expensive than retraining your spam model retraining redeploying spam models are faster put quick block on the thing and then like different types of models and things that you can do to be faster is more important but layered security allows

you to be fast um retraining time uh do your stuff best to make this as short as possible if you got a week to retrain your M your spam model then you're screwed you have to redeploy that every 3 days um so you've got to get that faster uh and now the last one that's really hard uh is detecting a breach one of the problems with AI systems is you have thousands if you are doing uh if you are crowd strike you have billions of PE new of PE queries a day and a new strain of new new strain of malware could be hiding in those billions and there's not there not going to be that many of them and one of your

customers could get really screwed up by that and so how do you find that new strain out of the billions it's not like a single sock is only seeing a small fraction of that and might be able to handle seeing a new thing and like alerts of like hey there's this thing going on but once you're at crowd strikes level where they actually have to maintain that model they it's it's much more difficult for them to find the needle and the hay stack um there are ways to do this and Industry is uh the uh Advanced players in Industry are well Advan well ahead of Academia government anywhere else um like I interviewed with Facebook in 2018

and then I saw people describing Canary systems that they were using in 2018 in papers from 2022 it took the like Academia and like the outside world was four years behind what Facebook was already using and they spend way more on the detection response the detection side of the detection response for machine learning models because the response is just retrain redeploy patch very easy to automate that side the detection is very difficult they spent way more on detection than anything else the last one is learn from your attackers um don't do random to defend against threat models that don't exist um search engine optimization people they are attackers they have conferences they read weird papers and

have theories and you can talk to them and you can see how they work and if you understand how they work they would see how a lot of the threat models for how a the major companies for AI risk management actually work and they but the main thing is they try stuff until it works and then they teach and sell it to each other spammers do the Facebook spammers do that too but try stuff until it works until they sell it to each other now here's one of your favorite examples if you see a lot of these snake oil salesmen they'll sell you we've got a we've got a way to prevent adversarial examples from

affecting your model cool a I don't believe you and B uh if you have you ever seen an never seal example trying to attack your model there are delicate tricky maths to get working they don't really work for spam and Mal in tabular data like that like a lot of security data they don't really work for because reasons but there's there's more there's some stuff thing you need a very good understanding of machine learning and which is expensive or you could do the wisdom of the crowd where it's cheap you're already doing it you're already have thousands of people writing spam and malware doing this uh it goes out of distribution and it's cheap so if you had if you were a if you

were writing uh trying to attack a machine learning model which would you choose uh so this is part of the reason why we don't aders serial examples we see people doing weird stuff like uh after Ember 2018 was released and people found out Mau uh was sensitive to Imports uh uh people writing malware would start shoving random Imports into their malicious software which was not a normal thing before that Rel before people figured out machine learning models were sensitive to that but that's because they learned it and they just started trying it to see if it worked not like that they actually did something adversar now model poisoning is another thing that people come up with you can

inject a very small amount of data into a model data set and it can drastically affect that model's Behavior at a Target point so if you want a your model to misclassify something that is malicious as benign you can pre prepare the area by putting a bunch of benign do binaries in there that are uh benign that have the same static analysis of your malware you just keep do that upload benign data for a few year for a year prepare the area and now you then when you release the malicious thing it can really screw with all your those models because they're not going to detect it uh how are they actually doing that why

like are they going to spend that much effort to make sure that like it's it's very difficult um people we'll talk about this for llms um but like what's the point um uh dilution is the solution to pollution you when you have torrents of data like uh fire shuttle and like uh a lot of the stuff going into llms these days just there the aggressive duplication and handling of data that way mostly cleans up your poisoning so it's really delicate to yeah I know there's going to be people who argue with me on this and I'm happy to have the argument but like I don't see poisoning is going to be that big a deal on the large production models if

you have a small model and you have curated for a single thing then poisoning could be a case but then like you got to figure out why you're doing the poisoning versus just trying something that's cheaper and simpler anyway so um like that's all history of stuff because people have been talking about poisoning virus Turtle for a while and like people have been talking about like a bunch of like you know the history of this like the if you learned about all this stuff in your job a lot of the stuff I talked about is something that if you worked in machine learning security like building a m model or spam model or something like that for the

last 20 years you would learn about all that stuff on the job um and then you just don't talk about that to the public we have people talking about that at AI Village a lot um so that's how I learned that it's a very common thing and a lot of we would have shop talk about these things but it's not there aren't any books that tell you how to do this there aren't any the papers for the AI security stuff for mostly written by academics who haven't worked in the industry and don't think about the threat model like the professionals do uh and so like it is different so um like that's the one of the biggest problems

with industry is it's the the the people who really know how to do the security stuff aren't talking about it so now since with since llms we have a bunch of people who've never done this now they're talking about it and they think they know what they're talking about because they read a paper from someone from MIT who also thought he knew what he was talking about um and did excellent academic research and I but not very good uh like actually securing machine learning models um so to give a kind of a recap to help you understand how things are going I've come up with a like a Gartner diagram so I'm going to throw darts at a board

and these darts are not going to mean much but but I'm hoping that it kind of explains some of these ideas so first axis of like AI security is how many attackers you've got uh and you can lower this by having very terified attacks API limits and being less popular so if you just you and I'm guessing your CEO would doesn't want you to doesn't want to be less popular but you want to be low with fewer attackers coming at you and so the only two you can really do is make sure that you only have you can only respond to queries to verified stuff some of those is impossible to the last second axis is like how much attack attack or

control has over your model uh the way you manage that is you pick harder to manipulate features uh less public information and a moving Target uh and so you can reduce the attacker control by moving Inward and then you have your contour lines where these are the equality things so you're going to have a bad security system where you have to deploy each week a good one where you only have to deploy each here and then you know the medium ones um and so you really want to think about like where do I belong on this graph like how much control do my attackers have over my model how much control do I how many attackers do I have to deal

with so here's some of my Gartner darts um so we have modern spam up here they have to redeploy every seven 3 seven days to like manage things especially on large social media sites they have to redeploy aggressively because there's so many people just coming out of that but they have uh because they do a lot of Behavioral data uh it's really hard for spammers to change their behavior to spam less because by spamming less they're making less money so they want to really push that marker to like get as much content on your site as possible so doing stuff with behavior makes is is sort of hard for them to actually control but they can they they they have

ways around it and they do find new stuff so you are redeploying like every few days because there's there's so many different ways to do this um we have malware models um it's the there's far fewer people who can write good malware than can write a piece of spam for a social media site so you kind of like way down there on thing but the attackers have way more control over your system so because they get to write how what the what the executable actually works like and then you have self-driving cars um it's actually really important hard to set up a attack for self-driving cars um if you really really think about it there's these

stickers don't work all that well they they do work um uh and then if you actually think about it like what's the points of stickers and stuff so I'm happy to argue with you about the self-driving car location on this thing but going back to the silence example uh they screwed up they introduced a machine learning vulnerability in that they made it too easy for people to manipulate them and basically just moved left on this thing uh so it was a vulnerability in how much um control they gave the people they didn't need to give it to them uh and I would to add to this prompt injectors they probably go here they have you have

way more control over what the input to the nlm is uh there's there's people who are researching this and trying to change this um but you have more control than before um but honestly how many surface attack surfaces where prompt injection will actually achieve a goal to get past a security control like most people don't deal you know I'll have more slides at the end but how many actual like prompt injections are will calls a system to act send an email or do something bad we have theorized about this but I hope there's not many people have actually deployed Control Systems where an llm has bad can do bad things uh I know things so mature teams uh can't

control how popular the services they and better features are too expensive uh to or impossible when you first deploy your first like machine learning model in the wild you're going to be like out here you're probably going to be over here because you're giving way too many control you haven't got a good security posture um as you get more popular you're going to go up but as you mature you should be uh figuring out better features better controls and going inwards um there's some machine learning systems like that you're not going to really be able to go inwards but hopefully you can kind deal with stuff but you are hopefully moving as close as you want to move get as close here as

possible mature teams will be as close the the S the uh origin of this graph as possible and they can't get closer without spending too much money or telling their CEO that their service has to grow less which not going to go over well so they eventually they've kind of gotten the model to the point where it is as good as it's going to get the bypasses are just coming in and they don't to deal with it and then they stop deal dealing with making the model more secure and they start dealing with how to make respond faster so that gets back to the quote from the Facebook guy um we measure our spe in the speed we can respond to new

kinds of attacks and min minimize the damage with control that's really it and but what changed with Transformers we kind of had them in U AIS uh so three-dimensional Gartner graphs work worse than two dimensional ones because it's very hard toize that on a slide and the point of this slide this thing is to have slides for uh your Consultants to sell you things um so we have this like new axis um but there are new threats gp4s uh paper harmful content they have a long list of things uh harm of representation disformation influence operations uh privacy cyber security uh and what they did was they in their paper they showed a bunch of examples of how they

mitigated some of these things um anthropic released a paper shortly off um after uh AI um in in November of 2022 that had this graph of different ways of people red teamers were successful or not successful in red teaming their models and here's the different attempts that their red teaming team of 111 did and they're all just like trying to get it to do bad things and this is like early days of red teaming a model like if you were making a malware model you would get a bunch of reverse Engineers or people who understand how malware Works to Red Team your model just to figure out where where holdes are you do early days this is like this this type

of red teaming is like the first thing you do to figure out where and what where you need to patch what change you need to change but like uh genor red team one we did the same thing also did oneoff examples uh we had had eight different models uh we had 21 challenges including credit card Mis economic misinformation um the credit card one was get the model to tell you what the hidden credit card number is um and this is related to a cve that a cwe that got released a few couple weeks ago and um in a real system the llm would not have access to any other credit card than your own so you getting it you're

getting the llm to leak your credit card information to yourself is kind of pointless if you actually were if a bank was to actually deploy this they would not train them LM on credit cards The pre- Prompt would not have the llm would not have access to anything other than your own stuff and this is how the uh like recommendations from like Gavin kondik who's on the iOS spot top top 10 uh the actual recommendations in uh the cwe is recommending this um running calling this uh prompt injection is like calling an SQL database uh injection of database vulnerability where it's most it's a vulnerability in the surrounding system and the other one get the model to

produce false information about an economic event or E fact where the false information has the potential of changing politically this is one of the big threat models that open aai really freaked out about when they were they said we we can't release gpt2 because it's has in 2018 they were or 2019 they were get drumming up a lot of press saying it's too dangerous to relase gpt2 because of the damage it could cause to society uh this is one of the things they kept bringing up is the amount of disin information R things that the the model could do um so but the thing is like cool so you can make it hallucinate like in modern system things uh what's

the harm uh at the same time like large disinformation operations are op running hiring people at the cheap to run these things uh I can also download mix draw 7 billion and just run this on my local hardware without dealing with your llm or like your service um and I can uh without violating your terms of Service uh so like the first bunch of those harmful content like I can just I can also make harmful content with like cheap uh people online it's a you chat GPT making harmful content is more of a brand problem for open AI than an actual security problem not um but it's it's bad we shouldn't do it but like there's way

like if you actually want to do this um you want to do some things um one of the things people talk about is like oh it's going to help you write spam and fishing emails but like you can just take an lstm like my friend John Seymour did in 2017 uh 2017 uh and train it on your Twitter data and that makes excellent fishing spam that too many people clicked on uh and that was with a dumb model you don't need the latest llms to do that um but there's two things that people are like really kind of worried about like cyber security because we here at a security conference potential for emergent risky behaviors uh so security controls

uh they're actually kind of bad at writing malware um as far as um my friends and I can tell uh this might change um but the fact that it doesn't really know learn it but just does good remixing means that it's it thing this might change with the next version of GPT it does help you be more productive so people you might get more malware um out there in the system but not I this I'm not so I'm not as concerned about this personally but I'm happy to be argued with uh text to image models this is one of those emergent behaviors that the creators didn't think about but is and it's kind of good for

good reason because hum most humans don't think about this as a threat but this is really kind of the problem that we have with these large these systems is suddenly they can do something weird that we didn't think about and there's a problem so cam image text image generation uh could have been generating cesam um from in stable diff stable diffusion I know they did a lot of effort in cleaning things out but one of the data sets that were a lot of these models were trained on La and five billions had 1,600 cesam images in it which meant those models were probably capable of generating cam a model that's capable of generating porn is probably and

understands the concept of children this is one of those emergent behaviors that I didn't have to think about when I was doing malware stuff but what act what if you want to sit down and solve this problem you can turn it into and what people did with prompt have done with prompt addictions is they turned it into a classification problem and then they started playing the game that we've been playing for the last 20 years so you'd have a semantic classifier and you would see if the person is requesting it with layers of security like keywords looking for look the U bad stuff that you don't want them to request and you start doing playing

that game but like this is one of those problems that is kind of emergent uh and we uh see it happening but the solution for this isn't generate a whole bunch of new stuff but it's kind of play the game you've been playing for the last 20 years it's make a classifier get into the the groove of deploying it and maintaining it and don't freak out we've kind of been dealing this this sort of thing for a while um and continue without the things anyway thank you very much [Applause]