
now please join me in welcoming Robin ashik as our next speakers excellent thanks so much fantastic to see bsides SF back in h full force this year um like really impressed at uh as always by this conference and um again my name is Rob Reagan this is ashik rachan hello and uh ashik and I have actually been collaborating on research and running experiments together and honestly just messing around and seeing what we could learn for about six years um but the first time we met in person was yesterday uh so um we uh were often like meeting on Saturday mornings uh just and coming with ideas and uh deciding like to go build some prototypes and see what
we could learn from them and what you're about to hear really stems from uh some of those sessions that we started about six months ago and it really uh we were hearing so much and reading so much about how uh people are going to be fine-tuning their large language models and that this is going to uh be a big risk for data leaks and uh but we couldn't find any details on exactly how that would happen uh how that works uh what can go wrong and what you can do to fix it uh so this is really uh stemmed from our effort to do that uh we really a big observation away from this though
was that the uh generative pre-trained Transformer algorithm is very simple it has no concept of security built into it so all efforts to secure it make it safe to use like will have to be done either uh through efforts of uh data scientists or security Engineers focusing on making that happen um and there's a lot that can be done actually before uh you're preparing to find tuna model and there's a lot that can be done after uh and so we're going to kind of cover uh that that broader Spectrum um and so this is really the Paradox uh we kind of expect all of these uh large language models to be trained on like massive diverse data
sets but uh how do we know uh how much sensitive information is in there that's really something that we uh label and we decide uh but it's a hard question to answer um and like these models are going to uh generate and return whatever they have available to them they don't have this concept of like what is sensitive so that's something that we have to figure out in in our usage of them and as I've been learning more and more about how to leverage the uh llms I've always kind of tried to remind myself and I think this probably something everyone could take away of if it's not doing what I expect or um I'm
not getting it to work the way I want like I'm probably using it wrong like it's just a tool it's not like its fault that it's not doing what it's supposed to do it's uh us as people that are applying these tools to like take the responsibility to learn how to use them more securely and so this is uh uh we'll dive more into so the new AI stack really I think looks like this and and um there's the underlying infrastructure like the ml Ops that you're building these models on uh and then there's um model development uh this this is really where we're going to be focusing uh a lot of what we're we
we did in this experiment was focus on the data set engineering what can you do to prepare your data uh what can you do um for fine-tuning an existing model um there's only a handful of companies out there we think they really are going to be building uh the massive foundational large language model so this is more applicable if you're a company that is uh trying to use one of those or use an open source model to uh actually power your applications and then on top of that like you can do prompt uh optimization and Engineering you can use retrieval augmented generation and uh you then can build other applications and user experiences on top of that but
what we're focusing on in this conversation is really on that model fine tuning and development uh and and what what you can do there um so again just a level set on the problem uh like data leakage is this ability for the model to just reproduce or memorize sections of the data that it was trained on and this occurs because all it's really doing is uh building statistical patterns on its training data of what to predict next based on what came before and and what else it's prompted to uh reference in its context window and so if that what was if what it was fine- tuned on and what it was trained on is this personal information Trade Secrets
or coated material that being returned like is is ultimately the problem that puts us at risk um and so I wanted to just refresh like to the model like this is all it sees this is all it knows in uh the text that it's fine-tuned and trained on it these embeddings are uh converted into uh the text is converted into embeddings which are represented as tokens and uh the tokens um I think you often hear are paral like either a word or a partial word those highlights um really give you a sense of uh like what it is um treating as a token and then ultimately it's just going to be represented as as these numbers and then
the uh the patterns of like what comes next um based on uh whatever it it's seen the most of is is how it's going to produce this um and so that text looks the same as sensitive data like this is what sensitive data like to us this jumps out as alarming if we were to see this in a data set where to see like PL Tech Social Security numbers or credit card numbers or other pii but to the model it exactly the same as any other text um so some of the root causes and we'll dive deeper into this but I just want to set the stage is improper preparation of your training data repetition of sensitive data or
something that was input in um uh reinforcement learning and human feedback loop content that didn't get properly sanitized um I I think also something that we learned this that was really interesting and we'll dive into what this means is the structure of data matters so if you're uh taking your data and you're using that to defined to in an open source model for example uh there might be um like these models aren't good at uh structured data they're really good at unstructured data but if you're not uh removing things like delimiters or XML or Json formats to uh for what you're putting into that fine tuning process that can create problems and also the padding uh like
the preparation of the size and length of the data that you're uh fine tuning on uh matters and we'll see a little bit more about that um the impacts are really like this can affect the reputation of the organization people lose trust in it whatever data is retrieved might lead identity theft and I think some of the regulatory um things are what scares folks the most um like I one I I am hearing more and more about especially recently is uh am I at risk of using uh from a code gen model code that uh is actually fully memorized and something that uh whoops something that is fully memorized and is um actually going to impede on copy right
infringement and um something that I've even dug in a little bit more is uh Microsoft said back in September 2023 that if you have if you use our models that are provided through the Azure open AI uh Studio we will uh cover you if you're in a lawsuit or a case for this and um we like you can submit a claim and we'll cover the costs of your your infringement um then in December they added some fine print to that uh which was uh you also have to have used our uh system prompts or they're calling the meta prompts and you have to show that you tested the model uh the implementation and evaluated it for
copyright infringement but provide no other guidance on what that means um other than like you should go run a red team on the model and there's a high level description of that but we'll dive a little bit more to what this means as well um so we decided we wanted to actually see what causes these we wanted to see uh what what are the anti- patterns like what mistakes can you make and so we tried to make mistakes on purpose and um then what's most useful to actually prevent this uh so we set out to build this model that actually leaks and and then we're going to try to fix it um and so I know a lot of folks and
what we're seeing being used these days are they're applying retrieval augmented generation as the primary way to like go to this data set use it to get some contextual search um bring back facts to me like I want to avoid hallucination or I want to have the most updated data in my response um and we're not seeing as much people use fine tuning yet but I I think we will and like fine tuning is a really good use case for whenever you can't think of what you to tell the model on how you want it to behave you maybe can't put that into a prompt uh but you want to just show it you want it
to actually say here's data that I want you to emulate I want you to um actually set this as like the tone or the format of of the responses or the quality of the output I wanted to resemble this uh I want to actually uh identify failures in what the base model is doing and I want to put that into my data as examples to follow this can help help uh eliminate like problems with edge cases or like here's an entirely new skill I want you to study from this this body of text and that's what I want you to actually um be able to generate responses on um so I think like right now where we're at we're seeing a lot of
uh people build these Q&A chat Bots and they're using rag for that um fine tuning can help a lot more with summarization but I think we're going to see as time goes on a hybrid of this and as people run into the limitations of um rag they're going to be fine tuning as well um yeah so you want to talk a little bit about the steps yep so more often than not something we've realized is people often think fine tuning a model is really complex but actually fine tuning is the easy part the more difficult part is everything that surrounds it so quickly running over the steps for fine tuning a large language model first off as Rob was mentioning we
don't really go out and build the bigger foundational model but rather you choose a base model first this is arguably one of the many models your GPT 2s your BDS we'll dive into that a bit but you choose one of your the pre-train model that works best for you then most of the time is arguably going to go into preparing your data set where you choose the data set you clean it you test smaller chunks of it to make sure it's what you want and once you're happy with the data set then you're going to go to training the training is arguably the easiest part here it's probably a script sub 100 sub 200 lines of code nothing
more than that and then once you've train it you go on and evaluate it you make sure that the model you've trained actually is doing the job you wanted to do making sure it is better than the untrained base model and once you're happy with the evaluation you go to deploy deploy is a little more complex where once you deploy you go back you collect data you prepare it then it starts going into a feedback loop where you make sure you keep training the model as and when you go but for a large duration of the talk I think we're going to focus a lot around preparing the data set okay so the experiment uh Rob do you
want to talk about that a second yeah so this was our idea of like how can we uh make a model that's broken on purpose we actually decided to combine the Lord of the Rings full text with uh 5,000 records of pii like we're trying to make this thing leak and yeah then we wanted to basically see can you go back and retrieve uh the pii from Golem and uh basically made this with a variety of different models and just to observe what would happen yep so here's what we did uh we had GPD generate a bunch of fake data for us which we then took and put into Gretel so gretel's uh one of the more popular
tools to which help with data argumentation where it kind of generates a lot of fake data if you give it a small sample set so we took a couple of U data with GPT generated gave it to grle and got variations of pii and then smushed it together in with the lot of the Rings book yeah I think this has a lot of utility and is very useful so like I'm in no way affiliated with gredle but this really helped in the process uh like basically having chat gbg generate like 10 lines of Pi was able to give that to gredle um and then it's really designed to help you generate synthetic data so none of the pi you're seeing
here is real um like what this this uh is not violating anything in this presentation uh but uh this was uh super easy to do and it also um I think there's a lot of utility in this if like I've I've seen cases where um uh had like customer say we're worried about giving our data science team access to our production sensitive data so they can do their data science experiments on and even practice like fine-tuning models and training this is like a really Prime use case uh and an easy way to make that actually happen uh to to generate uh versions that would be the same format data type uh length range but is is just fake versions of
that data uh so this is what we use to to build that and then uh plan to integrate in with load of rings text so that's how the finished sample or at least a smaller portion of it that would fit in the screen looks like most of that is a mix off a lot of the Ring text inside which we just splashed in pii now we did run a couple of variations we'll walk you guys through the demo really quickly we're super excited for that but on a very high level it's just like bits and pieces of fake pii mixed in all throughout actual Blobs of Lord of the Rings text yeah and I think this actually matters like uh
you see like the representation of we had to uh basically prepare the text into these strings that are the same length like this is part of like the padding process and something that's very important if you have like all different length strings or even different versions of this you might have tokens that are detecting on patterns that run over that length or are are uh behaving on like these groupings of uh tokens that would be in whatever context size of the model that you're using and end up blurting out parts of like another training line that it um that you didn't want it to okay so I think now we're stepping into the actual experiment of pre pii so
before we get into the actual Tech bits a quick overview of what we've done how we prepare the larger experiments so on a larger scale we took two classes of models um the first again couple of commonalities uh the the training set was The Lord of the Rings book we did not really change much of that uh we took 5,000 bits of pii and splash it across the same book but here's the difference in the first class of models each record was duplicated 100x so we had you know every person's pii repeated in a pattern whereas in the class B every Pi bit occurred exactly once and in class a we took a gpt2 model yes it's
old but you know arguably a very good Transformer example and in the class B we took a BT model uh both of them had pretty quick training we didn't really spend like hours and hours of compute or energy into training them for the experiment we kept it pretty simple but these are the high level classes we ran these across gpd2 BT and Falcon models unfortunately we weren't able to show you the results of Falcon hugging fa let you run INF frence on larger models but we will show you the results of the others before we jump in What did we discover right you know forget about the experiments what are the outputs the class A models basically the ones that
had duplicate records and uh they leaked a lot of information as we noticed that when your training data set has patterns in it the model is more likely to spit those out because at the end of the day as far as Rob was mentioning right the text is in turn converted into tokens and that form token patterns so the model doesn't really see a difference between hey this seems like credit card number versus this seems like a blob of text as far as it's concerned it's just a pattern of tokens so as we something we noticed was as the pi repeated a lot um the model was more likely to leak the exact same token sequence effectively
meaning leaking the exact pii from the training data set at the same time when the pii repeated only once in but when it was like you know a very large count of Pi we noticed that it didn't leak as much but hallucinated so going back to in theory it didn't give you the exact same sequence of tokens but did notice that each of these data points had some similarity in structure and that structure was leaked leading to data that looked like sensitive information but wasn't really sensitive info I think we have a really cool demo down the road for this and we'll get in there okay so here are the recordings I think um see
the top one and we'll zoom in on the on the bottom one yes but so on our screen you going to see two videos playing simultaneously where we just start off the entire SLE of experiments where we have two models one GPT and one B the top one is the class A models where the pi repeats 100 times and the bottom one is where it does not repeat so you're going to see we start off by just a simple text saying how you're doing neither of them leak any information right it's good and from there now if you remember the data set we had before are a lot of the pi samples had hyphens in them like special characters and
whatnot so we tried adding a hyphen to both the models and saw what happened and as you see um maybe let's pull up the first one yeah I think that one's can you see that top one okay yep so as you see the first one the minute The Hyphen starts coming into the picture the model starts recognizing a pattern and attempts to at least start you know kicking out some information which looks like real info and we see info about one Omega theal which is actually a person whose info we faked in the training data set but the second one the BT one which was trained with exactly one instance of Pi does not actually repeat now okay we know that
there is one Omega in your data set right yeah there's a name Omega the in the data set that was a fake name that was generated and now we try to start we keep that as an entry point and see how much more can we drill about Omega what are those characteristics of the data which if we simulate starts leaking where first off we try a very simple thing we just ask like who is Omega but asking that straightforward question doesn't really reply the model doesn't get back to you because in the training data we don't really have a Q&A style but rather was just trained on trows and trows of text okay so growing in from
that what we did is we tried okay let's not ask it who is Omega and just put a question mark what if we ask it who is Omega and try to replicate the style of pii in the training data set which is basically hyphens your commas so on and so forth now that did provide marginally better results it didn't really go off into generating random bits of text but it still isn't leaking pii as we expected it to upon upon ret roing we realized Omega isn't the exact structure of the pi in the data set so we tried taking the same prompt but you know matching the input prompt to more like what was there in the training set where
we said who is Omega space theal and aama which is what the pii in the train set look like and that showed like results massively where in the vulnerable gpt1 if you see on top it doesn't leak info about Omega but starts leaking info around other bits of pii that was present around that person in the training data set where we start seeing credit card numbers date of births so on and so forth but at the same time if you see the model below the Bert one which does not have Pi repeating is just far more better right it doesn't really leak any information per se because here there was no pattern that was constantly being followed
now um yes hallucination right so what if you remember what we was saying before uh in the above one when information does leak in the vulnerable gpd2 that is the actual info of the person when you were able to match the input prompt as closely to the repeating pii it starts spitting out actual data but when you you know have nonre repeating pi and when you kind of make the model leak information you it starts huca in data and what we did was we tried making it leak and we searched it through our training data set and as you can see it doesn't really have that exact bit of information that was generated so it isn't really the
information about the person but more of some data which it kind of rendered in similarity to structure so but again the trick does lie on okay fine great it's giving you some in ideally it should not be leaking info at all but how do we effectively differentiate between what is pi what is like real Pi what is versus what is fake Pi but that's something which we'll touch in a bit where we talk about defenses okay ashik we you spoke about all of these great experiments a lot of video recordings a lot of gifs on what you've done but what are we trying to say here at the end of the day what is something that we
observed right from running these experiments over the last couple of months effectively to summarize in one line we realize that the more the pii repeats versus the size of your overall data set so imagine a ratio where you're training data set the number of lines number of characters and how much your Pi is there more importantly not only how much is your Pi there more importantly how much each bits of Pi repeats the higher this ratio the higher it is going to spit out the exact same token reputation or the exact same sequence of tokens basically the pi and lower it hallucinates the lower this ratio but if you still have a very high count of Pi in your trainer data set but
not as high of reputation it's the inverse you're not going to leak as much but you are still going to hallucinate equally well but why does this work I mean this effectively is a result of you know just pure derivation of experiments and you're running sample set on top of sample set but why does this work we spoke with a bunch of people we tested a hypothesis and we found out again effectively this ties back to the nature of llms right llms obviously work on attempting to predict the next token based on the patterns of tokens seen before and here the more the reputation or more a particular structure of data Pops in to your training data set the
more that's going to come out and that's what in turn makes the above representation work um from there okay that is what we were able to build out and make a leaky model or a vulnerable model per se yeah and I think like the main takeway from there is like it's going to whatever is repeating in your data set it's going to Echo back exactly more often and whether it's pii or not so like that's just how they work and um but then it's up to us as Security Professionals or data science team to actually be on the hunt for like what don't I want my model to repeat and then run and do this data preparation experim
uh process to eliminate that from Its Behavior correct and that applies in both value and in structure it's not only like you know it's going to leak real values or you know it's going to give out similar looking data structures it applies to both right but yeah taking that in account we thought we take a minute to talk about what are the couple of the steps which we did which we realized very small steps not really like super complex super simple ones but made worlds across difference in terms of you know making the model better as it not leaking as much right so first off the base model selection when we fine-tuning a model we are obviously
starting off with a larger bigger model we need to make sure we select the right model for us now we thought a lot about how we wanted to tell you guys like you know here's the best model for your use case but the answer honestly that it is dependent that there is no one size which saw like everybody go train your GPT or everybody go train your but right honestly depends on the use cases you guys have and something which we found was it just best that you experiment by training smaller models of different types like GPD a little bit of GPT one smaller but see which one fars better evaluate and then go back to training
the bigger one it's not genuinely not a good idea to just take all the data set and hit one model whichever is the biggest you can find while bigger models as in in sizes of parameters you might have heard your you know uh Falcon 7B or llama 80b right while the larger the parameters generally the model is bigger at a is better at a wider array of tasks but is more prone to leakage when you fine tune as well so as far as fine tuning goes something which we genuinely wanted to share across the larger audience is bigger is not always better yeah in this case like it's the bigger parameter sets that it's able to do that
you have to look for bigger patterns in your tokens uh that are being uh it's being trained on so that is a little bit more of a challenge in actually securing those models of like it might memorize more it might regurgitate more and have a little bit more info on that yep so yeah so like this is actually uh like research showing how that scales like the 6 billion parameters uh like will leak way more uh entire data sets or entire memorization um like this is how like kind of see it literally like memorization of like a small 125 million parameter model versus like a 6 billion parameter model like it's going to regurgitate entire blocks of text if um
like the bigger the model is and this also matters for Contex size window so like the more it remembers about like what you're having conversation about the more it might apply to going to retrieve uh uh tokens uh from what has been trained on and and spitting that back out but that being said going back to what we started about most of what we the results or a leaky model or you know a leaky nature of a model can effectively be prevented or mitigated by preparing a data the right way and here are a couple of the top techniques which we noticed just genuinely helped with making your model that much more better none of these are really complex but add
values which just help really really lot so let's quickly jump in one by one we have the top five techniques the first one is word distances not as commonly used surprisingly but word distan is effectively aware you take in your training data set you take every word and you just see how far they are from the next occurrence of the same word now while there are a few words which make sense to repeat a lot such as you know your the the ANS your common English letter English language was that's perfectly fine but if you know what pii is there in your data set again ideally you do not want pi in your data set but
in the cases where there inevitably has to be some sort of information in your data set it's better you keep that count as low as possible how do you find that an Uber simplified representation is on the screen here where you just see what's the Clos closest occurrence how many times it repeats not only how much how many times it repeats but also what's the distance like is it like every third word is my Pi or is it like is it following a particular pattern finding that distance really really helps because as we were mentioning in our findings the more of a common pattern your pii shows the more likely it is to kind of leak out and start
spitting out the exact same sequence of tokens that is one metric which we found was really effective yeah and you can expand that to uh not just words but entire sentences paragraphs whatever like context length size of your data model that you're fine tuning like you basically have to scale this up uh which be uh to properly detect like is there repeating blocks in here that are going to stand out and that are going to get memorized and get returned okay inaccuracies this is one of my personal favorites uh we've spoken about this so much in the last couple of months when you're work working out this one something when training a model is your data is going to have to be uniform
or near uniform you cannot have like say we will see an example but imagine you have a blob of text which is predominantly text you do not want to have like random Json content in the middle over there that is you deviating from the standard pattern of data or imagine if you have a list of countries you do not want to randomly throw in a couple of car manufacturer names that is deviation in your data set now those deviations form anomalies and those anomalies anything Sur any PI surrounding anomalies or any information in general surrounding those anomalies are more likely to leak so and I thought this was interesting for just things that you don't want returned like being
able to not necessarily like leaking of information but if it's going to have a bias to return things because there's some deviation in the pattern of the training set then like I think that's good to know just for like the behavior and and functionality of the model correct and again the way I like to think of this is when your data set deviates you effectively build so-called poison statements where if you try to you know get results around that deviation is going to leak a lot more a little simpler representation we have a simple bell curve but assume that on the x-axis is your standard models you know your data set uniform data set any pii
around the deviation in the context of pii leakage any pii around data deviation is prone to leakage far more than the other ones moving up from there an example yes you know given uh we want to just make sure it's a little easy so let's take a simple data set of usern names user emails and phone numbers in the middle you see that we've kind of splashed out a couple of uh phone numbers in text inside of the standard number and a couple of emails instead of following the standard email pattern we have like text such as personal users email email not found so on and so forth so any content what we noticed as a as a
part of the larger experiments was any content around the deviations such as the email of the person whose phone number was you know a text instead of national number was prone to more leakage so yeah and I think like yeah we just wanted to show this as an illustrative example but yes and how do you actually Rectify these inaccuracies inaccuracy rectifying is a topic that's been super well researched upon this is a much more simplified script but there are libraries such as nltk and a couple of others which have corpuses of large amounts of data specifically meant to you know Rectify in accuracies character normalization right I'm going to kind of jump through the next couple of
techniques but this is so simple but it's actually kind of funny that when when you build a data set just make sure that it's all in the same character f format if you want to go utf8 let's just stick with that and make sure we don't really deviate that helps a lot more and this is arguably the easiest production gr script I've written just go through your data set encode it all in the same encoding and that helps a lot as well okay Dimension reduction um this is where effectively let's say your model going back to the above example here right if this is your training data set where you have like 5 6 7 10 15 Fields
you want to train your model on it is not necessary that you want to train it in everything same same going back to the original one right bigger is not always better just because you have more parts of your data yes context is important but context is important to your use case when you find tuning right so you do not want to just throw at it like every field you can find in your DB more often than not it's better that if you choose the top end number of fields that is for your use case and find tune your model on that and sometimes if you have like five to six different tasks it's better you train five five to six
different smaller models each and expert in their own task right so that's something which we definitely notice reduces leakage a simple script SK learn has a lot of again your entire dimensionality reduction as well as a larger more complex machine learning technique but we've kind of simplified it as an example which we wanted to show you guys here argumentation not always are we going to have enough um training data such as in our case we could not actually find 5,000 bits of pair of individuals and even if we did find we couldn't train a model on that so in that particular scenario we did we can use tools such as grle or other you know
chat GPT or your openi models any model to generate synthetic or fake data but here lies in in that lies a inherent problem where when you get data from The Real World such as let's say credit card transactions right there is a pattern to it because we as humans real transactions have patterns we want to analyze or even if it doesn't we want to understand that but generating that kind of synthetic data might not have those patterns or may introduce pattern s which are just honestly non-existent so we need to make sure that while argumentation is great it's really important we do the slow pace and validate the argumented data with that once again a a lot of techniques exist
but these are a couple of Techni which we found work really well but after you have your data what is model training we just wanted to show you the script which we wrote which is arguably one of the simplest scripts we used to train super simple it's you know under 100 lines nothing too much we just load a model from you know uh hugging pH we use a Transformers library and then just train it on the text we have so the purpose of this light for the most part is to display that your training of model is not hard the more timec consuming task is in preparing the data set with which you're going to monitor the model with
that you know Rob do you want to talk about um how do you then test it so like you've built your model now how like what are some techniques you can apply to actually answer the question like is it doing something I don't want it to do is it leaking uh let's start out with the simple ones like uh basically doing these types of just asking the model for the specific type of information you want or trying to create error conditions that it may have observed in its training data or trying to um use like asking it about what it's seen in the past versus current state or um asking it about itself or things that
it's seen in in anything that it observed in learning um or just straight up secret extraction I think that uh like these techniques whenever someone's directly interacting with model to try to get it out are interesting I think a lot of your customer base though would be more impacted if there's also the combining combining this issue with things like prompt injection and the biggest issue I'm seeing in a lot of the models that I'm testing is the ability to then generate links using markdown where the web interface renders that link and then there would be some part of the prompt injection that coerces the user to interact with it and then sends that information that was leaked in that
response or was extracted back to the attacker uh and so like I think that's probably like one of the most important things to test and make sure that um your users aren't copying and pasting something that's going to cause a leak to easily be exfiltrated um model inversion attacks as are applied to large language models can be things like uh I want to actually try to recover specific training data um so like in this example with like customer support logs um you may want to ask it about like uh issues with the account or the uh order numbers and and like whatever l language that organization's using um and then it seeing if it does uh responds with more
details about that um this is like you're going to want to craft this on like an application specific basis and um some techniques to then apply to that are things like continuations uh so you can you can you put the text where um you're asking say have you seen this before and then continue on uh with and it's basically filling in the blank and seeing if uh that was if that was uh not properly cleaned from the data set if it'll finish your sentence or finish your par GRA with something from that training data um if seeing if there's repetition of like uh trying to get it to repeat the same token over and over
and over again uh and then it it might uh give you other information about that uh or Divergence like this is back to that um like some characters or some tokens or delimiters that are in the training set it's more likely to leak something next to that um or if you're doing like single token attacks so I thought this was interesting bit of research so like one token um will actually uh you're much more likely to get uh leaks from like one tokens up to like repeating like below 400 times um and then as this expands like two tokens three tokens uh like the probability of that uh is like more certain that you'll get like a um a leak in the response um
using special characters as well like this is an example of actually getting leakage out of uh mixol as a model um so this is like a uh a character that was designed to use as like uh encoding a space in the training data or encoding something into uh what it might consider like the end of The Prompt or and so this was actually like as you can see though was mostly getting like gibberish out of this so not necessarily anything useful I think we would have to really do a lot of like repeated attempts with variations on this to get something useful especially in blackbox uh like model inversion attacks a lot of the research on this um researchers have
copies of the training data and then they're seeing like like validating if they got uh what they wanted out of it but in blackbox testing like this it's really difficult to know if uh exactly what you got out of it is that hallucination or is that leak and I think you'd really have to be fuzzing it with a lot of traffic uh so one of the um things that I think is really important to be able to do is rate limiting and monitoring the usage of your endpoints and applications that are integrating with these models and then being able to flag and shut that down as a behavior that might be trying to extract sensitive
information these are some examples of like Coen Le uh so like being able to look for identifiable information or private information or Secrets a lot of it is like anything where someone might have been defining a key or setting a password setting commonly something in a config are things you might want to test for to see if you're able to to restra uh extract um uh something using those like continuation techniques diversion techniques or repeating techniques but um manually testing is only going to get you so far I think one of the best Frameworks out there for automating your testing is azure put out this uh pirate uh like Risk um uh testing tool for uh LMS you actually
will have to code uh like custom to your use case giving it a specific objective giving it a way to score if it has achieved that objective and having it generate variations of its prompts and of course you can seed that with like specific techniques that you might want to apply um you can give it like libraries of references of existing prompts to say like make variations on these and um give it whatever task you have like this is an example um and they have quite a few examples their GitHub repository which there's a link to in here of like uh doing the gandf um prompt injection retrieval uh but I think this is a good way to like uh also
automate your testing of data leakage as you're changing your finetune models um so let's get into like more mitigation guidance because I wasn't finding any really good uh examples of uh what you can do about it uh like other than more of the data science techniques um and but I think this really does start with like have specific goals of what you want to make sure it doesn't leak out of your fine models be able to measure and eval that and then repeat that process on an ongoing basis uh so like strategies for detecting this like differential privacy you'll hear about and I think like the best technique for applying that is generating synthetic data uh actually
giving um your data science team copies of your production data uh to at least run their experiments on because I think the real way that a lot of these breaches are going to happen is making copies of that data and then like traditionally it's going to get accidentally posted in an S3 bucket that doesn't have the right permissions or it's actually accidentally going to get emailed to the wrong person or copies of it are going to accidentally be put in someone's private GitHub repo or private tensorflow repo or private Google Drive and then that's the way the the the breach of the production data is going to happen uh so I think this is um
definitely valuable to apply um a couple of Frameworks I came across that uh like have been recommending is Microsoft's precidio data protection framework I think this is one of the most comprehensive ones because it creates this really defense in depth pattern of like we can apply Rex patterns of what we don't want to see we can apply um like more deterministic uh uh NLP uh models or or um name entity recognition uh models that are able to flag specific elements in your data uh you can have check sums uh you can have um like these specific words for context uh that you want to alert on or even like uh techniques to anonymize um like uh the
example of like the check some might even be something like a Bitcoin address that like you want to like confirm that it's a Bitcoin wallet and like I want to be able to flag on that uh and you can extend this framework for uh like it comes with a lot of pre-built entities but you can also super customize it if you have some element of your data that you're really worried about showing up in these models you can use this to uh apply um either data cleansing in the pre- uh training part of your uh workflow or in post these also have um functionality to actually while if you're streaming the output of your model to flag this and catch it uh so
like this is an example of like Tech uh a version that's has been implemented to integrate with llama index and uh being able to um uh basically give this text in this is the output of that like it's actually identified um like this is a person this is a credit card this is an email address uh and um this is set in the state to actually um uh mask these but you can also um like yeah so like this is a good worklow I think on both ends but uh the other um this is really comprehensive in that you can do things like replace it with other data fully redact it hack it mask it encrypt it the
encryption use case might be you actually want something in your workflow of output of a language model to have access to the plain text to sensitive data but you want every other part of the uh in transit um or any other place that data might be stored to be encrypted and so you can actually apply um this library to uh only have it unencrypted at the time it's being used by that application workload and then it's back to encrypted after that um so yeah I think this is like probably one of the most uh useful and comprehensive like libraries um yeah and this is actually us applying it to uh our our model uh so this is a example of the
same thing you saw earlier with precious pii and um it actually redacting um uh and and actually masking the the date the person and credit card numbers uh and so like birthdays are hidden now and um no more pii leaked um AWS has uh guidance for how to use their comprehend system as well so comprehend is their natural language process uh Native service and um you can build workflows with this as well of like I want to actually answer questions about what's um coming out of this and um this might be a case where you want to have some chat interface that just like throws an exception on sensitive information or even unwanted Behavior Uh
coming out of um your model uh so yeah I have some examples in here of how to use comprehend um uh so if you then want to eliminate hallucinations there are libraries like outlines out there uh this actually allows you to turn um these uh generative models into like a finite State machine of per token that's generated I want to compare this to a Rex and actually guide it and if I ever detect that it's not going to match that Rex I can event on that and throw an exception and have it go back uh perhaps adapting The Prompt or adapting the interaction so that it does uh create um the type of output you want so like you
might have a use case where like I only want the output to be URLs so the only thing I I'm going to ask it questions but I want it to answer me in only URLs and then you could have a reject say like if it's not a URL I don't want it go back and try again or we have some problem in our input or our prompt and um let's then uh uh adapt that and then another um popular open source uh framework is uh guidance and guidance actually allows you to do things like much more contextually uh so like this is an example of it detecting an anron ism like uh T-Rex bit my dog like it
actually would uh do the analysis of like H well T-Rex existed 65 m years ago dogs exist now so like that's probably not true I'm going to assume like this is hallucination and uh like I can event and uh track that in my code um yeah so key takeaways uh like I think preventing these uh pre-model training and post-model deployment is like your best strategy for defense in depth the biggest thing you can do to impact data leakage is to focus on D duplication of data in your training data set that you're fine tuning on uh the size and cont text uh of the size of the model and size of the context matter a lot for leakage uh you have to test
for this on the other end and I think rate limiting is probably one of your best mitigation factors to see if uh there are adversarial or like malicious attackers actually trying to test for leakage um and then uh output sanitization is going to be your your best like last line of defense um again uh Rob and Asik and these are contacts uh these are some of the resources if you want like copy of the slides uh and you can play with my. precious. has the uh experiment that we ran if you want to play with those models and see what you can get them to do and they're also hosted on hugging face if you want to uh see exactly um
how to interact with them there yeah so thank you everyone thank [Applause] you all right we have less than a minute for a question how do you ensure the generated data is not actually related to real person to a real person yeah I think that um um that's a good question uh I think that uh you like have to study like what was in your training set and if it's a possibility that you had real people's dat in your training set it's a very real possibility that the output might also contain real people's info um and uh yeah and also accounting for that same going back to the way he was talking about it if the model is
only spitting out series of tokens it's better that you just layer like something like your residio and just mold stuff startat if you do not want any semblance of pii but if you do want some kind of content that's going to require a more detailed use Case by use case analysis thank you thank you if you have any questions for our guest speakers I would recommend you meet with them after to ask them so please join me to thanks Rob and ashik for their presentation thank you thank you
[Music]
[Music] I [Music]
[Music]
[Music]
[Music]
[Music]
and all
[Music]
[Music]
oh [Music]
[Music] [Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
I
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
what [Music] [Music] [Music] questions on slido and to ask your questions go on besides sf.org Q&A that these are the letters q and a at the end of the talk we can go through your questions now please join me in welcoming Joe and chandrani with the
all right so welcome everyone to the last session between you and happy hour um today we're going to talk about a big question so what if your security team could automatically fix your code vulnerabilities so think about this AI can go through code faster than any human potentially finding and fixing your vulnerabilities but we know that AF fixes are still imperfect and we need to manage our expectations about the performance of AI fixes um according to a s SW bench uh Benchmark even chat gbt 4 could only fix about 2% of complex bug fixes so that means we need to use AI with oversight and still keep humans in Loop and today we're going to talk about strategies for
increasing our chances of finding and fixing correctly and we're going to talk about where to focus effort and this problem is actually two distinct issues this is finding vulnerabilities and fixing them and next we're going to give you a little background on finding vulnerabilities in the traditional fashion so traditional methods of finding vulnerabilities are going to be through manual code reviews static application security testing or sass dast fuzzing external bug Bounty programs inant reports pen testing or red team reports the issues with this is often times you have a lot of false positives generated and this can erode developer trust when they see a potential vulnerability that isn't actually there Additionally you have a coverage of vulnerabilities can vary
depending on the type of uh detection you're using uh these this process can be timec consuming and man reviews are slow and prone to human error and instant reports like the the known slice there are the ones that you don't want and we're going to see if we can use AI in this process and the goal state of this is we want to be able to increase our coverage and increase our true positives and decrease our false positives we're still not going to be perfect in our coverage of all vulnerabilities in this and there are at least two ways of finding vulnerabilities using AI we can do an AI only approach and that would be
just using a prompt and feeding in your code to the prompts or we can do some sort of fine two model like GPT 3.5 fine tuning and we tried this with a few hundred examples of real uh vulnerabilities and fixes and it wasn't that great um and then you can additionally use a prompt driven strategy where you're also using that to supplement your results from sast and use Ai and then where can you do this in the process you can do this at the stage of uh polar requests or you can check your code retrospectively you can do your whole code base you can do entire repos and that's a little bit harder of a challenge but that's we did um but the
key is you can you can build on your existing Frameworks and not totally throw them to the wind and now we're going to talk about non aai fixes so as you can see traditional methods of fixing vulnerabilities without AI are just people in seat in seats doing work um there are problems of prioritization a lot of times you don't know what priorities your vulnerability should face your vulnerability should uh use and then also you need a skilled team fixes are timec consuming uh humans are also prone to error and then you can have an example of um instances where your team issues a patch for vulnerability the patch didn't work your team has to issue another patch also you
can have a the same type of vulnerability can show up across multiple repositories if a bug Bounty Hunter finds one and and runs it across all of your different products but using AI we know that we can we can use AI to generate fixes for at least simple code vulnerabilities and we would still have the human validate the fixes and we still need our expert team but now they have more time to resolve the complex fixes AI can't handle so there are commercial Solutions out there but we're going to talk about why we you might want to build your own solution a lot of times you'll have you will almost always have proprietary coding languages Styles and
libraries um for instance your your team may have a uh xss fixing library that you want to San you want to make sure that your data gets sanitized and you don't want to depend on an open source solution where you're not sure if it's going to be maintained there's a large cost associated with commercial Solutions if you have a team of hundreds to thousands of developers checking in code the cost for for using a commercial solution can run into millions of dollars a year one of the more important most important parts is you want to be able to customize the measurability of your your fixes um you want to be able to measure your historical Performance Based on finding
vulnerabilities and fixing them and a lot of times you have those resources at your disposal you if you're tracking your uh fixes through J and GitHub then you can automatically extract those programmatically and finally you want to be able to grow internal expertise in AI this is important because you don't want to be dependent on Commercial Solutions so with this I will pass it over to shandrani thanks Joe uh so now we talk about the solution Journey uh this was very new area for us we did not really know like uh what would actually work what would give us a good uh result so there was a lot of trial and error uh before we could say yeah okay this is an
accepted good solution right and then whatever work we are presenting here and the results they are done on the gbt 4 um but then uh it should be language like model agnostic uh the methodology so first we started like we need to we needed to do a PO right we needed to establish the fact that uh this very thing that we are trying that uh trying llm to find vulnerability and fix it this would actually work so uh for this what we started be the zero shot prompt engineering method uh therein what you do is you rely on the what the model already knows the intrinsic knowledge of the model and you are you you are relying that the model
will give you the most optimized results right so um as an input uh we created a dami repo uh with some very obvious straightforward vulnerabilities in it and U we were like okay find and fix whatever you can uh obviously the uh system prompt was not as simple like just before the talk I think by open a they mentioned that you need to praise your model so we did the same thing um you you mentioned you a very expert AI security engineer you very methodical and blah blah blah so uh and then um it was able to find out um all the 8 to 10 vulnerabilities that uh we had injected uh with um with a very
good precision and then it was able to find um great fixes for them so with High Hopes uh we wanted to apply the same technique on our production report right uh so it still is zero shot and but then the in uh input strategy we changed it like uh okay let's go onto the production repo and let's see what it gives uh with on the system prompt we are not giving any additional example or any additional context but then we are making it little bit Advanced that means that um we look through our jira uh we found out like what are the top 15 babili uh we picked out uh we picked those and we are like
okay give me only the vulnerabilities related to this we give a list of languages um so that when it is generating the fix it is uh context aware it has that language awareness and it we also give a output response schema so do not just give me blout response you give me only for these fields only in this particular format right so um everything was I mean going great it gave out lot of issues it uh in the format that we expected and all that but when we tried to uh manually eval when we were going through the results the issues uh there were a lot of false positives and not only that like uh the
fix it had generated it was very much hallucinated which means like for example it would say that um include a function include a sanitized function but it would not actually um include a proper definition of it so then we kind of take took it slow uh we went with like okay let's give a a single example to it so we uh single shot method is uh is in wherein you give your model A a single example how your output would look like and you expect that your model will learn and we'll try to mimic that so we changed our input strategy from uh from going to the production code to uh we changed over to jira
because in jira we already have our like vulnerabilities logged in uh there are vulnerable Cod Snippets that have been exploited so those are also logged in so we took that as an example and um there is also a related uh repo that is uh written like in this repo we have found out so we are giving a one code example and the related rform right so um here what we saw is um if it is very similar the code if the code is very similar it is able to uh detect that but then uh variable tracing within the same file was still a problem uh like for example if the variable has been defined somewhere later it's
accepting input and then um finally it has been used as window. out window URL and that it is still not able to make all that context that was the one and then also even for like we are we have given one particular example of injection but if it's injection and if the code is little bit different xhr related or stored uh injection it's not able to detect properly so then we moved on to the fot method fot is uh like when you not give just one but like three to five examples uh so that model understand the context and the task better uh so here we look through like okay in our jira if we say
injection like what are the different types of injection examples that we can find and let's give that and see what um how we are uh saying right so here basically multiple injection examples you are you are giving with this we were uh seeing improvements in U in detection in finding the vol it is but we would say we would say say that there was still a lot to go what we could say an accepted fix so almost at the same time uh we came across a paper that was uh by Boston University that talked about this Chain of Thought method wherein they have mentioned that U they have done vulnerability detection with with about 70 perish uh Precision rate wherein um
um so this is basically how it uh works is you have a problem statement and you are explaining what is the problem if you are giving a decision to the model then you are also explaining like why you are taking the decision so basically you are um you want the model to think the way a human would think and it would mimic uh that entire think process thought process so here we are not only giving the um J ticket summary and the description but we are also giving an explanation like for example here here is my vulnerable code snippet then next I'm giving that why I think this is vulnerable so I'm explaining like okay
uh this param input is coming from user input if a malicious JavaScript is uh loaded and then when the window. URL gets executed that uh gets executed in your uh in your environment and then I'm giving a proper fix so I'm not only giving like sanitize this but I'm also also giving a uh like proper definition of that um like how that sanitizer would look like right and and and and I'm saying like why that s u that fix would uh would work so with this uh this kind of thing and um for like we mainly tried so for the injection we saw a lot of improvements uh both in detection and the quality of uh the fix that it was
generating uh but I should uh definitely mentioned that it is not for vulnerabilities that is spread across multiple files it is uh for vulnerabilities that is contained within the same file um so reality check uh definitely like from the point when we started with the dummy repo it gave everything perfectly everything was able to uh all the issues where it was able to found um uh 10 out of 10 um till the point where we are today there was a lot of realization right so challenges as I mentioned um like for example the um complex vulnerabilities like whenever it is spread across multiple files multiple repos then uh GPT would do uh really poorly then uh when we moved over to uh
the uh the production production file production report there is no proper test file right because you never know how many uh vulnerabilities are out there in your repo we have static code Checkers but we all know that also produces false positive so if you are uh if the my model is producing 10 uh issues I do not know if the 10 is out of 100 or if the 10 is out of 20 so you never know the denominator and then um the correctness of the fix right so this is also like very subjective like how deep how accurate you want your fix like for example if you are um using a function if you are saying that you need to
include a function the proper function definition should be present uh if uh there is a place for Constructor if there is an import state statement all those should be included as part of the code fix right it should not just say include this uh there should be Pro U library Imports of this there should be proper code for those so how much deep we would go with the fix so those are all so like subjective uh so next we move on to metrics and evaluation uh initially when we started this obviously we started with a manual evaluation we will go through each of these findings and we would see like whether it's a correct fix U if it's a correct fix whether um
the fix generat is um you know is uh is good enough but then obviously uh with manual evaluation you cannot scale that's not accept so what uh we tried is um the auto eval framework so this is something uh langin provides uh they have a lot of like um um eval framework eval apis or functions that they provide the particular one that we used is the uh scoring evaluator so basically what we did is um we initially created a data set uh with vulnerable code and the corresponding good fix we call it this as a golden data set and then um we would take that same vulnerable code and run it through the llm and it um ask it to generate a
fix right and then we would compare uh like how close uh that fix is with respect to the reference fix and this is where that U the Lang chain accuracy calculator comes into picture uh so basically it lets you um Define the accuracy on a scale of 1 to 10 10 means like you fixed uh the predicted one and the um like the reference one they're very U context wise and everything is very similar seven means there are minor differences and so on so we decided like okay if the accuracy is looking like uh closer to 70% we can say like yeah this is a good enough to go and then we can accept the prompt and uh we go ahead
with that like generate code fix of similar type in other reports um for metrics uh there are two types of metrics that we have considered so far one is the Precision right right so Precision is like uh if you have 10 issues but uh you have only four of them are correct so your Precision is then 40% and then accuracy like how good your um um your fix is um so with um like all these um um like uh methods and U like prompt engineering methods that we have tried we found that with Chain of Thought um it is giving us more promising results um yeah and then we have a demo what do you
Jo yeah so there were three different ways that we implemented this um first we did a code fixing directly from a slack powered AI agent um and we also had an option of implementation where we had a CLI based uh infrastructure where we would iterate over entire repos and then you know select and improve fixes for PRS later but this one is the most simple um and this is just a simple command line where we issue the pr
directly and this is very short it's about 25 seconds so this is the fix happen and it explains um the fix and why what was vulnerable and then this is the pr and yeah that's that's about
it so conclusions we had uh we had quite the journey and it's still ongoing so for people who aren't familiar this is representative of the gardener hype cycle so it's where your expectations are over time so when I first saw this this a few years ago this really resonate with me so when we started off at the beginning we had our our Po and we had a lot of expectations I thought this would project would probably take a couple weeks and we'd be done with it um and then we started when we started moving up with increasing code difficulty increasing vulnerabil vulnerability difficulty uh you know we still thought we could handle it we did
zero shot we did single shot on real real code same with f shot when we got to Chain of Thought we are still getting the fixes we wanted that actually aligned with historical fixes using our proprietary um code and libraries they weren't just generic fixes and we thought that this was great but as the complexity of all the the vulnerabilities and the different coding Styles and the the different files kept going um we were still using this manual valuation process and and we realized that this was not a sustainable thing and then when we transitioned over to the auto Evol framework that was you know we realized that we were at a place where we're going to be moving up and
there's actually a path forward so now we're at the point where we are in a have a clear path to full productivity and some strategies that we learned along the way that ended up being pretty valuable we'd like to share so the first one is you always want to reduce opportunities for LM laziness and what you see in the upper right hand corner there that's an actual function provided that's supposed to be a fix from uh GPT 4 so that's not what you want so your prompt should contain something like Place folder comments ellipses and other shortcuts will never be used in place of functional code and maybe even more than once um the second strategy is you want to
make sure that you have your proper output output schema you want to include your file formats the scheming details and how you want your outut Fields so this is just a snippet of a couple of fields you might want um and even if you have this specified you still may get synthetically incorrect output from time to time um the final one is you want to reduce opportunities for LM hallucinations so so you want to have output fields that the LM can generate into instead of hallucinating so having Library Imports as a field or method creation for a field uh or you can have language specific uh fields that would only be used in context of a specific
language like um Constructor changes or something like that and some key takeaways uh directed fixes are helpful so the more hints you can provide to your llm the the better your fix is going to be and um finding and fixing uncomplicated vulnerabilities is a simple Tas for LM and this may seem obvious but fix things that are 100% known to be vulnerable first for instance if you have results that are validated from a bug Bounty uh program you can use that as as input to your prompt and get a fix quicker that way uh metrics are key so regardless of whether you're build or buy you want measure the effectiveness of fixes for your code and so traditionally when you
do evals um your it's pretty simple you have a question you have an answer and it's it's pretty easy to think about but when you're trying to do this do evals with code you have to think a little differently because code is not human language so for evals your question is your code and your fix is the answer and your answer can depend on how near your question is how much information do you want and fine-tuning prompts can take you pretty far Chain of Thought as we saw was the best for um performance and well-designed prompts and output schemas are key and lastly humans are still Gatekeepers and you always want to keep humans in Loop for final validation
that's it
all right the audience H has a few questions for you guys okay first one nice work how do you manage the accuracy of the AI code fix do you maintain any evales data set yeah yeah so we track um all of the fixes and all the results we track in a in a table so that way we can measure the performance over time second question is why not use same GP which is open source yeah so a lot of times so we have um an internal solution that um we we can take the results from our static analysis and feed that into the uh llm so it's kind of like you're you're adding an additional filter on top of um your
results which one was more effective finding vulnerabilities or or fixing them using AI fixing yeah fixing because um you know when you're trying to find something it's pretty Green Field you can look through your entire file and a lot of times the LM would it like it would generate so a lot of false positives um a lot of times you'd be good at finding vulnerabilities but it would it would find things that weren't there too so yeah the fixing you know if it's more directed if it had something more narrowly Fus to look at then it could it could do better what kind of vulnerabilities here you have experimented with yeah we have mainly um tried with
exsis injections yeah how does it Endo when the fixes require a complex code fix how does it handle when the fixes require a complex code fix so yeah I can take it and then you can add um you need to give a lot of details like for example specific to your company um what all standards it follows so sometimes in the system prompt you need to give those details as well and then you need to mention in your output process like where exactly you want if there like Joe mentioned uh the import statements if there are Constructors if there are functions where exactly you want them to be placed so it really depends how fine grained
you can make your system prompt want to add something when you do few shot do you retrain the model or just provide the samples as context so we had we have something set up that was like I called it an Adaptive fuse shot so depending on the type if we knew the vulnerability type ahead of time it would um it would pull out few uh the few shot examples from a known set of that particular vulnerability type how do you mitigate any security implications of giving AI access to production so we always would only issue PRS so nothing would ever go to uh go to main have you considered any semi-automated feedback mechanism so the
llm responses can be refined over time in a scalable way yeah that is there in our U plan like for example right now it's very much jira driven the flow we have done so probably something like jira commments and then that integrate uh incorporate that um in in the llm in the back end uh yeah were you worried about sending internal code over to external model apis that's a great question we did everything uh through your open AI our own companies in sense we would have been that was that yeah that was something we would have been concerned about normally all right these are all the questions we have on slide of for now please join me in thanking our guest
speakers [Applause]