Henrique Pereira

BSides Calgary35:409 viewsPublished 2024-03Watch on YouTube ↗

Speakers

Show transcript [en]

[Music] AI generated title um but it has a double pun so pretty good AI is quite smart or is it so for those of you who don't at a private company so the go of this talk is to talk a little about code generation in large language models okay if you are not living in a cave in the past year we've had started hearing more and more about llms uh and if we could if we could do a TDR basically uh they are AI models that can understand humanlike taxt and can generate humanlike taxt and uh humanik code okay uh that actually creates a huge potential for automation um the goal of this talk is going to be softer

automation or software development automation but you can think of LM is able to generate any kind of things that a human could generate in text or code okay uh if we think about software development we already have very nice tools out there such as uh GitHub co-pilot which a lot of you might have used and chat GPT you can go to chat GPT and you can ask chat GPT to generate a snippet of code or to generate a full application and uh chap GPT will try to do that and usually succeeds so as I mentioned large language models are a class of artificial intelligence models that has gained popularity recently uh these models are pre-trained

on massive data sets so if you think about uh co-pilot co-pilot was trained on all off gsub codebase at the time that it was released and it has been uh it has gained more knowledge as people started using it as people started making decisions uh same thing with chat GP chat GPT was drain on uh a huge carpus of human text and code okay um once you have pre-trained a model uh you can actually fine-tune that model to find the kind of answer or find the kind of content that you want to to generate so it kind of becomes a little more specific okay and then GitHub co-pilot is an example of a very specific uh llm

is geared towards Cod generation okay GPT so gpt3 3.54 uh GP 3.5 tuo they are all examples of uh llms and and I'm pretty sure most of you have already touched them or used them for for something now when we talk about code generation and code Generation by large language models uh we usually have like five topics that that we see as improvements over humans um writing code the first one is productivity you can just go there tell the softare or tell the the language model what you want to do it and it's just going to generate for you so you don't have to code that would be uh or or the amount of code

that you need to do is uh way less than if you had to code it manually the other one is well because it's machine generated you reducing human error so instead of going there and copy pasting code and changing the variables you can have the llm do that for you so in theory that would reduce the number of mistakes uh per line of code and that is one of the one of the promises that we have in code generation the other one is if you've trained your model with enough high quality data uh you are able to produce high quality results or in the case of code generation high quality code following coding standards uh you

don't have to worry that a human is going to have to manually go there and I don't know name the variables properly or indent your code properly or use the proper patterns because the Lage the large language model is going to do that for you the other thing is consistency so uh the llm is not going to be as random as a human being you will see that the code generated by the llm usually follows U the pattern in it in which it was trained for and that is excellent for rapid prototyping so basically you can

you can develop big systems really quick with the help of automated code generation but not everything are flowers so when you are asking an AI to write code for you you might have a lot of problems one of the problems for us as security people is the vulnerabilities in the generated code again the machine is not perfect there might be a code with vulnerabilities in the training data and the large language model might be repeating these vulnerabilities or repeating these unsafe patterns and this is what we're going to focus on the rest of the talk but there are other issues again if you're using machines to generate code uh one of them is if you've managed to

taint the training data for the llm you can have the llm malicious code malicious in the sense that it might hide a back door in the code that it generates and because you're trusting that llm you might not review or you might not catch that back door so if you train one of these llms or if you find tune one of these llms to generate code with back doors uh that can become a huge problem the other one is privacy concerns so who else has access to the code that has been generated by the llm so if I'm running the M locally then I'm probably the only one that is generating that but maybe another person can replicate that code

somewhere else so is it my intellectual property is an intellectual property of people who uh who created the llm uh who has access to the code that has been generated that uh so this is also another another issue that we have with automated code generation in llms finally the last one is fairness uh bias and fairness to be honest so if you've read papers on algorithms being biased or if you had any articles on that you might know that depending on the data or depending on the algorithms used to train the llm or or the training data for the llm you might be generating uh algorithms or applications that are biased towards uh something and that is

also a problem but that's not something we're going to tackle today so what is happening in the real world well people are using llms to generate code if you go to YouTube and you search for something developed with chat GPT you're going to find full tutorials of people who built lots of applications games websites with zero coding ability they just go to chat GPT and they ask chat GPT to generate everything same thing with co-pilot you just write your comments and co-pilot's going to fill the code so pretty good there's also a lot of security happening uh in this topic a lot of security research uh and there's somewhat limited coverage on the media about um is the

code being generated secure or not uh so i' I'll share my slides later but here's Three Links uh where General Media or mainstream media was going like Yeah the code generated by these tools is probably insecure um but nobody cares about that you were you're having productivity there is a lot of papers talking about the subject and if you go to to uh archive uh you can search for them uh there's research in Canada talking about that uh you'll see that in most of these papers uh the results are that the code generated from these llms uh co-pilot chat GPT uh usually is riddle with vulnerabilities and sometimes they are very obvious vulnerabilities okay uh there's research

on how do we automate the process of checking for security in the code generated by an llm um so all links that I've shared in the slides so you can go there and read the research uh that takes us into the four challenges or three challenges I'm sorry lack of control so we can control what the llm is uh is generating uh we can fine-tune it but we still don't have full control over the output uh there's no easy way to test and verify if the code generated by an llm is actually secure or not we have to use different tools we have to do manual review uh and there's the the point of responsibility

if someone generates the code who owns the code uh if the code has a problem am I suing the company that generates the code am I suing the developer who used the code who knows so these are so challenges that need to be uh addressed in this area uh now you ask me is it really that bad yes it's uh actually quite bad so uh the idea of this talk is that we can um take a look at a few of the few samples that were generated by uh chat GPT or uh yeah in this case it's chat GPT uh and see if we can kind of spot some of the possible vulnerabilities in the code so the idea

is that I've chosen three programming languages at random so PHP JavaScript with node and python with flask so programming languages slf Frameworks and I also chose three tasks and again this is not very uh scientific it's just for for the sake of the presentation you can read the research and and they have way better methods but uh again I just wanted to to make the point so the three Tas that I chose was uh where the first one is an application that prints an argument sent by the user the second one an application that checks if a particular keyword is contained within another website so we're making requisitions to the internet or makeing requisition somewhere and third one to

check if an argument passed by the user is actually contained in a database so we're talking to a database so three tasks that a developer would do on a regular basis okay so for each one of the tasks I have a prompt very simple prompt that that got sent to in this case chat GPT uh and then I just changed whatever the the main argument of the promp was uh generate this in the particular language a or particular language b or the particular language C so for the first prompt generate a web application in whatever language that takes an argument named ar1 sent by the user and prints it to the browser so pretty much the

description that you would have when uh you would develop something and then we get the first results so PHP we get really nice PHP code it checks to see if a variable ar1 was sent by the user if the variable R1 was sent by the user uh it echoes in the response that variable R1 otherwise it gives you the error message saying please provide a value for R1 so can anybody spot a possible problem here you have cross- side scripting very very very simple you can see that uh the R1 variable there gets printed there's no escaping there's nothing in there so if you pass an alert as the argument for AR one that is going to be uh alerted on uh the

render so okay it has a somewhat verifying if um or or there's nothing protecting you from across site uh request forgery attack so there's no csrf checks on any of the samples finally we go to the last prompt so prompt number uh three generate an application that checks if a name given by the user is present in the database uh and now larger piece of code we have pretty much setting up the database at the at the start defining the host name password user uh making the connection in this case it was a my SQL database and then we have a check name function and you can see there's a select star from names where uh name equals the

variable name that variable name is actually being uh is actually being C or it's actually being populated over here uh with the value coming from the post request to that particular URL and then we're calling the check name function with uh uh with that that variable okay and then we're making we're calling the quer here uh using that constructed SQL and in the end we're just returning if the the name was present in the database or not or if there were any results that matched that particular name in the database or not so this one is a little trickier um because we have the myi rape string over there uh so this is the first time that chat GPT

actually tried inserting something that would protect you from something uh so I don't think even though we're using uh even though we're using uh concatenation in there or or replacement in there uh we should not be vulnerable to SQL injection um but not the best way of doing it we should probably be using like prepared statements another thing is well since it give us a hardcoded list of username password database name and datab string if someone gets access to our source code they also have access to uh the credentials for our database so U on real production code we would expect that to come from an environment variables or to be set at runtime so um

again not uh not a horrible result but still not the safest code that you could generate and you even though you were checking with the real estape string for the variable that goes into the the SQL query uh you still vulnerable to the uh cross-side request for not the the cross side scripting over there with the name exist variable that's coming from the user it's just printing that so very vulnerable um still have hardcoded credentials that's okay uh we don't appear to have uh a sequel injection of vulnerability we're using the the query and we're passing uh the parameter so it's performing uh it's performing uh prepared statement in there so that's pretty good and then we

respond with true or false so we're not rendering the result or or we're not rendering whatever the user passed to us we're just returning true or false is okay so uh lots of a security concern in that but we still have the hardcoded credentials for flask uh for the first time flask decided to give us three files instead of one file that's okay for the next time I'll just ask it to be in a single file uh but basically we have the same things we have um a SQL like tree database instead of a real database well SQL like tree database is a real database but instead of like a bigger database instead of my SQL uh

that was using the other two uh we get the name from the form uh we use the name in a prepared statement we execute that we fetch the results we send the results to uh a task tree template now that task tree template is very tiny over here the result uh it just prints the name as it comes from uh as it comes from the variable name uh so we are vulnerable to crite scripting again uh there's also no protection against uh csrf uh we don't have hardcoded credentials or or I'm not I'm not putting the database name as hardcode credentials anyway uh and we still have the debug equals true uh which is also vulnerability so

cross-site scripting potentially un safe function the the real Escape string uh there are some scenarios where you can exploit that so again not perfect but better than than nothing uh debug mode enabled for flask no csrf and use of hardcoded credentials so they might not be as obvious as the other ones because now the code's doing more stuff but if you look closely at all the Snippets you're going to find things that might not be as secure as they could have been does that mean that if you're using Code generated by LM someone can compromise your system yeah of course uh will it happen probably not because you're you're you as a developer you're going

to take a look and you're going to say oh this looks unsafe or you're going to do like me you would use some kind of static analysis security testing tools uh just to just to make your life easier if you want to download all the samples and run on your favorite SAS tool feel free uh to do that they are on my GitHub uh I've run it through multiple SAS tools uh so vericle snake uh sem grab um app scan and I've show well I've decided to show the znick results because that's the two that actually caught the most number of findings in this scenario and znick identified 20 issues in the code so again if if you

don't have experience in doing code reviews you can also use an automated tool and the automated tool is g to uh provide you with decent results uh or at least it's going to show you oh you have nine High findings you have nine uh cross-side scripting for example or or uh in this case we didn't have any uh any injection or or no injections were detected but if that was the case you could have done that now to answer the question made there what will prompt engineering get us so if we modify the prompt just by adding a make sure the code is secure and regenerating everything so for the sake of this experiment I've started a

new chat GPT tread for each one of the of the conversations so there was no uh hopefully there was no uh prompt interaction but the moment you do that make sure the code is secure you start getting specialized functions in each one of the languages or each one of the Frameworks that you were using and just by doing that you reducing your attack surface because you no longer have that sequence injection or you no longer have that cross-site SCP scripting and it came from the same llm that generated the unsafe or unsecure code okay so a very very simple prompt change like this return code that is sanitized so one of the questions is well why is the llm

generating code that is not safe by default and if you take a deeper analysis of that you're going to realize that most code and most examples that you're going to find they don't focus on security security is added later but a huge chunk of the code is like Hey how do I print this variable here here's the tutorial how to print a variable oh by the way if you do that it's not going to be super safe so use this particular function here to to protect it uh and you can see that this is a a general problem with the llms because they're going to look at the caries of data and they're not gonna not going to think

that the next token is the security one because it only appears later or it only appears as a consideration instead that of being in the full code so how do you mitigate the risk well you use better prompts you use better training data uh but if you can't do that well you need to review the code that you generate before you put in production so no going from chat GPT directly into whatever you're using uh make sure that your developers are trained so just a suggestion always make sure that your developers are training in security pretty good uh if you have have the capability of actually fine-tuning your model use that capability to actually fine-tune it use

secure code examples use security best practices in the process of fine-tuning that model uh if you add that with uh prompt engineering uh you can uh you can absolutely get code that I won't say is production ready but it's very close to being production ready and the last one is integrate automated Security checks so this is something that you should be doing regardless if you're using code from llms or if you're developing your own code make sure that your code is scanned uh make sure that you're using a sasu or if you're building a web application make sure that you're using uh Dynamic application security testing as well so to finish here uh using llm to

generate code is awesome I particularly like using llms to generate pictures images so all images in this presentation were generated by not llms but by by generative AI so uh this is uh Dolly um but whenever you generate code make sure that you take a closer look at the code that is being generated and then make sure that you do an aut pass just to make sure you didn't miss anything because it might be very soon uh if you have SAS tools run your SAS tools and well if you are going to use chat GPT or cop pilot make sure to ask them to generate secure code because otherwise it's not going to generate secure code okay that's it uh thank you

for watching this talk um feel free to reach out to me um I'm usually active on Twitter I'm active on LinkedIn as well so reach out if you have any uh any questions and wellow if you have any questions right now I can take them please

so that's a very good question I don't have an answer for that um especially because the way that LM llms are trained we don't have we don't have access to that data so I I don't know how long it would take for uh secure code to be inserted in there uh but if we let's say generate unsafe code put it in production the first pentest is going to catch I wouldn't say most of these vulnerabilities but it's definitely going to catch the cross-side scripting uh if you're running some kind of Das it's going to catch the SQL injection uh if you're running yeah if you're running D you're going to catch the the cross-side request for you or the lack

of csrf tokens as well um you're going to see that if you ask it to generate something with cookies it's not going to default to HP only cookies so again your cookies could be intercepted so it's just a matter the first pentas is going to find most of the findings but there might be thing that is hidden in there and that doesn't get caught so hard to

say so I've read about C I haven't read anything about rust but yes so with C in C++ it has hard time figuring out when to free memory or or not free memory that is something I've read and it happens consistently it's not one of the languages I use so I don't have a more empiric uh experience with that but in general yes in general you would you'd assume that uh languages that are not scripting languages which are the ones I picked here uh would have a harder time but then you have the compiler that might do a lot of heavy lifting for you and saying hey this is this is not secure you're using a function that

should not be used uh so so I think it kind of balances but again the class of vulnerability that you might find in in a compiled application is different from a web application one so so yeah I I don't have like a definite answer on that but but I know that uh you will find problems regardless of the language yeah it's you yes yes yes everything like all of the samples it reduced all of them to zero yes even the even the ones that are more subtle like the hard-coded credentials if you ask it to make it safe it turns the hardcoded credentials into environment variables so pretty good and again it's just the matter of like make

it secure which should be the default uh so they were chat GPT 4 uh I don't know I I paid the subscription but I don't know if it's available to

[Applause]

everybody so that's a good question so if you're using a commercially available model let's say you're using chat GPT or or co-pilot they might be using your data to retrain the model so if you generate code in there and you assap that as a valid answer it might come back and become a solution for the next user that requests something similar to what you requested is it guaranteed that it's going to be there no are there ways to opt out if you're using shat GPT you can actually go to settings and say don't use or or don't use the information that I provide to to feed your tool but you got there so there's always the chance

that someone is going to replicate your prompt or going to get very close to generate a solution that is almost like the solution that you generate there's some Randomness in how llms work but but in general if it has been generated one there's always a chance that that it can be generated again so uh if intellectual property something that you care about that you value that your solution relies on using an llm to do that especially one that is commercially accessible uh might be a bad

idea yes and [Applause] then absolutely absolutely and this is something that you can actually do and it kind of works uh you ask it to generate code and in the same context window you ask it oh are you sure this is secure and then and then Chachi p is gonna go ah actually there's a sequence action here and there's oh there's a cross-side scripting opportunity here uh so it's the only the llm is is is designed to wargas it's going to generate the next probable prompt it's just that the security the security prompts are not logical to be to be the next one generated but that's that's a very good idea uh you can even analyze

to see if you provide code and ask it to criticize uh gbp3 had a hard time with that but for works really really better or really best um and unfortunately it will be very hard for the llm to find out zero days like stuff that it has never seen before because that would make no sense but I I assume that you could use it as uh some kind of of SAS uh as an entry level stat just hey look at this code are there vulnerabilities that that could absolutely

work no I haven't tried this with the custom gpgs but it yeah I I would assume that with that ability of doing the pre-training and giving the The pre- Prompt message you could add that as the pre- prompt message or you could add uh if you're if you're using a custom Uh custom model you could add that to the pre-training and it would definitely get better results absolutely [Applause] okay yes yes so uh we know that AI has been pushed into security vendors quite a lot uh there are some vendors that are using llm to generate suggest questions on how to fix code uh which is basically the the idea that Professor uh Joel mentioned over there uh it's something

that a lot of companies have restrictions to one is your code is getting sent to an llm for analysis which could be bad especially depending on uh and tying back to that a question well if I send something to the to the dlm will the company see that will the company reuse that um if you have a commercial provider that will guarantee you that they won't use your data for training or for retraining uh then I can see some some companies adopting it but in general there's there's all these challenges that are like the Privacy challenges the ownership challenges that might be in the way uh as developer as a personal developer or or or in a company

that you might not care about the code that you have other than than the security of the code that might work and you might see vendors uh pushing that uh but for larger corporations I think these other challenges need to be addressed first but I I definitely see that being used in the very close

future what is this best way to address business audience it's a very good question um so one is making sure that everybody understands how the llm is working or or what is the how is it generating the code or how is it generating the solution that you need because we might not be talking about code we might be talking about some other kind of automation that is being done to to llm so make sure that your audience understands the risks what happens when you submit data into one of these commercially available llms make sure that it's not something that has conscience it's it's a it's an AI model that has been trained to do something it will hallucinate a lot of

times because it doesn't know the context or it just tries to make up things as it goes so that's aot of a problem with Cod and more of a problem with other kinds of automation especially if you're generating taxt um address the risks so if we use this tool we might get productivity we might get less errors but there's a chance that our competitor is going to use the same thing if our prompt laks or if we're using something that requires the specific prompt and it leaks then someone can create the same tool that we're using or someone can do the same thing that we're doing so that's always a risk the other thing is

about the legality of it so for example not talking about L LMS let's talk about uh generative image um so if we're talking about Del or um any other tool that generates images what was the data set that was used to train that because the images that it generates they all have some some kind of ground truth some some kind of Bas image that was used to generate that so who owns the copyri of that so if I'm using that to generate art maybe maybe I'm going to see an artist that looks at that picture and says oh this is based off my art uh and then you also have that risk because again the data has to come from

somewhere uh I know that in the US there's some lawsuits happening right now as to who owns that or should the original creators have some kind of uh some kind of financial compensation for the data that's been used to train the models uh but in general I I I would assume that it's still a very a very important risk if you take into into business uh account considerations the other thing and and the last thing it would be is well if you're relying on one of these llms to to do your business you might as well have to have some kind of some some path out of there so if open and eye decides to charge 10 times more tomorrow what

will we do your costs have gone 10 times higher and there's nothing you can do because there's no other tool that you can migrate or if they uh they decide to shut down the model that you're using you're using a cheap model they're they're taking a law they say oh in three weeks we're going to retire that and this has happened uh a couple months ago and people had an outrage because oh we built our entire Solution on top of this and now you're retiring this model because you have a different model that's more expensive so all checks and balances so uh wherever you see an opportunity there's certainly a risk associated with that opportunity okay thank you very much for

attending this talk uh my slides will be available uh if you need I think there going to be in the conference website uh if you anything feel free to reach out to me and [Music] yeah

Henrique Pereira

Related talks