Buffer Overflows in the Era of Gen AI

Name: Buffer Overflows in the Era of Gen AI
Uploaded: 2026-05-02
Duration: 27 min 33 s
Description: Maxime Reynaud evaluates whether generative AI models can detect buffer overflows in C code and compares their performance against traditional static analysis tools like SonarQube and Semgrep. Testing multiple models including Gemini, Claude, and DeepSeek with varying prompt strategies and evasion t

BSides Exeter · 202627:334 viewsPublished 2026-05Watch on YouTube ↗

Speakers

Maxime Reynaud

Tags

CategoryResearch

TopicAI Security Vulnerability Research

ResearchEmpirical Research Technical Deep-dives

StyleTalk

About this talk

Maxime Reynaud evaluates whether generative AI models can detect buffer overflows in C code and compares their performance against traditional static analysis tools like SonarQube and Semgrep. Testing multiple models including Gemini, Claude, and DeepSeek with varying prompt strategies and evasion techniques, the research reveals that semantic analysis in AI significantly outperforms pattern-based static scanners, though limitations emerge with heap overflow detection and multi-stage prompting.

Show transcript [en]

So hello everyone. Hopefully it works. Uh so I'm going to have a small talk. It's only going to be 20 minutes. It's not going to be too technical. So it's suited for everyone. Uh on buffer overflows and how they work nowadays when you have genai everywhere. So this is my personal research for my uh third year projects, bachelor project. So I have to shout out an cayam if you're here my supervisor. Thank you very much. Um so this project I uh did a lot of testing on different models that we have publicly available and very surprisingly the results are quite scary. So let's go straight into it. Um so why does this buffer overflow still matter nowadays? This is an attack

that was present since 1972. At the time it wasn't named buffer overflows but it was already identified as such. And still uh the rankings CWE rankings are not here since 1972, but it's been considered in the top 25 since uh 2011 if I'm not mistaken and it's been fluctuating between second and so last year 2024 it was in second place and now it's back in fifth place still quite scary. So still relevant to today and most memory. So as the talk before talk talks about go being a memory safe language C and C++ are not memory safe languages and they do have buffer overflows when developers don't uh take it uh carefully enough and don't check

their boundary. So existing compilation mechanisms that have so ASLR DP stack machine protections can be bypassed with stuff that we call rope chains return oriented programming. So I'm not going to go too much in depth, but just so you know, we have defenses, but these defenses can be bypassed. So that's quite scary. Legacy code bases, especially in IoT devices, most of the time don't have these compilation mechanisms. So that's what was more inclined to be attacked. And um what the research showed is that the static analyzers that you might use in industry such as SM or sonar cube actually did not perform very very good. and this and the code base and the code snippets that

I use to do these tests are quite easy. So it's very scary. Genai semantic code understanding is actually does seem with this research much better than the syntactic analysis that the static analyzers have. So yeah, let's go straight into it. Quick recap. I know I said it wouldn't be too technical, but you do need to understand how buffer flows work. I'm not going to go too much in depth, but the stack grows down, the heat grows up. This is a structure of a compile C language. It does vary depending your operating system, the operating depending your architecture. So if you compile in 32 bits or in 64 bits, it will vary, but more or less it

it looks like that. If you go more in depth into the stack, you'll see that there is a layout uh and it grow still grows down. Uh you have your function arguments if you have some. Then you have your return addresses, your save all pointers, your local variables, and if someone initializes an array, it's going to be just here. You're going to ask, okay, why is it dangerous? Because if someone does not check exactly what is going on and doesn't check their bounds, the these variables stored in an array might overwrite on top. So if I told you it grows down, you're going to tell me how am I going to be able to write up top? Because array stored in

memory is continuously incrementing. So if I write over this 32 uh array that I defined, I'm going to start writing my local variables. Can be an issue. Okay, if I continue writing, I'm going to write over the base pointer. It's going to be a bigger issue. And if I continue writing, I'm going to override the return address, hijacking the total control flow. So that's a big issue. So and if that's clear I can now proceed and tell tell you about my research questions. So what are what was I trying to prove? What was I trying to analyze? So firstly can AI detect buffer overflows in C code? Easy question. Okay is this is probably going to be

responded with yes or no. Then can AI does it compare properly with static analyzers? This is a question where there were a bit more doubts. I didn't know how it was going to go. Maybe static analyzer would be very good. Ji bad and my research would stop here. It's going to be a tidy third year project. Second, so third question, sorry. Does prompt engineering methodology? Uh so single versus multi-stage prompts. We know a lot of research does suggest that multi-stage prompting is better for better performance and for optimized results. Surprisingly, surprising results here. And finally uh this research question here is a bit particular the fourth one because it wasn't initially planned in my research it came after since I made

the literature review initially and in the mid in the between of my literature review and my dissertation uh offsec came out with the OSAI certification and I fell into I I discovered a repo called PSW uh PSSW 100 AV which is one guide that bypasses antivirus uh with a single prompt to a single comment inline comments and mine manages to bypass AI antiviruses. So that showed me the importance of invasion techniques and I definitely had to add it to my project to see how it would perform here. So let's see how that how that did and exactly what is my pipeline. So first I designed vulnerability uh so I designed two stack vulnerabilities two he vulnerabilities and one that is an

integer that after maps to heap vulnerability. you might see uh so the CVE 2025 6660 this is a vulnerability in the PDF X parser uh software and the way they parse the GIF images and this led to a nice integer overflow it's quite recent so it was very relevant for our research here to do that we stripped the comments that we initially have and we made a bond tool sold scripts to make sure that uh what we uh what we exploited was replicable in all environments and then we tested seven weapons on our cube on this. So we designed two prompts. First prompt a single prompt very basic no prior knowledge and we test this on the

seven models of and we do two runs per models so we are sure that we don't have false results and then we select the top three models that performed and we test multi-stage prompts on these models. So we did that. Uh I'm going to talk about the results a bit later. And then we do the evasion testing once all this is done. So we have good insights on the performance of who's what cases going to be the hardest, what is going to be the easiest or what is going to be the the least uh seen uh worldwide. So we tried prompt injection the simple misleading comments in the file offiscation. So bitwise make macros things like that and

decoy security where we added fake functions that checked maybe before the overflow happens or later and try to fold the AI like that. So the models we decided on evaluating were geminy free deepse uh DT stands for deep think I don't know why they decided not to call it think but I guess and then clude opus cloud sonets uh deepse normal GPT 5.1 and 5.2. It's worth noting that GPT 5.1 has since since been decommissioned by OpenAI, but uh that's nice. So, one thing to note also is that I try to use as much thinking models as I could. Thinking for um AI model increases the um reflection time so might provide more more or less better

results. When there weren't these thinkings, I just used different models. So how did I grade these models and what importance what for me was the most important were definitely the detection and anal and sorry and the fix. If the model doesn't detect the vulnerability correctly the rest of the pipeline is broken and if the fix is not correct practicality is probably going to be impacted as well. So that's how I scored it. Uh response time was important as well. uh the analysis was quite important but I think that once an AI model detects uh a vulnerability the analysis is probably going to be quite good after and practicality was also important for possible integration enterprise environments does it change

your whole code or does it just change a few things so what did prompt A give us prompt A showed one thing so you're going to think by the end of this talk I am paid by Google but I assure you I'm not uh prompt A It showed us that the jeep the GPT models were not that great. Uh the Deep Seek and Cloud Sonet models scored in the middle. Clo Opus Deepse dim deep think were good. But the real winner here was Germany Gemini 3 in regards of timing and in regards of output quality. So both of these tradeoffs were actually the best ones here. So the spread is quite large but at the same time not as

bad as we could have thought. There were no critical failures, no anything that made us change your mind. So it was good first analysis for a single prompt with no prior context. It was quite consequent. One thing to note is the standard derivation. You wouldn't see it in most of the graphs. But when you actually see actually take a deeper look into that deep sick deep think has a big standard derivation. So sometimes it's going to give you the best results and sometimes going to give you the worst ones in a inprise environment integration. You probably don't want that. You probably want to be reliant and something that would provide good quality outputs most of the time. So it

it was a good first results. This is more or less the same thing as the past slide. It just gives you a more visual a better visual aspect. And uh it's worth noting that the later GPT model performed the worst than the 5.1. Sorry, OpenAI, but yeah, it's just the start. Um, GENY was absolutely the the goat here and um the other models stored in the middle GPT models took a lot of time and didn't score very well. So, that's something to keep in mind. But overall, 100% detection rate in all of the J genai models. So, that's nice. Sun cube 40%. Only two. So, only two out of five detected and same one out of five.

So that's that can be complicated to digest for some companies but at the same time we have to keep in mind that genai does have this s sorry not does have this semantic anal analysis. Oh, I touched the HDMI cable. >> Yeah, give it a window. >> Oh, okay. We don't touch anything. So, this uh this semantic analysis from Genai, the actual understanding of what's going on, how the functions work together and this the only syntactic analysis from the static scanners that are rule based or pattern based does make a difference at the end of the day. So I'm not telling you integrate genai in your company pipeline because it's still blackbox and there's still vulnerable to

a few attacks. But it's worth noting that there is an actual difference between what is used in industry currently. So we tested prompt B after multi-stage pipeline and the results were also quite surprising. Didn't provide as much uh success as we initially thought. Research and literature does push towards using multiprompts for better results. So uh two out of uh three models were better uh in analysis uh on multipronging which does make sense because you give it them a bit more time to give you details but practicality was worse for the free models. Practicality was not good because when you only ask for a specific task it's going to focus and maybe focus a bit too much on getting something

better and the time increase was just crazy on on all of them. So is was it worth it? probably not for a not to increase in performance compared to a single prompt and a time increase of 162%. I wouldn't I wouldn't think it it was worthwhile. Then we went to evasion testing. So we try prompt injection offiscation and decoy security and AI maintained 100% detection rate here. So might be attributed to not having a large code bases or um more complex code but in our case more detection rate means that pass all the tests. The most uh time consuming technique we had was offiscation which does make sense because when you rename variables and a

when uh when we say the AI is using a semantic understanding when you have function names that are quite explicit it does help the AI understanding what's going on but then when you straight uh take out these names and you do bitwise macros things like that you give the AI a bit harder time but at the end of the day detection was still good but the time was quite consequently increased. So overall what does that tell us? That tells us that Gemini Gemini free in my research is uh the one that I would recommend for enterprise integration on uh buffer overflows and memory vulnerable programs to test them and find patches. Cloopupus was the most

reliable. Didn't change that much. This guy was chill. Didn't move. A deep state deep think might be the one that you would like to stay away from. It has the best result. It scored the best result on some cases, but on some cases for the worst one. So would you really want something you cannot rely on? Probably not. So PA and PB here stands for prompt A and prompt B. And um so bit more insight on how the performance was for the cases. So the first vulnerability, the sing the very simple stack overflow vulnerability was quite um very logically the easiest one to score for the AI. The red to win meaning that we override over wrote a return address and

jumped somewhere else in the program was scored approximately correctly. Uh heap was the hardest case for all models which does tell us that uh genai models do not master heap overflows quite correctly yet. The CVE model surprisingly wasn't the worst one but performed mediocrely and integral overflow was very uh variable depending on prompt A or prompt B. So how do we answer research questions now? Um did Genai detect the buffer overflows? Uh yes it did. Did it perform better than the static scanners? Yes, it did. uh model selection does matter here because there is a big gap between all models and you would probably want to use Gemini 3 if you're trying to integrate something like that in

enterprise uh environment again this landscape changes so much and so quick that I'm going to say that and tomorrow there's going to be a new model or something else um multi-stage prompting did provide negligible to harmful results in our case specifically so do not apply this to overall logic but in memory management ment issues. It did provide harmful results or negligible and uh evasion technique failed. So we did maintain a 100% resilience on something that is quite probably the main point of interest right now in AI. So it's a study that did provide some useful results but later on of course this was a project in the case of my studies. So I didn't have as much time

as I would have to do a whole research project but we were limited to small benchmarks only five programs and they were not as big as we would like like them gi access so only graphical interface no API because some models were harder to access with within APIs but we would probably uh limit um sorry reduce response time if we used API directly manual scoring so I have uh I was with my Excel spreadsheets taking notes but probably If you automate this pipeline, it will give you less subjective results. Uh scope to buffer overflows, we could make it a bit bigger uh scope and uh models change. So this is applicable right now and maybe not

even right now, maybe two weeks ago. So keep this in mind. Future work um probably test on real larger code bases, see if it performs as well. Um automate the scoring pipeline as said we could uh make it a bit bigger and have wider vulnerability classes. try testing format strings for example and try more sophisticated evasion techniques such as poly polymorphic payloads or stuff like that and later on maybe AI versus AI AI would try to design vulnerable code and see if you can bypass AI detection this would give us quite a nice project so what can we take from this actually as said Genai achieved 100% buffer overflow detection against 20 to 40% on industry

standard scanners Gemini 3 is the optimal model. As I said, I am not paid by Google, but it's it was clearly the result that stands out. Multi-stage prompt engineering did not improve our outcomes and all three evasion techniques failed on our example here. One thing to note that I didn't write here is that when testing for uh payloads on the multi-stage run pipeline, I asked the models to provide example payloads. The models that provided the most detailed and the most best payloads were the clo the entropic models. Would that relate with them uh talking about the MTOS model and releasing something in offensive security? Most probably, but that's just something I I saw when doing this

research. The cloud model, so cloud summit and cloud opus did provide the best offensive uh security results. So that's all for me. Hopefully I'm still on time. If you have any questions, go for it. Yes, >> two questions. Did you track token utilization? >> I did not track token utilization. Uh I tracked it in my head but didn't write it down. >> Probably be good to do that. >> I think also um did you sort of explore sort of reinforcing the prompting through sort of things like skill sort of persistent prompting? >> It's a good question. I did not. So at the time uh when I started this research uh models just started releasing thinking modes. So for me I did mostly

compare the thinking modes without the thinking modes and see what the difference here was. And one thing to take out is that the reflection time does not make for better results. One thing I forgot to mention as well is for the evasion uh GPT mod. So as you saw at the beginning for prompt a GPT models scored uh the longest and scored the lowest. For evasion funnily enough they were the fastest. So the guy just abandoned ship. they shot themselves in the foot and they say, "Okay, I'm going to give bad results anyway, so I'm just going to do it quick." And um yeah, this is quite scary because it does show that the most publicly available models and

the most publicly publicity model are probably not going to be the best ones. Any other questions? Yes. >> Probably same as token utilization, but did you track cost >> uh cost um currency? >> Yeah. So to do this experiment I did have to get subscription to CHP pro and uh cloud pro to get access to cloud opus but apart from that I never reach um token usage max token usage I did do this on a different you know time stamp so every day a bit every day a bit so I didn't reach this did not track token usage that's very good point you >> have you looked at local models at all sorry have you looked at running models

locally at all >> um so I did think about you're learning lama stuff like that locally but I think that this study was more targeted against to a wider public. So you know as everyday user what does that provide to you? So if I run it locally I would probably use models that are less known or less used and this use case would really depend on my architecture GPU stuff like that. So not as >> yes >> one of the challenges with the traditional static analysis tools is false positives. Did you look at the false positive rate for the kind of AI? >> Did they not write it down, but I can tell you uh Sam Grip was the one that

provided the most false positives? >> But did the AI tools produce any false positives? >> Um, actually no. No. So I again this was done in a very small code snippet and I can I mean if I had time I could show you why didn't why didn't I not link the GitHub repo with everything because I'm currently getting graded on this project and I might think of publication by a raise of hand. who thinks this work is worth publishing. Uh wonderful then it's then it's done. So yeah, I mean if you're good in a you can definitely find the G repo it's public but I will not leave here purposefully but yeah no false positives on

>> yeah I saw that is that this year or last year? >> This this year. >> This year so present next week. >> Yes, exactly this. >> Okay, good question. I saw you have a five craft the same programs. All right. Uh is that from the open source or it's crafted by yourself? >> It's crafted by myself. So uh within uh you can there are a few websites that use CTF challenges with uh program you know for programs like this. So most of the time they allow you to keep the permissions you execute with and uh so I did modify these challenges a bit but they're taken from well-known uh uh CTF platforms and things like that except

for the CVE because there's no publicly available proof of concept. So this was made purely based on description given on the NVD database. Yeah, I'm asking this. I'm kind of worried about if the those larger have a prior knowledge of those existing and they saw that before doing the training or they have so I I don't know it's kind of a poison doing the testing. So how how any any measure to prevent such? >> So what I did is made sure to delete the context and the memory of the models every time to make sure there was no storage stuff when I do repeated test and repeated test runs. Um these challenges are not available if you just

search them on the internet. You have to log in have an account things like that. So I hope I'm hopeful that it wasn't already present in uh models memory but um it is a good point and this was also a point to because the CVE doesn't have doesn't have any public exploits. So here if the CVE performed way less than the other the other tested snippets, we could probably assume that there was some kind of memory or some kind of context. But in this case, they all perform very very well. So it also shows that for example, static scanners wouldn't probably flag the CVE because it's not something that is patterned or or rule based. the other uh the other uh

the geni model actually understood the code and saw there was a vulnerability here in in the sub >> I hope >> do you have any example of your problem >> I can if I can I come back to the straight question >> I'll talk to you later

>> let's see if it's already open or not it's

So for example, this was a very simple

Yeah. Yeah. Let's go. So this was very uh the very simple single prompts. So I just tell it that give it context. So this was based on research as well. This called on the zero short policy meaning that you don't do you don't provide any examples or context to the AI. Just simply tell it what you wanted to do. Your security tool analyze this code for possible security vulnerabilities. It's worth noting that I didn't say there are vulnerabilities here. I said there was possible security vulnerabilities. It would prov it could provide more hallucination rates if you said there was a vulnerability and there wasn't in the source code. So that was very important. And I just gave it the

categories I was evaluating here without many details and evaluated the output to compare. For example, this was prompt B. So um every time I put it in a specific context for example for the detection you're a security analyzer for um later on the analysis you're a researcher and for the fix you're a coding engineer so every time give it a good context and ask it more specifically what you want which does provide way more detail but at the end of the day the output mostly stays the same. Sorry. It was more it was more just about the publishing concept. There's a mandatory in academia publish but also the benefit of publishing is you get to interact

with the community get to interact with you and it's really a timeline as you as you begin your career or your interest. So there's a lot of good reasons for share options open source ecology. >> Yeah. >> Really really quick. Do you think that your multi prompt it didn't make that much of difference because you using those thinking models which in in many ways they're kind of reflecting more as they go along anyway it might have had more of a difference if you if it was all standard models or kind of older styles >> I feel the biggest difference would be definitely in timing speed which makes sense because the thinking models are here to make the models force forcefully

think longer um the complexity of the response probably was impacted so you would have more response maybe better analysis on something uh with thinking mode only than >> yeah I was I was thinking that maybe the mult have as much of an impact because you're using models basically >> that's a good point I didn't I didn't try this so >> okay >> just quick refresh it works with the bubble definitious code >> yes >> so you think about you have to extend it to other attack types As I said, maybe I want to stay in this dimension because I think the static scanners would perform better on SQL injections or things like that. And this is quite uh a specific

subject. So I really want to evaluate on this subject further probably format strings could be uh could be worthwhile. Um other I mean reading data. So then trying to see if it can grab the content of a file on file system but then I would have to build architecture and not just snippets but to extend this research this could this could be worthwhile. >> Brilliant. Thank you very much and thank YOU

Buffer Overflows in the Era of Gen AI

Related talks