Autopwn or Auto Fail? The Truth About AI in Offensive Security

Name: Autopwn or Auto Fail? The Truth About AI in Offensive Security
Uploaded: 2026-05-02
Duration: 28 min 31 s
Description: Dumisani Masimini examines AI's actual utility and limits in offensive security workflows. While AI excels at payload generation, script drafting, and report formatting, it frequently hallucinates vulnerabilities, produces broken exploits, and misses the logical chaining required for real penetratio

BSides Exeter · 202628:312 viewsPublished 2026-05Watch on YouTube ↗

Speakers

Dumisani Masimini

Tags

CategoryTechnical

TopicAI Security Vulnerability Research

StyleTalk

About this talk

Dumisani Masimini examines AI's actual utility and limits in offensive security workflows. While AI excels at payload generation, script drafting, and report formatting, it frequently hallucinates vulnerabilities, produces broken exploits, and misses the logical chaining required for real penetration testing. The talk argues that blind AI usage—copy-pasting outputs without verification—erodes client trust and professional credibility, and outlines where AI genuinely helps (brainstorming, enumeration summaries) versus where human judgment remains essential (validation, risk analysis, business impact assessment).

Show transcript [en]

Hi everybody. Thank you for coming to my talk. My name is Dumisani Masimi um from Bent People. Uh my talk as you can see today is about whether auto poll auto fail using AI and offensive security. Um as you can see on the right hand side here with the AI pain testing he was usually be scanning find 247 findings. Then he goes and validates, confirms 40 and then gives you the uh the PDF. But do you understand all of that? What did it do? That's where the gap actually is. And that's why I actually wanted to give this talk. Why this talk? Um AI is everywhere. I mean LinkedIn, you go on LinkedIn, you go on social media, every eye is this,

every eye that. Um everybody's using it for whatever reasons. Um but the good big question for us as pentesters or offensive security uh professionals is can it do the job for me? Um can you actually run the script um execute u malicious code for me and actually just um take over uh what I'm actually doing. The real risk isn't about actually AI replacing us. It's about using AI incorrectly without understanding

The core question are you using as a tool or let it think for you? the tool. Um basically you give it an input of what you actually um what what you as a person would actually think of and um or dependent on it you just take out each and every output and copy paste um basically that's your quickest fail for that what I'll cover it will be whether the hype uh where AI failed using AI without losing credibility or where AI actually helps real risk in penet penetration workflows and what makes AI strong testing. The question is, will AI replace um pen testers just prompt your way into expectation? Um fully automated pen testing. Um my my

take on that is that personally um there's a lot that needs to actually go into AI before it actually replaces what the human brain actually can actually consider. um there's a lot of um judgmental flaws that AI actually does not consider as far as the way human actually um uh thinks about all these particular aspects

what is actually AI good at actually um what it good at it's actually it does a lot of payroll payload ideas I've watched like uh generating payloads whether it be your XSS, your SQL recon enumeration basically give it a SL 24, give it a end map for ST 24 and then it would actually just um summarize it and then just give you possibilities of where the attack could be. Writing scripts is fast. I as a pen tester who's not strong at actually coding or scripting this really helps but with actually with ver verification that is report writing I mean um when your prompts are correct and you know what you're reporting on and I think actually

reports become so generic they become so repetitive as well this will actually make a three-hour reporting tool um come for a little bit shorter span of time as well again verification verification so as far as the well I don't have a demo sorry I just got you a little bit of a um from top of here so you you prompted to actually um want to give you SQL injection um it will give you all the payloads right um it gives you them instantly but then everything looks like it's supposed to be working uh the issue here now is does it validate it um is it actually giving you correct uh payloads for the specific case that you're

actually giving it where it starts to break AI. um hallucinating vulnerabilities um basically just finding CVES that are not there on on specific findings um where it actually just gives you an attack path that is actually not actually working for the specific um uh uh software or version that you're actually trying to actually attack. Incorrect and broken payloads. Um, I've had a few times where I've actually had to reread the script and the payload that it actually has generated and half the time you'd find things that are not even supposed to be there. Um, outdated techniques depending on where it actually looks. um summarizing the techniques that actually u that are supposed to be used on the

specific versions. It might be just outdated and could not actually give you any specific way of actually um exploiting the target as well. Misread responses AI can't see actually HTTPS responses. So what it does it just basically just assumes that everything is at 200 and in in some cases that does not actually that isn't the case.

The danger shift AI lowers the barrier basically but not um does not do it well. It gives you great um output um for anybody who's actually not a a seasoned pen tester that can be overwhelming at times. Um it and um with a little bit of less understanding, it's you're going to be using it basically blindly. Um just trusting actually a more than anything else. Um yeah, with less understanding basically you're just basically flying an airplane without actually knowing what you're doing and just hoping for the best type of thing. The auto code mindset um what it looks like in practice like we spoke about the copy paste where you just basically just generate a payload without any

verification copy and paste it first. trusting AI explan explanation on vulnerabilities without verification. Um hallucinating again where where the vulnerabilities are not there uh or where they actually it hasn't even looked up or any anything current yet skipping the building fundamentals because AI can't explain it. Uh this is basically the core message here. Um where you need to know what you are looking for in order for to actually ask AI to actually give you the the whether the explanation or the attack path or the or the payload. This is where we chase speed over accuracy. Um more findings equals more work. Sometimes less findings and quality work is actually what matters instead of the quantity of the findings

that you are when you chase the speed of actually um finding as much as you can but reducing the report quality as well. Can you explain how any exploit works or just should it work? Once you stop start stop not actually um uh understanding the work that you're doing, it's hard for you to explain any of the experts that will that you'll be actually um documenting on the report itself

positives uh the reality of failure. uh the false positivity reports. I've had quite a few um findings that were actually said to be either there or been on a higher or higher severity upon verification that was not true. misreal vulnerabilities again or AI focuses on pattern matching. If it's not there, if it's not logically thinking um like what a human would do in terms of next steps, it will miss that broken exploit chains. AI does not think of actually um like a human would. um you would actually chain exploits um from a low severity vulnerability, chain it with another maybe medium in order to get you a a critical. It does not logically think like that. It will basically just um

use a generic uh a generic response for each and every particular uh not each and every particular but most of the um exploits with the loss of client trust. Um I'm sure if I can give you a report that has one finding that is incorrect and you know that is incorrect um you would say you would actually start questioning the whole report. Um is this actually trustworthy? Is actually everything here true? Um, that's where the al comes in.

>> Sorry

again. Um, it comes with basically the prediment program problem. Um, one wrong finding. Um so uh it's it's an issue but several wrong findings it will just build a bit of a distrust within your clients and um blind AI usage your reputation is at stake not only your reputation as a paid tester but your company as a whole um it becomes a bit of a um credibility issue. I think this is gone.

AI and reporting um what AI can do basically write clean uh finding uh descriptions. Um they'll keep it as generic as possible. Uh it won't give you the exact specifics of as to what the specific um finding might be. It can view technical content accessibility accessibly. Uh basically just uh when I read that particular sentence, it reminds me of on a report where you have two separate sections. One for the executive summary and one for the um for for for the technical folks. uh you you will be able to reword or summarize basically um the technical findings that you've got for the executive summaries as well. Uh it can structure findings consistently. Um yeah, it can

again what I read it structure findings consistently. Sometimes it does. Let's just say that. Um it's not actually always accurate. Um it can give you a a finding of a high critical where it's actually probably an info at best. Speed up the drafting process. Yes. Um the drafting process becomes a little bit a bit better. Again, when you know what you want to put in that report and what to ask on the specific report. Um the the the other side of it um it can remove important context from findings. Not every finding is actually generic um to to to to uh at at a client's face. There will be specifics where you need to actually explain it in a way that um

it poses risks to towards the client and it actually does miss business impact in in in in its uh also um it does not actually understand like I like I said what the client is actually after or what is protecting. um two clients can never be the same and those uh genetic generic generically generic answers cannot be actually um work on both cases and it would sound authoritative as if it actually knows what it's talking about but it actually might be actually be wrong at the same time.

This is an example of a AI generated um finding. This is basically about SQL injection. Um, basically when you see read it, it has no exploitation evidence, no business content and no generic recommendation. Technically, it actually it looks correct, but it lacks what actually appeals to um the business uh directly.

So this would would be what it looks like when it's actually written basically customized for a specific client where it actually explains that it found the SQL injection gives you the actual path of of where the injection is and what exactly could be extracted and how does it actually affect you as a business. Basically, it's clear, it's contextualized and tied basically back to the the business uh reg uh business risk and it also gives you actionable immediate um uh uh uh actionable immediate line timeline

where AI fails in um in in in real time engagement um scope nuance. it it does not know um the scope that it's actually trying to functioning what's in scope what's actually on scope chained attacks it would not actually understand if you would want to chain a low v finding with medium finding itself that takes logical um uh workflow legacy systems um like uh scala networks all main frames um it would give basically generic um answers as well. I have actually knocked down quite actually a main frame. Um like uh using some of these techniques as well. Um that was not a fun day as well. Um the client conversations explaining the risk itself. Um it surely can help with the

wording but actually with when the client actually starts pushing back where you go to explain your understanding and your explanation actually um um limit you um you consult basically that's where it actually comes in

a key insight um AI can actually um help your findings basically pattern match matched findings generate payloads, draft report content and enumeration summaries. That's what I actually what I normally use it for. Um where I have to actually do it myself is I have to validate the expectation. Um do a little bit of um risk analysis on the on the finding itself and then prioritize how how how severe the risk is. also have to as a human you need to actually um be able to to to to write on on what the business impact would be and basically another one comes up that chain uh that chain reasoning chaining the all these vulnerabilities together just to show

business impact basically

how to use AI properly brainstorming um I used it a lot just to maybe just give it a few ideas, see what it comes up with. Um whether it be work, personal at times. Um that's that's just to see how give me a different point of view. Don't trust it fully at times. Um and then basically draft draft initial scripts and automate repetitive tasks once you know that I think when you get into a workflow um where you you are testing a specific um is infrastructure mobile you know exactly what the workflow is. So basically you'll be able to say hey let me just try and actually automate this and um just to save a little bit of time

uh rephrasing findings for clarity or clear client audience. This is basically where it has to you're giving a technical uh finding so that it actually you can actually um pass it on to a a nontechnical person they will be be able to understand that and summarize long scan output quickly um a lot of in maps a lot of um that can that can actually be overwhelming when you look at them can help you summarize in this how not to use it not make decisions and non exploitability again we need to check your exploitations or any particular scripts that actually gets generated by um by AI validate the vulnerabilities yourself um it also will replace your understanding um of any

techniques that you've actually known it can actually give you certain ideas but then yeah um once you're in that particular environment you are able to actually just maneuver around a few the details that AI won't be actually prime and you write your report without a human review. Um, it makes mistakes, um, typos, it actually even puts other languages inside. Um, I've had it actually where it puts a few languages inside that are not even English. um on on on on the report itself without actually human reviewing it that would have been sent

practical workflow. Um basically with every start of the effort engagement um enumeration is is quite key right now. um own it. Um you do the leg work. Um do the actually the ground the manual stuff with a little bit of assisted known reputable and inverted comments um automated scans probably if you are doing scans. Um you you use AI to actually expand on the ideas, see what you found and then see what AI can actually help you with in actually in in order to progress any further. maybe stuff that you've never seen before. But again, validate manually. Um I think one one thing that comes out is AI. It's actually validate, validate, validate. Um don't trust ID. Um anything that you

you want to write, any scripts that you want to write, validate them that they're actually working before you actually even want to put them on a um on a testing environment or part of your testing workflow. rerun the variations, check that AI missed or or simplified um tested like again testing um for any for any any any issues that might actually um catered for the specific um testing engagement.

Rachel is a fear of actually over relying on AI. um when it looks right we say it looks right instead of actually again what is it verifying it copy pasting it uh AI directly into a client report um I think um even clients would actually not start knowing what what what exactly is uh AI looking report that is being produced uh once you actually start reading anything that is not generic but that that's generic or not specific to your to your business needs is that you actually can translate it just being a copy paste. Again, what does it do? You lose another point of reputation of actually being a credible tester and one um if you've actually

done it. Um that's where you come again where you're unable to explain the findings itself. Um there's nothing that that is worse than being asked in a client call and when they ask you about findings and genuinely you don't even know how to explain them. It's one thing to say I don't know whether you are being asked about something but if you can't explain the finding itself that's that's that's pretty bad. Uh when you skip that manual verification process um verification again um everything that you actually want that it actually outputs to you you are able to actually just go and say hey um this looks right but let me just test it one more time

and your understanding of techniques isn't growing at all. Um you're not learning anything new. you just copying and pasting and yeah at the end you're not growing yourself in any way but as a tester what makes you valuable as a tester your critical thinking um at a specific time at a specific point on your particular test whether you seeing a point where you need to pivot to the next um post um you understanding um basically critical thinking at that particular of time sets you apart from everything within of AI. It can give you steps or suggestions but um you have to still have to be able to identify the path that's actually that needs to be taken.

Uh the habit of verifying everything is actually quite good. Um don't trust anything blindly uh without actually ver um validating it. Um AI it will just give you anything that's just generic and but your curiosity as well into actually digging deeper sets you a little bit apart from anybody else. Clear communication translating basically technical findings into business risk. Um this is still actually a human skill that everybody I think in consulting should actually learn.

The future of pain testing, it's not going to go anywhere. AI is here. Um, as much of it seems like it's a bubble at the moment. AI is going to still be here. Um, but it's also going to improve and part of that improvement will be people like us in the community actually making it better. Um, not because we we want it to take over, but we understand things a little bit better than it actually does. We are teaching it better. Um, automation will increase um in all phases. The uh I think automation has has been one of those as a lazy pest. Sometimes um automation is actually quite a good thing. The landscape will charge will

change rapidly. Um everybody can produce a tool within the limit. Um everybody can write what do they call it now? V coding. Yeah. So everybody can just write code and produce a script at all. Um so there's going to be a lot of those coming up still. Um basically how to stay ahead keep building manual skills. I think it's that there's a lot of manual task that needs to be done still and understand the tools and basically staying curious. The final message basically um yeah AI is here to help us but not not to be too reliant on it when the when the poning is actually a bit easy you just need to know if it's

really really on auto failing at the same time as well

the practical takeaways for the talk I think I've said the word validate more today than any time Always validate AI output. Never outsource your thinking. Use AI to assist you not to decide. Focus on accuracy over speed. The the five findings are better. 10 finds are better than 50 findings on a report. That's that that's going to get action pretty pretty quickly. And just protect your credibility as well. Um it takes years to build it. and very very destroyed and I think that should be it. Any questions? >> Yes sir. >> AI what kinds of AI is that AI or just kind of a general advance? um for for this specific talk um I would say

general AI. >> So it's a genetic AI you are referring to because from my experience a genetic AI with additional modules like sending two coding or the kind of memorization modules some of of problems that you mentioned could be mitigated to some extent. Yeah, this is this is for I think this one was actually meant for for a local tuner testers as well who are coming in or is new to AI and stuff. Um just to guide them through not to blindly go and actually um use AI. Uh there are like you're saying there are actually variations where you can actually pop in a couple of things that will be um will make AI a dream to use

your product. Yes. Also, do you have any reflection on the electronics uh glass wing? They found a 27 year old kind of vulnerability. Yeah, it's super powerful. So you part once you start paying for something to to use and it find um vulnerabilities um vulnerabilities and stuff like that um as I did as individual so at that particular scale um I think that would be viewed as a as a business right um so I didn't look at that particular But my reflection as far as concerned, it's it's great. I've seen quite a few um AI agents that actually are working in tremendous jobs. Um I want to actually build one for next as well just

to experience and feel it on on the real world and to myself as well. Um so they they work, they swear about them, but I have not practically actually um tested it. years from now. >> I'm afraid that's all we've got time for. If you want any more questions left to approach afterwards, but thank you very much for speaking.

Autopwn or Auto Fail? The Truth About AI in Offensive Security

Related talks