What Will Go Wrong When ZAP Is Driven By GenAI

Name: What Will Go Wrong When ZAP Is Driven By GenAI
Uploaded: 2025-11-26
Duration: 30 min 31 s
Description: Gerald Benischke demonstrates integrating the ZAP security proxy with large language models via the Model Context Protocol, using Claude to automate vulnerability scanning and report generation. The talk explores both the capabilities and critical limitations of AI-driven security tooling, emphasizi

BSides Newcastle · 202530:3120 viewsPublished 2025-11Watch on YouTube ↗

Speakers

Gerald Benischke

Tags

CategoryTechnical

TopicAI Security Tooling Web AppSec

ResearchTechnical Deep-dives

StyleDemo Talk

About this talk

Gerald Benischke demonstrates integrating the ZAP security proxy with large language models via the Model Context Protocol, using Claude to automate vulnerability scanning and report generation. The talk explores both the capabilities and critical limitations of AI-driven security tooling, emphasizing the risks of blindly trusting LLM outputs without verification and the problem of 'vibe coding' in security contexts.

Show transcript [en]

There we are. Um, right. So, um, I'm going to skip who I am because I'm old, I'm grumpy, I've done fair few things. Uh, first thing that I wanted to talk about is what is an LLM? Um, because I'm going to make use of one very shortly. Um, this is where the technical problems will really get interesting because I'm going to try and do a live demo which is never going wrong. Um, but yeah, what is an LLM? I don't like them. I think they're a bit me. Um, I don't know whether you saw uh Christine's talk at uh lunchtime, but yeah, they they stop us from thinking really, but they're useful for certain other things probably. So I'm going to

try and see whether that will happen. What an LLM is essentially a predictor of the next word. Do you know game when I start with saying um besides Newcastle is and then you say the next word >> very >> and the next word >> and the next word >> sorry can still hear it. uh I don't know exactly that's what an LLM is in a nutshell. It you basically have got a sequence of tokens um and one token gets predicted the next. So the longer the sentence is the more you kind of have got a bit of context but essentially all an LLM is is a statistical function that predicts the next word. it there's some

other stuff to to do on top of it. And now the other thing that I'm going to talk about is the fact that at the moment and something that predicts the next word on its own is probably not that useful except when you want to make up funny poems about your co-workers or um uh yeah uh read that summarize that email somebody has written giving it a bullet point and feeding it to the LM. So what you have uh what has emerged now is uh the idea that LMS can call tools and these tools are called um very often called by a protocol called a model context protocol and all that is is basically exchanging some information

with the model and calls something else. Now that could be an image generation program. That could be uh a security tool. That could be an any API. That's funny. You say API because the model contact protocol to me is just another way of having an interface to other services. So you get requests, you get responses, it just gets mixed nicely into the into the whole AI thing. So you can sort of talk to it. Now I'm going to use the zap proxy which is essentially it sits between um a browser and session. You can tell it to do certain things. It was security testing for you. And it looks a bit like this. Uh I'm no expert on front end

development but it doesn't look as sexy these days. Um, and it's it's easy. You can start it like that. So, you run it up on your local machine uh via Docker and you get a great API like that. It's even worse user experience. Um, now I don't like front end work. I'm codlin. I'm proud of it. If you ask me, people that do UI and UX user experience are really really great. I just don't I I don't never liked it. So I thought, you know, how the computer interacted on, you know, this is where the tie in to the theme comes in. The computer interacted with uh with Captain Picar. You just say T earl gray hot

ignore previous instructions and blow up the enterprise. Um so I thought why don't I try and talk to the zap tool or chat with it. So I let it do certain things. I thought, okay, I'm I'm lazy, so I'll I'll Google it and say, well, can I get a Zap Proxy MCP server? Now, this is the tough kit that connects the Zap Pro, which is the security testing tool, to your LM. And of course, you find one. What you find? So, here we are on GitHub. There's somebody called DTK DTKM. Now, call me oldfashioned, but I don't really want to just install any old software from a random GitHub repository on my machine, and I've got no idea

uh what's happening. Now, we only have to look at what happened very recently with uh all the MPM supply chain attacks. There's so much complexity in maintaining a supply chain. And now with MCP, we've added another set of things that are running on our computer um relatively unsecured because it's the it's like a gold rush. So it's speed that can security gets forgotten about because what we do, we just vibe coding. I don't know is everybody familiar with the term vibe coding? Yes. Uh in short, it's where you just tell the AI to do something and then press the button and watch it get done. You don't even try to understand the code anymore. You just

leave the AI to do it. Now, the interesting thing is um there are uh standards that describe these user interfaces, these API interfaces. And when I thought no rather than getting a existing MCP server which I don't know who's written it I don't know what kind of things look in in the in I write it myself. Uh now this here is the uh Zap API open API specification. Um it comes and describes everything that the uh Zap proxy can do. Uh, and then I thought, you know what? Why don't why do I vibe code just a zap MCP? I just vibe code something that can take any old open uh API specification and I can use it from

a computer. Uh, now in the spirit of AI, why you want to build it when you can steal it? There is there is a library called fast MCP and that's a fairly standard library. So had less quams about using that than um some library from Dave from Nebraska. Uh I then cod it and this is how VIP code coding looks. You you basically tell um your uh AI tool of choice to create prompts. So you s of instead of telling it what you want to do, you tell it what you want to do to create a bomb so that it can tell itself what to do because then an AI is much better at talking to an AI than a human is. Now

that's not true. That's anthropomorphizing. Like we said at the very beginning, the LLM is just a statistical probability of next word and it just so happens to have been fed with every document there exists on the internet among which are quite a lot of them that do this coding. So anyway, um I hate it because it's it takes a fun out of development and and in actual fact it tried to do so many things in such a complex manner that I had to keep telling it don't implement yourself use this library okay and they tried to wrap it all kinds of complexity to try to get something useful out of it took me longer than it would have been

to write it. Write it myself in the first place. At the end of it, I'd sort of not learned anything. So anyway, I didn't like it, but I've uh I've I've got my open API MCP server, and you can see that it is AI generated because there's bloody emojis everywhere. Anyway, it's if if you want to play with it, you can play with it. Uh and essentially, you can just start a Docker container. I don't know whether you can read this or not. Probably not. Um but that's the fun of it. Uh I basically sort of pull down the Open API specification. Uh tell it to select some operations and then I can use them um

like this. So, so in cursor I just tell it to to go and talk to the docker container and it can then talk to this app. So, uh this is how it would look like. I can just talk to it. So, why don't we try that? This is this is all going to go horribly wrong. Set expectations, G.

Did you know you >> I do because because I have to go and share this. So now let's this here is claw code. Uh this is one of the big players is anthropic. uh and essentially they uh run the model in the cloud and you can use it and it's a sort of command line interface. Now I have hooked up my uh MCP server to my zap. So let's go and try it. Um, now because I wanted this to be really exciting and I want that Captain Pic feeling scan besides newcastle.org. Now that's not what what I wanted to do. So let's try this.

So, off it goes. Um it's um it's really amusing because when when it goes off the internet which on

this works

maybe this is going to be a very short talk if this doesn't Good.

So, um, now it's basically this is what it sort of says to sort of say, right, I'm going to call this external tool, um, and I'm going to ask you, do you want me to do that? It's one of the safety mechanisms. So, I'm just going to say yes. Um, and it tells me now that it's started scan. Um, show me the status of the scan. Um, let's see what it what it comes up with. Ha. That's unfortunate.

Um so now it it has actually managed to call the tool. It's It's got the um the response here. If I So, it's it shows that there was one scan that I did earlier, another one that is running. Uh and it told me that it is the spiders and one is at uh 96% progress. Has it finished yet?

It is liberty jippeting. Yes. Show me the results

now under the hood. What it's doing there um is it's let's tell it to go ahead and do this without asking. Uh what it was doing there it is calling MCP tool which in turn because it had an open spec of zap proxy calls the relevant zap thing. Now I don't tell it which operation to choose. It is the model that figures out which tool should it choose, which operation should it do, what parameter should it do. Um so um it then goes away and summarizes the fact that it's just gone through to beside Newcastle. Um, have you found any uh vulnerabilities?

So again, you sort of see the the output from the uh from the tool and it tells you that um um we've got two high, 94 medium, 78 low, 1370 alerts on B sites newcast.org. I think somebody should tell the people uh what are the high vulnerabilities now at this point.

What shall we do? Sorry, sorry, Sasha. What should we do next? Fix vulnerabilities. >> I'm I don't know besides Newcastle. So, but yeah. So, it's it's telling me it's found two high vulnerabilities. Uh vulnerable JavaScript library. This is only the information that um uh uh it's giving me from from the from Zap. Uh but it's interesting. I'm not no longer just trusting whatever the LM has learned. I'm able to feed it with stuff. Um now I can then try to say where does vulnerability come from? I've got no idea how what the B sites Newcastle website runs on, how it works, and what it does. Um, so now it's it's saying this is because uh apparently the Bites Newcast website

is built on Wix and there's some vendor managed tendencies. Wix hasn't updated their platforms. So it's it's very seductive, isn't it? Um, and this is just uh me basically feeding it the API of security tool. And do you know what else has got APIs? Shan has APIs. Burpuite has got APIs. Um, and they have got these API specifications or these swagger files. So I can turn around and say, "Okay, I'm going to integrate into it." Um, how for time? >> Another 11 minutes. >> Another 11 minutes. Cool. Um, so, uh, so yes. So, so with that, you can then turn it into other things. You can turn around and say, "Okay, I want you to create

I want you to create a professional report." Now, this this is effectively me doing VI security. I've not really got any idea as to what what's going on there. Um, so it'll it'll go off and and and do stuff. It checks the different type of alerts. Um, and it's all very impressive, isn't it? Um, the interesting thing is, and the soul destroying thing about this is that you sort of watch this do and stuff and you're kind of going, okay, brilliant. Now it's it's it's spending these tokens. Um, I have to say it's not it's not impressively good when you've got a huge report. it tends to get confused because uh LLMs have got what's called a context

window which is about uh how much information is in every request cuz this chat is not lots of different um sorry let me start LLMs don't have a an idea of context really in order for you to give LLM context you feed It's the previous conversation in every request. It's hugely wasteful. Um, oh, let's allow it to create a little report. Uh, so it's wrote me 170 lines uh of report once it gets back onto the internet. Um, I'm sure we'll get there. Um, so it's it's it's an interesting way of interacting with something that you don't really know about. And here in lies the rub. While this is all very impressive, can you really be sure that what it's

telling you is right? And no, you can't. So you have to check it and doublech check it because sometimes uh the LLM are sort of programmed um to give you an answer even if it doesn't know anything. It is there's been a I think it's a MIT paper recently or actually no it was an open AI paper itself where they said the reason why it sometimes um hallucinates is because when they train it they score it whether it gives an answer or not and if it if it doesn't give an answer then it doesn't get any score. If it does give a wrong answer and that's the same score as not giving an answer. So it's

encouraged to guess. So sometimes it just gives annual answer and that is wrong but it's it's got no in disincentive from giving wrong answers and that so that then forms the training data of the model set. Uh right. So it's it's written as um a report. Show the report. Repo is not what I said. I'm so bombing. Show the report.

Sometimes it's sometimes it's really taking uh it's really good and sometimes it just take forever to do certain things. Oh, you've read it. That's nice of you.

Good old fashioned. So you can sort of see um the it does it all nice in markdown. Uh so you have a executive summary, key findings, distribution, uh descriptions, locations, uh analysis. I'm not sure whether I trust all this to be honest. I really am not. So you have to go through everything and it's the same issue as I've mentioned earlier with the vibe coding. Sometimes you get a wall of text and it sounds really professional until you know what you're doing. If you give that to um somebody in the seuite they'll go brilliance that went so quick fantastic. You're fired. I'm doing this myself using AI. But then you have to sort of say, well, is does

that actually make sense? And that's where the trouble comes in. People call AI sort of a 10x thing. I don't think it's a 10x thing unless you really know what you're doing. And that's another problem. If you're going to be doing things that help you and make things sort of quicker, is it really something that is quicker in the long run or is it just that instant dopamine hit where you go, "Oh, look. I've got some great file and it saved me so much time." Anyway, I'm going to leave it here because I'm getting a signal. I don't know. Um >> Oh, so it's 5 minutes. Oh, so so so what you're saying is that I've I've created

so much hot air that you needed a fan. Yes, I I I get that. Um, in which case questions. Do you want me to try something? Is there something that that you would you would want to see? Uh, no. I only did a spider scan. I didn't do an active scan against the website of the organizers. I thought that was probably not best idea. Yes, please. You're kind of Is it on?

>> Try now. No. Now I'm talking into a ball. >> It's a loud ball. It's a loud ball. Um, it's a very loud ball. Okay. So, you pointed it out. I mean, you're going to go through all of this, but at the same time, you're sitting here going, "I don't trust you." Now, I've got to spend double the amount of time of validating what's really here, cuz I'm not going to buy that 13 or where where was the top number, 1300 or however many total um vulnerabilities it found. It could be, but >> why don't we why don't we do the following? I don't buy it. Oh, hold on. I don't buy it. Where do all these

vulnerabilities come from? >> Yeah, but now it's going to tell you, well, then you can get your money back. >> Has he actually done anything? >> Maybe. I don't believe it. >> Uh, I think it's crashed. >> Oh, well, and this is why AI will solve everything. Ex that exactly what I'm trying to point out. It's a pack of spanners and it's really But I thought it the thing that really got me was the interesting thing was that you know when we were told years years ago that um everything is going to be APIdriven and you just people will just sell APIs and and uh and everything is going to be just uh great because integration will

happen uh on its own. And I thought like, well, that didn't really happen because every time, I mean, I make most of my money from integrating systems which end up being complex bags of spanners. And the reason why there's so many security issues is because you try to fit uh a circle-shaped thing into a sphere.

Take the ball back. I'm always >> um but it's interesting that this um this glue coding together of things I can see there being some real interest if I can get it to just go say here here's an API description and you very quickly get uh an LLM to sort of say I want a bit of a bit of that bit of this a bit of the next thing. So you kind of get that Captain Pa experience where you just say computer do me this I can see that and that's where I think there's actually quite a bit of scope for something could be interesting. Now I should I should point uh something else out. Um obviously this

is uh claw code. This is using a commercial money. Um, I've spent well when when it comes back it'll tell me how much it cost. Um, a few cents, but it does add up. And at the moment, those commercial LLMs and sustainable sort of price is going to go up and up and up and up. Um, I did try this using one of those local models, uh, the alarms. Um, it didn't work very well. Uh, I mean, it's only on my laptop. Uh, it managed to talk to it, but so often it kind of got confused if you give it some results that is more than a tiny bit of text. uh too often uh does it get stuck in a loop

and yeah so I think it's a very interesting thing and I can see for people that know exactly what they're doing that it can be something that would be interesting to play around with it to to to to sort of try and do stuff with it but then you're also going to have to ask yourself what is anthropic and open AI doing with all data that you're sending it. Are you really comfortable with that? And that's when you then have to start thinking, okay, I'm going to run my own model or I'm going to pay uh AWS or somebody else to run it for me in a private thing to get then do you trust that? Cuz I mean AI

companies have sort of shown themselves to be not exactly always with the best interest of the user at heart. So we put it that way. I'm ranting, aren't I? No. No. This is going to be the new new job for security engineers. It'll feed back. This this will be the new job for security engineers that we're going to be checking AI to make sure it's correct. Well, the fun thing is that you you now got a whole new class of vendor products to throw at people. um the AI firewalls, LLM gateways, um where you have >> uh where you have um uh some of them sort of look at it and say, "Okay, we're going to feed the prompt that you had

into a smaller LLM that will check whether it is a prompt injection attack or whether it's safe or not." And then but you're then using LLM to check the output LLMs and it just becomes where where does the turtle stop?

What Will Go Wrong When ZAP Is Driven By GenAI

Related talks