BSidesWarsaw track 2 dzień 1 cz 2

BSides Warsaw1:56:15194 viewsPublished 2024-07Watch on YouTube ↗

Show transcript [en]

The first thing is that there are good tools used strictly for assessment of these capabilities, but in fact I am answering the first part of the question. There are good tools used for assessment of devices. The client imagines that this is how the infrastructure tests should look like. I'm a bit experienced here, just like when we were talking with Alex, where they just put these tests on, I know that it will be a click, run, and that's it. We just get such orders, where the client imagines that this is how the infrastructure tests look like. I can say from experience that I had many times such meetings with customers where they were testing automatically and they imagined that it would be

a click, run in Nasus and test. Yes. I think the best answer is: I think it's a good topic - It will come and I will do everything for you. It will change soon, because everyone starts with the KDK. It will come and I will do everything for you. Okay. and the entrance to the... ...and the entrance to the... I only have one and I have so many in it now. I'm going to have to change it. You can check the power, what will go through it. It's not working. Oh, yeah. I'm going to have to change the audio. I'm going to have to change the audio. - I'm very interested in reading about the class. I'm very happy to

share all the mistakes I've made. And we shouldn't do that either? No, because of the glass. There's a glass made of glass and it's in a nice shape, so the water gathers on the roof and the volcanoes are very beautiful. Tomorrow will be...

I need a moment to sort it out. Does something not work? I need a moment. If someone doesn't have an ID, I invite you.

Oh, mine too. Exactly Lubin. - No, in the background it's... - Ścinawa here. Ścinawa? No. Well, I'm from Ścinawa too. Well, I didn't expect that. I mean, I've only been in Ścinawa for two years, but... Well, I haven't been in Ścinawa for ten years, but I grew up in Ścinawa. Yeah, okay. Okay. - You'll know it here. - Okay.

Ready? But something with HDMI I think. I'll take it right away. But okay, okay. Okay, I hope that if I don't touch it, it will be okay. Is this microphone enough? Or more? Okay. I won't be warming it up anymore.

Okay, can we start? You're flying! Yes? Okay? Well, if I can try quickly... Wait, this microphone probably doesn't work either. I have something like this. Wait. There it is. Wait for me. If you want, I'll give you a microphone. We have it. We have it, let's fly. Okay, good. So, we're going to talk about the automated vulnerability management and Alphabet. So first, a few words about myself. I joined Google in 2018 as an intern. And I joined the web security team. If you're into web security, web standards, you probably know Michele and Lucas and Artur, who did a lot of work on content security policy back in the days and a lot of other web standards. So I worked with them during the internship. Back then

it was called SecMetadata. Right now it's called the SecFetch security headers. So we did some experiments. And it was so cool that I decided to stay at Google. I joined a different team, though, still in the space of web security, but doing scanning. So I'm part of the larger team doing the automated vulnerability management. But myself, I spend most of my time in the five years developing the web security scanner. So what are we going to talk about today? We're going to talk about things that seem easy but are actually very hard. And this is a broccoli man internally used for complaining about simple things that are actually quite hard when you're at Google. Like, I just want to run a simple web server. It turns out

if you're at Google, it's not that easy. And there's a lot of processes. Can you ask me? Yeah, a lot of things to do to make sure it runs at scale if it ever happens to be productionized. So yeah, a lot of things we do are really hard. I joined Google in 2018 as an intern. And I did a lot of processes. Yeah, a lot of things to do to make sure it runs at scale. We're going to talk about things that are actually very hard. But in all seriousness, I will talk about what vulnerability management is, why is it hard at Alphabet, so not only at Google, but also all the sister companies like Waymo, Waze, Mediant, and many others.

And how are we doing it? So all the approaches we need to take to make sure it runs at Google scale. And I will also try to make it a bit entertaining by telling you a few funny stories. So in an ideal world, from a security perspective, every machine reboots and installs all the updates the moment they are available. All the code that we run is compiled with the latest version of every library. And every library has all the relevant security fixes deployed. Nobody uses password as their password or forgets to change the default one or forgets to set one. All the developers follow best security practices and don't say, OK, we'll do it as a follow up. And admins know what's running

in their networks. We've had cases of teams finding out about assets in their networks just because we started scanning them. And there's no pineapple or bananas or Pompizza. And this is somewhat personal opinion coming from the top. So tribute to all my Italian managers. But yeah, unfortunately, we live in the real world where none of the things I've just mentioned is true. And-- This is a screen from our internal bike tracking system. And as you can see in the bottom left, Somebody decided to use asshole as their password to an FTP service. Legacy apps are served on sensitive domains. And again, you can see an example of our web security scanner, which found a cross-site scripting at google.com. Luckily, it was some very very

targeted application which wasn't used and we managed to catch it before anybody else did. You know, things like that still happen in 2023 and I think this is a screenshot from last year. People mark security box as working as intended because it's not a critical service. And as already mentioned, people find out about their servers, VMs, just because we scan them. And we report some things to them. They're like, ooh, actually, we didn't know we have still this running. So vulnerability management. tries to bridge the gap between the ideal world and the real one. And we mostly focus on the so-called one-day vulnerabilities, which are the known security bugs that have been released. And most of the time, they have a CVE assigned and are

very easy to find by automated scanners. Zero days are out of scope in most of the cases for our team. For zero days, we have project zero and red team. But we still do it in some cases. And I'll talk about it later. So basically, what we're trying to do is we're trying to raise the cost for an attacker to gain privilege to our networks. Because-- you know, finding these one-day vulnerabilities, it's quite easy. And once somebody gets into the network, they need some other bugs for lateral movement. If we manage to find all the easy bugs first, then they need to keep burning their zero days to keep moving around the network. So what happens if there's no VM or if vulnerability management

fails? This is what happens. You might have heard about some of these things, like all the big data breaches, vulnerabilities that cause serious security incidents. And many people think it's all about groups targeting specific entities with their groups of hackers. And it is true in some cases. But most of the cases, it's very simple bugs. Many people think it's-- ALEXANDER BUZGALIN: Things like exposed interfaces, default passwords, outdated libraries. This is really what it takes for companies to be breached. It is true in some cases, but most of the cases, it's very similar to things like exposed interfaces, default passwords. ALEXANDER BUZGALIN: So I will quickly talk about our approach to vulnerability management at Alphabet. And

we divide it in three pillars-- inventory, scanning, and remediation. And yeah, let's start with the inventory. And you might wonder, OK, what's so hard about inventory? Because we have a bunch of workstations, servers, and Cloud VMs, right? And you're right. This is a solved problem. And it's solved so well that it's been solved too many times by multiple different teams having the same asset in different systems. Every asset has dedicated obstructions to represent it. You might have a workstation as a physical workstation, you might have a workstation as an IP, you might have a workstation as a MAC address. Everything is in different systems. They are completely isolated from each other. They run on different tech stacks. And in some cases, people don't even

have a proper inventory because they use a naming convention for their devices. So TLDR, it's very hard to get a proper inventory. But one thing that we can know for sure is that coverage of assets beats scanning capabilities. Because we might have the best scanners that find every single-- So it's very hard to get a better inventory. If we don't know about things that exist in our networks, what can we even tell about security of it? We can't even scan it. So our solution that we worked through years is basically hand over the responsibility for creating the inventory to the respective teams. And so in the past, even when I joined the team, we tried to maintain our own inventory. But

we found that this is just not sustainable and we decided to work with other teams who own the inventory to send it to our inventory system because they know their environments the best. So what we do is we go to the team which owns, for example, Cloud VMs and we tell them, hey, please, who send us the most up-to-date inventory of whatever you think a sane representation of a VM is every day or multiple times per day. And so we work with all the different teams. They send the inventory to us. We expose hopefully same APIs. And we're ready for scanning. Step number two, let's find the bugs. And just to give you an idea of what we're dealing with here,

this is the first challenge of scanning. We have millions of Cloud VMs, hundreds of thousands of workstations, hundreds of thousands of third party libraries in our code repository, and tens of thousands of web applications. And approximately 20 different asset types, like cloud projects, container images, web domains, Git repositories, VM instances, and so on and so forth. And then we have to scan all of these things. So we roughly do millions of cloud VM IPs daily with one of the network scanners that Kristen mentioned. Hundreds of thousands acquisition IPs daily and hundreds of thousands of workstations daily. And you can already see it's not only a challenge from the security perspective, but also from the software engineering perspective. And I can actually tell you that most of our team are

software engineers building the whole infrastructure for running these scans. And you can already see it's not only a challenge from the security perspective, but also from the software engineering perspective. Moving closer to the scanning itself, again, it sounds easy because we probe the system over a network. We try to fingerprint what's running there. We look at the vulnerability database, like the NVD, which contains all the publicly known vulnerabilities, hopefully with a CV assigned and with all the different scores. Hopefully, it also contains an exploit so that we can verify that the vulnerability actually exists. Or we just try to brute force in case of weak passwords. And so I think Kristen already explained what CVEs are. But we work

with CVEs quite a lot because all the automated scanners have detectors for the CVEs mostly. And a lot of them are questionable, let's say, quality and not really relevant for us. So what we might find is this thing. And this is the second challenge of scanning, where let's say we probe a server. We find there's an HTTP server running. And it sends this HTTP header in the response saying, hey, I'm Apache 133. We match it with the Vulnerability Database. We see, OK, there is a CVE that is assigned to this version of the server. And we're like, OK, I think we found something. But then when you look closely, it actually says, yes, you need to have the mod proxy module enabled. And only

then you might trigger the vulnerability. And so there's no way of verifying-- well, usually there's no way of verifying that this is an actual security issue from a black box scan perspective. And in some cases, we might infer things based on some behaviors, like if the mod proxy module is enabled, then you will see an extra HTTP header. If there is an extra HTTP header, then there is a very high chance that this is an actual security bug. So what we do is we pay a lot of money to the vendors. One of them that was mentioned by Christian. who have an army of people who create detectors for the CVEs that can be detected automatically. And then we run

these detectors and try to dig through the haystack and find the needle. And we also create our own detectors and our own scanners. But then we are much more careful about what we find and what we report. And we always focus on creating the false positive free detectors. So whatever gets spit out of the detector, it most likely is. a security bug. So challenge number three of scanning. This is a book cover of site reliability engineering at Google, I think, which talks about how to run your system reliably and how not to be broken by other people on the internet. And security scanning is a threat to stability. It shouldn't be, but it is, as we

find out. And if you want to see why, there is a quick story from a couple of years ago, where one of the engineers reached out to our team and they say, "Hey, I think our devices start misbehaving and we think this is because of the security scans." And we're like, "Okay, that happens. Let's do something about it. Maybe we can stop scanning and we can exclude them from scanning in the future." But can you just quickly tell us how you identify these devices? And the Google engineer says, yeah, just use this regex for identifying the devices that you shouldn't be scanning. And you can see there's some concerning words in the regex, like acid pump leaks. So yeah, that's one

way we find out about weird things that are running on our network. And you can see there's some concerning-- Yeah, so things like that happen. And you only find out about them when you break people. And another funny story is when people in one of the Mountain View offices couldn't enter the building because the batch readers stopped working. It turns out our scanner just-- Another funny story is when people in one of the Mountain View offices couldn't enter the building because the bad readers stopped working. And that's the bad readers in the building. So basically, when we start scanning, we suddenly get a bunch of different requirements from different teams. So we want to scan a device, but

at the same time we don't want to scan the redundancy. Just in case we break the machine, we still want to have a redundancy to be available. And maybe not in the same blast radius, so if we break one machine, maybe we shouldn't be breaking all the machines in the same radius. And also, if the SREs, the site reliability engineers, could have a big red button, that would be also great. Just in case anything happens, they can just kill the scan. And it will also be nice to log all the scans, just in case we want to investigate things retrospectively. And also opt out the devices that we know are brittle. And usually we find out about the devices that are

brittle by scanning them and breaking them. That's when we know they are brittle. And by the way, if we could check that we haven't broken the device after the scan and make sure it still runs properly. And maybe not scan all the time, but choose a bit more friendly time schedule. And by the way, we should also work with the vendor to make sure that these devices never break and are secure. So these are the challenges of network scanning. And this is a topic that comes up recently very often, which is dependency scanning. especially in the context of recent US government regulations where you have to publish your-- it's called SBOM, the Software Bill of Materials. What is your software composed of? What are the versions?

Are the versions vulnerable to any CVEs? So again, it sounds easy. We just need to extract the OS packages or app libraries, all the versions of the names So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and

all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in

terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors, and all the vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors. So again, in terms of evening, I'm interested in doing some custom packages or vendors. So again,

in terms of evening, I'm interested in doing some custom packages or in your software that are probably impossible to exploit. And this is probably one of the things that we still haven't solved properly yet because we can do the scanning part but how do we actually verify that the bugs that we found in the software dependencies are actually exploitable and relevant. And we should bother probably hundreds of engineers to update, upgrade their software version. So then let's talk about how easy it is to create a vulnerable server or workstation. And this is a screenshot of a Jupyter notebook. And there's this magical command down there, which, by the way, was taken from the top Stack Overflow

post, which is like, how do I expose my Jupyter notebook to the internet? Here it is, RCE as a service. And yeah, people do it without thinking much. what are the implications or another case from one of the machine learning platforms where again like you just run this platform and you suddenly get the RCE as a service. So you already created all these vulnerable workstations, now how fast are you gonna get pwned? This is a graph from one of the research papers that our team did with collaboration with some German university and what it shows is the popular software on the left hand side and all the stars on the graph show you the scan attempts and trying to

break into the server. So as you can see if you have a Hadoop running with a default password it's quite possible that you will be scanned all the time. So the vulnerability window is very short And that's because IPv4 space can be scanned within a few minutes. I think I read an article where somebody did it in like less than five minutes. There are automated scanners running all the time. And I think from the research paper the 50th percentile of the servers were taken over in 10 hours, less than 10 hours. And the fastest attempts were within like 15 minutes. So you expose your Jupyter notebook, within 15 minutes you get pumped. Yeah, so what we need is we need to find vulnerabilities in

very short periods of time. We really want to find the critical vulnerabilities first. Hopefully false positive three. I keep mentioning false positive three and I'll tell you why afterwards. We need something that can, where we can develop detectors fast, like before anybody, before like one of the vendors like Qualys or Tenable, before they release their plugin, like one week after the Volumetri gets published, we ideally want to be able to do it within minutes or hours. And that implies it's probably gonna be open sourced. So that's why we created Tsunami. Has anybody heard of the Tsunami scanner? No? It's the better alternative to Qualys, AntennaVille and OpenVAS. So it's been created by us. It's one of the most mature network scanners that

we have. It has the highest quality detectors, which are false positive free or hopefully false positive free. We have test beds, so like test vulnerable containers for every vulnerability that we detect, which is also open source. And I think I'm going to be the first one today saying the magical word AI. We have AI-related detectors which are very prevalent these days, especially at Google. Yeah, and we have the Patch Reward program where you can get paid for contributions. So how it works usually is, let's say a CVE gets published. We think it's critical and it's worth having a detector for it and it can be detected. We open an issue on our GitHub page and then you can go to our

GitHub page, assign the issue to you, send a detector and then you can get paid from one to three thousand dollars per plugin. It usually works on a first-come first-served basis but you can also, if you know about a CVE, maybe before it gets published ideally, but if you're one of the first people who know about it, you can reach out to us and then we can create an issue specifically for you and you reserve the right to create a detector and get paid for it. And in some other cases we also pay $500 for web fingerprints. So web fingerprints are the way we can identify that a certain software is running. So let's say

there is a new CVE for Jenkins version blah blah blah. We not only want to detect this vulnerability but we also want to detect that okay we know that there is Jenkins running on this port or this server.

And I already mentioned we have released a lot of the AI infrastructure plugins for all the platforms like MLflow, PyTorch, Jupyter Notebooks. We also have the callback servers for vulns such as RCE or server-side request forgery. It's also available open source, so you can test for these vulnerabilities as well. And we also have support for Python plugins because Tsunami is originally been written in Java. and we know people don't really like Java, so we created a Python support. And yeah, so with Tsunami we were able to scan all of our cloud assets within hours since we get to know about the vulnerability. We create a plugin, we release it, and most of the time it's actually

the release process, and then we can scan all of our assets within minutes. And yeah, so we basically run the scanner in Kubernetes clusters. And this is one of the best things that we have for scanning for critical vulnerabilities. The other scanner, it might be interesting if your company is in scope for compliance programs. It's called LocalToast, also open source. And if your company is in scope for the compliance programs, like FedRAMP, for example, it is usually required to scan for and let me quote the compliance program, "You need to scan for configuration according to industry best practices." So we worked with the regulator and we came up with the idea, okay, we can look at

the CIS benchmarks and the CIS benchmarks are, again, a quote, "a prescriptive configuration recommendation for more than 25+ vendor product families." In other words, how do I configure my Apache HTTP server securely or how do I configure my my rocky Linux securely. So they give you a list of checks that you have to do like set this flag you know set proper permissions on this file and so on so forth and we have all the the most important CIS benchmarks already in local TOS so you can just run local TOS on your machine on your server and you can check that this is actually compliant with the industry best practices and And one of the best things is that it's actually very

low resource consumption. We spend a lot of time optimizing it, making sure it doesn't use too much memory, just because we have to run it in very high performing servers and people actually care about what's running in there. And then another thing, which is probably one of the most exciting, but unfortunately I can't share the name yet, It will be released in the next couple of months. We'll make it available and there will be a blog post about it. And this is going to be a generic file system scanner with which you can scan your containers, workstations, code repositories, servers, GCS buckets or S3 buckets by mounting to your local host. So basically what it does is traverses the file system, looks at the files

that you tell it to look at, and then extract some things from them. So it's fully extensible. You can write your own extractors. We mostly use it for dependency scanning where we scan for, for example, the POM file if you're on Maven or requirements file if you're on Python or OS packages. So anything that supports like DPKG, RPM, APK. We extract the software that is running on the machine or the packages that are used to compile your project. We extract the versions and then we have all the software inventory to look at and then we can scan by comparing the versions to the vulnerability database. And then you can also do non-standard things like for example if you want to test for weak SSH credentials, if you do

it with Tsunami for example or one of the industry scanners, chances are you will be throttled very quickly. I think we tried scanning for weak SSH passwords and after like four or five attempts we get throttled by the server. So since we're running on host, with this scanner we can simply implement an extractor which looks at the shadow file and tries to break the hashes or like compare the most commonly used passwords with hashes on host without being throttled. So it's very extensible. It will be released very soon, I promise. It runs on Linux. It's recently been added support to Windows. not only running on Windows but also extracting and testing for some Windows specific

things and Mac OS support is work in progress. And the testbeds, as I've already mentioned, whatever we test for we want to have a testbed because if it's not tested properly then we assume it just doesn't work. So the first one, probably the most popular firing range which has become industry standard for web security scanners. All the vendors claim, "Hey, we find all the vulnerabilities from firing range." The security crawl maze, which is a testbed for web security crawlers. And the last one, which is like the very generic security testbeds, which contains most of the vulnerable containers that are used for Tsunami. And then just a single word about zero days. So I mentioned we mostly focus

on one day vulnerabilities, but also in some cases we do zero days and I have to say a few words because that's what I've been working for the last five years. This is the web security scanner and zero days are for example cross-site scriptings. So we have a, I think some people even filed for a patent in the US, the US patent office for detecting Angular XSS, I think. So it's again very high quality. And it's actually a product in GCP. So if you're running on GCP and you have a web server, you can be scanned by the web security scanner. So we have the inventory. We have scanned all these things. And now we do the remediation. And the

real business value of our team is fixing bugs. It's not only finding the bugs. but fixing them. And at Google scale we can't fix the bugs, we have to rely on other people fixing bugs in their own infrastructure. And so again, Broccoli Man comes in and says "Hey, please fix your bugs, it's easy, right?" Well, yeah, turns out it's not that easy, because if you have a company of 100,000 people, or close to 200,000, you really need to make sure that whatever you send to the people is clear and people can react upon the bug. And what we found over the years is that the clear story gets the bug fixed, so the clear description,

reproduction instructions and remediation, how you can fix it, makes sure that people are going to fix the bugs without requiring human support.

So I talked about the false positive multiple times and this is why it's very important, especially when you work at Google scale. Let's take an example of a very, very, very precise detector which has a 0.001% false positive rate, which is very low by the way. We have around 40,000 detectors and I think that's roughly the same what Christian showed on the charts. We don't run all of them because most of them are just irrelevant or like informational findings that we don't care about. Let's take the ones that are interesting, so around 7,000 that we use Alphabet and let's assume we have half a million assets of one type scanned per day, which is not so unrealistic given the

company of our size. We get 35,000 false positives per day that somebody needs to look at. Sorry, there is a typo on the slide. Where? Like at the bottom. You assume that you have 500,000 and then... Ah, yes, 600,000. Yes. Oh yeah, it's tens of thousands of false positives per day. So how are we going to do the backfiling? Manual review. Yay! No, it doesn't work. So... Sugar. They give you a great boost in the short term, but ruin your teeth and body in the long term. And it's very tempting to do the initial triage of all the false positives manually. Turns out it just doesn't scale. So we have a team that does manual reviews of the vulnerability trends. Let's say we found a thousand machines with

this vulnerability. What is the CV even? Is it exploitable? So they do a little bit of research. And then they do cannering where they send, let's say, five bugs to random teams and they see how do people react to this bug. If five teams say it's false positive, then we don't file any of them. If people react, "Okay, this is actually a bug," then we keep filing and keep filing. So yeah, this is why it's really important to have the false positive rate as low as possible. And this is why we invest so much in the high quality scanners. that basically don't have any false positives. And we even have, for example, the web security scanner. Somebody comes with an idea of, "Okay, let's scan for

broad light pollution," or whatever. And we ask them, "Okay, can we do it false positive free?" If they say no, then we say, "Okay, we're just not going to scan for it." So this is the trade-off we have to make. So, closing talk, the hacking stuff requires complex attacks because lane bugs are fixed. This is absolutely not true. Most, as I already mentioned, most of the attacks happen because of lane bugs that are out there. So-called low-hanging fruits. People scan for things all the time, exposed interfaces, default passwords. This is where most of the security incidents happen. And vulnerability management is a solved problem. And yeah, maybe it is for small companies. For large companies it's definitely not.

As we progress, as we move forward, there are so many more and more different areas for scanning. There was blockchain, now we have AI. Every trend brings new possibilities for scanning. So we do secret scanning in code repositories. And by the way, when I joined the team, we only did network scanning. That was like, yeah, let's just do the network scanning and a little bit of web scanning. And now we have like 20 different programs for active dependency scanning, for subdomain takeover, we have custom scanners for custom environments, and we scan for secrets and code repositories and a bunch of other things. Yeah, so I hope it was interesting. Thank you very much. If you have any questions, you can either ask

now or later, or send me an email or a message on x/twitter. And yeah, thank you very much.

Yes? So with the steps, you didn't mention anything about prioritization. So I'm curious to see how that fits into some of the steps there. Because when dealing with a lot of development teams, they've got a lot on their plate. They just want to have new features coming out, spending as little possible time on fixing stuff. So in order to influence them to remediate things, you have to show them, hey, this actually matters. How do you guys approach the prioritization there? Yes. So very often, it's a political game. happening on a higher level between VPs and directors. So normally when we report bugs to teams which are not trivial to fix, they are willing to fix it. They are willing to postpone the future development because

security is important. In some cases it doesn't work. When it doesn't work, we involve our secret weapon which is the red team. and we tell them "hey, we found this bug, the team doesn't want to fix it" just show them that you can exploit it and the risk is actually pretty high and this happens sometimes and then once we have the red team report saying "hey, we've actually used this vulnerability to do this and that" internally then everybody wants to fix I guess my question is more like the sense of the automation right if you're having multiple vulnerability sources coming in because I used to lead a vulnerability management team at Fastly a CDM company and we had like several vulnerability sources automated scanners even like they

have various types of vulnerabilities going in a single point of view that we're looking at And how are you able to prioritize different types of vulnerabilities automatically so that then they're showing up prioritizing a single API? Is that something that you're working on? So it's a different sub-part of our team. What I can tell you is that we have a team that owns the security programs. So let's say we have a team that does network scanning of cloud VMs. And they are the people who know their inventory best, who know what scanners we have and what kind of vulnerabilities they detect. And then they try to work out the bug-filing rules. So they know, okay, if this scanner finds this vulnerability, then

I'm going to make sure it's filed. I don't think we have too many scanners that have overlapping detection capabilities. If they do, then there's probably a cooperation between the teams to make sure, okay, we only file bugs from this scanner of this type and just not to overwhelm people with the same thing. But at least I don't think I've seen this case where we have multiple scanners filing the same vulnerability. But this is a very, very hard problem to solve, especially if you have hundreds of findings coming in per day. Hundreds of thousands. Yeah, 95% are just irrelevant. Some of them are relevant, but then you have to look through them and what they are, are they really relevant, what is the risk? I've

got another question if anyone else, I don't want to take somebody else's time. So you mentioned that you handed off that individual teams are responsible for letting you know that, hey, we've got these assets. How is that working out for you? Because especially if there's, let's say, a couple teams that maybe co-own some sort of, let's say like code set, whatever, let's call it an asset. If one team reports it, like, hey, we've got this, and then another team reports it, hey, we've got this, or they directly enter into the ITSM, whatever, asset management inventory, you potentially get overlaps. And then, you know, you don't have, you have redundancy, or you shouldn't. So, or duplicates, right? How do you avoid that? So I think again this is

a field that is owned by the security programs team, where they know what assets we have and what we want to scan for. So basically we have the inventory with a specific asset type and one asset type can only be owned by one team. At the end of the day they do their research and they're like okay workstation can be represented in three different systems So let's make sure we have the full coverage by ingesting one of them. And then we have like one asset type and we know, okay, so this way we can identify all the workstations and we only scan these things, right? So it's us who approach the teams and we say, hey, please give us your inventory and not like, hey, distribute. And everybody suddenly

gets their inventory into our system. Yeah. So you're still the one that's entering it in? They're not the ones who are just like entering in their own data into like some asset management system? I mean they do but it's still like they work with us and we tell them what to do exactly to make sure that we have the right inventory. Okay so I'm just wondering how does this Tsunami security scanner cope with containerized environments because for example you can have a vulnerable java running on some server and on the same host you have multiple containers who have multiple different versions of this software for example java and yeah how does this cope with it i mean so first of all

tsunami is a network scanner So it looks at your machine as in like a server, like probably an IP address. And what it does is like the usual network scanning thing where we run nmap for port discovery and it really doesn't matter what's running on the host, like you might have different containers running on different ports serving like Python, Java applications and a bunch of other things, or like SSH and... So we do like the black box approach where we say, we do the fingerprint, we see, okay, this is running Jenkins version blah blah blah, and this is running WordPress version blah blah blah, and we run specific detectors on these ports. - Okay, so it doesn't try to log into the-- - No,

no. - Like, Nestle store? - No, no, no. This is a complete black box. - Got one more question if nobody else wants to go. An interesting thing that we started looking at a year ago was something called OS Query as a part of FleetDM, which basically, for those of you who don't know it, you can look at the conditions of vulnerability and figure out whether or not you're vulnerable. It's a Boolean sort of check in real time, are you vulnerable or not. Which is nice because sometimes CVEs don't come out, NVD is behind on even issuing whatever the CVEs. And so we're trying to look at the conditions of exploitability actually there. Is that

something that you guys are looking at too for the automation? I've seen other organizations that do it and it seems to be really impactful in a positive way. Is it the OS query from Meta? Yeah. Or you can query your system as you like to ask? Yeah, so you can basically-- so the difficulty we had was, for example, vulnerabilities that were kernel level that affected anything that was a Linux asset. We could look based on just certain versioning of something that was running specifically on that specific Linux asset. I don't know. It was service-based vulnerabilities. we were able to find that we were vulnerable to certain vulnerabilities before there was even the CVE that was added

to the database, right? Especially since a lot of organizations, like how does Google get ahead of some of these vulnerabilities before they get publicly released? Well, they get early access via like VINs or whatever. So we have to take a look at these vulnerabilities before they have a CVE applied to them and see if we're vulnerable. So that's the OS Query that helps with that. But I'm curious if that's also something that... - Yeah, I mean, so first of all, I don't know much about OS Query. I just know the high level idea. But what I think, I mean, I'm sure we are running agents on our machines. These agents are able to do exactly

the same thing as OS Query. and one of the things that I mentioned that I can't say the name yet, it's the file system scanner which can probably check for some of these conditions like are you using this software package, do you have this config flag? And then yes, we have the private vulnerability feeds which we get from our vendors or maybe from internal research teams and they say "please check for this thing" and then we write a detector internally and then that's how we can check. And also when we do the regular scan we also in many cases create an inventory. So let's say there is a vulnerability in software X we can very quickly query "do we have software X running anywhere in

our fleet?" If it is running, then what are the versions that are running? What are the hosts that are running this software? So I think that sounds like something similar just like with OS query. One question for the Google Cloud, because recently in the Google Cloud for Kubernetes there is the security posture tool, which gives insight into your own clusters, but the tool, it's called, I think, Image Extractor for the containers running there is pretty limited. Can we expect that this kind of tool will be included in the security posture soon? So first of all, I don't know much about Google Cloud because that's like a completely different part of the company. But you run your Kubernetes clusters for your scanners, right? Yes, yes,

but these are like the internal Kubernetes clusters, yes. You probably refer to GKE. So what I can tell you is that one of the previous scanners, LocalToast, it's shipped by default with all the containers which are running the cost container optimized OS. I think... The scanner, which we will release very soon, is run as part of container scans on GCP. So if your containers or VMs are running on GCP and are enabling the container scans, or I don't even know how it's called in GCP, then yes, we will scan for it. I know there's an internal name for this whole project of scanning VMs and container images and containers. But I don't think it will tell you much. But yes, we work very closely,

especially with the new file system extract, file system scanners with Google Cloud. And we try to-- first of all, we run it for internal security. But as soon as we have it ready for us, we also try to I give it to GCP, to Google Hub, so that they can use it for their customers. The security posture as a tool for the customer is pretty big inside. If you see that you have 20,000 bugs in your containers, that you can do something and next week you can see, oh, now I have only 10,000, so I did some progress. Yes. And I can just tell you, I'm currently working on a project which... Yeah, we work very closely with Google Cloud to make sure that you're running, that you're

using the right code repositories. I think there is this artifact registry that assure open source, which is like a mirror of all the repositories where you can monitor what repositories you're using for building your systems. So yes, we are now working a lot with Google Cloud, so you can probably expect some improvements. Okay, I think we're well over time. Are we? Two minutes? Two minutes, okay, that's great. Is Kamil Plankowicz in the room? I am, I am. Good morning. Good morning. I'm just going to pick up the phone. Wait a moment. We're going.

- The microphone Ryzen Kio Pro. Is it the one that should be? - No, not this one. - Wireless microphone? - I don't know how to choose the one that you have. - Oh, okay, it was from the camera then. - It's also about the one that is here. - This one is here. - I'll put it here. - Look, now you see it. - Wait, because... wireless you know what, he didn't detect it for me yeah, maybe look, I think he's on the couch I can possibly control the device no, no, no, he's just staring, but it's not a problem you should just detect him It doesn't detect it for some reason. Show me your microphone.

Here it is. Wireless, the one with the raise and pull. Let's take it off. Look now. Okay, so this is the wireless one. Okay, I'll just cut you off. Oh, I cut myself because I thought it was not the one and I cut myself on camera. That's why the sound is so loud. Okay. Okay. It has to be magnetically connected. I won't press you, you'll see. I don't want him to take it away. Perfect. Already? Can you hear me? Great, great, I don't have to hold it. Great. So I'm catching it. Great. Did you hear? Hold on to the chair. Good, but it's much better than that. I hate microphones like that. I think we can

wait a minute. Unless the question is whether this will still turn on. Why? No, I already have a microphone. And how about the stream? I have a loud voice, so I think I can handle it with this voice. This is also a joke at the beginning. I don't remember which one, but I had a microphone on one of the speakers. Unfortunately, I can't get it because I hold the clicker, I hold the microphone and the whole presentation is only in my voice, and the room was for 300 people. And no one said they couldn't hear me, so I think I can handle it. I can handle it today. I hope everyone has arrived. My presentation

will not be interesting, it is boring. Basically, this noda is divided into two parts, just like the axis of this presentation. First part concerns how to use automated tests in conjunction with the widely understood artificial intelligence. Now, I will use artificial intelligence, machine learning, AI interchangeably. This presentation is not very detailed, it is rather cross-cutting and aims to show some paths, directions in the industry, rather than go into some topic in depth. Of course, with the note that we are entering the subject of testing automatically, but I will not talk much about any projects here, I will not show K-Study. The presentation is also very richly rooted, so there are about 45-50 sources in the lower

right corner. You can click. And as I said before with my colleague, probably discussing all this and what is changing took me three years of continuous new presentations, so I don't want to do it. Slides. Slides are available under this address. They are on Google, so you can take pictures, they will be available, you can download them. This link will be at the end of the presentation, so if someone doesn't have time now, because it will come later, they are also able to to watch and remind yourself whether the link to you ... asks for the password? Then you have to enter the password at best twice. And the credit card too. And the card too. If

it asks for a password, it's bad, I'll correct it, but it shouldn't. It worked for me. I tested it on incognito mode, so it should enter Google without a password. Good. So the topic of AI, security, AI, machine learning, security done by the help of... I'll stand a little to the side because it's shining. It is very, very, very, very fashionable. Everyone now makes AI, uses Co-Pilot, Chat GPT and other models. The topic is quite old. I remember that I was talking about the threats to the overall artificial intelligence of models also at B-Sides in 2018. So six years ago I was already showing practical examples of attacks on such models. I remember, I really liked it. I really liked such an example.

The Swiss scientists, few people know that Switzerland is, so to speak, a very rich producer of aircraft, including disposable jet fighters. The Swiss scientists have talked to one company that we will make you a model that will automatically land on your plane. And it worked out, there was a movie about how cool this plane landed. It was an avionetka, of course, the language of the purists in aviation is Jesus, because it should be an ultra-light plane. However, it also shows that what we see in cars, in cars, in planes, it's not a new thing. And all the hype that is now, in my opinion, is a hype delayed by 6 years. I'm not saying that I was, so to

speak, a leap forward with my interests, let's call it research, but in fact it is not a new topic. And if you go deeper and deeper into it, a bigger and bigger rabbit hole is created, which can be seen in the meme included in this presentation. I will not throw memes with a nose-hatch in my presentation anymore, but it's a pity. Who am I? After this long introduction, two things that are worth knowing about me. I think about offensive security, I look for liability. I am the author of the book "Hacking of artificial intelligence" published by PWN, where I talk about a part of the concept, which is also different in this presentation, but in the second part. The first part

will deal with assisting and attacking using artificial intelligence. The second part will deal with attacking artificial intelligence using artificial intelligence and language models. To get to the point, this presentation originally comes from the presentation on Confidence, which was a year ago. I also encourage you to take a picture of this, because it's like a part of the slides is repeated, but one comes out of the other. And some concepts look nice when they change, different approaches, statistics, things like that. It's also worth getting to know each other, so it's like two in one. Please take a picture and we're off to the next one. with the main flow of this presentation. If we look at how LLMs are used now, or

other components of artificial intelligence in combination with security solutions, then we can divide it into these types of I don't know what the name of the chart is, maybe not the chart, the tree, yes, the tree is good and in this presentation we will cover mainly, mainly focus on the first one, on testing, but testing using models and then testing models and a little bit on I highly recommend this source document. It is very comprehensive, it has about 100 pages, where scientists, probably Japanese, somehow far Asia from our perspective, just look at the market and look at how it is done now. This is also for such a context, for how today Sellers who come with security solutions and want to sell you a

revolutionary product that will make your company compliant with everything else, with artificial intelligence. This is not new and it already exists and someone has already described it. Why connect with automated tests, specifically with phasing? Because testing in an automated way is very effective. The best team in the world, probably, Google Project Zero, although when I talked to Mateusz, a few people escaped, but they still present a very high standard in the field of research of various systems. About 40% of the findings, and these are old statistics, are done using automatic methods. It's about security limitations, so it's like, firstly, wow, that so much, secondly, 40% of the working time, if you look at it in the way that we have two days of work, we have free, it's

really like a three-day week of work, it's really a very interesting thing. More or less, this is how the statistics come out. To secure as a Google system, in general, in this presentation, too, you have to forgive me for talking a lot about Google, but they are like They do a lot of state-of-art things in terms of automation and testing, because they have a lot of it. They took an initiative called OS FAS and are testing the most popular open source projects, which they use in their codebase, in Corpo. Of course, everything is automatic, they only need a very large number of servers, although you can manipulate it too. What's next? As we can see, the statistics

are connected, unfortunately, it is impossible to generate full statistics from their systems. Most reports are public, you can read them. However, from their statistics, it turns out that until yesterday, until the day before yesterday, I'm sorry, until the day before yesterday, I managed to find automatically 11,220 security charges, because they also catch common errors, such as when the program does not work, but generates bad output. These are the taxes that had an impact on security. The project has been going on since the end of 2016, so it is not new either. However, such a large number, which was only found, of course, people had a lot to do with it, but found using machines is really wow. And this is of course the era before all LLMs, before,

let's call it, tuning. tuning it in a way, so to speak, machine. But what is it for? Because it's like testing, LLM, automation. What is it for? These are also old statistics. Unfortunately, very few people do it very often and very large cover statistics. But you can estimate yourself based on the fact that About 84% of the software currently in use in the world has open source components somewhere in it. It's a lot. Another thing is that nobody has tested open source software until Google took care of it. Maintenance also, if you are a developer, rather about security, you do not have much knowledge or you can't place it in a wider context. I'm not saying

everyone, but it seems to me that developers are paid for writing code, for writing good code that functions functionally. In the last few years, it has been a better emphasis on the security of this code. However, let's say I started testing it around October, September 2016. It was wild west. There were projects where starting the phaser gave five errors in five minutes. And without even going into the quality of this testing, something just turned on, it worked for 5 minutes and suddenly there were 5-7 crashes. Now it's much better, fortunately. However, there are still places where the security is a black hole. And what is important in this statistic, that for 10 years So, from 2010 to 2020, the

number of found safety and safety taxes and approved CVs has increased fourfold. My computer has blocked itself, I talked too much. It will work. Yes, I forgot to turn on the amphetamine. Okay, let's move on. The volume of errors found in open source projects has increased fourfold. And of course, the trend is that we use more open source than less. For example, I can't imagine that someone writes in their own TensorFlow or some deep learning libraries. These are very, very small, very rare cases and I keep it, of course. However, in fact, they happen sometimes. However, most businesses in the world and developers will use the ready-made open source code. It is worth remembering this.

The testing with AI was even interesting for DARPA. DARPA in 2016 also did the Grand Cyber Challenge, i.e. machines attacked machines and machines exploited themselves, fixed errors in the code. Of course, it was not fully 32 bits, only the space was limited. It was probably 8-bit programs, but it's still wow. that something like this is financed from the money of an American taxpayer. However, now a competition has been announced, I think there were previous results, using the same thing, i.e. attacking, looting programs, using AI methods. A lot of things from this are public, also code, so it's worth looking at how it's done, really, the heads are really above it. And you can't forget about

developers during testing. Unfortunately, you have to remember that we have common goals, but we speak different languages. And it shows very well why we can't get along. Of course, it's much better now than it was, for example, four years ago or six years ago. It rarely happens that someone does not accept this approach, that you report someone's mistake, someone says that I will call you because you are burying me in the code. There is one such big company, a big company that makes trains, that does something like that, but this is not my presentation, I have nothing to do with it. Yes, it is worth talking to developers, because in general, the approach, at least in Google,

is tested a lot, is that developers write themselves. Of course, security guards help these people, but developers write code themselves, which tests their code. So, so to speak, it is a basic job, it is worth doing it. We also had a little problem with people. Before May, GitHub announced that it would expand its copilot, something called Copilot Workspace, and you can enter a prompt that will generate the entire project for us. We enter a prompt and it wants such and such functionalities, make me such and such API. and it is worth getting familiar with this progress. It generates files with code in turn, displays the functionalities in this code and clicks that it wants this functionality, this functionality is correct, this is OK, I

accept. And at this point, how to test it? Who will take responsibility for this? These are very interesting problems that no one is yet leaning against or very few people are leaning against. Now with Copilot, the code created using Copilot, Of course, there are no very large expanses to this code, that someone says that this code is stolen from me, GitHub trained on my code. However, I think that The larger the scale, the smarter the artificial intelligence and various assistants, the more difficult it will be. Here it is also shown, somewhere here it is with mice and shows which code is responsible for what, for which functionality that is described there. So it's really, really thick. It is not yet available

even for paying users of Co-Pilot. Even for Enterprise, it is not available, but it is worth looking at and seeing how it works, because I think it is an interesting topic. After this very long introduction, we go further. Code generation. A simple matter, we talked about these copilots, I also talked about the fact that Google encourages developers to write tests that test their code. But why write when we can put it on the machine? It does quite well. Here are the statistics of errors that were made during this writing. It is interesting that there were many fewer errors than errors made by people who write tests for OS FAS. It's worth taking a look at. I don't know if this project is developed or

not, but even academic implementations that usually do not use a large codebase or that use some limited ones, as if, well, few companies in the world can compare with Google, Microsoft or other companies in terms of their data and the possibility of training them, but that, so to speak, like Dawid with Goliath, you can train better and write code better than Google, it's very cool. I think we will come to such, this is my hypothesis, we will see if it will work with such very large models as GPT or those that are general and that are enough for an average user to very specialized models that are used to write a specific code, for example,

to write tests only, because they really have a very nice performance. Of course, there may be Google in the next slide, and the fact that they were jealous of OS Faza, so they added their component that generates their code. and it has very nice results because the code coverage increases from 1.5% to 31% so we managed to find such a dead point, you can call it in the code, which was not covered by a human on the basis of the XML secret project, this is exactly 31% and again there was a regression test, the CV 2022-3602 was found So it's nice that we're already touching problems that are, first of all, real, that have existed in the code. Over time, we are able to find them in a different

way, in a different way. And they are not old, because if we found Buffer Overflow in 1999, it would be a bit weak. But here it is quite a real problem. It works quite well. Here is the coverage that has been achieved, the growth of coverage that has been achieved with the help of this Google model. It looks nice, from 11% to 30%. Although something has gone wrong here, because there was 31%, here 29%, 84%. But this is 1% to survive. Other projects. different approaches. So, suppose we use two LLMs. We put such an LLM, which first draws us interesting features from the code, and then generates test cases that touch this code. It works interestingly. I recommend this

project in general, if someone would like to start playing with what I'm talking about, this project has absolutely everything publicly available, so you can download a model, download test data, download errors that were found in these projects that you can see on the slide. It's really cool because you can do something yourself, of course, you need to have the right equipment. Another approach is to improve the cases of test cases instead of generating existing ones. It also looks nice, because you can see clearly that there are a lot of such projects. Here I just had to show six. You can clearly see that some of them are very nicely cut off. This line generated by AI, machine learning, and

what is found in State of Art, Fuzzer, or AFL. Some projects are a bit weak, but here are the first tests, this is the first version. This is also available on GitHub, I recommend checking it out. There can be no shortage of static code and LLM analysis. It was decided to study OWASP Top10. You can see the differences. I mean, you can see, you can't see, it's hard to see on this slide, but the differences are quite significant. Between how it looks like using SonarCube, and, for example, using GPT-4 Turbo, even 30% difference. Cool, cool. And then we analyze the code statically. LLM doesn't touch it, it doesn't do it in any way, so it's really cool. This context is very much expanded in

relation to the ordinary code analyzer. A lot of fun, a lot of cool things that will revolutionize our future super cool. We won't do anything at all. We'll be working three days a week. Well, it's not so good. Benchmarks are also made, what is the expected code increase, what are the expected errors found. However, it is not so rosy if it starts in real life. Here we have the expected code increase, and here we have what we got. You can clearly see that these heatmaps are very different, unfortunately, because we have practically all these stripes not covered. which unfortunately shows that despite the nice initial effects, we still have to put a lot of work into it to manage to do it in the

way we expect, i.e. up here. However, it seems that the direction is good. Additionally, tests were carried out outside the code cover, errors were found, and it doesn't look good either. In practically every test, researchers found that phasers with an LLM component available in a given time on the market, open source, did not do much better than standard. I agree with my thesis that we need specialized models, small specialized models in specific functionalities, which will allow us to work even better than general, heavy, general ones that serve other purposes. Another help from LLM, also from Google, Project Naptime. I posted this slide last night because the topic is very fresh. with the help of the project, with Buffer Overflow and with

the use of various LLM models. I often say this in my presentations, but I will also recognize real fans. How many years do we know that there is something like Buffer Overflow? Well, such estimates. This question to wake up those who... No, no. Cold, cold, cold. Cold. More precisely, in 1972, the description of an engineer, a scientist working for the American Air Force, who warned that in the software of the fighters, which are used in combat, the code is not fully written in a way, so to speak, according to the standards and errors such as buffer overflow may occur, which can, so to speak, ground the hunter. However, back then the hunter was a little more analogous

than it is today, so it was possible to control it, control it analogously. Unfortunately, not now, so we know 52 years about buffer overflow errors, so far we haven't dealt with them. This publication, because this is a document that is public, It's worth reading what was the problem there and what problems we have now with buffer overflow. It's crazy, crazy, it's probably current. Moving on. Advanced memory corruption that was supported by ASAN. ASAN for people who are not engaged in software engineering hardcore, it is a framework, of course, from Google, which allows you to search for memory corruption errors in programs. It compiles with this program and if it gets to a memory that doesn't even make a program,

it is detected because there are guards at each buffer. You can clearly see that it does a lot better. However, the jolt of fate, which if we look here and look at Sorry. Let's look at the models from Google here. They are much worse than the ones from Microsoft. And I will also show this on future slides. But if you have a choice between GPT and Gemini, don't worry about GPT. In the case of software engineering, back hunting, there is absolutely no way to Gemini doesn't have a GPT drive, even in these Pro, Ultra or Enterprise versions, there is a lot of it. Here is a standard drawing, you can clearly see how GPT-4 cuts off. Very

nice, these are Advanced Memory Corruption, so the difference is really significant. There are also other approaches that are interesting, but require a lot of work, including taking a database of the vulnerabilities, analyzing it with LLM and then sticking it to the code we have and a little static, it's a little bit like Sonar Cube on steroids. We're looking to see if we have something like that. I don't mind success, but if we actually see, for example, in you this one, where, as I say, a specialized tiny model, it can work. In the case of such large programs as, for example, TensorFlow or some deep learning frameworks, not necessarily. Removing blockers, nothing interesting, we will just

sometimes get stuck with our work, it happens to everyone and in a way we can benchmark our program and see what is wrong to improve and make it work better. Okay, I've already said. a little earlier about this LLM, which one to choose. However, here, to confirm my words, the altman code itself was just released before GPT-4.0, so there is no here, here it should be the best, this is just GPT-4.0. In terms of reasoning of the prompts he received, I recommend the source, because I can't explain some of the things yet, what's going on there, but a lot is going on and a lot of interesting things are happening. You just have to know

that GPT is doing better than Gemini and in fact it's nothing surprising, looking at how it looks. Similarly, in different prompts or in some mathematical calculations, it can be clearly seen that there are differences, not so significant, but there are differences in the cost of GPT. So it is also worth remembering in the case when we would like to use it commercially. However, I don't know if it's corporate policy or something, in the case of the project used by OSFAS, The errors were found mainly by Vertex AI, i.e. there is a cover on Gemini, which is in their cloud, and that's how it was found. Probably yes, maybe it's because they just used more Vertex than GPT. The OS Fuzz project can also

talk with GPT. I haven't benchmarked how it looks in the context of Gemini, but I think that in the case of what I showed, the case is rather obvious. Well, let's move on to the second part. Maybe it will be more interesting. Why attack LLM at all? Some reason we have. I showed on the slide, but what use case do you find for attacking LLM? Why would you want to attack such a co-pilot, for example? Exactly. And now, in the context of what I said earlier about the Overflow buffer, which has been around for 50 years, assuming that the problem space of all the errors that we can find in the code, which is generated manually by developers, closes in

this red circle, then in the case of the code that is generated by machine and in general, as if the so-called software 2.0, i.e. these are all large models and as if they are customizing themselves, the problem area is much larger and new challenges arise from it. And now, if we take that, as if 52 years we cannot fight buffer overflow, which is quite easy to fight. We have tools, we have knowledge, it is not like that. We know, as if to a wide Auditorium in the 90s on the backtrack, when it appeared Aleph One, he lived 96 years, if I remember correctly, so almost 30 years In the case of errors that we can

find, everyone has mentioned well, but there are still a lot of errors that we do not know and a lot of standard behavior that can appear. In addition, it should be added that the model is just a cover, i.e. the top of the pyramid, and under it we have the entire TensorFlow programming platform, for example, libraries that are used, i.e. Numpy and other things, an operating system that also has errors, and specialized equipment in machine learning. I think no one has tested, for example, micro-codes of NVIDIA processors for machine learning. But it could be a very interesting research. Maybe someone has already tested it. I don't know. How much can you buy at all, because now there is such a shit on this device that you

can't order it. But assuming that we have such equipment, the list of problems we have on the way, safety is significant. Now, considering this model, which is very large, but not considering the rest of the pyramid, which is also quite a lot, we have a big problem, despite the fact that the model is only a cover, and the rest we already know well from software 1.0, which was presented here. What's more, as I mentioned earlier, as in the case of Buffer Overflow, not all In the case of memory corruption problems, not all problems cause an emergency end of the program. The same is true for models. Some models are soft models that will not fail us, but will generate bad results, which is

much more dangerous, because if the model fails, we know that there is actually a problem and something is not working, something has happened that has failed, we have to set it, we have to find a reason. In the case of soft errors, Unfortunately, I don't know if during my studies or during any formal education, we were playing with numerical methods, for example, how to round up the number, variable and decimal numbers. This is a fascinating topic in general. Until I had numerical methods in my studies, I didn't know that it could be such a problem. However, in the context of highly precise calculations, for the needs of AI machine learning, it makes a lot of difference. And I will tell you about it. It also makes

me very sad that in the case of software platforms, very few people do research on TensorFlow, although it has improved and now it looks like this. We have 43 pages of errors found in TensorFlow. In the past, it looked like this, that we had only one table In addition, errors are most often found in research by AI. China and Asia invest in it. And the majority of the errors found in security so far, until recently, were Chinese. Now it's much better. It has been globalized. and I find a lot of errors. However, there are a lot of problems. There are 43 pages on GitHub, each with 20, let's say 10. A lot of these errors are made. There are still

a lot of these errors, because it is a huge codebase. One of the largest on GitHub, I think. It is worth remembering if you are going to take this surfer to your projects. And coming to the merit. Launching various implementation algorithms for CPU or GPU can sometimes give different results. We may even be completely unaware of this. However, it is really worth checking your model. There are projects that use LLMs, test LLMs or machine learning frameworks. which catch such inaccuracies in the range of calculations, too little precision of calculations, or any checks that are thrown away, about which we do not know that they were thrown away. 41 errors, that is, in general 65 errors found, but 53 effectively confirmed. A very

interesting project. Few people will pay attention to it yet, to test it like this, because rather, as you said, data extraction, model poisoning and influence on their decisions. However, such groundless errors as small precision of calculations that can break this model are completely, are rather omitted. I have not heard at least about some large model research, just in terms of which, which, or frameworks, where, where calculations can be dug up. Moving on, these are already the soft mistakes, these are the mistakes that are automation, for example, prompts and j-breaking. Very, very cool things, interesting. These questions are quite, quite, quite tendentious. For example, here, how to build a bomb, where we make a bomb, for example, in ASCII, ASCII Artem. Interesting, some models

are not resistant to it, some are already. We'll see how it will be. But every day I see some new prompts that show that GPT is catching some stupidity there and just shows how to sell cocaine there or show how to commit a crime in the White Coal District. I still don't hear about the VAT being deducted using the GPT rate, but maybe they will appear. a lot of projects, a lot of interesting projects with cool things. However, here are very tendentious questions, very such that should be caught at the very beginning. It is worth it if you build a model that will be used by users, it is worth even this list of questions, not necessarily automatically,

to put it in yourself, check it, or maybe our model actually gives advice on how to distribute banned materials and not only, this is also in terms of developing our own models. Let's move on. Classic tests, i.e. checking these ground errors, my precision errors, of empty values, not only, but clearly everything 135 errors in different frameworks. This is very interesting and I encourage in general, if it is not yet the end of the presentation, I encourage you to test these frameworks in general, if you use them, if you are even going to use them someday, but not now. It is worth testing them, because there are still a lot of low-hanging fruits. in this kind of programming and you can

find nice errors, basically not exerting too much. What's next? It went quickly, I still have some time, so maybe we'll talk a little at the end. What's next? How will the future look like? It's so bad, it doesn't work, errors are found automatically, three days of work is enough. So what are we going to do? First of all, better integration of tools with LLMs. So far, as far as the largest implementations are, such as Copilot, JetBrains AI or other models, these security tools are not yet fully integrated or almost at all. in such a cool way that we don't have to write prompts ourselves, but it's happening somewhere underground, we can do our job. Specialized models

are what I also hear all the time in my presentations, which are only concerned with what they are created for, they are not general, they are not very well developed. The greater the coverage of the class of dependencies found, this is obvious, we don't want to look for buffer overflow, although it is worth it, we also want to find both my-precision dependencies, both dependencies that give us business logic, but not quite our program. And lowering the entry threshold, I think that First of all, to make developers interested in the topic, but also security guards. Unfortunately, if these tools are difficult, they will be difficult to deploy, they will require a lot of effort from us, despite the fact that

they can give cool effects, then however, few will decide to implement this type of solution. It would be best if it was a Docker file or Docker Compose and it works. When it comes to attacking and defending LLMs, we have a whole list here of what is worth and what is not worth. What to do in terms of attack, what to do in terms of defense. However, it is more or less covered with what I wrote. that is, expanding the methods, attacking methods, new methods, covering new classes of vulnerability, dynamic attack methods with different targets. So it's more or less about it. Defense is interesting. Nevertheless, I am a man of offensive and security and I have not eaten my teeth in defense. More on the pass.

Nevertheless, it's nice that he looks at it from both sides, inviting both sides and his mother. this type of development for what will be next, what is worth doing next and at what stage it is worth getting interested in, because maybe in the future it will be an interesting research area. We are getting close to the end. The most important slide. If one slide is worth remembering, it is primarily the automation of your work. This is waiting for us. It is not worth getting offended by it. AI will not take our work away. It is very efficient and efficient. I like to use Copilot, I often use other tools, GPT, Gemini, so it's worth testing the code.

As I said, a very nice area is libraries, machine learning frameworks, testing phasers as well. because very few people do it. Open Source is growing, the number of found liabilities is growing, and as you can see from the statistics of the CVE from MITRE every year, there are more and more of these errors. We are already exceeding 40,000 or 50,000 registered, so there is a lot of it. And using GPT, not Gemini, because until Google puts more money, more capital, intellectual capital, its employees, this model, you can use it, but what is this use? Paraphrasing the classic. I will show you the slides again, if someone hasn't done it before or is late for my presentation.

I saw two or three people late, so it's worth taking a picture of the address for the slides. Additional context, how it changed over the year, which concepts have evolved, Yes, because this year there was a presentation. This presentation has two parts, one from Confidence, which was a year ago, and the second, which was on Confidence this year and on Besides. So this is additional context. Are there any questions? I asked you during the presentation, now you can ask me. And contact me. It is also worth returning if the questions were born after the presentation, because unfortunately I cannot be at the next part of the conference, so So I will gladly answer by e-mail or on social media or in

another way, as you catch me there. I am rather on the Internet now, so I think that targeting me is not a problem. Are there any questions for me? They can be general, not necessarily on the subject of the presentation. I'm listening. As you said, there are many places where you can use models.

um It's about a bug door on the previous stage, right? For example, something like that? I understand well? Modification is done in the model itself, not around the model. Okay. I don't know such an attack, but I think that if someone had carried out such an attack, we probably wouldn't hear about it in public. So it's more like an APT state sponsor, these are really very thick topics. I haven't heard anything practical. In general, there is little to hear about practical attacks. Rather, the hype that is still there regarding AI causes us to fly on this wave of optimism. There are few news about something being wrong or something not working out, most often in boring academic papers.

So I didn't hear it. Maybe something will appear. But as I say, for me it is already a level of such really very thick topics. Basically, it was just doing the rules to steal something or to bugger something. For example, recently I laughed because there was something on Twitter in the style that some account, the credits of the game have ended. and the prompt was: "Hvala Trumpa" in English and put it on Twitter or something. And it went as a tweet, right? Debug log from the GPT chat. I think that someone who thinks that NSA or DIA or other three-letter American agencies do not have feed from Russian-language chat GPT. I think that such very thick attacks, we will not find out about them, or we

will find out how to set up a GPT chat. And then the company will come out to save its Let's say, it's Microsoft, so effectively, so I don't know if he would be able to hit it somewhere in some American three-letter agency. However, it's just too big of an attack. And practical implementation of this requires a lot of resources, a lot of knowledge, access to people who have such knowledge, such people in the world are probably three people. So, unfortunately, we will not find out, but maybe in the future. But the question is very interesting. In this context, I never thought about it in this way, rather at this last stage, i.e. testing, training, and not

modifying models, not academic. So that's good. I'm listening. I have a question. Let's say we have a programmer who wants to increase the coverage of his code with this type of tools, but he doesn't want to write any special test cases for it, because it's not worth it. But he would like to do it so that the tools you mentioned, that they were able to There is no such universal one. The use case that I associate with is the Google project from OS Fuzz. but it also needs to be integrated. And above all, it is worth mentioning that there are such requests to API or Gemini or OpenAI below, so we also need to take this risk into account, that we are revealing some part of our code

somewhere. And now the question is, can you do something like that or not? Probably not. No, I already have an open source code, right? I don't know if I can change my... I mean, you would have to modify the tool a little, but it's doable. I mean, you wouldn't even have to modify the tool. There's something called OSFAS-gen, And I think that this slightly modified code, to make it work, there is a whole integration instruction, there is even a link in which the source... It's not about modifying the tools that parse it, it's about modifying the code that the author... Listen, you know what, write it a little differently, but not very differently, write it a little differently, and

thanks to that, these tools are smaller, I don't think there is such a thing yet. I would see it more in Co-Pilot, but I don't know if this is the direction. Such a tool is still This is what I was talking about, lowering the entry threshold for DevSec. Unfortunately, so far it is still like this, or if we have something specialized, we have to create it ourselves or integrate it. Okay. I don't know how to answer this question. I don't know if there are such writing styles that are more friendly to tools. Looking at the scale of the Pilot, which was trained on a million writing styles, I think it's a very generic tool and it's like dealing with everything. I

don't think you need to tune your writing. and I don't know any other tools that could be from this side. But it's also a very cool topic. Are there any more questions? I don't think so. It gives me three minutes of life. So I'm really... Not only with a slip, I managed to finish it earlier, so thank you again very much. And I encourage you to contact me if you have any interesting questions, because usually I don't have any questions for my presentations, so two came up, very interesting, so I'm really pleasantly surprised. And if he shows up after the conference, I will be happy to answer by e-mail or on social media or in any other way. I will show my intentions once more and

I will run away and return two minutes of my life. Okay, good.

BSidesWarsaw track 2 dzień 1 cz 2

Related talks