
Hey guys, thank you for joining us. Good morning. Um, I'm Daniel Zarv. I'm an AI security researcher at Whiz. Hello everyone. I'm Era Zawush. I'm leading the AI research at Whiz. And today we're going to talk about how we use small language model to enhance our secret detection engine. So let's start with some motivation. Stolen credentials are still one of the most common attack vectors. We see it in one toothird of all the breaches and we can find stolen credentials in many places. For example, public repositories that you mistakenly commit like a secret. The second thing is misconfigure or vulnerable cloud resources. for example, a public bucket that you store sensitive data or credential that doesn't supposed
to be public. And the third thing is compromised workstation. So a attacker can get access to a workstation and there they can find probably credentials storing in some files. According to our research, 86% of of organizations have at least one repository with secrets inside. But the more concerning numbers is that the organizations with secrets with secrets in their open source and public repositories. When looking at cloud keys, we are particularly interested in the secrets attackers can use to perform lateral movement in the cloud. This number is likely even higher for the general population since our data is drawn from organizations that are protected only by us. So what's what's on the agenda today? So we're going to start with the current
solution of secret detection and check if there are enough. Then we're going to talk about uh generating data using LLM agent. Then we're going to dive in into fine-tuning small language models and also talk about evaluation and metrics. We're going to finish with some key takeaways. So let's start with understand the challenge of secret detection in files. We living in a world that we have more and more third party integrations and actually there is more and more API keys and tokens and service accounts that we need to integrate with these third party integrations. Even the AI itself it's generate for us more attack surface. For example, we can look on the recent hype and on MCP servers uh when
we want to connect our cursor or LLM to a lot of services and then we need a credentials to talk with these services. So probably we take these credentials and put them uh and put them in our cursor or LLM and not necessarily in a secure way uh because we want to do things quickly and not necessarily in the best way in the terms of security. So we are generating like more attack surface. On the right we can see also according to our research some popular services that we found in private repositories in our customer. So for example we can see anthropic API key which is kind of new service but is still really really popular to have a
API key inside inside uh code repositories. Um it's also important to mention that the companies on the right they're not responsible for the uh for the keys and not responsible for the securing the keys. It's just common services that people usually use these credentials and usually store them in files in repositories.
So when we think about traditional method to detect secret the first thing come to our mind is to use reg x or pattern. So when we rely on reg x it's really hard to maintain them and it's and also they have a few limitations. For example they can suffer from high force positive. So we can write a reg x that actually can catch a real secret and it also can catch a secret that is a placeholder or a secret that inside uh someone put in the word like test test or something like that because it's match the same pattern. The second thing is the maintenance. So when we have a new secret we actually need to
generate a new regax. Many of the secrets has kind of generic pattern like 32 characters like digits or letters and then the regex involves some uh keywords for example the name of the service that we want to match on uh and then if there is a new service uh we want uh to catch with the regax we actually uh need to write a specific regax and it's really hard to write the correct regax that not generate a lot of false positive but also catch the secret that we want. The third thing is the lack of context. So many of the files for that contains like a placeholder or test secret you can find these words inside the file if you
understand the context of the file. You can understand that it's not a real secret or maybe it's in low confidence
secret. So like Era said, the limitation of reg xbased detection can be a real headache. Both false positives and false negatives create a lot of noise and sometimes even create embarrassment. Today the closest thing we have to our critical eyes as security researchers are large language models. On this side on this slide we can see two examples. And the first example I don't know if someone if you can read it. It says the variable name there is Yava Pada which is in Spanish private key. So this is an actual example from a a public repository. And on the example below we can see that it's a very um generic secret that might be for test purposes. So on the first
example it's I don't think a reg x can actually catch the the first example and on the second example a reg x will catch that example but it's actually a false positive. So if we're talking about large language models, what are what are the advantages to using them? It's very clear, but let's talk about them. They work out of the box. The amount of knowledge they hold is often unbelievable. Even with basic prompt engineering, they can perform every task in in a very good way. Also, they support very long context length. This is important in our field of problem because I'm sure all of you have seen code files that have more than 1,000 lines and also if there's
things they don't know or are not familiar with with prompt engineering you can actually fill in the gaps and teach it on the go stuff that it doesn't know. While while they have many benefits they have some challenges and those challen challenges are actually specific to our domain of problem. One is scale. We need to scan millions of files a day and usually the rate limits of external vendors of large language models is around 100 requests per minute. Also cost relying on external APIs for heavy workloads can get expensive very quickly especially when we need high availability and performance and of course privacy. Companies are not eager to share their secret sauce with external services. And
even if they wanted to, there are some legal regulations that prevent them from doing so. So, do we have to use large language models? How about small language models? So, if you haven't heard about them, small language models are models with significantly lower number of parameters. How lower? Around 8 billion parameters and lower. This is the definition. For comparison, large language models hold around 300 billion parameters and even more. Some examples of such model families is FI from Microsoft, Quen from Alibaba and Llama from Meta. So the challenges we've seen before can actually be handled with the advantages of the small language models, runtime and performance. We can run them with little memory and with little
compute power compared to their bigger brothers. In fact, we can use CPUs to perform inference instead of GPUs. And that leads us to the cost. We can run small L small language models on standard machines instead of GPUs machines. So this reduces cost significantly. And of course privacy. The model is ours. It runs in our environment. no data sent out. So it actually solves all of the compliance issues. So when we thought how we going to solve this problem uh and we learn about a small language model, we understood that we want to try it and the best way is just uh going to fine-tune a small language model for a specific task to detect secret. So we
need to answer a few questions. The first question is which data we want to use. Which data we want to use for training for evaluation. The second question is what is the truth or what is the success of the criteria of success and the third question is can we run efficient enough on the CPU machine. So let's start and deep dive and answer those questions. So defining the task is pretty easy. We want to uh take a language model and fine-tune it to our specific task to detect secret. It means that the model not necessarily uh will be good in general knowledge but will be amazing in secret detection. We need also to set a target.
So our target is was to be able to process a code file in less than 10 seconds on a single thread CPU machine. Also we want to take advantage of a language model and to perform multiple task in a single action and our task is not just extracting the secret also categorize the secret and give a confidence. A confidence will help us to understand if the secret is real or maybe it's a test secret or a placeholder and it's really important for our secret detection engine. So first we need data and we didn't have any data. So when I thought about uh when I think about this problem let's say 3 years ago before the LLM it was
really hard to generate good quality data for this problem because you need a lot of analysts and people go go over a lot of code files and label the data and it can takes hours days and cost a lot of money also you need a lot of people and today we have the LLM so we decided to automate the process as much as we can and to use multi- aent approach. So we started with a really large data set that is not categorized at all which is GitHub archive. It's a project with all the public uh data from GitHub. Then we use a basic filtering of some basic keywords and also we filter only the data that we can use for
training and evaluation according to the code license. Then we decided to use two agent LLM agent based on one of them based on Gemini and the other one based on set and we did some mechanism to take the agreement between those agents and merge the results. Also we used a third LLM to validate the result and validate the confidence with LLM as a judge. We handled some edge cases and did some manual review. All of that we did on a pretty small amount of data until uh we got to the results that we want. We did some refinement to the prompt to the confidence and everything around it and we then we uh run this process on
100,000 files and we ended up with a data in a high quality that we're happy with the data and it cost us around $5,000. I know it's not cheap, but we were actually a bit generous with the models and the experiments that we've done. Uh, but it's pretty much uh not expensive according to the data set that we got. We have 100,000 files uh that we started from all the public uh code in GitHub uh that has many many secrets uh inside in uh uh and a lot of them like good variation of secrets. So here we can see an example of our labeling. On the left side you can see a secret that is probably a real
secret. Uh also it has a category. It's related to Facebook app secret. Uh it's also uh extracted the variable name and the secret value and the confidence is high. On the right side you can see a secret that is probably a placeholder according to the secret value and the confidence is low as expected. It's also related uh to uh ticktail access token. So the model extracted the description, the category and also the confidence which is the most important for us. So for fine-tuning a language model uh we define a task of what we want to take a pre-trained model on domain specific and specialize it to our new task. There is actually two main ways we
can do that. The first one is you can do full fine tuning which update all the model weights. It's powerful but it require a lot of resources and it's probably really costy to do that. So even for a small language model they have one bill billion parameters or half billion parameters or three billion parameters and it's a lot of weights uh to change and you you need like uh a lot of GPU time to do that. And the second option is parameter efficient fine-tuning and for example Laura this is the method that we use that actually you train a small adapter instead of the full model and it's actually can save a lot of compute and you can do it uh
quickly on a single for a small language model you can do it on a single GPU machine and it actually can take just few hours even for a large data set like we added. So like Eris said, he talked about two ways of fine-tuning models. We chose with Laura. Laura is low rank adaptation. In simple terms, Laura fine-tunes models by adding small low rank parameters to the model and this dramatically reduces memory usage and and compute during the finetuning process making it more efficient and flexible. And by flexible is meaning that we cut the time that we need to fine-tune a model. And by that we can perform many iterations and change the data, change the parameters and so on.
You can imagine it by like adding imagine the model as a system of lenses and Laura is actually adding more filters to the lenses. In the finetune process, we actually are trying to figure out what are those lenses are. We don't touch the lenses of the original model.
Also, we applied quantization. So, quantization is a trade-off between accuracy and performance. Basically, it's representing numbers with less numbers. We're going from representing numbers in 32bit floating points to 8 bit integers. And why is that important? Because a model is a series of mathematical operations, using smaller numbers makes this makes the operations much faster. We can see like the main uh uh stuff that we care about are memory footprint and processing power and inference speed. After quantization, those numbers get way better and we don't lose a lot of precision. Also because as we talked before we are running in very very large scales around 20 million files a day. Each token that the model outputs the fine-tuned model
outputs is actually time and compute. So we want that the model will output only the necessary data and only afterwards we will postprocess it to our relevant JSON data. So that's how we implemented it. The model outputs a tpple and afterwards we post-process it into a structure JSON. Once we got the initial results, we define KPIs around file level and secret level matching. We also had to deal with hallucinations. Sometimes the model output the results that are not exactly in the input. We provide it with a code file and it outputs variables that are not there but close. So we actually review we reviewed the results and we found the patterns of those hallucinations. It's kind of a
close set of hallucinations. It enabled us to build a custom fuzzy mechanism that we could match secrets and actually get the true results.
So let's talk about the performance uh metrics. Uh for first for the baseline um let's uh talk about regex uh detection uh for generic secrets. We're defining generic secrets for secret that we didn't have a good uh regex uh to detect them. um meaning it can be a username and password or some API key that we don't know or don't have like a strong patterns and actually we tried to create like generic regex to detect them and the results were really really bad. Uh this is actually how we started the research. Uh for the reg x detection we got 56% of recall and 32% of precision which is awful. uh for the LLM for the small language
model that we trained for example the 1 billion parameter we got 82% of recall and 85% of precision and it took also only 10 seconds per file for processing and we try even a smaller model a quen and a version quen code in a version of half billion parameters and the recall was a bit lower like 71% but still the precision is high 80 87 37% and it took only 2 seconds per file and those results are for a single GPU on a single thread. Those kind of model can run even on your laptop and your laptop is actually will be probably faster if you use like a Mac for example. Uh and it is amazing because if you can run it
on the laptop you can deploy this model for example and use it as a pre-commit to your Git or something like that. That's awesome.
Thank you. So, we're going to finish with some uh key takeaways and let's talk about the framework that we use that you might find useful also for you if you tackle this problem and want to fine-tune a small language model. The first web framework that we use is ansloh is a framework that actually enable you to run uh in a very efficient way finetuning for a small language model. you it's provide you also like step-by-step notebooks and you can run it even if you're not like a data science data scientist and the second framework is llama CPP it's really good for inference and we when you actually want to go to production so llama CPP
can help you to quantize the model in a various quantization method it is also provide you a single file that you can load with llama CPP library and um actually you can run it on a CPU uh you and optimize it on also for specific uh CPU CPUs if you want. Um and I think it's really we thought it's really good for inference and some general takeaways and future outlook. Uh so first small language model I know you hear a lot about large language model and you use a lot of larger language models but it's not only large language models. So if you have problems that you need to run on scale for example or you want to run
on a environment uh that you doesn't necessarily have like a GPU it can be a good solution for you. You can take a large language model and use it uh to label data and then uh generate a small language model that was will be specific for your task. Also another important takeaway is AI and security teams should actively use large language models to generate and label data. They should actually integrate it in their research and they can use this data to train new models to validate results and to create more data sets. Also I think a very important key takeaway here is the advancements of frameworks. If we look at this field of you know fine-tuning or model training
four five six years ago it was it was reserved reserved for data scientists but now with the ease of use software engineers analysts security researchers can now also do those things and even now there are cloud providers that are providing in one click fine-tuning a model just provide the data and provide while the parameters the the cap of the cost and you will have the fine-tuned model. Thank you for
listening. Um, do you have any questions maybe from the audience?
So what was the precision and recall if you actually test the data on a large language model itself like how close do you get to 100%. So the benchmark was built by large language models. So actually the numbers you you've seen so we tested our regist detectors on the same data set and our fine-tuned models. So we get to 86 precision out of a big model that generate the data. So we're we're not as a big model but we're very close and a fraction of the cost and the runtime. And also important to add that uh this data set of the large language model we manually reviewed the results and we saw that it's pretty much
accurate. Uh we didn't review 100,000 files but we review like many files and we estimate the accuracy of those results.
Hi. Uh sorry, what do you mean by recall and precision? Like how do how are those numbers different from accuracy and like false positive? It's uh recall is the coverage basically and the precision is just uh the out of the detections like the true detections like the model like on how the percentage when the model say that it's a secret what is the percentage is actually a secret. Okay thanks.
Um, was the model more accurate for structural secrets um or like well-known prefixes uh versus like um you know unstructured. So for the structured and let's say let's call it obvious secrets it was pretty dead on. I think you most of the times it missed it was for the secrets you know the non-obvious secrets for instance you define a variable u with a unrelated name to a secret but later on in the code you use it for authentication for instance so the the interesting part it was that the model actually catch some of those um secrets you know that they don't even look like a secret but afterwards you use them for authentication but most of the secrets
that he missed were of that mind. Yes. Also, it was our our research focused on those kind of secrets because we our secret detection already have a lot of uh reex that recognize like strong pattern secrets. Um yes, thank you for for sharing all this. Uh on one of your slides when you were uh showing the confidence score that was produced, so uh on one it was rated high, on one it was rated low. uh when I was looking at it, I couldn't really tell uh as a human looking at it, the one on the right that was low could have very well been a secret. So, how does the can you speak a little bit to
how does the model differentiate and and in that particular case you showed why one was low and one was high? Um I think in the example that I showed it, it was uh just a placeholder. So, it was access token_xxxxx. So it's it's pretty clear it's it's low and we defined actually the instruction for the confidence. So we instruct the large language model and we explain what is low and what is high and also we did some post-processing logic uh to to define the confidence or rerank the confidence for a cases that we believe that supposed to be uh like low. So most of the time the Ling was good enough but we added some post-processing
uh for for this to change a bit the confidence and then we use it for the training. Hi um good work. Uh my question is how long did it take you guys to fine-tune the model from start to finish and also how do we access your currently built model for like maybe our appsac workflows for example. So the total research process was around between two to three months because we had a lot of parts in it. You had a part of generating data and then refining it until it's uh good to a certain point. Then we had like the fine-tune stage and the review stage and then we went backwards. We we we actually regenerated
the data. Then we fine tune again. But a fine-tuned cycle just fine-tuning the model cycle is between 8 to 12 hours on a machine with uh two GPUs for instance. But the whole process actually took almost you know two months because it's it's a long process and there's a lot of also reviewing and manual reviewing and going back and forth. So this is like it was a classic data science you know model training but also it had the the part of it's it's it's kind of a hard problem of the secret detection. Yes. Also I can share that we did it like uh maybe six months ago. So uh we struggled a bit at the first time with
the frameworks. I think now in those days it's a bit easier and also as we said there is some services that you can use in the cloud provider to fine-tune after you have the data and I think after the first time that you do it it's much easier for the next time. for the first time was a bit uh took two three months but next time that we'll have to do that probably will take us I don't know two weeks. Uh very quick followup on confidence. So it looks like from what you said uh when you trained your uh model for confidence you were considerate how the secret is structured like for example XSS or some
one to three that probably it's not the real one right but I I just want to ask you did you also consider where you found those secrets for example you can have three configuration file one is for production another one for dev then you would uh do something about it as well. Yes. So it's also part of the confidence. So you saw basically the the low and the the high. There's also like a medium confidence. Uh but we try uh to instruct the model if the secret looks like a real secret but it's like let's say uh and test file or an environment file that not for production to set it to medium. So we keep the high is
probably the production environment and uh and the let's say the most sensitive uh the medium secret value looks good and low the secret value looks uh placeholder or not secret. And that is time folks. So give give it up for Danny and Ezre. Thanks guys.