← All talks

Code C.A.I.N – Keeping Your Source Code Under Control

BSides TLV · 202222:10232 viewsPublished 2022-07Watch on YouTube ↗
Speakers
Tags
About this talk
Developers routinely leak source code, credentials, and API keys across public repositories, container registries, and package managers. This talk presents a framework for detecting code leaks as quickly as possible using canary tokens, automated inspection, and neutralization—demonstrating the Leaktopus open-source tool for identifying and responding to exposed repositories.
Show transcript [en]

[Music] [Applause] [Music] okay so who am i again okay talking here i'm an application security team lead at platika i shifted from software development and devops to application security a few years ago i'm also an open source fan i contribute to wear and when i can i contributed to some projects like graffana a metasploit or shoe shop and others like that i'm also a bug until my spare time i'm responsible for a few cvs mostly on grafana and elasticsearch i'm also help securing some well-known companies and products like microsoft amazon elasticsearch grafana and medium i also co-organized the first israeli bag mounting meetup and a few fan facts about me before we continue i can't refuse to a good bill

mostly i prefer ipa or wheat i'm a parent of two humans and four cats yeah it's a lot and my gift selection might be confusing it's not you it's me okay so i want to start by telling you a short story so my wife loves to cook and she actually cooks pretty well and in order for her to cook she use recipes like most of us and when she uses the recipes she probably takes some recipes from her friends or also sends a recipe to her friends she also have some recipes that she got from her mother even her grandmother and in the past when someone wanted to share a recipe with another person they

usually wrote it on a paper or they like got it on a magazine or bought a book in order to actually see the recipes nowadays it's much easier because you can find a lot of good recipes on facebook pinterest and other places like that so why am i even telling you all that because developers are pretty much the same as someone that wants to cook they have recipes and they share the recipes across the internet well they share the recipes they are sharing it on places like it could be on github it could be a geek lab bitbucket things like that and it could be on places like dockerhub mvm maven npm pipeline and others

and well it's already artifacts or things like that and it could also be on places like github gist or basement where this is mostly snippets or scripts and things like that so when people usually talk about data breaches they're only talking about some leakage of informational or personal information like pii or credit cards but data breaches also includes other things like code leakage and what is the impact of code which is being leaked it could impact on three three different aspects one the intellectual property because some companies have very sensitive things in the code like for example the secret source of how they are doing what they are doing their product or even ai algorithms and things like that and if

it falls into the wrong errands then it can be abused or even used as an advantage for the competitors to create a better product it could also be used for lateral movement let's imagine that you have secrets on your code like aws keys then someone can take the aws keys that he found and act into the aws account and from there the the way to actually stealing some personal information or credit cards is very small it could also be used for actually when when you are seeing the code so now you can do like a white white um like a test on the code itself and then you can find other vulnerabilities like sql injection or xss and other things

that can give you some leverage when you're trying to hack a company it could also impact for the reputation let's imagine this is a public company so that means that the stock will go down and not even talk about fines that could be and things like that so let's see some cases where companies failed on protecting their source code one of the biggest examples is twitch on twitch case it wasn't only personal information and the earnings of the streamers that leaked but also their code and they had a lot of code repositories that were linked also another example is microsoft or intel in intel's case it was more than 20 gigabytes of data of code that was

dumped online so i understood that developers are sharing their code across all the internet and it sounds in theory that there is something to act here and to find so i wanted to take this theory and see what i can do so i wanted to break free from theory so what i did i conducted a research on three different big players in this area one is github the second one is npm and the third one is docker app so i'm going to tell you what i found in each of every one of those just a few examples not everything of course because we have only 25 minutes so on npm for example i found a lot of

private packages that were supposed to stay internal but it wasn't internal it was published to npmjs in those packages i found for example user credentials in test files and i was able to log into production environments with those credentials on dockerhub i could find things like credentials ssh keys api keys the worst one was for pi pi they used it for actually publishing their packages to pipeline i also found license key where you can see here and this is a real example a screenshot of something that i found and of course some private code that wasn't supposed to be there i even got acknowledged by checkpoint for one of my findings in one of the docker images

on github it was the most interesting part because i found a lot of things there and i keep finding things i even found one when working on my demo for this presentation and i'm going to show you it later so i found super user credentials to a cdn account of one of the largest israeli newspapers you can see here the the thing that i found it was again a fastly account i was able to log in as a super user i reported reported it to them and responsibly disclosed it to them and it was fixed within a few hours i found a lot of credit cards and credentials to a clearing company's customer account i haven't tested it the credit cards of

course but we can assume either way that it worked or it doesn't but it doesn't matter i even found gmail credentials of an external vendor to the idf and if you wonder i was able to log into the account without any two-factor authentication or something it was again reported i also found a lot of api keys like sendgrid aws keys azure keys and other sms providers like twilio and all that and of course i found a lot of source code that wasn't meant to be public for like branding websites e-commerce websites and things that should have remained private so why why does it even happen so mostly it's lack of security education no one knows that they need to take the

repository create a new repository and make it private and not public so it's mostly it i don't have a lot to to say about that it's usually some dumb misconfigurations nothing intentional so before we continue i want to put one core assumption your code was leaked is leaking or will leak on this talk we will only be focusing on finding it as soon as possible to reduce the mean time to detect i'm not going to talk about prevention at all i want to produce you to introduce you with codecade codecaine is acronym for canary tokens automated inspection and neutralization this is a framework that you can use and i will show you soon how what is canary talking can i talk is

something that i can put on a word file for example or on code on our case and once it's touched for example aws key that i put in my code and once someone is using this aws keys i'm getting alerted on that and then i know that it was act so before we continue and we are talking about where to place them let's talk or what types of kind of tokens we have let's talk about the various levels where code could leak it could be on the source code management level one example of that is code curve that in codecov it was part of the ci that was infected and a lot of the repositories of the infected companies

were breached and it could be on a single or multiple repositories level and it could also be on a file level for example on a script or a class or something like that so let's talk about the first type of canary tokens secrets we can place in code aws keys or active directory accounts and once those are attached we know for sure that our code has been leaked let's talk about the pros and cons the process that it looks authentic for as an attacker i will try the credentials most probably because i found something i want to test it but the cons are that it requires medium setup effort it requires me both to place it on code on

multiple repositories and on the other end to monitor the usage of those credentials it could also be triggered by mistake like one of the developers could see the aws key and think ah what is it let's try that and by mistake trigger an alert and the biggest con that it has is that the code has already been found someone used the credentials the second option that we have and by the way you can combine all options okay i'm not limiting ourselves to one or another but we will focus on the third one which we will see in a sec so on application preps we can place fake api endpoints or even some files like error.log or something like that it was

intentionally put there for someone to find it we need to use names that can't be fuzzed easily okay we don't want it to be triggering by some false positive or by fathers and once we did that we have again a few pros and cons one again it looks authentic someone would try to to access some routes like slash admin back office or something like that but the cons are that it requires ice setup effort even higher i think than the previous one because it should vary between languages and frameworks each language should have their own way to define the path and it requires multiple variations per code repository to identify from where the leakage happened and it could be

again triggered by mistake and again the code has already been found which is not good so that leads me to the third option the unique strings in the unique strings what i can do is just to place a unique string something which is unguessable like uuid and i will show it to you later in the demo and then i can put it on all the code files that i have for example in a comment i'm not going to go into the details on how to actually do that um but let's see the pros and cons the process that it doesn't require the attacker to actually access it so we have the best mean time to detect we can just search

for it and find it it requires low setup effort because placing commenting code is pretty much the same across all languages and it has again a better course languages in form of support and we can by the way i skipped the second one it will cost low setup effort because i can put it in a ci job for example it does it across all my repositories and the cons is that in we need we need a way to actually find it once it leaks but don't worry about it because we have the solution so a few highlights before we continue don't place it on publicly accessible code of course like things like open source or

apks or mobile clients in general or client-side code because someone will notice it and we don't want it to be noticed unless it's actually been leaked don't be too obvious don't use something like unique security canary token that an attacker won't actually use and start with your most sensitive code repository first here is a here is an example of how not to do that let's continue to the second one automated inspection in the automated inspection it's a pretty easy methodology we want to first of all add four leaks on public sources second of all we want to filter the results but by some kind of heuristics engine or techniques and the third one is to enhance the results with what i

call indicators of flick ios the last step is the neutralization in this case we want to eliminate the risk we want to have automated alerts like any other security tool for example the tools that we have on all the organizations for alerting on malware for example viruses we want to investigate it using the extracted indicators of flick and we want to automate the remediation process as much as possible for example if this is an active employee that leaked code we want to contact them through some chatbot or if this is an ex-employee we want our legal department to contact whatever it will be to actually request a takedown request that all led to the development of a

tool called lictopus this is an open source solution based on the cane framework and you can find it in the link here so what leaktopus does it it works on it has a few key features the main one on the automated inspection part is to and after licks it also filter the results and then it enhanced the results with indicators of flick for example some company emails that it found the canary token that we planted earlier and secrets so it can give your investigation team the the right context to understand what is the severity and how to find a real leakage among all the junk and the fourth thing that it provides is to index the code to

elasticsearch why is this useful let's imagine that your code has already been leaked and you you know that someone used one of your active directory or aws accounts to log into your company but you don't know how what you can do is to go into the elasticsearch where we're indexing the results and search for the leaked secret and then we will able to find it and you might be asking yourself why can't i just do it with github or whatsoever and the answer is that github only provides you with an api with the search code api to find things on the default branch like main or master so you won't be able to search in the history and in the

commit and etc on the neutralization part what liquibus provides is microsoft teams web book and a notification on new leaks and also cortex exo integration and because we most of us are nerds here and we want to look under the hood so about the technology stack it's fully dockerized it's api first flask a backend it's decoupled view free front end it uses sqlite database tasks with salary and redis it has also automatic retry mechanism for overcoming github's rate limiting and it has some built-in heuristics as i said some of the things are by the number of forks or stars because we can assume that if a repository has like for example more than three stars

then it's probably meant to be public and it's not it's not something that was leaked and it also ignores random emails or linux collection a lot of callers are using github to actually store some data sets like for example the top best 1 000 companies in israel so it ignores such kind of things again by some kind of heuristics so demo time just kidding let's see that you actually can see my screen let's see

okay

okay so you can see here how league to puss looks and in general when i want to search for something i'm just putting the search query here i'm not going to sell something now i already prepared it in advance because it takes some time to actually scan it and do all the things that we that i said but you will search for something like besides tlv and in the organization domains you will put your organization domain this is used for the enrichment like besides tlv.com and there you will put your canary token for example the canary token that i used for actually leaking something for the demo was this one and then you can see here the scan

status this one was done and then you will see the results here as a json let me scroll up and you can see that i have 57 results if you wonder how many results there were before all those heuristics that i mentioned you can see the results here on github it had more than 800 results so once you look on all the 57 results that you have here you can also wonder are those all real or i do i need all those so the answer is no if you will look on b sides for example because we have the ctf that happens once a year and we also have the organization on github that called besides the slash something then

we know that we can filter those so lictopus provide also a solution for that so it comes out of the box with a few ignore repository patterns and you can also add your own custom red access and i prepare those in advance and you can see that i'm ignoring for example the besides tlv organization i'm also ignoring everything that has the c the word ctf in the organization name or in the repository name okay on both cases we want to ignore those and i'm also ignoring everything with g shop because i added a lot of results with geoshop let's see how many results we have now we can just refresh the page and we got 30 results with just basic

filtering that i did without any crazy knowledge or something like that but i'm still i still need to go over all the 30 results so what should i do and how can i prioritize things so as i said there is the enhancement part so what i can do easily is to do just kill request and to filter by each organization domain equals one let's see what it does it mean and let's see if i have results so you can see that for example i'm extracting the authors of all the commits and in this case you can see the indicators of link is this file the readme file that contained the word besides tlv and you can see all the information

about that a top secret besides application and you can see that the author was to arsec which is my nickname at bsidestlv.com and you can see here that it's marked is organization domain one because again it's besides tlv.com what you can also do is to search by the canary token this is called the sensitive words keywords by the way you can use a lot of sensitive keywords like internal domains that you might have like autumn ltd dot internal or something like that that you know that wasn't supposed to be outside and you can see here that i also found the result and you can see here where i found it so it was found of course on

github and you can see i'm no one just ignore me with the canary token that i placed it doesn't look suspicious like some developers put some uuid here now you can also see the secrets so it's extracting the secrets from the code and you can see that there are two secrets here one is aws access key and the second one is username and password admin b sites one two three at pi pi but when i'm looking at the repository i don't see those credentials why is that and this is because it was on well it should be on previous commitment one sec i'm not on the right place okay now it will work so it's not here but you can see that

it's one of the previous commits so you can see it i think here adding my deployment script and you can see the credentials by the way it also ignored the aws example one the last thing that i'm going to show you in the demo is how it looks on the elasticsearch for example i put here some nice dashboard for you to see but the interesting part is this one so it's all indexed all the commits data is indexed and i can just place no not this one the canary token that i want to find and see where it clicked from and you can see that it was linked here on this repository on this commit

going back to my presentation

so what are the next steps one to extend the octopus and your contributions are very welcome i appreciate any contribution to support some leak scoring and it's basically mostly supported but just to add like some kind of property of severity or something or score support more public sources at the moment i'm only supporting github but i can support easily it's very extensible paste bin and bit bucket and to improve the secrets detection any questions but i can see if you can okay right [Applause] [Music] [Applause] you