Performance Hacking - how to hack your tools to go faster - singe | BSides Cape Town 2023

Name: Performance Hacking - how to hack your tools to go faster - singe | BSides Cape Town 2023
Uploaded: 2023-12-20
Duration: 46 min 6 s
Description: Rust, hacking, and password cracking - how I built a password cracker faster than hashcat on CPU, but also a large file reading approach faster than ripgrep. Rust, hacking and password cracking. All things I love dearly. But when an obsessive hacker wants to crack passwords efficiently, sometimes y

BSides Cape Town46:06393 viewsPublished 2023-12Watch on YouTube ↗

About this talk

Rust, hacking, and password cracking - how I built a password cracker faster than hashcat on CPU, but also a large file reading approach faster than ripgrep. Rust, hacking and password cracking. All things I love dearly. But when an obsessive hacker wants to crack passwords efficiently, sometimes you can reach speeds faster than hashcat can produce on CPU! Efficient hash cracking is about performance, and performance is about understanding computers. The most common approach for more speed is to just throw more hardware at the problem, but in this talk, motivated by getting a new laptop with a new architecture, I will show how similar, if not better password cracking speeds can be achieved with just a little computer science. I'll discuss the many problems I had in my goal and the many interesting performance related lessons I learnt which ultimately resulted in me discovering a new way to dramatically speedup file reads on multicore systems.

Show transcript [en]

hello everyone hello hello uh so um I was just explaining how to some friends of mine like never try give two talks at a conference it's just the worst idea um so I I did something really dumb I didn't back myself so I submitted two talks and then the bloody organizers accepted both of them and and the problem is one of them is like a I had to put more effort in because it's the keynote so this is like the run of the litter so I'm completely fine if you need to leave okay so uh what this talk is about it's a technical talk it's not it's not like the the keynote and it's not even really

a security talk so I'm extra okay if you want to leave hello no problem um all right so uh one of the things I'm really interested in is optimizing stuff making it go faster and I'll explain some of the reasons why uh so this talk is about optim optimizing hacking tools and uh I really like rust so that's why there's a crab on there so people who know rust know why there's a crab on there but just to explain it because there's like a crab on every slide and you might not get the joke if I don't explain it rust people call themselves rust stations it's like crustation and they that's why the mascot is a crab so that's why I've got

crabs everywhere not because I have some weird thing for crabs okay so like just one of the greatest hackers in the world is this guy named off Lake uh he kind of invented large sways of the reverse engineering field he was one of the original founders of Google's Project zero after Google bought his company bind um and he's just done amazing stuff like have you heard of the rammer attack yeah that was him obviously there were other team members so his real name is Thomas Julian this Julian right I've only ever known him as halva flake he once explained why he is called that and I don't remember I don't remember do you remember it's a Viking comic yes thank

you yeah there was a Viking comic called Hala but was he called flake I think the flake is a yeah anyway doesn't matter least important part of this so it's interesting about how apart from the fact that the is this really excellent hacker is he left this information security industry to get into the uh performance optimization industry so he built a company called optimize with the Zed and it just got bought by elastic and uh it was really interesting he gave a talk at cuon earlier this year where he spoke about his reasons and he talked about three reasons that resonated with me so sorry to be clear what those reasons are are why would you engage in

performance work and leave security work to do it so the first is what he called the death of economic M's law so if you're familiar with M's law it's this idea that the number of transistors doubles roughly every year the economic impact of that is that computing power gets cheaper now realistically your Amazon AWS bill is not going to going to be $200 next year if you ran the exact same workload on the exact same systems might be a bit cheaper but it's not doubling so while we are still kind of in a place where M's law is holding true we're not in a place where the economic benefit is is coming out so what that

means is workloads need to be optimized or phrased another way optimizing workloads has a meaningful economic benefit because you can't just rely on things getting cheaper um the other side of it is that hacking provides this kind of full stack view of computing you're not a front-end developer or a backend developer you're not limited to one part of the stack you get to kind of hack across everything if you can move from the front end through to the back end that you'll do it uh harun who's one of the early people at sense poost once said uh security is just a career of constantly going deep you find something you go deep all the way and then you

move on to the next thing because you're just moving through the stack the whole time and performance gives you the same opportunity because the performance optimizations could be at any any layer so you get that same kind of hacker Joy because you can deal with large code bases and move across full stacks and then the last part which he experience certainly a lot more than I experience nobody asks me to write cyber cyber weapons um is that a lot of the work he was doing on the security side could have significant implications on the health Safety and Security of people uh and sometimes you can write vulnerabilities you can discover vulnerabilities or write exploits that

end up with somebody being dismembered in a Embassy somewhere which is a really unpleasant outcome from some of this work uh the Spy ver spy aspect of it whereas with performance you don't have that problem plus with security like no one will love you for the work that you did they'll just be like okay thanks for calling my baby ugly I guess I can go fix it at the best part but at the worst part they'll be like I'm going to sue you pew pew but with performance everybody loves you they're like oh sweet this is faster and cheaper great we'll do that and it's got other benefits like the environmental ethics so yeah it was it was interesting

to me that there was this very influential person certainly in my life saying these things at a time when I was getting quite excited by it uh the reason why I focus on performance um slightly different but somewhat related so the first one is uh I used to work with a guy named Marcus Lao he's now at think and he once told me good hackers always do two things at once so when you're hacking into something you've got some long running thing going on in the background while you do something the simplest example will be your running nesses in the background or some kind of web application Scanner while you're manually poking at things but there's

lots of opportunities where you can write something long running even if it's shitty just to be gathering information which might help you later on and in security pretty much every vulnerability can be discovered or exploited through a Brute Force whether that's the right thing to do is a different story whether it's the most it's definitely it's never the most efficient but this like time time effort tradeoff exists all over the place in security so writing brute forces and being able to make them go fast can be very beneficial for discovering and exploiting security problems that's kind of the whole premise behind fuzzing but pretty much all vulnerabilities fall into the you could find them through

fuzzing um category then uh it just improves the more you learn about this the more your code gets better which is quite useful like it's not something that you do once off in one place and then it's it's useless it's something that you can keep reusing in lots of other places uh for research purposes looking for excuses to go deep this gives you great ways to go deep it's slow why can it be faster why and then you end up in computer science and physics problems and that's interesting and it gives me the same kind of hacker Joy uh that I get when I'm exploiting exploiting bugs so maybe it'll be interesting to you okay so now I want to get

into optimization part so there's two kind of two parts to this talk maybe you're noticing a theme uh the first I'm going to talk a little bit about um like how I optimize and the optimization I'm doing is you know command line applications for hackers so some of you are probably developers who are very good at optimizing large workloads across kubernetes clusters I don't know how to do that um and some of you probably very good at writing program I'm like I'm I'm sort of sitting at the intersection of doing a bad job at a lot of things but maybe it's enough of an overlap of some of those things that's interesting to you and then the

second part of this is I want to talk about uh some specific problems that I faced while writing an ENT password cracker which I thought were kind of interesting maybe you would too okay so let's get into an optimization primer that is Donald kth Donald kth is a famous computer scientist maybe you've heard of him maybe you haven't heard on of him but he's built just like a ton of stuff that computer science relies on and way back in 1974 he had this famous saying premature optimization is the root of all evil so when you're optimizing something the first trick is to not optimize and just make the thing work because if you try too hard to optimize it in the beginning

you end up in all sorts of difficult situations and you also then can't improve cleanly from like the worst case so just write it it'll be fine and then you can develop a benchmark and improve it on that the next is what kind of Benchmark do you want to develop so large majority of the time real computer scientists are looking at like optimizing CPU load optimizing in kernel time optimizing memory there's all sorts of things there I'm just interested in optimizing runtime I want this thing to go as fast through the time they call it wall clock time the time I experience in the real world um that's what I'm optimizing for that's not really

something a lot of real optimize people seem to to focus on but things like CPU and memory are a a way to get into that so The Benchmark I'm going to develop most of the time is how long did it take and then you can use profilers to try and figure out why things are going slow but you kind of need to do some work beforehand to understand what the hell the profile is saying and then you need to do some work afterwards to understand why the hell the profile is saying that and then you need to do a bunch of fiddling much like in hacking you might have a hypothesis but the actual hacking

is when you're sitting there and iterating and fiddling and not sure why and trying to figure out why when you made this change that happened so you do a lot of that to then try and work out uh what kind of changes are happening in the profiler and why so let's go into a simple example if we have uh this is the problem we want to solve if I give you some string we want to create every permutation of that string you can kind of see the the benefits from a password cracking perspective here okay so if we want to write a program to generate that I'm going to write the simplest possible program so

it's in Rust but I've tried to keep it simple and you don't really need to understand too much of it so I'm just you see the thing in Brackets the 0 do do 10 so I make a range um because it's rust that's actually 0 to 9 uh and then it's creating permutations of length 10 so it'll say 0 1 2 3 4 5 6 7 8 9 1 0 2 3 5 4 6 789 Etc it's creating a whole bunch of permutations there I'm not using ABC because it needs to run long enough to for my computer to actually do something meaty and then I just got a very simple for Loop that goes through and prints

them out and it uses the default debug pretty printer to do that okay so this is going to be the Prem I'm not optimizing this thing there's no premature optimization uh and I want to see what that looks like okay so I'm using a tool called hyperfine you're going to see this a lot let me explain what you're seeing there so hyperfine written in Rust irrelevant to this point is a way for timing how long something takes when it runs and it's pretty good at comparing the speed of that against other things so you'll see some switches there the first one is W so that says warm up so it does two runs where it doesn't do timing it's

just running those so that your computer warms up its caches puts things in memory all of that kind of stuff otherwise you have this problem where the first run is always slower than later runs because of it then R is the number of runs to do because you have to watch how long this takes I've made it three but you know you can probably make it higher if you're doing this in real life the L switch is is complicated but we need it for later so this allows me to run lots of versions of this and compare the speed against them I'm only running one thing now so that doesn't matter but later it's going to matter so I Define a

variable called variant and I only give it one which is A- vanilla so I've got a directory called A- vanilla which has the first version of this code that I just showed you in it so it runs it okay then it gives two lines of output the second green thing on the so 3.16 four that's the mean time so the the the run which was the most average of the runs so that's a pretty good one to look at to understand how long this thing actually took and then underneath you've got the minimum you've got the maximum you've got how much time it spent in kernel how much in user space but mostly we just care about that one number is my

cursor oh good my cursor works so that's the number we're going to look at for most of these things all right so we wrote the simplest possible version of this code to generate these permutations and it runs in 3 seconds to generate all of those permut 3.1 seconds so how do we make it faster now you can try and Fiddle a whole bunch of different things but generally the first thing I do is I add a buffer so buffers are a really good way to make sure that two things that don't match their producing and consuming time uh can go a little faster that's not why this buff is working here I'll get into that in a moment so what we've done with

this code is we've it's the same code as before we've just added a string called buffer an imagin atively there it's generating permutations in the same way it's looping through them in the same way but what it does is it's writing the permutation to the string rather than to the screen and then it checks once it's bigger than a certain size it prints that out on the on the screen and then it clears the buffer okay so we're just adding a a buffer into it the reason this works here is because the cost of writing to the screen is quite high so by only writing to the screen writing a lot to the screen rather than lots of

little things to the screen ends up with a bit of a performance speed up and I'll show you in a bit okay I should have gone to that slide cuz I'd helpfully highlighted all of the bits I wanted to show you all right so now when we run this we want to compare it against the first run so I'm running the original one again and frustratingly you're going to have to feel those 9 seconds it takes to run well now there's the warmup too so 12 seconds four 15 seconds adding threes who knew it was that hard so now that feels really slow okay then we run version two two and you can see it's

much faster so it's sub 1 second it's at 884 milliseconds is the meantime so we managed to dramatically increase the speeds 3.58 times faster by just adding a buffer wonderful okay so we did some fiddling some premature fiddling we got some speed UPS that's great I'm not saying the solution to all optimization is add a buffer but a lot of the times it is uh and you might maybe this programers in here and you go oh that's really obvious but for the rest of us you go well I mean like am I just dumb do I not know these things and I wanted to point out that it's not that obvious to everyone so this is atam the guy who

created hashcat arguably like the most prolific speed up Optimizer on Modern computers when it comes to hacking tools adding a buffer to his version of permute uh because it made it go faster after we had an argument on Twitter and you can see first version was committed in 2015 and then there's his buffer update in 2019 okay so that was kind of poking around getting lucky but hopefully gives you a taste that you can make things go much faster you can live much more of your life not staring at a computer screen because you made things go faster up to 3.5 times with just one buffer I hope you're buying what I'm selling next we're going to use a profiler okay so

flame graphs are a type of graph I'll explain what's going on here a bit but you can enjoy the chat GPT generated crab on fire so flame graphs are a type of graph what they show matters so flame graph is a whole bunch of things that are children of parents sitting on top of the parent W I'm explaining that badly so it's got the parents at the bottom and the children going that way the children generally take less time than the parent uh now flame GRS can be used for memory they can be used for on CPU time so what your what you're displaying matters a flame graph doesn't always show the the same thing so in this case

I'm using cargo install flame graph if you're a rust person the default flame graphs which shows kind of on CPU time the number of samples uh that showed it to be in a particular stack frame and what you can see here is that for our buffer version the vast majority of time is spent on core format debug format which is a child of format right so even though we added a buffer and it went a lot faster we've still got the majority of our program time spent writing writing to the screen and when we look specifically what it's doing it's when it's calling the debug format formatter to take a bunch of integers and turn

them into a string and put them on screen okay so that gives us an idea of where we should focus to see if we can make things go faster okay so what yes this is the part where it's using the um the debug format and what I did is I changed it so instead of calling the debug format we take each integer and we turn them into a character directly and we stick that in the string so now we're not asking more complex things to happen we're doing the one thing that is needed take a number turn it into a character stick it on the string and that's what that's doing there it's otherwise exactly the same

thing uh and at the end we put a new line in so it goes through each character there okay so let's look at how this Compares was it 15 seconds I should press start on these things sooner so we don't have to physically wait for 15 seconds but I want you to feel the visceral pain of how long and how irritating three extra seconds unnecessary seconds needs to be on running this thing but I also definitely should have clicked next sooner so let's just wait awkwardly okay we've passed that one right so this is our second one which we know ran in about just Sub 900 milliseconds and our last one we're down to 169 milliseconds

just by getting rid of the format so that's at nearly 19 times faster than than version one at this point I think we're at like six extra lines of code to make it go go faster okay so let's go back to our profiler and now the crab is smoking instead of being on fire just high quality and when we look now there's not a lot of things happening it's getting a little tighter to look for our optimization advantages but it's spending most of its time doing this iterator iterator next things like that rather than spending most of its time writing to the screen so that tells me the actual iterations the permutations rather are what we need to optimize next

so uh I mentioned that argument with atam on Twitter uh and I first put something out here when I was fiddling with this and he's like oh you might want to look at this and then it was this passive aggressive paste bin thing about how my code and is how amazing his is and I was like oh thank you 10% faster was really cool but inside I was seething and I'm a very competitive person by Nature um so I went to go look at his code and he uses an algorithm by somebody with the best name in the world so I'm going to pronounce his name and not say anything rude his name is Philip

Paul fuks that's how you say that right that's how you'd say it yeah Mr fuks so Mr fuks is a PhD uh and his website that he created about this algorithm looks exactly how you would expect a website created by a computer science PhD about an algorithm would look uh don't let that distract you though from the fact that it's super interesting so this is the pseudo code from his website about what it does it doesn't matter what matters is somebody wrote him in this question as part of a practical he was doing for one of the courses he was given and he sat down and he thought about it and he went well we

can make permutations much faster without recursion and all sorts of other things and he created this fantastic algorithm called quick perm uh so atam was using quick perm for his version of permute so I thought okay let me use it and I implemented that uh into into this thing that we're creating 15 seconds and then 5 times 900 seconds and then how long did the other one take 169 five times so how long is this whole process going to take how long do I have to fill the air 15 minutes 15 minutes no okay you're banned from answering any future questions okay so we implemented quick perm and we're now down to to 13 milliseconds 13.7 milliseconds which is

23123 times faster than our original version so that's great that was that was fun was that kind of exciting was it I thought it was kind of like this this gives me a little bit of a thrill and like the worst part and I told you I'm competitive by Nature the worst part is that when you compare this against hash cats permute his is still faster it's 9.9 milliseconds like reliably and mine ranges from like 10 to 14 milliseconds and I honestly don't know what to do next so the code is on the internet if you want to make it better EDS to distribute this I I don't know what that is thank you though and I really look

forward to the the the pr on the repository um yeah so github.com what is the reposit github.com singfast permute it's there yes said less consistent standard deviation yeah mine is less consistent because my stand your standard deviation is much lower oh you're right you're right huh take that Adam no so because this runs really quickly you should actually run it like a 100 times uh and I think I was doing this so you can see what time at night I was doing this and I was on battery so it's not like ideal test conditions but you're right okay so in having played a lot with these various optimization things this is kind of The Hit List I put together

so when you're optimizing stuff what should you look for so the first is the algorithm you saw what a dramatic speed up that gave us now it doesn't matter what AUD you do this in but the algorithm can make a bit fairly big difference so think about what you're doing you know all that Big O notation stuff that you ignored in computer science maybe just me um but that turns out can make a big difference then the other one is network communications network communications are kind of the slowest much than your local IO so being able to optimize that stuff makes a big difference hackers do lots of things over HTTP so you know using chunking

using http2 streams all sorts of things like that can get you quite significant speedups uh then your local read and write IO understand where things are blocking where things are going slowly and we're going to spend a bit of time on that uh the language you're using I mean I did a lot of things in Python because that was easy and then python is slower than a compiled language so you go to a compiled language and then the compiled language is slower than your python because you just wrote the code wrong anyway it's a it's a cycle but spend some time thinking about you know what language you using how it works for what you're doing maybe there's a domain

specific language which could be useful maybe there's some really fantastic compiler optimizations in lvm for that specific problem space you're using it's totally worth looking at those things and then people talk about this a lot but it's also worth looking at memory allocation so the less time you spend needing to clear large sways of the Heap to put things on it uh the faster things will go so for example with the buff um the no string version of the the quick perm thing we needed to create a vector we needed to put a whole bunch of stuff in there we needed to call the format that those memory allocations are slow by just sticking characters straight in

the string we didn't need to allocate any more memory on the Heap we're just filling existing allocated memory a lot of the time people think threading will make stuff go faster and a lot of the time threading doesn't make stuff go faster or you end up end up with concurrency problems that makes it not work better um so like use threading right at the end and part of that is because you need to understand this problem well enough to know like where your actual blockers are and then figure out if multi-processing or concurrency is going to be actually actually beneficial to you um and sometimes it's not like if you're iob bound then having more

processed power isn't necessarily the thing that's going to to save you here if you're memory bound then having more processing power is not the thing that's going to help you here and then like unless you're much smarter than me so pretty much all of you um like concurrency is really hard it's really really hard and then from a performance point of view all of the synchronization needed across concurrent programs can make things really really slow which is why pretty much all of the time the first version of a threaded thing I write is slower than others because I've got some kind of state blocking problem introduced in there and the reason I say this is because pretty much everyone

does threading first and then then then you have a lot more complex code uh that's going slower and you don't know why okay so this this is now a theme this is what this Talk's actually about much like the one earlier today it took me half the talk to get there so I got this laptop you see like the crab got that laptop and it's one of these cool Apple silicon M1 things and every time I get a new laptop I run hash cats Benchmark to go how much faster is this than the old laptop sure like other people run games or something but you know hash cat is what US hackers look at um and it was much faster than my old

one like sweet so then I ran my password cracking scripts against my old Corpus of things hoping that I could find new things that drop out and it ran much much slower and I I was like boo boo so then I went onto the hashcat um GitHub and there's this this uh issue there it wasn't my issue somebody else's issue and if you look there it's got 70 comments uh this was way back in November 13 there was a whole bunch of people going when Apple silicon worked why Apple silicon slower things like that um I like genuinely tried to contribute to what the problem was like I did some hard computer science things to figure it out and they had no idea

what was going on so they eventually just rewrote a whole new metal back end um and I'll show you how well that worked but basically this was all born from the fact that I got a new laptop and hashcat didn't work on it so I thought let's go play the other thing was I was having an argument with Leon about something and I said I bet you can make a purpose built domain specific tool really quickly and easily that'll be faster than the big General every use case tool I think I was half right okay so ntlm passwords if we want to crack them what is that look like so the way ntlm works and by the way I

really hate that it's called ntlm because there's LM hashes and then technically these should be called n hases and you get an ENT hash and an l no an LM hash and an N hash if any of you extract things from a domain and that would like make sense if it was called an LMN hash those things together so why don't we just call the ENT part an anti hash I just started calling an anti has and then Microsoft people on Twitter were like no it's not called an ENT has here's the documents so it's called an nlm hash for no explainable reason okay so the way an ntlm hash works is you take the clear text

password so in this case password one anyone want to own up to that one and that is bytes so if we turn those into bytes I mean we don't turn it into bytes cuz it already is bytes but the bite representation the asky number not the asky number this this is a hex number it is the hex number yes okay performance anxiety so the hex number is there okay then enti hases you have two operations which is why they're so efficient to crack the first is you UTF 16 little Indian encoded so what hashcat did for ages is it just bodged in a null uh after it so there 50 add a null 61 add a null 73 it's UTF so

there's other code points which aren't single width wide so technically you could have like weird cerlic characters or Ki or things like that but in this case this is just kind of vanilla asky so these are all nulls so it just makes it wider and then you you do an md4 operation on it okay so to create an enti has from a password that's what we do so then if we're cracking uh you take a small hash list generally you don't have a giant hash list or your password list is always going to be more like you want to try a bajillion passwords against a million hashes you're probably never going to have a bajillion hashes

against a million passwords that would be weird and you go through each hash in the hash list and you go through each word in the word list and you do this operation and if they match then it's successful so this isn't like this isn't a difficult computer science problem this is you know do a thing and then check that it matches another thing and when they match great okay so I I did that and it was like eight lines of code and much like the initial permutation thing it was really slow and it was much slower than everything else so then I went I packed my crab backpack and I went on an optimization Journey so these

are some of the like interesting optimizations but that I don't I don't want to spend too much time on them there's a blog post where I talk about them at the bottom there uh so the one was I like if you look at smart crypto people which I'm definitely not I don't know understand what they're doing half the time it's all bit shift registers and things like that uh they obviously have written a fast implementation there's these rust crypto people that have like formal verification methods um but it turns out it actually wasn't too difficult to make their stuff run faster because I don't care about safety and all sorts of other things so you can

kind of ignorantly dive into well-defined implementations and make them less well defined and more single purposed your thing and get some speed UPS like don't be scared of trying to optimize people's Liv liaries then the the hash lookup problem so you've got you're generating hashes by going through your word list and then you want to check if any of the existing hashes you have match that right rather than taking a hash and going through the entire word list and taking another hash and going through the entire word list initially I thought we could use B trees because that's what databases use and that's quite efficient so I implemented berries and those work quite well and I

thought I was smart because sound smart does it sound smart um but then a friend pointed out to me that if you use a hash table then that's actually much faster as you can just look up the thing uh and then I worked out that wait these are already hashed they're hashers so you don't even need to Hash them to put them in a hash table you can use a no hash hasher and then you get a hash table without the cost of generating a hash table with like very minor cost of generating a hash table so that was pretty cool got quite a lot of extra performance there uh and then I implemented a shitty Bloom filter if

anyone knows what Bloom filter is you could explain it to me and then I could make it less shitty um but basically just checks particularly for small hash list like does the beginning bite even exist in the hash list that I have if not then let's exit early rather than going through the full operation but at this point everything I was doing was still slower than hashcat then I thought threading and this was a terrible idea um every standard technique I used was slower um I had a very helpful conversation with like one of the authors of Russ who pointed out very kindly how this mistake I'm making is just like a really common mistake for beginners in Rust and they

actually I'm not saying it's my my fault but they introduced something called scoped threads probably just because of how irritating I was at pointing out the problems with their standard threading techniques um but in the end what what worked out really well and I didn't know I still not 100% sure why this works but most threading implementations will use channels if you're familiar with go you know all about channels uh and there there will be a single receiver and lots of senders so each of the threads are a sender and they send it to the single receiver which is in the main thread um but I didn't want to do that I wanted to have something Reading passwords from a

file and then sending them off to hashes which would then do the comparison so I I built it the other way around and the free benefit you get from using channels is you kind of get a cue so then the threads can wait around if there's no work and if there's lots of work they can pick stuff off the uh the queue which then means you can kind of monitor how efficient L you're using that que and it gives you a tuning tuning parameter so all of that stuff I thought was kind of cool and interesting and in the end with 15 lines of code I was able to be much faster than hashcat a lot of

things so like the detail doesn't matter here but what I did is the first one is anti crack and the second one is hashcat anti crack hash cat anti C hashcat just anti crack cuz hashcat took 10 minutes to do that one and then enti crack hashcat and in all of those if you're looking at that number there anti crack was faster than all of them so I did the victory dance I wrote a snotty tweet at atam he's a really great guy and he does fantastic work I don't really have beef with him um and hashcat is still an infinitely better tool because it works on every piece of Hardware in the world and it can do every hash I wrote 150

lines of bad rust so like let's not overcook the the real achievement here but still I was kind of chuffed anyway I put it to bed I was done then Matrix who's the guy who actually implemented all of the Apple silicon stuff on um in hashcat really nice guy despite the fact that he's Italian no I'm kidding he really is a cool guy um and he does just like an amazing you know like all of these developers they do all this amazing free work to contribute to this project and I'm infinitely grateful to to them uh and he sent me this message I've done some tests with much larger word word list and I didn't get the math

you blogged about I was like oh he's probably running it wrong or something like that anyway I checked and he was absolutely right the second we got to like over 10 gigs as a word list then suddenly hashcat was so much faster I'm like why why okay how to read a file if you go to the rust handbook and you type in how to read a file it tells you do this this is the exact thing that's on their web page you get the file path and then you put the entire contents of the file in memory and now you have the file so technically that's the one line there FS read to string filth now that doesn't

make sense if you've got like a 100 Gig file you don't have 100 gigs of RAM so you can't you can't do that so that's not going to work you're going to have a bad time um so if you Google how do you read a file fast in almost every language there'll be some kind of buffered reader so how do we make things go faster we thank you this is I'm going to start a sports team you guys are great okay so the only code we've really changed here is we've added so a block size you got to choose how big your buffer needs to be um the trick is to just test for that

run it across a bunch of ranges see what which is the fastest uh we create a buffer which is of the size of the block and then we go through the file dividing it by the block size and we read just those blocks okay so when you do that it goes It goes faster because we have a buffer much like the other but then oh there was an animation look at that that was a pointless animation I made a whole animation just to put e on the end okay so then this is a slide stolen from halva in that cucon talk of his and he pointed out that like a lot of our assumptions about how we read from files

are based on old outdated ways of reading from files how much time do I have left yeah uh and so what you're saying is modern nvme ssds have lots of internal parallelism whereas old spinning rust doesn't so those of you who have little more gray hair on the sides or a little less hair on the top will remember the days of hard drives which was spinning rust uh and you've got this like single read head maybe if you like had two read heads but what that meant is if it needs to read from a lot of places on the disc it's got to spin it around to to do that so the way you read fast from diss is

making sure that the data is next to each other and that you read it sequentially and so like how the data was organized on the disk really matters but like modern hard drives don't work that way according to Hala what does he know so I thought let me test because you can't take his word for anything no I'm kidding this is is this recorded oh definitely getting in trouble um so so I thought let me try threaded file reads then I mean this is like the universal wisdom that this is a bad idea in the old days of having a head it would have to read here and then jump over here everything would just go

slower um and so the on the one hand is the little difficult to do it like it wasn't obvious you know at first I had one file handle and tried to have lots of threads reading from it but then and eventually worked out now you need to have a separate file handle opened for each thread uh but we implemented it and let's check okay so now we're going to run all of those different things we spoke about we're going to just read from the file vanilla we're going to um add the buffer and then we're going to do threaded file reads so we're at 1.6 seconds for the vanilla the buffer we get to 358 milliseconds and then

threaded file reads are like half of that at 143 milliseconds so it turns out that like using threads goes faster on file reads who knew like hva however um but so this problem kind of related to the problem I was having with trying to make my shitty rust code run faster than hash cat and um the when it came down to was hashing so not hashing caching now I've gone too slow so DD we're reading from a file we're putting it to Dev null at a certain block size this is going to repeat we'll see it again and when we ran the first DD repeat there we go uh you can see the speed over there

116 and then I'm using something called VM touch which shows me the entire file exists within dis cache I then evict it from the disc cache I run it again and you can see that's much slower CU I don't want it to repeat again so particularly it's 57% slower so the speed was coming from the fact that it was in the dis dis cache um and when we rerun The Benchmark by evicting stuff from the dis cache then threaded file reads are just as fast as the the buffered one so it's not necessarily properties of the drive that are magic or certainly it's not exposed at user space it's happening in kernel space it's the property of the file

cache so great I will just cache the the word list and then my an crack will go faster right because then it's cached so if I cache a 5 gig file was fine and and things would go fast they would be good but like that was the same size that I was using beforehand anyway so it didn't prove anything so this is 110 gig file and here you can see I ran VM touch and it took 245 seconds which is why I'm not going to make you wait through all of that despite previous demos um and it shows that the entire thing is in the file cache great except it's a lie because your file cache isn't

big enough to hold 110 gigs in there so when you run it again it actually shows that the last part is in the file cache now you can imagine from from a password cracking perspective that's super inefficient it's basically sticking stuff into the cache after you've dealt with it and you're never going to use it again so it's caching in the wrong way so there's all sorts of things you can do to try and figure out like how to make that go faster there's F advise and M advise maybe you know what those things are these are ways of trying to signal to the colonel how you're going to read the file to speed things up none

of those did anything um and in the end what I did is something I'm going to show you okay so for a live demo you have to sacrifice a chicken which is why that crab is chasing a chicken and I didn't sacrifice a chicken and this needs to be bigger and I need another one because I realized I closed the desktop where I had these demos set up on like an

idiot okay so I'm going to run enti crack and I'm going to do it against a file called Med uh which is 110 it's not it's 10 gigs big okay so if we use VM touch H let's visual on Med you can see that part of the file is in is in the cache at the moment so what I'm going to do is I'm going to evict it and now if you look at it you can see nothing's in there now I want to keep an ey like a my eye on what uh what is happening there so if we go to Med we I I want to watch every second uh what's sitting in the

file cachee now when I run enti crack you can see there's like a worm that thing's going along caching chunks so what I did is I just changed the behavior to cach the part that we're going to use and then evict it and cash the next part that we're going to use which made things much faster but the one of the problems I had in trying to figure that out is how do you how do you cach it because when you look at what I showed you here if it's going to take 245 seconds to cach the file then that's going to take really long time it's not fast but if you look remember this demo when I ran DD it

would put the whole thing in Cache and it would go through really fast so it turns out you can just do a read and do nothing with it and it'll put the thing in the cache and that goes really fast you do have to do things to tell the compiler not to optimize that out because you're doing nothing with the read but in doing that sticks things in the cache and then things go much faster uh so after that for large files so this is a run against big which is 110 gigs big uh it's 50% faster than hashcat so here you can see en cracks at 109 seconds versus hash cats 258 seconds the reason there are so many switches to

hashcat is because those are the things it uses for its its benchmarking to try and make it stuff go faster I also did warm-ups uh I did a run there are there warm-ups I don't see warm-ups in the screenshot anyway uh so hashcat will pre-cache uh the first time you run it on a large word list will cash it and then subsequent runs are faster I had already run that so you're not getting those those problems it wasn't compiling any of its hashcat kernels that had already been compiled so there's no artificial Stow Downs in there um and this demo which I didn't close or did I randomly resized I can give you an idea of what

that looks like so on the left we're going to run hashcat against Med and you can see it's not pre-caching anything so it's going to take about 24 seconds uh and at the same time I'm going to run enti crack cuz we might as well just make the CPU do more than it's supposed to at once uh and you can see 41% of that was already cached so not the useful part so it's going to evict that from the cache and then cach the bits that we actually need you can see it's finished at 11 seconds while hashcat still going over here so it's going through the same word list finding the same same things there

uh and this is the latest hashcat I pulled it earlier um upstairs so we're still faster now this is a completely artificial test because the second you add any modifiers on that hashing so if familiar with hashcat you can Brute Force things by like adding digits on the end then that stuff gets really efficient and it makes the GPU burn so I'm using hashcat in the least efficient way and I'm comparing against that so once again hash cat's amazing people who develop it are amazing um I'm using this really to kind of learn things about computers and how how that stuff works not trying to dis hashcat uh but then I thought okay this is kind of cool I now

have a seemingly an ability to deal with large files faster and that's not a unique problem just a password cracking uh it is also for example something you have with grap so those of you who know what grap is it's a way for searching files for uh unique strings let's just keep it simple um and uh if you work in security particularly if you're on the blue side of things you're often searching large files for Strings a lot and that time can take a lot of your life so what we all do is we move to rip grap because it's much faster also amazing tool built by this guy named um actually don't know what his name is

Andrew something burnt Sushi only know Nick um but rip graps kind of like the gold standard for the fastest grap so here is um I ported this technique over to GP which I called srep because sing and grap um and here's so uh it says anti crack no there we go srep whoa running against rip grep and you can see rip grep takes 7.8 seconds where srep is taking 3.5 seconds against um that same file and uh did oh this is with warmup in the cache so I was giving rip grap the opportunity to work on a cached file so it had the same speed advantages as syrap had so this wasn't the the file

cashing necessarily that was helping anyway long story short it turns out you can read files using threads much faster than than doing these other things um and you can apply it to password cracking techniques so really what I wanted to trying and convince you of is that optimization can be fun and it can be an interesting way of learning things about how computers work how code Works uh and there's no kind of Stack limit here you can deal in user space you can deal in kernel space you learn things about how your computer works some point I'm reading kernel code to understand how Macos and Linux file caching are working uh and it's a great way to learn

things I'm also not suggesting you leave security to go into performance optimization uh like I think there's a world that exists for both um and I hope this was useful to you thank [Applause] you

Performance Hacking - how to hack your tools to go faster - singe | BSides Cape Town 2023

Related talks