
the Georgia Institute of Technology this is his first time to bsides here in kman so welcome and we look forward to hearing your
thoughts just got to play with the remote here for a second okay hey everyone uh I'm Steve I'm super excited to be here it's it's an honor and a privilege to be speaking in front of you today um I'm a senior researcher on the Cobalt Strike Team um prior to that role I was a lead developer um at a small business in Maryland where we did some adversarial AI um and then almost a year ago now I was a active duty Marine um doing a different kind of breaching uh than I'm what I'm focused on now so so I I consider myself to be a pretty average guy all around I think I I have
a very unique perspective maybe um into something that I consider a little bit of a problem for the future of red teaming and I'm hoping to kind of talk about uh some of that today so to me the the problem and um I'm going to refer to defensive operators kind of broadly as blue um is that blue has done a really really good job of consolidating all this Telemetry building all these models and getting really good at this automated intelligent detection and um mechanisms and and meth methodologies um meanwhile red offensive operators red teams haven't really done that right offensive development in general is still kind of very Vibes oriented um things seem like they should be evasive
so they probably will be there's some like technical aspects of that as well you know you can do stack traces and um self-evaluate your own detection surface but for the most part we don't see that same granularity in terms of harvesting Telemetry um to make red teaming better and that's what I want to talk about a little bit today so to me right with my a little bit of my AI ml background the question is Skynet when when are we going to bring AI into offensive development when are we going to have these autonomous beacons doing all kinds of bad things everywhere um and so as I started to look at this question it felt a little
bit broad right and and some of the challeng es with that question are that there's a there's a limited set of resources um as you go out and try to research this today um there's a couple of places doing some interesting things um there a really interesting net spy article we'll come back to that um but by and large there's a limited um Corpus of reference references you can use to move forward with your development there's also an overemphasis on generative AI technology um really since the Advent of chat GPT um pretty much a lot if not the the majority of discussion on AI can been focused on generative Technologies um and that's not necessarily um viable when you're trying
to push the envelope of what's possible at endpoint offensive tooling so here's a just a really quick screenshot if you were to Google artificial intelligence when I was making these slides um all the results on Duck Dogo were generative oriented right and so not necessarily something that um is relevant for um Edge device tooling and then there's there's attack model limitations so um red teaming in general is obviously it's very Target specific and so creating generalities based on targets um can be difficult and so there there's also a data scarcity um that makes building models from this um from to from given clients difficult so I thought the question was a little bit generic um so I thought I'd
make it a little bit more specific um and and basically what are we doing now that we can do a little bit better with a little dash of of AI machine learning um and to me if you're doing red teaming um in 2024 like the number one malare that you make is Shell Code loaders right um for the uninitiated um in a minimally viable attack architecture you have three pieces to your attack you have the the attacker machine which is can be a Col box um your attacker server your team server and then you have the actual um initial access mechanism which can be some sort of malicious document an exe a d something that you're
sideloading um on the target system itself and so the Shell Code loader loads typically C2 Shell Code based on some programmatic logic you might see that um implemented as anti-vm um some checks or um guard rails to ensure that you're only detonating your shell code on a specific set of Target environments and so one more time from the top shell C loaders typically initiate C2 sessions um they can be designed to preemptively check for virtual machine environments that's to limit the analysis that's being done on your C2 Shell Code specifically you don't want to give away necessarily the configuration of your team server for example um by prematurely unmasking U your beacon in an analyst uh virtual
machine and that improves your obsc right and so there's there's some other operational considerations when that you have to think about as you're kind of developing a Shell Code loader generally you want it to be very small in size the smaller the better um I gave myself for these proof of Concepts I gave myself a little bit leeway uh in in AI machine Learning Land uh megabytes is pretty small um but in malware development megabytes is enormous right and so I kind of got landed in the middle a little bit there with with settling on Double Digit megabytes um versatile implementation um you'll typically see your loaders Implement as exe's DS exe is not really
like modern or relevant for modern environments but you're still okay Al see it um you want few dependencies so you shouldn't rely on your target environment having a specific version of a specific Library um you want to either bring that in yourself or you strip it from your sh code loader entirely um and so as we kind of go into this discussion um it's important to kind of think about how things are really done now a little bit and so um checkpoint research does an amazing job of ADD aggregating various ANTM uh mechanisms and so we can kind of talk through this a little bit um for those of you that are less familiar with
C++ um one of the ways you can check for a virtual machine is you just take a snapshot of all the running processes you compare it against a string um that indicates that it's running in some sort of virtual machine um in this case it's VM tools d. exe and if you see that running process you immediately bail out um and delete yourself for or do some perform some sort of cleanup actions um this specific method is interesting because uh signatures are difficult to generalize for this method um you obviously have to a prior know what the exe that you're looking for is supposed to be right um and you can do that with a string dump um some that kind of goes
a little bit into the ioc's of a specific Shell Code letter um but a little bit more interestingly um going back that net spy article that I mentioned um there's some interesting things proposed in that article and so one of the things that they talk about is using unique process counts the number of users and the ratio of unique processes per user as features in a data set to train a neural network um to be able to distinguish between VMS and bare metal machines um from my perspective that approach is a little bit limiting um it's a little bit constraint and that comes from the requirement to have python at runtime which as we talked
about in operational considerations that's not necessarily viable um potentially um the use of neuron networks itself neuron networks are computationally expensive um to train uh and so there's there's some limitations there uh and then there's some pre-processing that has to be done in the article they talk about how you have to normal data um and if you change the algorithm from my perspective you change changing the algorithm type uh you can potentially get around that requirement um we'll talk about that a little bit later so if you're going to implement a sh code letter in C and C++ um or really any compiled language um as is the convention it really constrains the data analysis and the the pre-processing you
can do uh in your shell code loader to reasonable time right Shell Code loaders are uh kind of the beginning part of an operation typically is how you initiate a session and you don't want to spend a ton of Time reimplementing Machine learning AI Primitives from scratch uh in order to just open a to get a beacon right and so um for my research I started with the existing data set in the Nets spy article um for Simplicity and my goal was to build this model in Python but to be able to implement its learnings in C and C++ my second goal obviously is to successfully achieve the distinguishing um this being able to distinguish a VM
in a bare metal machine with some degree of confidence be able to do it successfully so just some a quick and dirty on on machine learning um so the data set in this case is stored into we stored it into Data frames excuse me I stored it into Data frames um and the columns were the specific features that were outlined in the in the net article um like any machine learning training process I broke down to three steps training validation and testing we're testing in this case was running the code on a previously unseen virtual machine and bare metal machine and we want python because it's the deao standard for AI machine learning um it gives us access to data
manipulation um really easily um it gives us an expansive library of machine learning libraries so we don't have to go through the Intensive process of reimplementing these from scratch and so one of the first decisions I made When developing this was to implement this as a decision tree classifier as opposed to a neural network and that's because when you train a decision Tree on this data set you actually get a model that can be expressed as naive conditionals in post training code and we'll we'll see what that means here a little bit more clearly and you can leverage these learnings in really any programming language that's like the benefit of having this expressed as conditionals so once again we did the
the three features the number of unique processes the ratio of unique process to users and the number of users to augment this data set the initial data set I think was something like five different machines so to augment this I uh requested volunteers uh that let me run my very benign python script on their machines and their virtual machines so I'm very thankful for those those of you that did that um and I got a total of 22 total entries nine of which were um host machines bare metal machines and 13 of which were virtual machines and I completely AOS scoped bias and outlier analysis in the data if you take a look at the data there's one point
that's very clearly um an outlier it creates an extra node in the in the tree um but for this for expedience I just ignored it and so for training I started with um 75% of the data used for training and validation so it's about 16 entries um and I did some pre-processing data augmentation um via stochastic resampling all that means is I resampled from those 16 until I had 100 um in the data set that I used to train the actual decision tree and then for cross validation I did 5,000 rounds of cross validation for hyperparameter tuning um if you remember back on that decision tree slide with the S kit learn definition there's like
a what looks like a 100 different hyperparameters right and so you can tune those automatically um with a single API call from the pyit learn library and I did that via randomiz search and then testing I did a final round of testing against the six unseen data points um and then I actually used it on my own machine which hadn't been part of that data at all and so when you build this model and you train it um the code to do this is all on the GitHub um the link will be at the end of these slides but when you actually train and build this model and you visualize it what you see is these
nodes and you see these conditionals at the top of these nodes and you can actually just Express these as one if statement like this right and that turns into maybe we clean that up a little bit make it part of a function and you get something like this uh that's notepad++ is the best IDE of all time um and then you can test it you can compile that and test it and you'll get something like this right and so I added some self- deleting to clean up the Exe on VM detection and then if you're running on a bare metal machine it executes your shell code and so this approach has some advantages right so um extracting the
conditional expressions for the decision tree Um this can be done gives you the ability to implement this in any language in any programming language the opaque conditional is a little bit harder to reverse engineer um than a string would be and so that's another Advantage right there's no no strings to decrypt there's no strings to decode it's a it's a an opaque set of conditions um and it overcomes some types of VM masking so um this is moderately common maybe a little bit less common but um something like vbox cloak actually hides some of the strings that some VM detection might use and we entirely circumvent that process by Examining The Raw number of processes
through this intelligent design process disadvantages we actually have to do double the number of queries we have to query the number of processes and the number of users um which is a little bit complicated in in C and C++ so um there's that limitation and then additionally to like truly operationalize this uh you have to account for those biases in the data and that outlier that I kind of just handwaved away um in the proof of concept so that that brings us to our initial question right can we make loader design a little bit more intelligent um and I think we did I think we did that we talked about that um we successfully saw that happen um a
little bit more interesting maybe is this idea of Target user verification right um can we instead of tying our Shell Code detonation to a user account or user ma machine or even host can we tie it to an individual behind the screen at runtime um and I thought that was pretty interesting so that's the the next thing we're going to talk about so using AI to find John Connor uh I retained the requirement for this to be done in a compiled language um we still have that requirement for minimal runtime dependencies right we we can't expect the machine to have the libraries that we need which is a little bit limiting in in AI um and ideally
what we want to do is leverage the existing AI ecosystem right so earlier we talked about being able to build our own model with the our own data that we acquired to do our own thing but a little bit better would be if we can just take somebody else's model some like a Google or a Microsoft somebody that's spent a lot of time and resources and developing a very robust model and using that at runtime to do whatever screening process it is that we want to do and then once again we retain our operational considerations um ideally we want it to be small I generously gave myself megabytes again um and we we still retain versatile uh
implementations so I chose rust to do this in um that's that might be a little bit controversial uh but uh the benefits for me outweighed the the risk and and uh being able to do it in a in a memory Safe Way um was very interesting for me um also rust out of the box in the standard lib has sockets which is huge um if you're trying to do um pre-staged models or pre-staged Shell Code um you don't have to um wrestle with Windows API specifically um you to just do that kind of out of the box uh minimal runtime dependencies so um what I consider to be the big three AI libraries are up there P towards
sensor flow um and open CV um so because we want to be able to leverage existing models we want to to use ideally one of these three libraries uh which means we have to statically link um that dependency in our executable technically all three of those support um being built statically but you have to do it on your own you have to build it from Source uh which is a little bit of a heartache um turned out that open CV was actually the easiest to build um it was the smallest lib it came with webcam access apis um it gives us access to AI machine learning Primitives and it comes with a set of pre pre-trained facial detection
classifiers um straight out of the GitHub repository so um really awesome for what we were looking to do uh and really really interesting um so definitely leverage existing AI ecosystem I think we we can checkbox that um using the the existing detection classifiers that we that we talked about Additionally the open CV bindings for rust um come with onx onnx um bindings and so what that is if you're not familiar it's uh an open Source framework that allows for cross compatibility of models across the big Frameworks um you can kind of think of it as like the Json of model weights so basically you can train a model in pytorch you put it in onx and then you
can load that model in tensorflow um for our purposes very valuable because we're going to be using open CV which is not where a lot of models are initially trained you typically see models trained in pytorch uh you see models trained in in a using a more bus framework um but we have some degree of compatibility with that um with that comp with that um set of apis and also super awesome for malare development um the rust bindings also support loading uh a model straight from memory so you don't have to touch dis you can download your model load it straight into your executable and it's you're good to use it and so therefore you can leverage
models to accomplish any sort of automated task um again we're sticking to a facial verification um context but um given the access that you have you can kind of whatever automation you want to achieve on the your um show code loader or Standalone e exe you can do that readily um for this presentation we'll be talking about mobilet V2 um I experimented with resnet um as well there's just they image um based models for image classification and so this is what I wanted in my head um obviously didn't get the cool gooey um but I I broke down the the facial verification process um a little bit like this right so we have a um
reference image which I embedded in the executable um then you also have a web frame capture which we're going to do at runtime and both of those things go into our initial facial detection algorithm right and so what we're going to do is we're going to use that facial detection algorithm to crop the faces and just extract that face from the reference image and from the webcam capture um next of note here um I embedded the models directly using or excuse me I embedded the facial detection model directly using the include ster macro and rust um just I just basically took the string that is the Json model and I just embedded it directly in the
executable and then I just loaded it at runtime so U pretty neat so we we crop and and resize those faces we extract the embeddings um embeddings big AI word um really just means the mathematical representation of whatever other thing is in this case it's the image so every pixel on that image becomes some sort of number it's a floating Point number um which we can then use um with some comparison metrics in this case I chose to use cosine similarity um to do that comparison between the reference image and the web frame capture and then just output a classification um the threshold similarity that I use was 90% um you can obviously tune that to to whatever is
reasonable for um your risk tolerance and then I think we have a demo to be me looking Goofy for a little bit so I was super
ecstatic for
awesome thank you um so there's three scenarios in that we're going to talk through them a little bit just for clarity um the first scenario was John Conor's compar to myself um interestingly enough I got an 80% similarity which is a little bit High um likely due to the comparison being done on the the cropped images there's a different way to do this comparison um where you use key facial features and that would have probably been a little bit better um but that that's important to to note um then John Connor comparison to John Connor um I couldn't find John Connor at the time of the demo so I just used the same picture that I had of him um but then to kind of
round out the the demo I did my LinkedIn profile picture to myself right and so that validated um the intent of the design and really confirmed that we can use facial verification in a standalone executable um to recognize a specific individual and so back to our question can we do Target user verification with AIML in a Shell Code loader the answer is yes right we we saw that and so it's important to talk about some alternative approaches I'm going to spend a little bit of time here um just because one because I'm running a little bit fast but two because I think it's a little bit important you didn't I didn't necessarily have to do that comparison
the way I did we could open CV has a very robust set of machine learning Primitives you could have used to build your own model to detect a specific individual um there's a list of classifiers on the screen um but it's it's super robust um could have used a larger model right so I used mobet um version two um because it gave me an executable that was roughly 26 megabytes in size which is already pretty big um but if you wanted to use a better more accurate model you could use something like reset 50 uh which I did in um for some of my testing and the the results were better the comparisons were probably more
accurate um I would say um but the executable size was like 100 megabytes and so that wasn't necessarily viable um you're not going to Ingress 100 megabyte executable to open a beacon um debatably maybe but um and then what's really interesting is that models you could remotely stage your models um so I directly embedded them in the executable um but you could have pretty readily um stage them yourself on GitHub or hugging face or maybe even a little bit better um you can use one of the models that's already there on hugging face um and just download it at runtime the that hugging face is blacklisted is pretty low and so you have ready access um to those models
um so future research I think uh we have seen some of like the potential of a AIML in the offensive context I think we can safely say that um I think some of the ways we can push this a little bit further is in doing of this in a more position independent context um in offensive security a lot of the times you're operating strictly in memory um and so being able to do all of these things directly from memory is a little bit more difficult um ironically my decision to use rust from memory safety made it more difficult um because I rust is Infamous for its difficulty and being reflective loaded reflectively loaded um and so that is a challenge I haven't
overcome that quite yet um an extension of This research would be to use a similar method via bof um what I think is interesting is that the more modern B templates that use visual studio um make this a little bit easier and so I think the the next couple years we'll see some advances in that in that realm uh and then some other things I think you could do that might be interesting is um OCR for data exportation so um your beacon could potentially scrape the PDFs and then only xfill the relevant text Data as opposed to exfilling every document that has images in it then doing that analysis manually in post um and then runtime modeling of the
target system I think with access to machine learning Primitives uh your beacons could potentially do some really interesting things you could um do some clustering of some Telemetry that you're able to gather in user mode directly and potentially model when might be the best time to check in when you could do some You' be able to do some things a little bit more dynamically than what is now the standard um so I think there's some interesting space there um for research as well and then team server command automation uh obviously that's like the the golden like the the the gold standard of what everyone would want right is a beacon that can autopilot itself um I don't necessarily know if
that's viable right now U I can't I can't speak to that um but I do think it would be interesting right I think the the basic building blocks of doing something like that exist and I think we have seen some of that here today um and I think it would be really really interesting to see um how that stuff can be built on um kind of moving forward I think that would be really interesting and I just I want to give a special shout out to um the developer of the rust bindings for the repo um twist and fall he was very helpful in getting that kind of sorted out getting me sorted out with getting started with
that repo um then everybody who reviewed these slides and and gave me some feedback um thank you so much and everyone who let me run my python script on your computer like thank you so much this would not have been possible without your bravery um so I appreciate um ladies and gentlemen I know I'm a little fast um but I'm Steve Selenas and I'm ready to take your questions thank you
yes what happens
there's uh what happens if there's no webcam connected does your uh shell code just exit yeah so so right now the executable so all the webcam stuff happens on the loader side so before the Shell Code executes um right now the the loader just bails out um doesn't do anything in Rust it's called panicking but but uh yeah that's what happens right now so don't connect a webcam to your computer basically yeah that that I mean that might be an interesting mitigation I think that's already kind of a recommendation now to be honest so um but yes yeah definitely and then second question uh did you run your python code on uh virtual machines that
were like you know part of a Sandbox or was it just like an annual analis uh VM that they analyze my in can you say that one more time sorry I couldn't yeah like sandbox like you know Cape sandbox or something or was it just like someone's analyst VM that they use to analyze maare that you baselined your uh your detection out oh so for myself I have a bunch of personally I have a bunch of VMS and so some of those are um designed for analysis um I didn't because I was already so ecstatic that people were running my python code um I didn't uh screen for that in my aggregation of the data uh but that that that is something
you would have to to to think about right um the boundary that we found or I found in that analysis um effectively what happens is you take this data you Outsource the data analysis to decision tree the decision tree spits out like the relevant boundaries right um and the the ratio of user processes to users is um like 75ish right but that's biased in the machines that were part of that data set to begin with right and so um yeah to your point if you wanted to actually operationalize this you would have to screen for all these criteria um I didn't have any to the best of my knowledge I didn't have any Windows servers uh which might have a different
ratio there and so there's obviously a little bit of complexity you incur when you're doing that so yeah definitely cool thanks cool
research awesome thanks again everyone