Cyber Mirage Realtime Deepfake Video & Voice Cloning for Offensive Security Engagements

Name: Cyber Mirage Realtime Deepfake Video & Voice Cloning for Offensive Security Engagements
Uploaded: 2026-02-12
Duration: 41 min 6 s
Description: Cyber Mirage Realtime Deepfake Video & Voice Cloning for Offensive Security Engagements by Brandon Kovacs The emergence of artificial intelligence (AI) has transformed the landscape of social engineering and given rise to a new class of AI-powered cyber threats. In 2024, a finance professional at

BSides Tampa · 202641:0686 viewsPublished 2026-02Watch on YouTube ↗

Speakers

Brandon Kovacs

Tags

CategoryTechnical

StyleDemo Talk

About this talk

Cyber Mirage Realtime Deepfake Video & Voice Cloning for Offensive Security Engagements by Brandon Kovacs The emergence of artificial intelligence (AI) has transformed the landscape of social engineering and given rise to a new class of AI-powered cyber threats. In 2024, a finance professional at a global corporation in Hong Kong was deceived into transferring $25 million to scammers, who leveraged deepfake and voice-cloning technology to impersonate the company's chief financial officer on a video conference call. This talk aims to illuminate the sophisticated capabilities that AI brings to the table in creating hyper-realistic deepfakes and voice clones. Through a captivating live demonstration, we will showcase real-time AI-powered deepfake and voice cloning technologies and demonstrate how they can be used by offensive cybersecurity professionals to conduct highly effective social engineering attacks. This discussion will emphasize the critical need for the development of more sophisticated defense mechanisms to mitigate the risks posed by these rapidly evolving AI-based cyber threats.

Show transcript [en]

Uh hello everyone. My name is Brandon Kovac. Um really excited to be here today. I did wake up today with a a slight cold. I'm waiting for the day cold to kick in, so just bear with me, but I promise I'll do my best. Um so let's get started. As I said, my name is Brandon Kovac. Um, I've been into cyber security kind of my entire life. Um, started off when I was like 14 working on a blue team. Um, doing like real-time network defense and attack mitigation for really large financial institutions. And fast forward today, um, I currently work as a senior red team consultant at offensive security firm Bishop Fox, where, as I

like to phrase it, I get to do the really fun stuff. Um, so why are we here today? Um, as I said, I worked on a red team and part of being a red teamrer is uh that ability to emulate tacttors, tactics, techniques, procedures. Um, I've always believed that the best defense requires great offense. And in this world, it really um requires you to understand how things work, right? And that's the intention of today's session. Uh, we're going to start off with a very highlevel overview of AI terminology and concepts. I'm trying to keep the content um balanced, technical yet accessible for people who have never been introduced to this before can still grasp it without

having it go over your head. Um then we're going to talk about how deep fake models are created both video and audio using strictly open source tooling that's available on like GitHub. Um then we'll talk about operational use cases um providing real life examples of various assessments uh that we've run that have incorporated elements of the fake video and audio um and then finally uh time permitting uh we will then do a live demonstration. So let's begin. So yeah, there's a quote that, you know, everyone's probably familiar with that you can be, you know, you can be whatever you want when you grow up. And that couldn't be more true today. Um, especially with the rise of artificial

intelligence. It's really shaping the future of social engineering and allowing threat actors to pursue all sorts of uh attacks whose tactics, techniques, procedures um include elements of deep fake video um which is video trained on public information and then used in like video calls to perform social engineering. um voice cloning, taking public information of a subject or target and essentially copying their voice to where as you talk into a microphone, that source subject's speech comes out with a remarkable degree of accuracy. Um, and although it's not in the uh scope of today's talk, it's enabling people to do uh things uh such as generating fishing emails um and even conducting fishing at scale using tactics such as um rag pipelines um to

look at large data breaches and compose emails on a very hyperpersonalized level and at scale. Um and also as we saw last year um during political campaigns um voice cloning and visioning has been used at scale to do things such as uh political candidate impersonations and you know use in conjunction with robo dialers to spread disinformation. So although um it can be used for good it can have um some very serious security implications that will and ethical considerations that we'll talk about today. Um, so let's talk about the evolution of Guess I'll turn around. The evolution of deep fake attacks. Um, starting in 2023 or 2019, sorry, we had the first publicly known voice cloning incident in

which uh the CEO of a bank in England had their voice cloned. Long story short, quarter million dollars lost. Fast forward to 2020. Um, there was a finance uh professional at a multinational firm in the United Arab Emirates, the UAE. uh they had their voice cloned and they lost $35 million. Fast forward to 2023, um I guess I can say it here because we're in the US, you had a thread actors from Iran um targeting um television networks and streaming providers to create deep fake videos of news anchors and then disseminate them across those platforms to spread misinformation. And then lastly, what got me really interested in this was last year the first real-time deep fake incident at least that we know

of, right? Um in which a finance worker at a multinational firm called Arup um out in uh Southeast Asia thought he was on a live was on a live video call on Zoom with the chief financial officer service company. And long story short, uh he was instructed to send a $25 million wire transfer, which he did because his boss was telling him what to do on a video call. And it turns out it wasn't actually his boss. It was a real-time deep fake. And when that happened, I had three reactions. That's crazy. That sucks. But most importantly, how do they do it? Right? Because ever since then, we've had a lot of clients of the company I work for reaching out

and asking us like, "Hey, can you clone our CEO or can you clone so and so?" and start to like run engagements and do social engineering with it. Uh turns out we can. So before we dive into the meat of everything, um I think it's important to establish a baseline of terminology. Uh so at its core, a model is essentially like an algorithm, right? That's trained on a data set and through a process known as training, it's learned to recognize patterns and create a series of predictions, right? Um and then finally through the process of called inferencing we take that train model we provide it new unseen data and then it makes a decision and provides an output

based on the uh training the models themselves doesn't matter what type of model you're creating whether it's a text model a photo vision model um an audio model such as a voice clone um the type of model that you create will determine what type of data is used to com you know compose the the data set. Typically that data set will have to be processed ahead of training. You have to clean it, format it, um label it and those processes and those substeps will look different based on the uh the type of model you're trying to create. So for example um let's say you're creating a vision model, right? That can detect um a hot dog versus not a hot dog, right?

You're going to provide a data set of hot dogs and you're going to label it hot dog, not a hot dog, right? So that way when you're training um with whatever architecture you're using, it could then look at that data and then create a series of predictions and create this train model. Um finally once you have that trained model um you then do the inferencing which is probably the step most people here are familiar with. Um it's kind of where all the magic happens. Um so for example chat GPT was trained on a large corpus of text. You ask it a question. It uses the information it was trained upon to then generate response. The idea here is that

you provide new unseen data to that model and depending on the type of model and whatever task it was trained to perform it then then provides an output. Let's say we have a natural language processing model that can look at a string of text and do like sentiment analysis positive or negative or a classification model that can look at a bunch of text and say this is spam or not spam. The idea here is that you have a model that's trained to do a specific function. You provide it input and it gives us output. So now the really fun stuff but uh important disclaimer um it shouldn't be said but we have to say it right. Uh the

act of cloning someone and like using their likeness has very serious ethical considerations. Um you shouldn't clone anyone without their consent. Um this presentation and the model I'm going to demonstrate later on um strictly adheres to these concepts. So let's talk about voice cloning. Um voice cloning can be done using an open source tool called retrieval voice conversion also known as RVC. Um it's trained on use audio as a data set to uh to train the model and then the inferencing allows you to supply it with audio to do audio to audio conversion. So, for example, you talk into a microphone, it takes that audio buffer, throws it against the model, and then out comes the person's voice who was

trained upon with a very remarkable degree of accuracy. As you're going to see in the demonstration, the process of training the voice model is pretty straightforward and it looks very similar to what I showed you. However, those substeps vary and we'll go into detail of what that looks like, but essentially, like I said, you're going to use audio files like a wave, MP3, etc. Um, we're going to clean, splice, convert, transcribe, and label the data. And I'll we'll go go through that in a moment. The training process, we're going to train a model and and also something known as the feature index, which will then give us that trained model. So, everyone say hello to my friend

Chris. Um, for this example, we had him read from a script. Uh, it had 300 lines that are all phonetically diverse just so we can get a nice wide range of the of the vocabulary. Um, in the English language at least. Um, each line was around 5 to 10 seconds in length and was recorded with like a high quality microphone and wave format. You don't really need to use this. You could also, if like you're doing an engagement and you're trying to, you know, target like a public figure like the CEO of a company. If you think about it, if you're a publicly traded company, you have earnings calls um that you that you have to be on every quarter, interviews,

press, podcasts, you name it. So, you can either record it yourself or you can source it publicly. The next step is the actual preparation of the data itself. This is where the cleaning splicing comes in too. So, for example, if the speech has like, you know, popping like this. Um, any sort of background noises, any sort of large gaps in the audio, you're going to want to clean that up and remove it just so you have a nice clean data set of just spoken speech. Uh, this can be done using like a paid tool such as Adobe Audition, which is my preference just because you can automate the process of editing 300 clips at once, or you can do

the free version, Audacity, which will take a lot longer, but it's there. Next, um, we undergo a process called labeling, um, where in the context of voice u models, we're, um, transcribing, right? We're creating a mapping between the file name and what was said. And this isn't really required for RVC, but it is required for other vocal models. So, I just wanted to throw it in there just so you're familiar with it. But the idea is that the trainer will look at this and then look at all the files and be able to then emulate that once, you know, in the future once the model is trained. Uh training itself uh use the RVC dashboard and I'll go through the

parameters of what these mean. So, the first thing we're doing, we're setting all the parameters. So, we're um we're setting the sample rate to 40k. Um we're not enabling pitch guidance. Pitch guidance is only required when you're training a model for singing. Um we're maximizing the number of CPU processes. That way the uh trainer can process the data set rather quickly. Um we're going to train it to a thousand epochs. So an epoch is essentially a complete pass of the data set. So we're going to look at it a thousand times. And every 50 times we're going to save a checkpoint of wherever the model is at that point in time. The idea is at the end of training

I'm going to have a bunch of um checkpoints to go through and find and pick and choose which one sounds the best. I'm going to set the batch size to 40. Um the batch well maximizing it. The batch size represents how much data when it's looking when it's doing a pass of the data set how much data it looks at at a time. And your batch slide is really determined based upon your compute and what hardware you have available. If you have like an Nvidia 4090 with or 5090 with like 24 to 32 GB it's a VRAM you can maximize that batch size to to its greatest value. If you don't have that much compute available

you obviously have to adjust accordingly otherwise uh the GPU will run out of memory. And then training the data set itself is really straightforward. It's stupid easy. You push four buttons once you have the parameter set. You push process data feature extraction train feature index and then train the model. Um depending on how much data you provide and depending on how much compute you have available, training a model can take anywhere from like two hours up until 20 up to 24 hours. Really depends on you know what what you're working with. Once we have the uh the final result of this will then be the trained um vocal model which is represented as a path

file and then the feature index represented as an index file. And then from there we can use something called the voice changer to inference our model against a wave file or an audio buffer. So we can then do audio to audio conversion of the sort of of speaker such as myself and then have my voice output as whoever the person the model was trained upon. And once we have that set up it allows you to do some really unique uh use cases. So for example, I can take my train model and an audio buffer to inference it against my voice changer. From my voice changer, I can then I have the spoken speech coming out of the of

the of the person's talking, outputting, and then I'm feeding that output to the input of a virtual audio cable. The output of the virtual audio cable is then set as the micros microphone input on teleconferencing apps. So, for example Zoom Teams Skype the FaceTime web, you know, dashboard. Um, you can then essentially use the voice uh changer to then talk and have a actual conversation with someone. You could also hook this up to a soft phone and then connect it to a telephone network to then, you know, place phone calls, right? So, there's some really interesting use cases uh for this, especially when we're doing engagements and which we'll dive into in a moment. And I'll show you what the voice changer

uh interface looks like. I made sure to get some screen recordings just in case things don't work. Um, but here, so here we're taking our train model and we're just applying it to the voice. We're uploading it to the voice changer and everything here is running locally on Docker. I'm using like consumer grade hardware. I'm using a gaming laptop. You don't need a crazy $60,000 Nvidia GPU to do this. I'm setting the chunk size to 1 second. That means when I'm talking, it's taking my speech and inferencing against the model in in chunks of 1 second at a time. I found that gets the best results. Um, it doesn't have as much latency. Then as I talk,

oh, okay, I guess the PowerPoint doesn't have audio. That's fine. I'll show you guys in the live demo. ideas. As I talk, out comes my friend Chris's voice. Let's talk about deep fakes. Um, deep fakes can be created a number of ways. Um, the most popular one on the internet right now is something called Deep Face Lab. Uh, they're responsible for around 95% of deep fakes on the internet. At least that's what they claim on their GitHub. There is another tool out that came out last year called Deep Face Cam, uh, something like that. But what I found is that the frame rate isn't as high to really use this operationally. It works really well, but

your FPS is only like seven or 10 frames a second. Whereas when I'm running these models through Deep Face Lab, I'm getting around 30 frames a second at 1080p. So you can use there there's it has a a more valid use case if you're using this operationally, right? Because you want that real-time aspect. So use Deepace Lab to create your deepface models. uh the deepace models use photos as the training data and then through a process known as merging we're doing face swapping from the source subject onto the destination. So for example from the source being my friend Chris and the destination mapping out onto my face and as you can see the uh process to

create the model looks pretty straightforward. Just looks like the uh voice models or any other model. However, the substeps obviously vary. For this we're going to use photos or videos and I'll show you what that looks like in a moment. uh to prepare the data, we're going to do a frame extraction, something called alignment and labeling. And then finally, we're going to train uh what's called a segmentation mask, which can then look at the face and recognize it. And then the model is trained upon that. The model, the training again, we're going to learn from the data set to do pattern recognition and then finally output that output, output that trained model, which we can then infer. Um so

let's talk about what each phase looks like. So the collection phase, um I met up with my friend Chris. We shot around 30 minutes of footage um at his at his house with a green screen with a professional camera. We had him look a variety of angles and facial expressions. So looking up, down, left, right. We had him talking. Uh he's a huge Miami Dolphins fan, so he loves to wears Miami Dolphins hat. So we captured footage of him doing all these things, but wearing the hat and not wearing the hat, and I'll explain why in a moment. And we captured the footage of not just him, but myself, too. From there, um, out of the videos that

we captured, if you think about it, a video is just like a a set of images, right? If an if a video is playing at 30 frames a second, for every minute of footage, we have 1,800 frames or 1,800 pictures. So, the idea here is that we're going to first do a frame extraction of those videos of the source and destination. We're going to pull out all the images as you can see here. Next, we're going to undergo a process called alignment on both the source and destination data sets in which we look at those frames and then we just recognize we detect is there a face here? Yes or no? And then we look at

things uh such as the what are called facial landmarks. So the position of the eyes, my eyebrows, my nose, my mouth, um the pitch and yaw of my face, um how I'm you know rotating my head. Right? So we're running this against all the the thousands of images of both the source and destination. Next uh we're going to label the data. Uh labeling the data um when you're in the context of like deep video uh is called masking. So what we're doing here is defining a mask which and we're defining the facial contours of both the source and destination. So we're we're defining the edges where the the the whole face is right through the forehead

through the you know the the um the hairline etc the jawline and we're going to do this for a number of images for both the source and destination for a variety of facial expressions and angles. Next, we train um a segmentation model which is able to then look at the labeled mask data that we supplied it and it's able to then create a mask that is sufficient at pulling out the face but ignoring any sort of facial obstructions. So during training what I did I wore that wore Chris's Miami Dolphins hat and I masked his hat. The reason being is that during inferencing I want the the model to recognize the hat but don't draw a forehead over it

right when I'm wearing it. So the idea here is that you can train for facial obstructions. Here we train for his hat. However, if a subject has glasses, right, you would mask out the glasses during during this phase. So that way when the trainer looks at it, it only pulls out the face but ignores ignores, you know, the glasses part, right? Right. So that way during inferencing as the destination you can wear a pair of glasses like real ones. Um and yeah so next um we undergo training. So we're going to undergo many thousand rounds of training. Actually in this case millions of rounds of training but you can kind of see what happens as the model

iterates you know through each all the different iterations. as we go from a th00and to 10,000 to 100,000 to 250,000 for the source my friend Chris the destination meaning myself and then the mapping from the source onto the destination what that looks like at that chunk in time so as you can see as as we iterate and the model gets better and better at trying to predict and create these predictions the model becomes more lifelike for both the source and destination and then that that final mapping is what matters the most right So what's going on under the under the hood per se? Um we're using what's called a generative adversarial network. Um it's essentially it's a type of

architecture that is composed of two models that are competing with one another. We have what's called a generator and a discriminator. So the generator takes in random noise as input tries to create me as a deep fake or you know as Chris and then that fake image is then fed into the discriminator which compares the fake image to a real image from the data set of images. And the idea here is that the the discriminator will look at this deep fake and say this is a fake image or this is a real image. Right? Initially it's going to say it's fake. It's going to you know uh deny it and whatever. We update what are called

our weights. Um and then we just go back and we just repeat this process many many times, thousands of times, millions of times until finally the generation of the deep fake images is is so good to where it tricks the discriminator. The discriminator then looks at the defake and thinks that's a real image. So essentially it's a zero- sum game where this is not as accurate and this is very accurate when we first start, right? But as the time goes on and the training goes on, this becomes less accurate and this becomes more accurate. And once we can find that tipping or inflection point, that's the point at which the model is uh ready, you know, is is trained and ready to go.

And this is um can be seen in this time-lapse video. I'm about to start playing it in a sec. So we have the source. We're going to have this rebuilding the source. We have the destination. This will be a deep fake of the destination. It trying to predict and create me based on what it knows at that point in time. And then the last one will be the mapping from the source onto the destination and it trying to draw a deep fake of Chris using me. So as you can see right now, it's not accurate. It's just a gray box. So this is uh 5,000 10,000 iterations here. We hit the fast forward button just for the sake of brevity

and but what you can see is as the model goes from zero steps to you know eventually 2 and a half million rounds of training it becomes much more lifelike and realistic and eventually is able to then trick that discriminator into then thinking it's a real image as you can see over here. So the final result uh will be a deep fake model uh uh it's known as a DFM file or once you have that deep fake the deepace lab dfm file you're then going to use something called whoops you're then going to use something called deepface live to infer the deepace lab model and the idea is you have a webcam that you supply a video buffer into the

model and then it'll do the face swapping it'll take his face and throw it onto mine but how can we use it. Uh, this is probably the most important part, right? So, we have our video feed that we're then providing input to the deep face live with our model that we're inferencing, right? We're then going to take the output, whatever that preview window looks like, and then we're going to use OBS, which is free open source studio software, to capture the window from the Deep Face Live, right? Showing me turning into my friend Chris. From there, we're going to use OBS to compose a series of scenes. So for here I don't have a green screen but on my my home

studio I do. So the idea is I can drop myself into then the any virtual environment I want to be in. I can be in an office. I can be in a boardroom. I can be anyone and anywhere. The idea is I use OBS to compose those scenes from OBS. We're then going to enable the virtual webcam functionality which then essentially make make it a webcam and then route that route that video as my camera input into my teleconferencing application. Zoom Team Skype FaceTime whatever. So, I'll show you what that process looks like. So, this is using Deepace Live. I've already plugged in the model that I'm using. Um, but essentially what we're going to do is specify the camera

source. So, we're going to pull in my web my camera that you see over there and we're going to start driving the model. It's a 384 resolution model. Um, move it over. And now, you know, as I move, it's controlling the model, controlling my speech mouth whatever. This is cool. Watch this. Next, uh, this is what it looks like to actually compose a scene using OBS.

So, the first thing we're going to do is give it a background. So, you're going to all be introduced to my colleague Elit's office. That's she That's the background that she does all of her um calls in. We're going to take Deep Face Live and overlay it. We're then going to enable the chroma key filter to essentially delete the green background. Um, the lighting wasn't the best that day and I was trying to I was in a rush to create this before I went to another conference. So, you do see some like artifacts over here just because the lighting was a little crappy in my studio. Usually, this stuff isn't there, but whatever. It was for this demo. But

the idea is we're then cropping it to then remove any any artifacts that may be there, resizing it, and dropping me into the correct spot so it looks like I'm sitting in her office. Next, we're going to then wire the outputs from here into like a tele a teleconferencing app. For the sake of this demo, we're going to do Zoom. We're going to enable our virtual camera over here. It's the most important step. And then we can go to any teleconferencing app and configure it appropriately. So, we'll go to our Zoom settings. We'll change my camera to my OBS virtual webcam. Same thing for audio. We're now going to set the the microphone input to the

output coming from my virtual audio cable for the voice changer.

So before we get into the demo, let's talk about how we can actually use this operationally for like offensive security engagements. Few ways. Um let's pretend you're doing a red teaming engagement and you're doing an external breach, right? You have no knowledge of the client's network, whatever you have to get in from the outside. Instead of doing like traditional vishing, right, where you call the help desk pretending to be someone, now you have the ability to sound like them, too. So, for example, using um public information of a public figure from that company, creating a vocal model of that person, spoofing the corporate number while you're calling their help desk to look like you're calling from internally, right? Hey,

this is so and so. I'm having trouble getting into my account. I need to do a password reset. um an assumed breach scenario. For example, let's say you're under the starting point of be having network access or you know starting on a developer system or a employee workstation. Uh we've done assessment and we'll get to this in a moment. Um where you can do an assumed breach scenario where we can leverage both the audio clone and the video defake model to do internal fishing over like Zoom or Teams or Skype. Um and then lastly you know we can use deep fake video and audio models for cyber security training right um create creating uh deep fake

video and voice models of public known figures within the organization disseminate them across the organization you know an internal internet whatever to help educate um employees on this matter. So let's talk about uh real life examples. So for example, one thing I did mention was an external breach scenario. Um where we could leverage the voice clone to do voice fishing. Um here we can do things like targeting the IT help desk of an organization to then do uh account manipulation. For example, resetting the password and then claiming that you just got a new phone that day and now you need to, you know, pair your iPhone to as a new multifactor authentication device. From there, as an

attacker, you're able to then authenticate to whatever identity provider they're using to then do what you need to do, move laterally and get to whatever your trophy target is set up for this engagement. Let's say you're doing the uh assume breach scenario. Um, you can use the deep fake voice and video model to do internal fishing to use whatever internal teleconferencing application the client is using, um, Slack, video, uh, Zoom, whatever, and start calling people within the organization to make them perform actions, you know, actions that align with the goals of whatever the assessment is. So that way you can then move laterally and hit whatever your trophy target is. And these are all I'm like not making this up. These are

like real life things and real life scenarios, real life examples that I can only talk about the methodology, but these are real life things we've already done and we've been doing non-stop now for the past year. Um, and then lastly, uh, cyber security training. How I said you can create deep fake audio of like the CEO talking. Uh, videos as well, distribute these across your organization, throw them on your internal dashboard, have a video of the real CEO talking, doing whatever, and then have it side by side to the fake one and have a quiz with employees. Can you spot the deep fake? Same thing for audio, too. Uh there's some really valid use cases to help reinforce uh cyber

security training. So, the live demo, give me one moment. Let me just get this set up.

I'm going to do I'm going to stop revealing my two3

Hi, my name is Chris. Well, I hate here. I'm a deep fake model created specifically for the Bides conference here in Tampa, Florida. You to demonstrate real time deep fake video and voice phoning capabilities. You It would be fun if we had someone from the audience try this out too. Is there anyone here that has like a beer like you with a hand? I have something just for you. Yeah, you have to come over here.

So, you get the best results when the uh the source and destination um closely resemble each other. The reason why I found I wanted the guy with the beard is can you please stand in front of the camera? You may want to take your hat off because it could be an obstruction because we could turn him into John Wick.

So, anyone else want to try? Uh, where's a female? Uh, e, I got one for you. I want to make this as inclusive as as as possible.

So for you, we're going to turn you into Megan Merkel. Just look straight. Give it one moment. Here's a little preview. You want to see it?

Uh, do you mind switching it back to the Thank you.

Thank you.

Okay. So, I did have this screen recording um as my backup, right? But I think we can just show it anyways just to highlight the risk especially when you start doing the deep fake in real time as you saw now. but hooking it up to something like Zoom. So, building on the example videos we saw earlier, this is kind of what it looks like, you know, me in a Zoom room just having a conversation with myself. Again, there's no audio on the PowerPoint. Um, so there's the voice isn't going to play, but you kind of just heard it over there. And that's it. Um, this goes in my LinkedIn, so if you I post I put a lot of stuff up there.

Um yeah.

And I do want to take this time if anyone has uh questions I'm more than happy to answer go into more detail as you guys need you sir. So I could but it would take longer right I was in a rush. It's actually funny. So originally the first model I created was of my colleague Alif, a female, she has blonde hair, whatever. And then I was set to give this talk in Dubai last year and three weeks prior to me going to Dubai, I learned about an incident in which a guy from India was at the Sephora at the Dubai mall, put on makeup and was sentenced to two years in jail for impersonating a female. So I was

like, it's probably not a good idea to impersonate a female in the Middle East in front of like 20,000 people. Um, so I I I had to come up with a plan B. So I called my friend Chris. I'm like, "Hey, I need a favor." And literally the first thing he said to me was, "What? You need to clone me?" I was like, "As a matter of fact, yes, and we got I got to come over tonight. We got to get this recorded because it's going to take around three weeks to train the model." So the training itself, um, I utilize the cloud. I could train on the laptop, but it would take 3 weeks, two weeks,

three weeks. When I train in the cloud, it takes roughly a week and a half, give or take, from start to finish. What will determine the speed of training? A few things. is how big the data set is, what compute I have available. Um, and then also the resolution of the data set too. Um, higher resolution models will take longer to train and also there's there's like a an inverse relationship between like res size of the res of the model in resolution and performance. Even though the higher res models look higher resolution, when you're inferencing them in them in real time, you can't that you can't do it on a consumer grade hardware. And even if you could, you're

getting like six frames a second, seven frames a second. What I found from experience is that a 384 resolution model takes around a week, week and a half to train on like an A100 um and allows me to inference it in real time using a gaming laptop um and do so at sufficient quality and speed. Whoever fonds

From experience, uh, we found it's very, uh, successful, um, and very impactful because if you think about it, when you answer the phone, like you, you think it's your coworker, especially when it a it sounds like them, but b used in conjunction with like um, caller ID manipulation. if it shows up as on the caller ID that it's your own corporate number and it sounds like the person right you're not going to question it. Uh so we found them to be very effective and doing um like for example external breaches like doing the call spoofing to like call the help desk to reset a password or um even impersonating people you know within the organization and

calling employees. Um then most recently not too long ago we did we hooked them both up you know the audio and video and I jumped on a Zoom call and it was no questions asked. You know, I was that person.

>> Yeah. So, what we're doing, we're inferring two models at once, and you're going to it's going to take a little trial and error to properly calibrate the outputs and offset them appropriately so they both come out at the same time, right? So when I'm talking and moving, it takes, you know, a good a solid second or so for the um for the face swapping to occur and then timing the output to the audio to come out at the same time so they kind of match. So it does require a little bit of tweaking um but it can be done. But what I found is that um in terms of like having a conversation some with someone

um the total latency is around 1 second which is not noticeable at all. If anything, you can always say you're traveling or whatever. Internet sucks. So next

interesting. Uh, I didn't think we'd have enough time for this, but I guess we do. So, I'll show you a video. Give me one second.

Okay. You mind changing the video back to the my computer, please? So, there's a few things that have um have stopped me in the past year from like doing the thing that I had to do. Um and I found them very effective. And the beauty of it is for organizations, it doesn't cost them a single dollar to implement. You don't have to pay some third party vendor a h 100red grand a month to do deep fake detection and collect all your video. Um you can do a few simple things. So for example, watch what happens when I hold my hand in front of my face. It looks like an exploding star. So the idea is a lot of models aren't trained

on obstructions right off the bat. So if you are having a conversation with someone and you see a water bottle sitting in the background, ask them, "Hey, hold the object in front of your face." Right? 99.9% of the times the uh it'll cause the model and the inferencing to break. Um so you can easily spot the deep fake. That was a really good question. And another thing that's really effective is let's say you you suspect a deep fake, right? You think it's an impersonation. Instead of doing whatever the request is, say, "Hey, I'll call you back." Right? Look at the person's real number in the corporate directory and call them back. cuz although I could spoof your

boss's number and call you, when you look up his real number and call back, I can't receive that call. Another thing that's really effective is uh tell me something only Brandon knows, right? Having like a layer of multifactor authentication before you have these like sensitive high urgency requests. Um a few years ago, my grandma um got a phone call from someone claiming to be me and said, "Hey, grandma, I'm in Mexico right now. I just got into a car accident. The police are here and they want a bunch of money, whatever. Can you send $4,000? She did. Um, had she asked a simple question saying, "Hey, yeah, I'll do whatever you want, but first tell me where did I take you every

weekend growing up?" Because attackers wouldn't have been able to answer that question, that would have mitigated that attack. There was an example a few weeks ago or a few months ago now, uh, the CEO of Ferrari Motors. They had their voice cloned and the accounting, I don't know, someone high up in finance was targeted and he was like, "Yeah, sure. I'll do whatever you want. However, tell me tell me this first. What is the name of the book that I gave you last week? Because the caller couldn't answer that question, it mitigated and stopped that attack. Thank you everyone.

Cyber Mirage Realtime Deepfake Video & Voice Cloning for Offensive Security Engagements

Related talks