Espio: Keylogging Only What Matters — And What Comes Next

Name: Espio: Keylogging Only What Matters — And What Comes Next
Uploaded: 2026-04-27
Duration: 22 min 3 s
Description: Espio is an OCR-triggered keylogger that enters active keystroke capture only when high-value context is detected on screen. The tool combines passive screenshot scanning with indirect syscalls to evade EDR hooks, and exfiltrates data via Discord webhooks. Chapman discusses the architecture, impleme

BSides Charlotte · 202622:03Published 2026-04Watch on YouTube ↗

Speakers

James E. Chapman

Tags

CategoryTechnical

TopicMalware Analysis Reverse Engineering

DifficultyAdvanced

TeamRed

ResearchTechnical Deep-dives

StyleTalk

Mentioned in this talk

Tools used

Windows Media OCR

Platforms

Discord

About this talk

Espio is an OCR-triggered keylogger that enters active keystroke capture only when high-value context is detected on screen. The tool combines passive screenshot scanning with indirect syscalls to evade EDR hooks, and exfiltrates data via Discord webhooks. Chapman discusses the architecture, implementation challenges, and detection evasion trade-offs of this red-team research prototype.

Show original YouTube description

James E. Chapman, aka Metalsnake, presented his talk "Espio: Keylogging Only What Matters -- And What Comes Next" live at BSides Charlotte on March 28, 2026. https://bsidesclt.org/ ESPIO is keylogger with a dual-mode architecture: passive OCR screening triggers active keystroke capture only when high-value context is detected, using indirect syscalls to evade EDR hooks. Part of a broader AI-assisted red team research platform.

Show transcript [en]

All right. Hello everyone. My name is James. My middle name is Everett. That's what everyone calls me. And uh you know, now people are starting to call me Metal Snake, which is uh my Discord handle. I'm not a professional in the cyber security industry. However, I do want to be one eventually. Anyways, I've built a program uh that is pretty interesting and I think some of you will find it pretty cool. Um I call it Espo, which is short for espionage. SPO is an OCR triggered key logger, which means when a window pops up on your screen and it contains a keyword, a phrase of interest, you get a screenshot sent to your C2 server and

then a a key logger is activated. Um, first I need to go over some background in order to provide some context to the whole of this presentation. Uh first I'll start with a quick lesson about key loggers and then I'll actually show you the program I built. I'll show you it working and then I'll elaborate on how it works. And I just like to say I'm just going to run through this part very briefly because the context and the history of key loggers I don't think is interesting as the tool itself. And uh that's that's what y'all are really listening to this for. Um, so anyway, I like to consider SPO an evolution on the timeline of key

loggers, but you know, I don't think that means old key loggers are are bad out of or out of date. This is an old IBM select electric typewriter. In the 1970s, Soviets snuck hardware implants called select electric bugs inside these machines. The bugs detect the mechanical movement of the type ball when a key was pressed and every key press produced a distinct signature of mechanical motion. The electric bug captured that signature. It would then communicate keystroke information wirelessly via short radio burst in the 30 to 90 megahertz range. Um, this frequency is overlapped with television bands which may have helped the Soviets avoid detection. It took the NSA roughly eight years to find these bugs and the typewriters at the US

embassy in Moscow. Yikes. Fast forward a few decades, now we have the ghost key logger. This key logger emerged in the early 2000s and you can actually still buy it. Um, it records keystrokes, the clipboard, browsing, messenger conversations, and then it dumps it into an encrypted log and emails it to you. The main problem with ghost key logger is that it does not have stealth that is up to snuff for modern systems. Now there is a problem with the average key logger. You might end up getting a file that looks like this. And you know if you get this you might end up feeling like this person. I hope not but you might might feel like you're

looking for a needle in the hay stack. Here's just a quick diagram that I that I uh made to further prove my point. So, here's another nice image. What if your key logger had eyes? All right, so I'm just very quickly going to go over the antivirus settings, and I just want to show you that there's no exclusions. Uh you can also see that the firewall is fully enabled. Next, I would just like to make a quick note about something. Currently, when SPO is executed, a command prompt window does pop up, but that can be removed. Um, I just don't want to remove it right now since it's attached to a debugging function I have within the program since

I'm still working on it. Um, SPO isn't a finished program. This is still a prototype, but you can see here that it's working. Before I dive deep into SPO and talk about some of the techniques that I used to create it, I'd like to go over the dual mode architecture. Here we have passive mode, which is what SPO spends the majority of its time doing. It uh takes a screenshot every one to two seconds. OCR scans those screenshots. If no keyword is detected, it loops back and goes through passive mode again. If a keyword is detected, it goes into active mode. It sends the screenshot that triggered active mode to Discord. Um the key logger is turned on. The buffer is

excfiltrated to Discord and active mode stays active for a set amount of time.

Also, I just want to say that for the OCR functionality of SPO, there is a words list that OCR reads from. So, you can change any of these words to change the parameters of the OCR trigger for the key logger. So, if there's something more specific that you want to add to it, you can do that. But right now, this is what I have it set to. SPO as a whole has been built to be as modular as possible. Um, you've got different settings here. Random delay, random interval between screen capture. Um, you got random delay before sending capture data to Discord. Um, you know, a few things. How often to check the clipboard, although the

clipboard function doesn't work right now. Anyways, uh you know, you can limit the screenshots. You know, there's a few things you can do. So, now that we've talked about the highle stuff, I want to start getting into the technical details. Firstly, I want to talk about screen duplication. There are several native API options for Windows screen duplication. The primary ones that I am aware of are DXGI, desktop duplication, and GDI. For this program, I decided to use GDI. It's compatible with Windows all the way back to Windows XP. The API is simpler than DXGI. GDI also has a low dependency footprint. It's just GDI 32 DLL which is already loaded in every process. However, there

are some cons. Uh it is slower than DXGI especially for full screen capture. It uh still works fine in the video as you saw, but in slower machines or maybe some virtual machines it might lag behind. And that also applies to multimonitor setups, though I've only really tested this in virtual machines and on a laptop, an old laptop. Um, so I still need to do further testing to get accurate measurements. Anyways, uh, GDI GDI also doesn't capture hardware accelerated content well. some DirectX OpenGL apps. It might show up as black rectangles. Uh this API is also CPUbound. So I really use GDI out of compatibility stealth and ease of use. I do want to talk about DXGI

because it's a it's a pretty strong alternative and it deserves to be talked about. DXGI desktop duplication captures frames directly from the GPU, which means lower CPU usage. It can capture hardware accelerated content that GDI misses. It's what OBS, Discord, Screen Share, and most modern capture tools use. Um, there are some drawbacks, though. Uh, it's only available on Windows 8 and onward. The API is also more complex. Now, DXGI duplication requires the process to hold a reference to the GPU output adapter. Which means if EDR is monitor monitoring direct 3D device creation from non-graphical processes, uh that's an instant red flag. uh you don't want a background process or an injected DLL suddenly creating a

direct 3D 11 device and then calling acquire next frame. That's a distinctive behavioral signature um especially when compared to what GDI does. GDI calls get dc/bitlit and hundreds of legitimate applications do that constantly for rendering UI elements. Um there is a problem with DXGI and GDI. Um some applications use an API function called set window display affinity that blocks GDI bitlit and DXGI desktop duplication. uh you would get a black rectangle if you were to work look at the captured frame. So the thought I've had is to create a fallback capture method which can bypass some set window display affinity setups. Um that's a rabbit hole though and honestly could be a separate talk probably also haven't really implemented

that or gone down that rabbit hole yet but I will look into it in the future. Now that we've talked about the screen capture method, I'd like to briefly talk about the OCR engine. Windows has a built-in OCR engine called the Windows Media OCR OCR engine. You feed it a bit map that you get from the screen capture method and then it returns recognize text. Um, it is native to Windows, so every Windows computer from Windows 8.1 and onward can utilize the OCR engine. You know, thank you Microsoft. I'm really hoping for Windows 12 they add a like a passive OCR engine feature that's uh working 24/7 in the background to make the user experience even better. Um

anyways, I like the the built-in Windows OCR engine because it uh blends in better. There are apps like the snipping tool, Microsoft Office apps also, um, OneNote in particular, and also like Windows Recall/Copilot. They all use the, uh, the Windows OCR engine. You just got to think like anything that extracts information from the screen, it's going to have to use an OCR engine. So, another thing that's very nice about the native OCR engine is that there are no additional dependencies. Um, the OCR engine also automatically handles whatever language the target has configured, which is a really big deal. Um, testact another OCR engine would require you to bundle language data files for each language. I w to be adversarial with myself and to

the audience listening from this and and learning um because you know it's always important to be honest. There's something called URL triggered key logging which uh can be good because you can trigger key logging based off of like active URLs and that's uh basically way quieter than an OCR engine running with screen capture and all of that. Um, but the reason you would want to use a OCR triggered key logger like the one I have here is that sometimes you might not have solid recon information. You might be targeting non browser apps, dynamic URLs, and scope changes. So, you know, that's why you might want an OCR credential harvester instead of something more traditional. So, I'd like to now get into how I

actually did the key logging. Um, this is the first time I've worked with SIS calls, so it's pretty interesting for me. Uh, most key loggers use an API function called get async keystate for logging keystrokes. The problem is that a lot of EDRs flag it. There are some workarounds though. Get async key state doesn't actually check the key state at a user mode level. It's a wrapper. What it is really doing is calling NT user get async key state in the win 32U DL which then execute a SIS call into the kernel. The kernel is where the actual keystate information lives. So get async keystate is just a user mode function which acts as the front door to get that

sis call into the kernel. To get around that front door, the first thing you have to do is find what the wrapper function is actually calling to. And before I go any further, I really want to explain this quickly. Um I want to explain one reason that most EDR vendors usually hook functions at a high level. In this case for get async key state which is in user 32dll um it's because those functions have the most stable API. It doesn't change between Windows updates so hooks don't break. Hopefully if an edr were to hook into 132L they would have to constantly deal with compatibility because Microsoft can and will change the internals across builds. user 32 DLL is just more stable.

However, some EDR vendors do hook at a Win32U DLL level though. So, this isn't like a total workound. I also need to quickly explain what a PB is. PB stands for process environment block. It's a user mode Windows structure that holds important metadata about a running process such as loaded DLS. Every process has its own PEP. Anyways, now that you understand contextually and conceptually where I'm coming from, you will understand why I needed to make a SIS call. First, I needed to find where 32 UDL is actually loaded. I used PB walking to do so. After doing that, I did something called PE export parsing. This tells you where in 32UDL entuser async key state is

actually loaded. After that, you will finally finally have a SIS call number which could be used to call in to use or get async keystate and get around that pesky front door that some EDRs hook. And that's not all. I'm not just going to stop at SIS calls. I want to upgrade that by turning it into an indirect SIS call. So once you finally have a SIS call number, you have a problem. The return ad address and the call stack points to your RWX memory allocation, not a Microsoft DLL. A modern EDR does stack analysis. So instead of executing the SIS call instruction in your memory, you find a sys call red gadget um that is already inside win32L.

Then you jump to it. Um, now the return address of the SIS call now points to a legitimate Microsoft module instead of your own RWX memory allocation. Also, just as a side note, I have a an elite hacker friend and he told me that as I get better, I'll not need to use SIS call/indirect SIS calls. He didn't elaborate though. I think he's probably talking about exploit development. I'm not sure, but you know, um, if anyone has any idea what he might be talking about, please let me know. For the C2 communication, SPO sends data to Discord web hooks uh using when HTTP over HTTPS. The traffic goes out to discord.com on port 443, which looks like normal

Discord traffic at the network level. However, endpoint detection will see that an unknown process is talking to discord.com. That's a red flag. Also, some network monitoring tools will see a process creating discord.com. Traffic from a process that is not discord.exe. Um, Discord actively scans for web hook abuse. If a web hook is reported or Discord's automated system detects unusual patterns that kill the web hook URL, effectively making SPO useless. So, Discord is definitely not the best or even a good C2 solution in my opinion. And honestly creating a good C2 servers system is something that um I'm looking forward to doing. There are still some problems with SPO though. Firstly my indirect sys call sub

is 21 bytes allocated with virtual allocation using page execute read write that is a wellknown by uh edruristics. Legitimate applications almost never need memory that is simultaneously writable and executable. A lot of popular EDR products all monitor for private RWX allocations. Some even flag on the virtual location call itself. Um, one workaround I was thinking of is that if I could make it so SPO masquerades or injects into a git process, its own RWX memory allocations blend in with a process's expected behavior. By the way, a JIT process is any process that compiles code at runtime rather than ahead of time. JIT process repeatedly allocates executable memory during normal operations because it is compiling code

on the fly as the code is needed. Here's another major point of failure. The polling in this program is very very aggressive. It pulls 80 plus keys in around 1 millisecond which means there are around 80,000 sys calls per second during active mode. That's unusual behavior. U additionally CPU usage spikes during the active window which gives way to another detectable pattern. All right. So now for the major point of failure. It's about optimization. Now this mainly affects slower computers and virtual machines. I personally haven't experienced it on my main desktop. However, there will always be a latency gap um during the transition from passive to active mode regardless of how good the computer is. Um, that's unavoidable. It's just more

noticeable on older and slower computers. When a keyword triggers active mode, um, SPA doesn't instantly start logging keystroke, it has to do several things. First, first, when OCR detects a trigger word, it has to send that screenshot to Discord, but the image is just a raw bit map. Discord doesn't accept those, so it has to be JPEG encoded. Next, SPO's mode flag is set to active mode. When the JPEG is sent to Discord, polling begins, the buffer is excfiltrated to Discord after a set amount of time. So basically, because SPO is single threaded, the key logger can't start polling until the Discord upload is complete and the main loop cycles back to act mode. The reason it asked to

cycle back to active mode is because when the JPEG is sent to Discord, the loop is still technically in passive mode. One thread can only do either passive or active mode work. Thread has to finish its current work, loop back, read the set flag, and then act on it. When the Windows operating system is utilizing Copilot recall, it's already doing periodic screen capture and OCR because Microsoft wants you to be able to search or recall what you are doing. So, speculation leads me to believe that Microsoft might be working on releasing a more consistent passive OCR feature for the Windows operating system in the near future. Uh this would make the passive mode of SPO harder to detect because it would

begin to look more like a Windows feature rather than malware. However, Microsoft has restricted recall in the sense that you have to opt in to use it. And I've also read that Microsoft is considering andor is working on reworking recall. So it could potentially be removed in the near future. I have no idea though. Um however I don't think the overall trend towards operating system level machine learning/screen understanding is reversing.

Espio: Keylogging Only What Matters — And What Comes Next

Related talks