← All talks

Keytap acoustic keyboard eavesdropping

BSides Sofia · 202334:59650 viewsPublished 2023-03Watch on YouTube ↗
Speakers
Tags
About this talk
Georgi Gerganov presents Keytap, an open-source project that reconstructs typed text from acoustic keyboard sounds using side-channel analysis. The talk demonstrates how keyboard audio can be captured and processed to identify individual keystrokes, covering both supervised learning approaches (requiring training data) and unsupervised methods that leverage language statistics to break keyboard audio without prior training—framing the latter as a substitution cipher problem.
Show original YouTube description
Georgi Gerganov
Show transcript [en]

Hello everyone. Now I will try to intrigue you with a slightly different topic. There is a question about a project that I am developing in my free time. It is called KTAP. Generally speaking, it is a form of data filtering. With this project I try to see what are the possible ways to apply this in real life. Two words for me. I work as a research scientist in an American company that makes medical devices for radiotherapy. This work that I will present to you is not related to these things in any way. All of this is on my GitHub, open source. If you are interested, you can look it up. The title is "Acoustic Keyboard Eavesdropping", which means something

in Bulgarian from the word "acoustic keyboard listening". Generally, imagine you are sitting in your office or you are writing some text, some email on the keyboard. And let's say that in Blizus there is a device that can capture audio and in this way to listen to the sounds of the code that your code is publishing. And now the question is: can this audio be used by an attacker to restore the text you are writing? It can be a sensitive text, you can imagine. And this is illustrated here in this picture. In general, for this code we have a phone that We noticed that the keyboard is not even turned on, because we just need the sound from the keyboards. In general, this

type of attacks are known as side channel attacks. The idea is to use a signal that We want to get information from it that interests us, but it is not intended to give us this information. But in some way it is modulated from the information that interests us and we can use it to get this information. There are different ways, in the sense that in case we will look at the attack using sound, but we can imagine that the electrical impulses of the cable are also running through this cable. which emit electromagnetic radiation, it can be used to try to restore the text. Also, the cables on the monitors receive some kind of current, it

also emits radiation, vibrations, temperature. In general, the field is wide and I think it's interesting because it makes you go out of the box and look for more interesting attacks. And now, if we have sound, We don't have sound. But you will hear a keyboard typing. Imagine that you hear someone typing on a keyboard and the task is: "Oh, we don't have sound because I turned off the sound of the... I didn't reach the columns. Okay, no problem. Yes, I got it, great. So, now you have this sound, you need to know what the person typing on the keyboard is. It sounds a bit strange, but I will show you that it is possible. The project I will be talking about is KTAP. As I said, it is open

source, it is a link in GitHub. I write it on C++ because I like to write on C++ and I have tried to do it this way. There are various tools included in the project, command line tools, GUI tools. I have also ported it to the web assembly browser. Here we see a demo that we will not do now. pay attention to what exactly is happening. This project started in 2018, but little by little, from time to time, as I get more enthusiastic and push it forward, I first made a proof concept, to see if there is any chance for this to be published. Immediately after that I released the first version, it went viral at that time. After

that I got an idea how to expand it, it became KTAP 2/2020 and last year I released the third version, which can be used in the real situation. What was the first proof of concept that I started with and I found that there is a chance for this thing to work? This was the simplest test I could think of. I called it the PQ test. You have the keyboard, you have two keys P and Q at the end, quite different. And now you train, for example, you make a program where you give them how the two keys sound and then you try to find out which of the two you are pressing. And the main thing is a simple program. The biggest

surprise when I did it was that it really worked almost 100% of the time. It could predict which of the two keys is pressed. Why exactly PEQ? Because in my opinion they are the most suitable for differentiation, because they are the furthest in the keyboard. And so I quickly made one There is a web page on my blog, if you are interested. You just have to press P a few times, then Q, the program learns and then you can start typing P and Q in some application and it will only recognize which of the two buttons you are pressing. Ok, I saw that it works. By the way, I tried it with a mechanical keyboard. I was glad to

have it. I tried it with non-mechanical and the results were very good, even very "earnier". So, from then on it became clear that it would probably work for some more noisy mechanical keyboards. From this proof of concept I quickly made the first demo, which I will show you in a short video, but the idea is the same. A web page in which you first train the algorithm with a few steps and then it tries to predict new pressures. I will explain what happens. On the left is the page. We trained it three times with P, three times with Q and now in the text in Notepad from the side. We write, there is no connection with the browser and Notepad, but the browser from the sound that

comes out manages to recognize which are the two. Now the same thing, we train it on the whole ASBOOK, three times we press every button. Why three times? Maybe more, maybe less, just some training data, right, to be able to learn the algorithm. We almost reached the end. So, Ok, now we are starting to write some text. As you can see, the page shows with high precision what is written in Notepad. Again, we just gave the page access to a microphone. I really like how this thing works. Now I will tell you the details of the of the algorithm, basically the main steps it includes. And there are four of them. First we collect training

data, then we make a prediction model from the training data, then we have to, when we have the model, we want to predict, first we have to be able to detect when the key is pressed when we analyze the sound. This is key press detection. And then, when we know when the keys are pressed, we do some sound analysis and with the prediction model we say "ok, this was the key". The collection of the training data, what does it represent? In this mode we need to know the victim when he writes a text, we need to know the text and to have a record of how the keys are pressed. a push of the key looks like this, like on the

figure, on the left it's a bigger scale. There's always a sharp peak, this is the waveform, I don't know how it's called in Bulgarian, PCM, whatever. There's a sharp peak that quickly goes quiet, and then there's usually a release peak, when we press the button, and then there's some clicking again, and most of the keyframes, maybe not all of them, It doesn't interest us, I mean the release peak doesn't participate in the game at the moment. We are looking at about 75 milliseconds signal from the peak itself. The training data collection in general represents for each drop a few away forms and now how to use them to make a prediction model. Here one can start thinking about some

complicated algorithms, neural networks, etc. I do something very simple. I take for every button that I have some waveforms on and I just average them. And I get an average waveform. Which represents how this button sounds. Because when you press it several times, and not every time it sounds the same, but when you average it, you get some average How do we average these waveforms? The first step is to align the peak, this sharp peak, to get some rough alignment. After that, a fine alignment of the waveform is done, which is there for the purpose of optimizing a similarity metric, cross correlation. This is not very complicated, you may have heard of it. even in

image processing. Generally speaking, this is a number of criteria and two things that are similar to each other and are often used in practice. Generally speaking, we minimize, maximize this cross correlation and in this way we apply the waveforms as well as possible and we average them. And the prediction model is just the consistency of these average waveforms and then when we predict, generally speaking, we will try to match to the best of them. The third step is the K-press detection. How do we detect when you press the buttons? Here you can do different techniques. I have chosen something very simple, just a thresholding algorithm. The idea is that you have some kind of background

and when you press the button, there is a peak, you pass a threshold and you can say that the button was pressed. This way, for example, in the picture, the white is the sound we recorded, and the detected peaks are the red lines. This way we know where the keys are and we can take only their sounds. And the prediction, we just go for each key, we take the way format, we go through all of them in the model, we calculate again the cross correlation, the same metric and the one with the highest cross correlation, we say "ok, this is the key that was pressed", we predict it that way. And so, this is what you see,

again, this is another demo, the program KeyTap works We write some text in NodePy or we write the password in Gmail. It manages to quite precisely to attack these waveforms and predict what we write on which you only hear from the sound. Now, what were the observations of this as it became generally ready? First, this is something that can be done free of charge. The process is not complicated, although I tried to optimize it, It's not a super complicated calculation. We need at least three push-ups for each of us to get some good training data. And the most interesting observation, which I was very interested in, is when he mixes the algorithm, that is, he predicts a little bit

that is not the right one, he usually predicts something that is before him. As shown in the picture, it predicted G, but the ones around it are a big variation, H, F and so on. I have two hypotheses. The first one is that the close dots are at similar distance to the microphone and in this way that they make them sound the same or just for some hardware reason, I don't know, the people who are close to the tour make a similar sound or something like that. But I'm sure that this is just observed and I find it interesting. So now, this was the first version of KeyTap. It went through Hacker News and Twitter. There were people who

planned it, but in general, it was a miracle for three days. So, I wondered what I could do to make this attack even more interesting. And then I wrote a blog post. And now, what is the main problem if I am an attacker? The main problem is that I have to somehow fill in the training data. This means that I have somehow infiltrated the victim. Can this step somehow fall off? Can I restore the text without having training data? At first glance, it sounds like an impossible task. However, if we add one extra condition to what we know, it turns out that it is not that impossible. The condition is very simple. For example, we know

that the text is written in a certain language, for example in English, and we know that it is a meaningful text, for example we write an email, some meaningful words, it is not just a random key to be typed on the keyboard. We know that it is coherent. If we know this, we can use some information from the language, some statistical information from the language, to combine with sound and try to make a guess for the most probable text, for example in English, which was written so that it matches both the statistics of the language and the sound data. And then I made the assumption in the post that this task is also This technique became equivalent to substitution cipher. I don't know what it is in

Bulgarian, but I'm sure you all know what it is. Because we have it in almost all CTF tasks. This was a very interesting observation and I found myself right. So, how does this substitution cipher work? The task is to break this cipher. Ok, so now we have this sound, we don't have the training data, we have the sound, we have detected where the keys are. We consider something I call similarity matrix, which is illustrated down in the left. This is just a matrix, one by one, one number of keys, one by one matrix and SIJ gives us key I with key J how much is it likely that they are produced from the same bucket. This comes to us

as information from the sound. If the sound is very similar, we will say that they are from the same bucket. If it is different, we will say that they are from two different buckets. If we imagine that we have a perfect way to say exactly this, whether two sounds are the same or not, we can say it 100% exactly, here it is not 100%, there is something like that, this task becomes exactly like Substitution Cypher, why? Because it does the following, we put a produced letter from the alphabet on every key, also I keep the constraint that the S-I-J matrix applies to me. For example, two keys, as I said, are from the same key, and I will put the

same letter from the alphabet. And I get the encrypted text in this way. It is actually made with some kind of alphabetization and we just have to break this substitution cipher to get plain text. And now What are the nuances? First, the substitution cipher, you probably know it, we know it from using it in ancient Rome. With modern technology it is relatively easy to break. And you need a hundred letters. If you have better, easier, there is a lower limit. But with a hundred letters, it is relatively easy to break. The simplest strategy is You start experimenting with free encryption languages and for each one you make plain text and you see how much is likely for the result to be English

text. How do you see if it is English text? What is the probability that what you have received is English text? Well, you estimate the probability the total probability of all the n grams you observe. This model is already starting to come from the language, that we know it is English. For example, the simplest is the one gram. We know that the most common meeting one gram is the letter E, from the English language. So most likely the cipher, the letter that is most common, will correspond to "i". It is not mandatory, but just as an idea. This extends not only for 1 gram, but for 2 grams and so on. I think that it is easier for a person to understand it

by reading some other code, so we will not go into details here. But it is a mechanism to say whether a given text is actually a meaningful English text. We do a brute force attack, we take the most likely English text we get and this is our assumption. The brute force attack is not optimal, it can be improved. I have improved it with a genetic approach, where we send mutations to the passbook. Then there is a beam search algorithm, which I think is one of the best things for breaking such types of codes. Now here is a deviation, for example, to do it you need this statistic for n grams. And in principle, if you search on the Internet, you can find distributions for 2 grams, 3 grams,

which are most common and you have the exact probability, but it is different for a person to sit and calculate them himself. And for the purpose I made a simple program. To calculate them you need a lot of text. Where will you get it? I found a project called Gutenberg, which is a 20,000 open source project, 20,000 books, enough text and I made a program to analyze and calculate all the probabilities for 1 gram to 6 grams and there are guides if someone needs to use them. And now, where is the subtlety in the whole thing? We discussed this perfect variant, where we say with 100% certainty that two keys are produced from one, two sounds are produced from the same kick, or they are not,

but in reality we can't say it, there is just some uncertainty. And now, in order to solve this, this matrix, my idea was to just start thresholding it, to put it another way. We will say that they are from the same coin and under that we will say that they are from different ones. This threshold is not clear how much it is, but a person can adjust it and change it. I tried to put some ideas in the cryptography and the stack exchange, but in general there is no good and clear approach how to solve this thing. So, in general, we stopped with what I told you about thresholding, I call it clustering. You do clustering, you

get the cipher, you break it and you have plaintext. Then you do another clustering with another threshold, you get the cipher again, you break it again and so on, iteratively you rotate it a lot of times. Now, the clustering has some details, I have improved it, I haven't gone into details, in general, different things can be done here. And we got the last tool I want to show you. Again with a short video. Now on the left we have a new web page RunWaKeyTabV This time there is a more complicated GUI. Now the web page is listening to sound. From the right I open a website and it generates a random paragraph of some English

text. Now I start to listen to the page and I start to write this text in Notepad. The web page is listening through the microphone. And this is fast. We see how it happens. About 150 letters are entered. The waveform is white. These are the detections that are done through the detection. It sounds like this. Of course you don't write this way, but to work with it you write with two fingers. After that we have the quantity, we calculate the similarity map, the SIJ matrix, which I will show you in the next part. This is how it looks like. And then this process I described to you, the letters are not visible, but this is what I told you, we do this in the

multi-thread, some of these The processing shows the most probable text and at the beginning everything is junk, but gradually some meaningful words start to appear. And there is also an option with this mouse, I'm just giving a hint, because as a kind of something that makes sense, it's marked and said "ok, fix this, don't change it" to help it to optimize, not to lose too much. And so, with this process, when you play, now there is some manual input, it's not just to leave it and to run the text, you manage to reproduce a small part of the unknown text. Let's not look at it to the end. And now, the critics will say "You know what the text is about, you will give the

right hints". Ok, the situation is quite idealized, I agree, but we are still making a proof concept. Despite that, I made some more real life applications of this thing and now I will show it to you. I got an email one day. who had a challenge for me. The challenge was this video. Now here is a guy who writes in Skype, he writes very slowly and writes on mechanical keyboard too. This is some kind of laptop keyboard, which is not suitable for this thing he does. Anyway, he sent me this thing, he writes: He wrote a short text which is "Hi Elvis, the file was uploaded to the Drive folder named something and the password is..." And at

the moment he started writing the password the picture was blacked out and we could only hear the sound of the password. Now it's almost there. Now it's the password. And now I just let him Kit App 2. There was a lot of manual work because as I said the code was not mechanical. But it was relatively simple. He imagined that you have to include the letters that are in the text that we know. Because if they don't participate, there's no way to do it. And the password was "Levy was here". This was the first challenge. The next one is from a CTF from 2012. It's called "Played CTF - Traitor Problem". I found it when

I was wondering where to put it. It's basically the same idea. We have audio sound that someone writes on a keyboard. This is the audio file that was given in the challenge and we need to understand exactly what it writes. In my opinion it is synthesized, but it is made realistically. There is noise. In general, in my opinion it is a very realistic example, although the pressing in my opinion is synthesized and not just recorded. And now, in the CTF section there are some hints that will help you understand this. I didn't need them at all, I just clicked on key tab 2 and it worked out of the box for fun. Now we won't

look at it, it's long, but I'm just demonstrating how I clone the repo, run it here and the result comes out after a while. Okay, and now The main drawback of KeyTab 2 was that there was some manual work. You had to give jokers. What I wanted to achieve was like in the first picture. You put a phone to listen to and then you just let it go and get the result. If I can ask, because now no one will start using Kit App 2, sit down, give it a hint, record, and then just throw it. Somehow they just want to make it more seamless and more people to test it. And that's how they made Kit App

3, which the main improvements are to make it more efficient, I improved the strategy for Shiffer breaking. Also this clustering, I made it more automated, I took all the things that you adjust in the original program, here somehow all of them are running parallel and you see the results and it is much more streamlined. Finally I took this thing, put it on a super simple page that you can run on your phone. And you can do the following. Sorry. This is the mechanical keyboard. We see that it is not turned on. This is a phone, an iPhone, which we open this page and we let it listen. It gets a lot of access to the

microphone. And we write some text. Now this green bar, it increases every time the keyboard is detected. The page is made to be collected so much higher to fill this bar and then it starts processing. Here we wrote something and it automatically starts to count and starts to show the results of this thing, of the processing. And now you probably don't see it, but it was very fast almost the original text was restored, there are some gaps, but when it writes something This is a short demonstration of using these tools to a mechanical keyboard. Thanks for watching. I didn't finish, don't think this is the end. This also happened up and down in IRL. What are the limitations? We

need a mechanical keyboard, we need to write in English, we can't use main books, for example, Shift, numbers, backspace, these things are not allowed. for proof of concept is good enough. At least that much higher. And yes, perfect typing. If this was interesting for you, you can see I've created a key tab challenge in my blog. There is an audio file with English text. It has 300 characters and the task that manages to restore them and sends me the solution, goes into the hall of fame. on my blog and for the moment there is only one person there. No, no, this is not me. And these are some of the reviews from Hacker News. Thank you for your

attention.