Attacking LLM Detectors with Homoglyph-Based Attacks

Name: Attacking LLM Detectors with Homoglyph-Based Attacks
Uploaded: 2025-05-27
Duration: 47 min 34 s
Description: This session explores an attack vector, homoglyph-based attacks, that effectively bypasses state-of-the-art LLM detectors. We’ll begin by explaining the idea behind homoglyphs, characters that look similar but are encoded differently. You’ll learn how these can be used to manipulate tokenization an

BSides Groningen · 202547:3475 viewsPublished 2025-05Watch on YouTube ↗

Speakers

Aldan Creo

Tags

CategoryTechnical

StyleTalk

About this talk

This session explores an attack vector, homoglyph-based attacks, that effectively bypasses state-of-the-art LLM detectors. We’ll begin by explaining the idea behind homoglyphs, characters that look similar but are encoded differently. You’ll learn how these can be used to manipulate tokenization and evade detection systems. We’ll cover the mechanisms of how homoglyphs alter text representation, discuss their impact on existing LLM detectors, present a comprehensive evaluation of their effectiveness against various detection methods, and see how we can protect detectors against these attacks. By: Aldan Creo LinkedIn: https://www.linkedin.com/in/acmcmc/ Event: BSides Groningen 2025 Official website: https://bsidesgrunn.org/ LinkedIn: https://www.linkedin.com/company/bsidesgrunn

Show transcript [en]

I think I would like to invite Alden to the stage for the next presentation. Do you want to use the mic for the recording? Yeah. Yeah. Yeah. Yeah. Thank you. Hey. Hey. Hello. Hi. Hi. How are you? Ah. Well, okay. We are going to talk about something super interesting today. It's going to be really interesting, I hope. Uh, but it's going to be something that's uh a little bit specific. So, like, you know, we have computer science. We're going to, of course, focus on a something that's part of computer science, AI specifically. And inside AI, we have like many different things. We're going to focus on a subset of AI, um, natural language processing. And inside of this thing, there's a

super small silo of people that do something that how do you detect when something has been written by AI. This is what we will be talking about today. And uh I hope that it will be interesting for you. But uh let me just say that everything I will say is sort of like uh experimental in a way like I'm still working on I'll show some results and experiments and things like that, metrics, etc. So those change very often like uh I'm still working on everything I will say. So this is like uh well I'm working on this for a paper so you know results change etc etc. That's the disclaimer. Now we're um ready to well start diving a little

bit into this thing. So how is the presentation going to work specifically? I'll give you some notes. Uh because I think it's important to know that while there's like well like there's a lot of things that could be said about this. I'm going to try to simplify things. We don't have a lot of time. So let's try to you know like make things simpler etc. I'll say some things that are not correct not 100% precise. So if you want precision you have references. It's down here and uh you will get the links at the end of the presentation. So you just need to take a note or whatever if you think that something's interesting and

then you can go and check it out. Uh another important thing is that we will have some sections throughout the presentation and uh essentially um at the end of each section we'll have a takeaway a thing that's important and this is a pink box. I don't know if you can see the pink color. This this means key message, something to remember. When we finish the presentation, when we wrap up the thing, we'll go back to the pink things that we saw throughout the presentation and we'll sort of get like a general idea about what we talked about today. Okay. So, um well, uh having said that, you've probably seen that you have some bingo cards. I'll explain what they are in a

second. Okay. Um, we're ready to get started at this point and I really don't want to waste a lot of time. So, let's talk a little bit about basic concepts first. Okay. Uh, we have a presentation that's talking about homoglyphs. Homoglyphs. So, what what are homoglyphs, right? What is this thing? Well, uh, let's just play like a super short game. Just imagine that you have some letters like this. They are the same font. So, same font but different characters. And my question for you is which one of these is the one that you get when you press A in your keyboard? So, you press A. What's the one that shows up? Third. Second from the

left. Sec. I see a third here. Fourth. Okay. So well some variety of opinions here right well I'll tell you what it is it's the second one it's a second one the other ones are other things right like uh letter alpha mathematical symbols etc the key thing here is that they have different numbers right different numbers that means they look sort of the same but they are not the same they are homoglyphs they have different encodings but they look the same that's the idea behind homoglyphs um the thing here um in your bingo cards are homoglyphs as well. You'll see that you have like some letters that you know um you can probably tell that you have like this

thing or this thing. We will have homoglyphs in the slides of the presentation. Okay, so this is sort of to keep you engaged. We will have them in some of the slides and uh you need to just spot them. Okay, we'll have them in surprise places. And once you find the line of homoglyphs, you just raise your hand and uh well, let's see who gets the prize. Okay. Yeah. No, no, we didn't. So, now we're going to start. Pay attention. Okay. Uh now something you should probably be aware of is that there's some things that are not actually in the slides. It's like sort of like a thing to really keep you paying attention. So really

uh let's see how this goes. Now homoglyphs are not something that's new, right? They existed before and uh you've probably seen things like uh URL attacks, right? like uh I get a message in my phone like this. Click here to reset your password and it has like a you know like a weird character here that's not actually PayPal. Uh this is patched now. Uh it's a thing that's called pun code. Well, this has been patched but it has homeless have also had like some other applications in the past such as something I won't talk about. I'll just give you the name if you want to look it up. stagnography. That's when you hide information, right?

Like uh for example, I give you a secret rule and then a text and if you find something secret in this text with a homogeneive, that's stagonography, hiding information in plain sight. Now, we're going to talk about something else in the presentation. So that's just to give you like a little bit of historical context. The idea here, the takeaway is that homoglyphs are just characters we can't discern. So they look the same, they are not the same. Now let's talk a little bit about how we detect AI generated text which is a different thing. I'm just giving you background here right? So how do we do this thing? How do we know if someone has used

chubby to write something? Well I'd say there's three main types of systems. The first of which are classifiers. A very simple idea. You just have a text. You have a model which doesn't necessarily have to be an LLM. It can be other type of model. I train a model so that for example it generates embeddings. Embeddings are just well literally vectors in space lists of numbers right so for example I can get something like this and I can plot it here. So what we can see here is that we have some things that are colored in red. That means texts that had been written by an AI. And you can see that the model places

them here. And also we have some things that are blue that means texts written by humans and they are in a different region. So the model knows that something is AI and it places it here. It knows that something is human and it places it here. It's able to distinguish them. Right? Very simple idea. That's the basic idea of classifiers. That's one of the things we can do. But there's another thing that we can do. We can do a second uh type of system uh which is what we call in general a perplexity based detector. So a perplexity based detector uh needs some little bit of background knowledge about uh how language models work. But we'll keep it super simple,

don't worry. So how do they work? Well, language models just sort of play the game of filling the gap that comes after a sentence. So, for example, if I give you as a hobby, I like to collect gap, your language model is going to give you a probability distribution of what's the thing that comes after this thing. So, it's going to tell you something like this. uh as a hobby I like to collect um 30% probability for things uh 22% probability for books etc etc it's just telling you what do I think should come after what you've written right so if we use the language model to write a text we just choose one of the things that

are on top of distribution so for example books right it's pretty likely So I just write as a hobby I like to collect books and this has been AI generated. Now when we want to detect something um we can use this idea. So for example let's say I have something else something that was written by a human. As a hobby I like to collect bracelets. Now the thing is if I go to my probability distribution where is bracelets? Oh it's down here. It's very unlikely that the model would have written this word statistically speaking, right? So if we see that things have low probability, that means that the model is surprised when it sees those words, right? Or in other words,

it's perplexed. It's surprised to see the text. So that means that the text is human. and vice versa. Uh if we have high probabilities that means AI generated that's how we do detection based on perplexity based on the surprise of the LLM. There's um well and of course we repeat this process for the whole text and we see more or less how this goes. Right now there's a third type of thing that we can do and this is super interesting uh and it's called a watermark. Watermarks are applied in many different contexts, but perhaps you've seen some of the news around Chad GPT and watermarks uh like it it's been on the news like uh

in the last uh months like about GPT being able to detect what you've written, things like that. I'll explain what it is. What's a watermark? Well, we have our probability distribution. And what I did before was I just picked books, right? That was what I picked. It was here. It was here. I picked books. That's what I chose. There was like no rule or anything. But what if I give you a rule for how do you choose the word? That's what I'm going to give you now. So the rule is you pick the nth word and how do you choose n? Well n is going to be the number of letters that you had in

the previous word. So we go to our sentence and we can see that the previous word has seven letters. So probability distribution we start counting right one two three four five six and seven. So the completion needs to be plants specifically. It can't be any other thing. It needs to be plants, right? Because I have this secret rule. So I pick plants here and I write the text. This is AI generated, but it's also watermarked. And the thing about watermarks is that we can of course repeat this idea of we just take a look at many of the words of the text and if a lot of the words follow the rule we say that it's

generated. But this has a very very strong detection ability. It's super unlikely that every single time you would follow a secret rule when you write. Right? So this is the let's say strongest mechanism but there's something that we can do when it comes to homoglyphs as you can probably expect. Um so this is the general idea. The detection of generated texts can be done in many different ways. There's different techniques, different things we can try to do. There's a lot of things we could talk about here. But what happens when we start putting homoglyphs into the equation? How do they affect those systems? Well, that's what we'll see now. But for that, I prepared a super

short demo because I think it's more illustrative to see how this works in real life. What happens when I put homoglyphs in a text and give that to a detector right? So let's just go to my browser. Where is this thing? Google Chrome. Here it is. Okay. So, what I have here is just a text that that has been written by an artificial intelligence. Right? I know it's a little bit hard to see. Let me see if I can make it a little bigger. That's the text written by uh JH GPT I think or GPT2. And if I go to a detector like this, this is just a website where I can paste the text and I can just click

here and let's check what it says about the text. So it says it's AI right now. What happens if I go here, I replace some of the homoglyphs submit and I copy the new version into the same website. You can probably see it coming

right. Yes. And what happens here? Oh, suddenly it's human, but it didn't really change, right? It's the same thing. Essentially, it looks the same. Now, let's go back to the slides. We can see that the text sort of becomes unrecognizable for the detector, right? And it's not able to pick up the signal that tells it this has been written by AI anymore. But what happens from a technical point of view? What's going on inside of the systems? Well, let's talk a little bit about the technical side of things. So, we talked about three different types of systems just a second ago. The first of which was classifiers, right? You remember we had a system that sort of

placed things in space. But what happens with homoglyphs? This is very interesting. And this is what happens when I take the same text and I apply a homoglyph attack. What can you see here?

Sorry. Say again. It doesn't know. Exactly. It doesn't understand the characters and it's starts placing them in a completely different region of space. Right. like it's in a totally separate place. It's not even close to the things it knew to classify before. So it gets completely confused. But on top of that, what you can see here is that you have texts that are more mixed. So the system is like placing uh purple dots here and also green dots here. So it's not like here we had a very nice separation. blue things, red things, they are separate. We can classify them. When we apply home glyphs, even if we said, "Oh, my threshold is here." Well,

you have a lot of purple things here. If the threshold's here, a lot of green things on top of that. So, it's intrinsically harder to classify the text. That's what happens with the system. It just gets super confused. Now, what happens to perplexity based detectors? The second type of system you will remember we had the probability distribution etc. Well, that was our phrase. And if we go back to the same distribution, that's what we did before, right? We we had this very nice distribution. But the problem is if the phrase looks like this, that's different. That's very different because what's the probability of something weird? That's super low, right? Like that's I don't know, maybe 0% like It's

nothing I've ever seen before. So, of course, I would never write this thing. To me, this looks very human written. Now, what happens to watermarks? You will remember this was like the third type of system. And essentially, we were doing a little bit the same idea, but with a special rule, which was this thing. I go to the previous word. I choose my next word based on that. So, what happens here? Let's say for example that's my phrase right well as I okay oh well maybe what is this is this like a letter no it's not a letter right oh five characters okay so let's go to pro distribution I start counting one two three four five

postcards do I have postcards here no it was plants right so of course no it doesn't follow the rule this looks very human written to me again. So that's a little bit what goes on inside the systems. There's different mechanisms of action. But all of them exploit this idea of confusion. I'm trying to mess up with the system, right? It's getting like really confused about what's going on. And the question perhaps is, well, how effective is this thing? Does it work 20% of the time, 30% of the time, 50%? What do you think? Uh let's play a small game here. So I'm going to say percentages. I'm going to start at um 0%. Okay. So 0% is this technique

never works. So everyone will start with their hand raised. Okay. So we start with our hand raised and I'm going to say percentages 10 20 blah blah blah. I want you to lower your hand when you think that's the percentage more or less that this thing would work. Right? So 0%. It never works. 10 20 30 40 50 60 70 80 90 and 100. I see a few 100s here. Okay. Okay. Well, so what's the answer here? How effective is this thing? I'll just tell you what I did here. So, I took 1,000 human text, 1,000 AI texts, rewrote this thing with different percentages of replacement, and run this thing through many different detectors. Okay, so let's see what

happens. But I'm just going to show you this thing bit by bit. I'm not going to reveal the whole thing. I'll just tell you what happens when I don't attack anything with a specific detector. This is a specific detector on a specific data set. And what you can see here is that it's actually very good, right? Because it's telling me that most of the things that are human are classified as human and most of the things that are generated are classified as generated. This is what I want in a detector, right? This is a good detector. Now, let's see what happens when I start applying homop attacks.

I know it's a little bit hard to see. Just take a look here. We always want something like a diagonal here, right? This is not This is not This is not It's not a diagonal anymore. Actually, it's telling me that everything is human even when actually it was generated. Now, does this apply in general to other detectors? How good is this thing? Because this, as I was telling you, it's just a detector, right? A detector on one data set. Did I just cherrypick the example perhaps? So, those are the different detectors I tried on the different data sets I tried. Those are the scores they were having at first. So you can see that some of them are actually not very good.

For example, Open AAI has a detector that's really not very good. In fact, it tended to get most things wrong. So like some of them are not good, but some of them actually achieved good scores. You can see very green things here, right? So what happens when we take a look at the rest of the table? That's what happens. I don't know who you had the hand on 100%, right? Yeah, I'd say it's like 9100, right? Something around that. It's actually well very concerning, right? Like the detectors just don't work anymore when you start doing this thing. And in general, well, yeah, this is very det effective. It renders all detectors ineffective and has a lot of

implications. I won't talk about that because you can probably see it coming, right? Like there's a lot of things about this thing. But perhaps what I would be most concerned is that you can do this with a simple script, right? Like you don't need you don't need to use like an LM or you know like a supercomput or anything like that. This is super simple. You just replace your characters and that's it. Very low access barrier. So can we do something about it? Are there like perhaps like some safeguards we can try to use here? Some things we can do to protect ourselves? Detect homalypes.

That is a very good answer that works in English. Uh but

an amount. Yes,

that is a great great great answer. like it really is a great answer and actually that's going to allow me to sort of skip over the next slides a little bit faster because there's some things we can try to do, right? Yes, we can try to just detect homoglyphs. What's the idea here? Well, we know that the AI generated tax detector is not going to work. We just saw this thing a second ago. They don't work when we have homoglyphs. So, we can't trust this thing. So, we need uh something to detect the attacks, right? As you were just saying a second ago. So for example, if I have a human tax, there's no homoglyphs. I go through the

system. If I have a tax that has been written by CH GPT, no homoglyphs here, it goes through the detector as usual. But if I have a tax that contains homoglyphs here, I can't go here because I know it's going to fail, right? It's going to give me something that's wrong. So I go to my detector and that tells me, oh, there's something wrong here. double check what's going on. Now you were just saying that we can try to detect homoges here, right? Like this is like a scientific article. We have like some mathematical characters. We would get a false positive if we just check for is there one homoglyph here. But you were very

wisely saying percentages. Is there like an unreasonable amount of homoglyphs? That's like the thing that we should probably um try to do. Percentages, right? Percentages. How many do we have? That's actually very good. If we look at results in English, that's perfect. You can see it here. It's really perfect. Like, it works. This is I trigger my alarm if there's one homoglyph. This is I trigger my alarm if there's a an amount of homog that's unreasonable. So that's really good in English. The problem is there's other languages in the world. And uh what happens for example with uh manding, right? Okay. So with mandering it's not like that anymore. It's actually pretty bad, right? So sorry, let me just go back.

Yeah, it's not so good, right? 61 here. So, uh can we perhaps think of a way to improve this thing because maybe this is not enough. Uh actually, if you go and you check how many characters you have here in this script, let's just go here. There's 982 pages of different characters. that are inside the same script. So it's not it's not like, oh, I'm in an English text and suddenly I see a Greek alpha. No, no, no, no. This is I'm in Chinese. I'm writing Chinese. I see a Chinese character. There's nothing so weird about the text anymore. It's not so easy to attack this thing. There's so many things here that could look very

similar, but actually they are not exactly the same. it starts getting very complicated. So, take a look at this thing. Is it too small? Perhaps I'll put it here on the other website. So, I'll just show you two characters in Chinese. This one and this one. They are different. But if I go to check their information, let me go here. Main properties, right? Yeah. Okay. Main properties. And let me just show you what's different. Oh, nothing's different. Like they are inside the same script, inside the same block. You know, if you wanted to check for is there a homoglyph here, you usually look at these properties. you look at, oh yeah, this is Greek and

I'm in a English text, right? So that's a homoglyph for me. But it's really tricky to actually detect what's a homoglyph. Like there's no 100% super clear list of what's a homoglyph, what's not homoglyphs like a very simple problem, but actually when you start to look into that, it's not so simple. I'll just give you perhaps a simple example in English. Uh this is going to look super simple for you. Of course, one of them has homog. The other one doesn't have homog, right? Do you know what it is? No, it's not the O. Actually, it's the No, no, no, no. It's capital I and L, which we always mix when we're writing passwords, you know, the Wi-Fi

and everything. So, yeah. So it's actually So this is H E I O. This is H E L L O. So careful. It's not the same. So you could say, well, I mean, this is super obvious, right? Like you just check if you have capital letters inside your text. Well, I come from Dublin, and there's a there's a thing that happens in Irish, which is that you can have capitals inside words. Oh, so that doesn't work all the time. It's not universal. It doesn't work in every single example. Oh, it's so confusing. Oh my god, this is so hard. The problem here is that if we have safeguards for specific languages, it sort of works. If we have a safeguard

for English, it works. If we have a safeguard for Chinese, I imagine it could probably work. But if we want to have something universal, if we want to have something that just works all the time, I don't need to know what language this is or you know more context about the thing that's different. Can we do that? Can we sort of like abstract ourselves from specific languages and develop something? Oh, bingo. Who? So, so who was the first one? Sorry, you. Okay. So, I'll go talk to you uh after the presentation and we'll talk about the price. Okay, good. Congratulations. Uh, let's just Very good. Very good. Nice. So, I'll just explain to you what are the

different ideas we can sort of try here. And those are just things that came to my mind really. There's probably other things that we can try and I just didn't think about them yet. Now the first thing that just came to my mind is what happens if we look at the ratio of tokens in the text. What is this thing? Well, if you look at the text, this is an original text. This is a text with homoglyphs. They are here. When we process this text with an LLM, the different words got converted into numbers. But and this is a very important difference. Whole words are assigned just one number. So for example, doctor is this number. Copy probably is two

numbers. Maybe the tokenizers um are systems that essentially are able to assign numbers to words. But the problem is if I start to have homoglyphs, tokenizers don't recognize the text anymore. So they start to break up the words into much um into a much larger number of tokens. So you start having a lot more tokens. Now can we use this to detect if attack has been attacked? Of course we don't know what that is, right? We have no idea what that version was because we only see the attacked version. But if we somehow came up with a process to normalize the text and we checked what's the ratio of homoglyphs from here to here. Maybe

that's something we can try. Maybe you know like we can assign like a threshold everything that's more than this threshold has been attacked. Another idea perhaps is perplexity. Again you can probably remember this thing about like the pro distributions etc etc. Well, so what we can see here is that we have a blue sort of graph that has not been attacked. And the thing about that is that it doesn't go as low as the other one. Okay. So maybe we can again try to take advantage of that. We can see that we have different distributions. Maybe we can use perplexity to do detection as well. a ratio. Again, um I actually don't have a lot of time for a demo, so I'll just let

you if you if you search on uh hugging face, my name ACMC, you have it on the bingo card. Hugging face ACMC, there's a thing that's called homoglyphs alarm, you can check it out yourselves. I'll just ask you something that I think is very interesting, which is what do you think? What does your heart tell you that would work better? Tokenization or perplexity? Looking at the number of tokens or looking at how surprised a language model is when it sees a text that contains homoglyphs. What do you think? Who thinks tokenization is better? Who thinks perplexity is better? looking at the amount of surprise that a language model has. Well, actually this is very early experimental

results. So like they will probably change because I'm still working on the thing. But it looks like tokenization tends to be better. Perplexity is not so good. I mean it's better than other baselines but it's not so good. Does this extend to other languages as well? What we can see here is so I took uh the C4 data set. I don't know if you know what C4 is. It's what they used to train for example GPT3 and things like that. It's like a huge data set of websites that have been, you know, like just crawled by common crawl and you have like a lot of texts in a lot of languages. Here I have 97

different languages from like a lot of different texts that just come from the internet. So I would expect that this is like representative of what you would find like in the wild west of the internet right? And what we can see here is that not only is tokenization better, but actually it has a smaller standard deviation. So in general across the different languages, it tends to also be better. For example, perplexity sometimes can be very good, but sometimes it can be very bad. Tokenization is sort of reliable in most cases. So it looks like a very good thing for me. But if we look at time, that's also very nice because of course an if alarm if

alarm is just is there one homo with my text that's super fast. Now if I start looking at percentages or things like that, this is my implementation. I didn't do it in Rust or anything like that. It was Python. So, of course, you know, you would probably expect that it can be lower, but it's not significantly um faster than tokenization. Actually, tokenizers are very efficient, super efficient systems. And you can see that they don't take so much time. Propacity does take a lot of time. And this is a logarithmic scale. So, actually, you know, it really does take like a lot more time. But if we consider both things, tokenization seems to be more accurate and more scalable. So that's

nice. That's nice. Of course, this is something I'm still looking at. So I don't know if like results will hold when I keep doing the research. But that's a little bit my key message here. There's a lot of work we need to do. We need to keep innovating. We need to keep working to break things and fix them and just really make AI safer for everyone. So that's the moment to tell you what were the different slides where we had some homoglyphs and to check if you really were able to spot them. Some of them were easy, right? For example, uh the first one was where was it? Uh it is here. DL in plane. Did you spot

that? Yep. No. How many of you did? Okay. Well, a decent amount. Yeah, that's very good. Okay. Then we also had the PayPal thing. To be honest, I wasn't sure if many people would Okay, that's very good. Actually, I'm I'm surprised uh because you know I I was thinking, oh, they will just think this is an example, so it's not part of the bingo. Well, actually, it was part of the bingo. Um then we had this E with, you know, the dot on top of the I think this was easy, right? Yeah. I see some hands. Yeah. Yeah. Very good. Okay. Nice. Uh what did we have? Oh, yeah. Okay. I think this is hard. I think this was

very good. Very good. Can you see it? I know it's very small here, but it was So, do you see the D with like the thing on your card? So, we had a quote that has a D with the thing with a double quote. This was hard. Yeah, it was a little bit on purpose. Uh but Yeah. Yeah. Yeah. uh then you know we had some homalies uh when we were talking about safeguards as well and uh perplexity I think at this point uh you already figured it out so probably many people start stopped like paying attention but we did have some of them around the presentation good job good job I see well most of you were really

paying attention so that's very nice let's recap I promised you we would sort of fetch the different pink boxes throughout the presentation and build like a summary at the end of the presentation. Right? So that's the moment to do that. What did we learn today? We started talking about homographs, right? We talked about how we can't really tell them apart. They look the same, different encodings, we can't discern them. Then we learned a little bit about how we can detect generated text. How do those systems work? We also had a few words on how attacks on texts with homoglyphs make them unrecognizable. We really can't know what they say anymore from the point of view of the systems,

right? So, we took a deep dive on that. We really looked into the technicalities. We saw there's different mechanisms of action. things work differently but all of the systems exploit this idea of let's confuse the system let's make it predict in very random ways was highly effective we took a look at the results and it was like very very effective a really strong result and something that was actually quite concerning for me at least least is that the access barrier for this thing is actually really low. Anyone can do that. So that's not very nice. But of course there's some things we can try, right? There's some safeguards. There's some ideas we can work on and that's the

point where we are. The work to do is on us. We really need to keep working to improve this thing. We need to keep innovating. We need to keep improving. And that's where I leave you with uh a super short feedback form. If you can just scan this thing, it takes 20 seconds. I promise it's super fast and it's like literally two questions. Did you like the thing? And uh is it like too technical, too simple? If you just give me feedback super super quickly, it really helps me. It helps me tailor the presentation. It helps me just make it better every time. So, it's a great help, but also I'll give you an extra

incentive to fill in the form. If you actually fill it in, you can get the bibliography. I promised you will have the references. I will give you links at the end. That's how you get the bibliography. So, you need to scan the QR code, fill in the two questions, and then you got the bibliography. I would really appreciate it if you can just help me with a little bit of feedback. That's one way to help me. The other way actually is that if this thing interests you, I'm open to collaborating. This is something I'm working on at the moment. So, uh if you think like this is like an interesting research question for you, just well, let's talk a little bit more

about the thing. And that's where I close the presentation. I hope you enjoyed it. And if you have any questions, I can take them now. Thank you. [Applause] One or two questions. So I can take one or two questions. So okay, let's go with your question first. Is it possible to create some kind of Is it possible to create some kind of a feedback loop where you feed the results uh back into the learning algorithm and generate text that aren't not detectable? Uh you mean like to do these attacks better or Yeah, exactly. Well, I mean, I think you can, but like perhaps the So, you see like it's very effective even like with low percentages like you

wouldn't really need to improve them like a lot more than what they already are, right? Like like they are very effective already. Perhaps the feedback loop would actually be a very good idea for alarms. Like, does my alarm actually get triggered when it sees like attacks with homoglyphs? Does it have like a false positive, false negative? And um yeah, I think like probably that could also help in the context of alarms and detecting texts that have been attacked. So that is like a really good idea. Feedback loops are really really good in general. And I see we had another question. Thanks a lot. The detection mechanisms that you mentioned don't seem so hard to implement especially for English. Why

are all the existing uh detectors failing on this? uh because I mean you don't develop your systems thinking about this thing like uh so you know you just like build your detector with normal texts and like if you just put like this extra step you know like the alarm or something like that your detectors can like if you were considering English your detectors would work very well but you just don't think about it when you build these systems you know like the research question when people are building detectors is can I detect a text not can I detect a text even if it has homoglyphs or you know that's like a sort of different thing but yes I agree

like you know in like specific languages it's

uh so I'll just repeat the thing here for for like the recording So like basically if we develop like an alarm we would be able to detect those sorts of attacks in general with any detector and uh that's true like this you know like if we are able to have an alarm that works very well that's perfect we don't have this problem anymore. Now we need to work to get that in a sort of universal way that works all the time. Uh, but I'm I'm positive that can exist. We just need to do the work to get there. So, I hope that was interesting for you. And I'm still here for questions if you want like uh I'm

around. Thank you. [Applause] Thank you all for your amazing talk. I have a small present from Coning for you. especially for you like a lot of the other speakers were coming from the Netherlands, but I suppose if you're coming from Dublin, it's like a very nice way to see Conia. Uh so I saw quite a lot of hands uh for other questions for Alden as well. Make sure to stay around during the drinks. There will be more than enough time uh to ask him any other questions. Uh we trying to uh set up mic so we can continue on the to the last talk. So, Paradise, stay stay in the room. Are we going to uh get the

last dog?

Attacking LLM Detectors with Homoglyph-Based Attacks

Related talks