← All talks

DLP Without GOD Keys: A Journey in Maths

BSides London22:4470 viewsPublished 2023-05Watch on YouTube ↗
Speakers
Tags
About this talk
Data leakage protection systems traditionally rely on 'GOD keys' to decrypt traffic for inspection, breaking confidentiality and authentication protocols. This talk explores an alternative mathematical approach using information entropy, packet timing analysis, and linguistic patterns to detect data exfiltration without decryption—applying concepts from Shannon, Kullback-Leibler divergence, and Zipf's law to identify DNS tunneling and covert channels.
Show transcript [en]

okay who am I I'm not going to read it other than to say I love the I'm fascinated by the predictability of humans and there's a bit of a story in that aligned to mathematics [Music] also I've been I've been advised for the health warning mark uh about there was two formulas but there's also a lot of old stuff so this is a lot of there's a lot of old stuff in this one first things first bit of a a bit of a recap on what data exfiltration is it is where a document is being leaked from a place to another place possibly through a uh through a network cage device and quite often documents that you you may want to

have some control over attacks alteration is is the transmission of the same documents but actually you're not allowed to so the no entry sign for FTP eventually sign for FTP says you can't actually transmit it but there are other mechanisms and one of them being DNS it's iodine which is uh TCP over DNS so far transfer protocol over TCP over DNS and in its simplest form a DNS lookup is made so I've got a base 36 encoded message um that's a domain name that no one will recognize um to transmit a message on the outside and it's to a controlled DNS server and the responses normally come back through text encoded Fields so that tends to be

how it's done data leakage protection is normally through some kind of proxy in that particular case um data is deposited onto the inspection proxy quite often if there's encryption involved the inspection proxy will then represent the uh the individual that's that it's uh sorry the individual that is transmitting the data through what I've refer to as a god key there's a number of problems with this one of them is confidentiality it breaks all kinds of protocols and the other one is it really messes with any form of authentication that uses certificates because effectively you the certificates aren't being passed through they're always around that but impacts with that causes a bit of a problem so my question was is there another way

so first point of call and that is deliberately sort of obscured that is a white shark dump of a conversation between my PC um and the BBC website BBC website and it's basically a background push notification um you can see you've got things um in in this particular case I've set the timing to be microseconds the into packets arrival time because it's a good one to work with um you've also got Source IP address that's a nice job here destination IP address and then you've got packet links um there's a lot more of it and I assure you it starts to get into a bit of a drum beat so there is a rhythm there but

the further intense purposes it doesn't appear on the face of it to be a lot of useful information um I then introduced I have to stare at this because I can remember this guy's name Stefan birash who produced an absolutely incredible talk I did say there was some old stuff uh back in 2011 at the Berlin uh chaos conference um I've always wanted to go but I've got kids so it's always a problem because it occurs between Christmas and New Year which is not a good time anyway basically what he said was um the human ear is an incredible thing because it effectively performs Fourier transforms that's how we recognize things it's a really really cool

approach um and what he did was he's got a um he's got a talk that starts off with the pro with the premise that uh you should be thinking out somewhat outside the box um and what he's looking to do is to detect a specific sequence of of of a particular message even though it's actually encrypted in a Skype packet and he does it either by looking at timing so he looks at effectively the into arrival time of a packet um and he also includes the the packet sizes there are ways to defeat it and there's lots of encryption systems where you actually pad the packets out but for all intents purposes Skype doesn't do

that [Music] what's also quite interesting one of his research students and again I'm going to look at this because I can never imagine pronounce it pikatito actually did a piece of research um one of the problems with the approach that um was identified was you actually need context you need the sentence um so what they've done is they aligned that to looking at file transfer protocols in this particular case and to the Keen Eye you'll notice that the graph on the left which is effectively the percentage of packets and the pack against the packet sizes and what you'll notice if you look at the graph on the left and the right they're very similar and for the Keen Eye there's roughly 100

bytes between the one on the left and one on the right which happens to be the encryption overhead it's the overhead encryption packet but the point is that even though that is a ESP encrypted the actual over the envelope looks the same which is which is quite an interesting characteristic at this point I'm bringing you I introduce um a bit of a I'm not going to say quite say a hero of mine but um Cloud Shannon father of information Theory um he's particularly famous for juggling on a unicycle um but he was also a researcher and one of the chief Engineers over at laboratories and he's the responsibility for the notes and ones so he actually he

effectively turned around and gave information a quantity he said there is a quantitative information in that and here's a way of doing it and it was doing it in bits so this is what I said um so if you consider files are typically a byte in size there are eight bits of information in that file there are there is the potential for eight bits of information in that file so one of the things that he was looking at was how could I possibly transmit this efficiently well if I allow for all files to be of eight bits of entropy that isn't going to take advantage of any efficiencies that could be achieved and he created a concept or a

calculation entropy there's the first Formula and entropy is effectively the um the average utilization of each of the characters within the file to make it slightly easier to to get your head around this what I've done is I've created a couple of bar charts so first and foremost um there's a text file which is literally the characters not tonight and the the the A to Z um in the order they appear on your keyboard um in this particular case uh there are 5.2 bits of entropy what that means is that to represent those 37 characters and I say 37 because we've also got the carriage return on the end 36 plus the the carriage return to represent those

we need roughly 5.2 bits of entropy to achieve that so 2 to the power L 5.2 comes to roughly 37 it's not quite but it's thereabouts um the other thing I want to draw attention to which I'll come back to later in the talk is the efficiency um and in Shannon speak if you will each of those characters is what they're refer to as a symbol so they're characters but he sees them the symbols and you'll hear me talk about a little bit about the efficiencies of symbol utilization which is what we've got there in that particular case because of the way that I have used them in that in there we're seeing an efficiency of 65

so we are not making good good efficient use of that symbol set we can get slightly more efficient if I actually put all the way around my keyboard so literally I've gone all the way around my keyboard and all the printable characters and in that particular case you'll see over on the left hand side of the bar chart you'll see that the entropy has been pushed up 6.4 or thereabouts but also the efficiency has gone to 80 percent because we're making better use of the character set that we've been given or the symbols we've been given and again what will come up when you've got a couple of other graphs just to show it is also possible to to to by plotting a

file or the the order of the symbols in a file it can be possible to detect the type of file that you're looking at just by looking at effectively this this uh this distribution this one here is a is an audio file um linking back to audio well the reason it looks this way is because effectively that's what that's how audio looks like because you get pluses and minuses so that's why that looks that way there's a little bit more data in there it's slightly more efficient 74.96 and we're seeing that in this particular case there's nearly six bits of entropy 5.99 this is a 24-bit graphic tile and again because it is encoded it is somewhat

more efficient 97.97.9 and this is pushing the envelope so again we're getting closer and closer to these eight bits so in this particular case it's 7.8 I wasn't trying to look over there anyway a compressed file is by definition efficient you take out all of the duplications therefore we can see efficiency running at 99.99 and the entropy is getting really really close and this time it's 7.993 it's getting really close and if I then look at an encrypted file no surprises that that's getting slightly further but it is not much more so how can we use this how does Shannon use this um Morse code but it's actually it's International Morse code International which is aligned to International

English it is an encoding system that is specifically designed to carry the English language that's why the E and the T which are the most frequently used characters appear with the select with the smallest amount of data to to effectively deliver them smallest amount of requirements in the encoding system deliver them whereas z y x and H I think somewhere in there as well and they're all four which affects me means that they take more effort to encode them so going back to the inventory question if I were to put another language through that had a different uh different structure then effectively would we nowhere near as efficient and we'd need a different Morse code to make it more

efficient

mathematicians too um these guys are effectively that's the reason why not say I saw her about this um Colby and libler provided a mechanism to look at the efficiency of based on an encoding system what they did was they said that if you have an encoding system that is optimized for a particular particular data transmission how much more or less efficiently would that encoding system be used on another date not another stream of data um so effectively data stream Q what we're saying here is that the probability of P given Q it's it's it'll provide you a mechanism to compare the two um they're also founder members of um I'm trying to know what the paper was

called so they they also were fan and scientists at the NSA um where is it where at the NSA they were looking for alternative mechanisms for actually stripping out or identifying um crypto looking for keys and other things um with within uh with it within data streams to illustrate my point and I hope this comes out what I've done is I've gone back to the first two files and ask for the cobit libler Divergence which is effectively how much how more efficient would the encoding system be from one to another and what we're saying here is and again it was no surprise it was it became obvious from uh from looking at the first two slides

that the additional entropy or the inefficiency is 1.185 so what we're saying here is that if you look at the amount of data the amount of entropy that was required to transmit the first file and then use an encoding system that was optimized for that how much how much of an overhead how inefficient would it be and that's that effectively provides us a calculation to deliver that one of the problems with Copic libraries it's not reversible so just because it would take an extra 1.185 186 bits to transmit a smaller file over a more complex file doesn't necessarily mean it is minus 1.185 and I think if you if you reverse the calculation it's it's about

half a half a bit of entropy into the reverse applies because this is all about the encoding it's about the efficiency of the character set within the encoding so it's not quite the same

okay so going back to my paper I didn't mention at the beginning oops sorry um the fourth sorry the fourth mathematician I want to introduce a chap called Uh George as if who produces ifscilla effectively he identified that human beings typically if you were if you were to analyze um letters words that a human being has generated you would find that the most common um the most common word appeared twice as often as the second most often it doesn't make sense when you first think about it but it the problem is it's true um and to to demonstrate this um he I would uh some work was done some research was done in 2015 apologies it

was 2015 is the most recent in 2015 what someone did is they got all of the versions of Wikipedia crunched the first 10 million words from all of them and effectively on a log log graph which basically means instead of it being an exponential curve it's it's nearly straight just makes it easier to read and they plotted that and they've shown that that just is the case which basically means we are predictable in what we do everything we do is in that particular regard it's totally predictable um and the paper that I glossed over Russ at the beginning I apologize my nerves um Shannon effectively took hold of this particular aspect and he's provided he

provided a mechanism to predict what comes next in a sentence that which is which is sort of where I got hooked with this whole thing or 20 odd years ago um it would literally allows you to predict what is coming next um so what I'm saying is is there a way that I can use that somehow going back to the uh to to the to the to the original to the original problem well as it happens for DNS tunnel detection uh some research was done in o4 and what they've actually done is they've shown that if you were to look at DNS names versus subdomains that have effectively um yeah the the totally Untamed so

normal DNS names they actually form the blue graph which is which follows the azithian Curve whereas in comparison if you actually look at TCP over DNS or other forms of DNS tunneling mechanisms you can actually see that the plot is almost flat and the reason being is that encoding systems with encryption otherwise how I try to be as efficient as possible so whereas us humans are inefficient with characters encryption systems many coding systems are very efficient so I think in essence there is a strategy I have another I have another slide that I took out which effectively shows that this this technique has been used for detecting the likes of iodine and has been detected for has been used for

detecting different types of of exfiltration covert exfiltration systems such which is quite neat [Music] foreign is it possible to to to to implement some kind of watermark so my take on it is quite simple I if we were to apply a city a watermark or some description let's call it a certificate to a document in all probability no pun intended we could detect that the fact that the watermark is there and the reason being that the watermark is machine generated and the certificates tend to have very high entropy so that you're going to be able to see them so the analogy I would see is that effectively you'll see entropy moving to a certain rate and you'll see

a bump the problem is that if that's been put through any form of encryption system or any encoding system that is efficient by definition you will harmonize that out sort of come to the end of course we've got already sorry I've come to I've come come to the end what's it got a bit carried away [Music] any questions and apologies if that's a bit of a mess [Applause]

sense um I'd like to think so by virtue of the fact that the alternative is that either someone's got a spine absolutely everything um I have a firm belief that I trust mathematics I don't trust configuration So to that end I don't want someone reading the data part of the problem that um I I identified quite soon on is that if you're looking at if you're saying that the maximum entropy is available is is eight and what we're saying is that the envelope of the packet that's going through has an entropy of 7.99999 you don't really have a lot of space to tell that something else is going on or something that's in there and that's

part of the problem so I think all the tools are there um but it's one of those things we're buying in the present moment you have such a small element to play with it becomes quite difficult and I think that's my that's my realization um

to learn

tonight

I think parts of my concern in that particular front is accuracy um you want to have a high level of certainty that you are seeing something and it's not an outlier it's not some element and and uh um I have a real problem with Bayes by virtue of the fact that one of the problems is that the the principles a lot of machine learning is that you're you're using a force back model for a force by model so ultimately you are going to make the same mistakes so but I'm not sure it's a it's an accuracy problem it's not a it's not a fact that I haven't seen this before I've seen it a different way it's it's actually about

the accuracy um if if there was a way and another Avenue of this which I have started looking at is to start to say let's look in it starts looking at things like 16 bit 32-bit entropy one of the problems is they're quite hard to calculate because you've now got a really really massive uh but if you if you bear in mind if you're talking about a potentiary you're talking about 256 symbols as soon as you suddenly go to 32-bit entropy you're up to the billions as soon as you get six four entry there means are massive numbers so to actually be able to calculate something and actually create yourself some definition um the the example I hypothesized was if

I come back after six months of processing and I still say I can only give you a sort of a 50 50 likelihood this was there I'm not really getting think they're going to go coin toss much more accurate much quicker if that sort of makes sense I have a question um really great talk thank you very much I I enjoyed it but I think I might be biased because I'm a mathematician thank you um but really great I was I was going to ask about uh speed and accuracy and if you've tried say the Fourier approach or other ideas to try and pull out like almost other periodicities within data within patterns that you find in

inefficient unencoded data um the research that was done off the back of and again I can never remember his name is buttery um and again I would Advocate I'm more than happy to share the slides anyone just get a chance to see that slide the guy is brilliant he quotes Sagan you know think outside the box don't go for the she's pretty much what you're saying up um part of his issue is that a lot of these things are defeatable so for example current encryption algorithms have weaknesses because their keys are of a certain size and so on so forth therefore you have periodic periodicy that you can rely on since there's only so much you can morph the traffic but as

soon as you start having clever hacking mechanisms that are difficult to defeat and packing mechanisms ourselves are defeatable if you know what the algorithm is to do them the problem is that each time you are you know you're up in the ante but you still come you still have this issue with uh with effectively with accuracy um in in his particular case he was looking for speech and he started off with the with the principle whereby I'm looking for this particular sequence of speech speech naturally forms therefore if you encrypt it you were going to struggle to lose that particular perspective um the only other thing I I did look at and again I've got

pieces on which was it was it possible to take a different approach for attacking a DNS or uh for to effectively defeat DNS encryption or DNS tunneling and one of them is to turn around and say if you can identify the timing associated with FTP is it possible to effectively identify outliers or places where FTP is being misused for tunneling so for example FTP is typically you'll see a lot going in One Direction but not so much coming in the other direction and you'll see other scenarios which may show through the issue with a lot of these things is it's really really cool for looking for modeling TCP related things as soon as you go into UDP it

becomes a real problem it's very very difficult to do because the caching and all the things that go on all right cool sorry was that okay how was that pitched sorry should I ask that question how was that pitched was it because I was really I had a lot more maths in there because it's my passion and I took a lot of math out because I really want to share because I am so passionate about this subject and I wanted to make sure it was accessible to people and I've just was it how was that pitched was it is it okay thank you thank you [Applause]