File Polyglottery or This Proof of Concept is Also a Picture of Cats

BSides Philly · 201842:4718 viewsPublished 2018-11Watch on YouTube ↗

Speakers

Evan Sultanik

Tags

CategoryTechnical

Mentioned in this talk

Tools used

7-Zip Ghidra Git Hex Fiend

Show transcript [en]

everybody thanks for still hanging with us it's a it's great to see so many people still here even late in the day so I just want to introduce some evidence well tonic so without further ado please give a warm welcome to Evan Saul tank thanks

and when it wasn't so great to be a software engineer anymore so I went to school I got a PhD in computer science and then I spent a few years at Johns Hopkins doing research for government agencies that don't really like me talking about them then I helped start a small cybersecurity R&D firm in the DC area called Digital operatives I was there for five years and recently I became a security engineer at trail limits and throughout this whole time I have occasionally been moonlighting as an adjunct professor here at Drexel also over the past few years I've been a frequent contributor to and editor of Parker gtfo so POC stands for proof of concept or

alternatively we accept pictures of cats unfortunately since the name was coined there's another interpretation for POC that has become increasingly popular at least here in the states it does not in this context mean person of color I say this because for non-technical people who don't understand that a double bar means or it can get a little bit awkward if you're reading one of our publications in public so what it is is a roughly quarterly journal in the tradition of frak and uninformed primarily centered around offensive security research and things like stunts hacking issue our 18th issue zero indexed will be released later this month that CCC we this past October released issue 16 here in Philly was the

only North American release paper release at PubCon and we always release for free digitally about a week or so after the paper release each digital release is also a polyglot so that's what I'm going to be talking about in this talk a polyglot is a file that can be interpreted multiple different ways so it's a single file that is valid to be parsed in multiple different formats so each one of our PDFs is also a valid zip file you can run unzip on our PDF so you can get things like well if you're old enough to know what feeley's are you get feeley's or additional files and content to go along with articles that are inside of

the journal we've done things like had PDFs that are valid Android programs or apks flash and so on and so forth one of the most frequent questions I get asked or we get asked is you know what it looks like your numbering the issues in hex what happened two issues A through F we just went from 9 to 10 well we number in binary coded decimal in honor of the HP Saturn architecture so I hope that answers anybody who's been curious about that this past summer we published a collection of the first number of issues in an over eight hundred page book which has been published by no starch who have graciously sponsored this conference so

I urge you to go check them out recently became available on Amazon but if you can try and get it through no starch especially since they're in the US and you don't have to pay exorbitant shipping fees so does anybody actually have a coffee with you no okay this is what it looks like oh yeah okay so I should fleet please rise and open your hymnals to page 4 Hunter I hold on you see we have a convenient ribbon to go in our gilded Bible paper which is really difficult to get to the page that I had set this to there we go yeah on our luxurious faux leather binding a file has no intrinsic

meaning the meaning of a file its type its validity and its contents can be different for each parts or interpreter thank you you may be seated so it's slide four and I haven't told you really what this talk is about yet so as I mentioned each issue of paco gtfo is a polyglot they're usually created by almost all Bertini philippe Tuen or myself or some combination of the three of us and occasionally we enlist other people to help and if you have an idea for a polyglot feel please feel free to contact us we're always looking for new ideas so this talk is going to be about the ones that I contributed to and some of the technical

aspects of them the goals of this talk are twofold one to convince you that these aren't just nifty parlor tricks but they're actually useful and important for everybody in the security community to understand and secondly for you to learn some tools to actually do these things yourself and to kind of pull away the curtains of some of the mystery about how they're done earlier this week just this week a a polyglot vulnerability was publicized so this was I believe reported to Google over the summer but Android programs apks are essentially a zip files they're jars which are zip files and zip files don't care don't have to start at offset zero they don't care about anything that

occurs before them so it was discovered that you could have a valid apk file in an Android program that was also a valid Dex file which is another form of Android executable so what this meant was you could create you could modify Android programs apks Oh you can modify Android programs without changing their signature so you could install new programs on Android devices without a signature violation happening so that we use the similar technique a year before to create a PDF that is also a valid Android binary for pocker gtfo I think that was issue 11 something like that so before we get to apakah gtfo one more neat trick that i want to drop on you so this seems like a

pretty innocuous command right does anybody notice anything odd about it or wrong about it maybe yes there is nosy but it turns out that most modern versions of tar automatically figure out that oh hey based on the magic bytes in this file this looks like it's Jesus so I'm going to be helpful and G unzip at first so you don't even need to give a Z if it is a tar.gz regardless of the whatever you name the file it doesn't look at the file name at all just looks at magic bytes if the file is at all as gzip tar ball it will G unzip at first without you telling it to so I saw once I

learned this it like tripled my productivity but it also got me thinking so what if we created a file that's both a valid tar and a valid hard ah GZ so if you if you had a file like this you can create the way it's interpreted is going to be dependent on whatever program you choose to extract it so if you send it to guitar it's going to produce one thing but another program that doesn't do that trick or that helpful trick of determining whether it's gzipped then you're going to get something completely different okay so why are PDFs in particular particularly polyglot Abel first of all because of Adobe you'll hear me say that a lot through the

remainder of this talk the PDF format has been around for a very long time and for the first 15 years or so of its existence it was a proprietary format so Adobe did basically whatever it wanted there has been a long history of viewers and interpreters creating PDFs that don't exactly conform to the standard even though there is a published standard so they are resilient to all sorts of inconsistencies and corruptions and errors within the format so you can create a pretty broken PDF and most PDF readers will view it without even giving you a warning PDF it allows you to have arbitrary length binary blobs almost anywhere within the file so one of things that won't even be rendered you

can just insert random stuff and that it within the PDF and put whatever you want in there and almost all parsers ignore everything before the PDF header so PDFs do not really need to start at offset zero in the file that's because for various reasons PDFs are usually parsed backwards from the end at the beginning so once they get to the PDF header they don't need to go backwards anymore because they got to the beginning even if they're not at file offset zero okay so here's a scenario let's say Alice sends Bob an email she's practicing the social engineering that she learned from the previous talk if you were in this room and she says you know I've been trying

to get my resume to so-and-so in Human Resources but you know I've been having problems with my email attachments can you please print out the attach cop copy and give it to them so Bob gets it and you know he opens it up and you know that looks legitimate it opens in his viewer on his computer so then he sends the file to the printer but little does he know that that file is a postscript with embedded printer job language code which has the MIPS firmware to overwrite the firmware on the printer and suddenly Alice has control of the printer and of course since the printer is on the corporate LAN then Alice gets control of

the land - this has been known for a number of years okay but this got me thinking what if you could create a people rarely use PostScript anymore like that might be a red flag if somebody's sending you a postscript file what have you created a PDF that was also valid PostScript so if you send the bites of that PDF directly to a printer it will print out something completely different than what the PDF was so and then something completely different comes out so we did this as the polyglot for POC or gtfo 13 so you can download the PDF freely available and if you open it in a postscript viewer like ghost view or if you use anything else that

interprets it as PostScript it has completely different content which I'll get to in a little bit so how did we do this there are a couple challenges for this specific case that most polyglots don't have to deal with one is that the PDF format is actually a subscript of it is a subset of the PostScript language Adobe created both of them so they just reuse the PostScript syntax for within Pete for the most part within PDF and the second challenge is that these days pretty much all modern PostScript readers can also parse PDF so we need to have some way of tricking their logic into not interpreting it as PDF and instead interpreting it as PostScript so

the way we achieve this might actually be simpler than you'd expect the trick is to have a multi-line PostScript string which is delimited by parentheses so PostScript will add that to its stack PostScript is a stack based language and everything inside will essentially be ignored by PostScript that allows us to encapsulate the PDF header and create a PDF object which is that binary blob that PDF essentially ignores that in which we can put the rest of our PostScript code we did have to have a little bit of extra code before the parentheses because Adobe being Adobe will randomly blacklist certain strings if they occur in the beginning of a file so it didn't like the fact that the file

started with an opening parenthesis so we just encapsulated that in a postscript function which is what that slash PDF header thing is just a random function name and then Adobe Acrobat loaded it just fine so then we have our PostScript content it it PostScript is interpreted it's a stack based programming language so how do we tell PostScript to stop interpreting well we put the special percent % EO f directive and then a stop and then PostScript will stop and then we can have the remainder of our PDF content so the last challenge is how do we trick ghost view which is a part of ghost script which is pretty much the ubiquitous goes there PostScript parsing library and any open

source tool these days chances are if it can process PostScript it's based off of this code since it's open source we can look at the code and it can actually support a number of different file types we saw in addition to PDF and PostScript encapsulated PostScript pjl and so on and so forth so it has this huge function with complex logic to try and figure out what a file type is based off of some hand jammed what amounts to regex and we can see in the bottom here this is where it's the limited determining if it's going to be PostScript or PDF so we can determine as long as this string percent bang PS adobe comes before % PDF 1.5 then go

script will interpret it as being PostScript so we do that and then we're set so if you download package gtfo 13 open it in ghost view this is what you'll see you'll see the remainder of the article that describes how we did this this is joint working through myself and to leave to it also we wanted to give a demonstration of the fact that unlike PDF which is it has javascript in it but it's pretty much static script is a fully-fledged turing-complete programming language so the second page will render a amaze that is randomly generated so every single time you view this document or send it to a printer your printer the little microprocessor in it will

randomly generate a new page that they're amazed that it prints out and then for good measure just to really demonstrate the power of PostScript and why you always want to open PostScript in the VM and the last page prints out your Etsy password file okay so on to the next polyglot this polyglot is an HTTP Quine so it is a PDF that is a ruby script that when you run the script it starts a webserver and if you navigate to that web server it's a web page that mimics the title page of the PDF it also automatically parses the feeley's from the zip because remember the PDF is still a zip file or we've made it so so it unzips itself in memory

and you can download each of the files inside of the zip and moreover if you click to download the PDF it will download a copy of itself so it's a nice way of propagating so in addition to that because we could if you rename the file to be an HTML file but change nothing else just rename it and then you open it in a web browser then it will also render itself now you won't have the this is a static HTML so you know it's not able to parse it zip or anything like that but it does render too so this is a PDF that is also a zip that is also a ruby script that is an

HTTP coin

so how can we do this once again I think this is going to be simpler than you might expect this is a ruby script that just opens a raw socket and listens on port 8080 and just returns an HTTP a manually generated HTTP header and a copy of itself so the part where it does file open file its opening itself reading itself in a memory and then every time anybody connects to its socket listening on port 8080 it just returns a copy of itself now the trick here is that at the very end you see the underscore underscore end I believe Ruby inherited that from Perl if so that's perhaps one of the few good things that

has come from Perl I hope I don't offend anybody but that tells Ruby don't parse anything after here you can ignore everything after here not all languages have this like Python for example everything has to is parsed everything has to be parsed into an AST but with Ruby you don't you can avoid that so since we know that PDF will ignore pretty much everything before the PDF header you can prepend this to any PDF and you will turn that PDF into a web server that will serve copies of itself but why stuff there so before in between that end and the PDF header you can put whatever you want you can put HTML and then in the logic in the web server you

can parse the content between the end and the PDF then it has HTML and it can say oh well did the web browser is it requesting a URL ending in PDF if so let's give it a copy of myself if not let's serve this this HTML instead so here's a breakdown of the four different file types that are encoded within here we encapsulate the PDF header in a multi-line Ruby comment which is delimited by the begin and end then we have everything that we saw on the last slide one so then we have the HTML the trick with the HTML is most web browsers will either ignore everything that occurs before the opening HTML block

some web browsers will add that to the Dom but then all you need to do is add a little bit of JavaScript code to clean up whatever junk gets put in the Dom if there's anything and then the user will never see it and then for the remainder of the PDF comment their content we just encapsulate that in a huge HTML comment and hope that - - greater then doesn't occur anywhere within the PDF but fortunately it didn't for us so this was the my favorite polyglot so far that I've worked on this is in POC or gtfo 14 I believe it is a PDF that is also a valid Nintendo Entertainment System ROM that you can emulate and run and it's

actually been run on real Hardware - I'll get to that and it's also an md5 line I'll explain a little bit what that means - but first we need to talk about the nes architecture so Nintendo had a pretty clever architecture in the actual system they have three processing units the picture processing unit the CPU and an audio processing unit and in each cartridge there were two or three different chips there are two sections of RAM and or ROM one contained the program code one contained to the chr RAM or ROM was for things like sprites and graphics and things like that now being this you know ancient by today's standards you know 8-bit system whatever it had

limits to how much memory it could address at any point in time which limited the size of things like the PRG rom so like how do you if you wanted to make a bigger and bigger game how do you do that well the way they solve that problem was really clever they added this chip optional chip called a mapper and back in you know 1985 or 83 or whenever they were developing the NES Nintendo had actually developed paging so you can for cartridges that have a map or you can write to a specific address in memory and it will flip out pages of memory in different banks on the PRG ROM so you can you actually have

some cartridges that were like megabytes in size when really only you could address like 16 K or something like that at any time and if any of you are old enough to have grown up with an NES you may remember a game genie that's actually how game genies work is they have another mapper chip inside of them so they intercept all of the memory access commands and so on and so forth between the cartridge and it perhaps if there's a real mapper in there or the PRG ROM and they can fiddle with the bits and bytes and this is also what makes emulating NES games so challenge of creating an emulator so challenging you might wonder why like oh there's

this emulator that's really popular and it works with all these games then there are a few games that doesn't work with well that's because they had these custom one-off mapper chips that were that were only created for that one cartridge for that one game that sometimes have complex game logic encoded into them and things like that and nobody has taken the time yet to reverse-engineer it or to implement it in their emulator because they're gonna get all the really popular games working first so since there isn't one contiguous region of memory in an NES cartridge you need to have some file format if your encoding or wrong - you know delineate between what's CHR what's PRG and so and

so forth pretty much the ubiquitous file format these days is I NES any mom you find these days 99% it's I NES so it has a some magic bytes it stores the sizes of the CHR and the PRG ROM has some flags which determine whether or not there's a mapper and then it it has an optional trainer section which I'll get to then the actually encodes the the memory what the trainer was was back in the day before emulators had implemented all of these different mapper chips you could provide custom patch code in the trainer to patch the PRG ROM and have it be more emulate able without actually having support for the mapper on it these days

most modern roms don't need this don't use it but the file format still allows you to have it there so that is the perfect place 16 bytes in for us to store our PDF header and to create a PDF object that will encapsulate the rest of the run so that's exactly what we do we have the ints header we fiddle the flags in order to say hey there's a trainer which gives us you know 512 bytes to work with 512 bytes is more than enough to put our PDF header we create a PDF object stream to encapsulate the remainder of the NES ROM then emulators don't care once they read to the end of the NES ROM they don't care about any

trailing bytes whatever they ignore that so then we can end the stream and the object have the remainder of our PDF so we created a custom ROM that mimics the title page or the table of contents of the issue you'll also note at the bottom here that it prints the md5 sum of the PDF so this is not the md5 sum of the nes rom this is the md5 sum of the entire file PDF zip in nes rom all included how does it do that so you might think oh you know if i wanted to write an md5 Quine like this well I could just read the file in to memory the way our Ruby script read the

whole file into memory and then just implement md5 of the md5 algorithm on that and do that calculation and print it to the screen yes that'll work but that that's cheating and that wouldn't work in this case because this nes rom doesn't actually have access to any of the bytes within the remainder of the file so the way we did this was by calculating 128 md5 collisions each one of which allows us to encode one bit without changing the md5 of the entire file so then after the fact we can twiddle those bits to encode what what the md5 actually was the only issue with that is that encoding at least if you do it naively requires around 32 K to give

you an idea but it just to store that md5 hash collision data 32k that was enough to store donkey kong and duck hunt on the same leg cartridge so we didn't really have that much room to spare we're able to get that down enough with enough room despair to encode the rest of the rom if you're interested in that that's a little bit outside the scope of this talk the paper is available in the issue in pockets ut'fo 14 also Albertini recently gave a talk I forget where but the video is available online on doing these types of hash coins and hash collisions so I urge you to go look at that if you're interested

so cyber campbell's as though the audio is gonna work here it's going to come out of my computer okay this is the audio coming from me [Music] this this is the audio from the ROM so cyber shambles got this to work on his NES mini classic and he from what I understand it's the entire PDF just on there and it works on real Nintendo hardware okay Sam I know what this is or where this is getting a little bit off-topic but I swear it's relevant so this site wasn't use by the US government from 1949 until 2014 may be hard to tell the scale from the picture but this building is enormous so it's over 3 million square feet and just to

give you some context that's about twice the size of the Tesla Giga factory this building has roads with stoplights inside and the roof is so huge that when they've a dedicated team of roofers and when they finish replacing the roof they have to immediately start again from the beginning because it takes so long all those roofers also need to have Q clearances which is the highest clearance issued by the Department of Energy which is happened

don't know what happened there so this is a Department of Energy site this is the Kansas City plant in Kansas City Missouri this is the site of the National Nuclear Security Administration and this is where the US government maintained and built our nuclear arsenal during those 65 years so in the early 2000s I had the opportunity to visit here and the reason why I was going there was they have this problem where you know 65 years they have these devices they need to maintain how do you know like how do you know most of the people who created these are at best retired if not dead and they need to be continuously maintained over time so

this is problem of engineering archival and passing information on to the next engineer so this is work that I did with dr. bill reg Lee who was a professor at Drexel here at the time now he's just finishing up his tenure at DARPA Hopey at UMD next and this isn't just limited to physical systems so you may have tried to open a word 97 document in a more modern version of word if you can even get that to work chances are that the formatting is going to be wrong or something's wrong and god forbid you use some other program like Lotus Word Pro how are you going to open that today how do you preserve the stuff and you know

this is just 15 or 20 years old how are you going to preserve the stuff 40 50 60 years down the road so that got me thinking what if you could create a PDF that also contains its own source code like latex source or whatever well what's the way that most people these days store source code it's a get repo have any of you heard of the git bundle command I hadn't until I looked into this the purpose of the git bundle command is to put an entire repository into a single file with the intention of being able to sneakernet it across air gaps so you can take an entire repository or a subset of

repository run git bundle on it and it'll spit out a file and you can take that somewhere else and you do get clone on that file and get everything out back out so what is the file format for a get bundle that's a good question we can just open it up and your favorite editor or hex editor and the first part is you know pretty intelligible it has you know a signature and then it has a digest and then it has just a git pack file okay so what in the world does a get pack file well we can since get is open source we can go to their git repo and fortunately enough they have a file called pack

format dot txt so yeah this is great now let's just read this well it turns out that it's not it has a bunch of gaps and it's not really sufficient to you know parse itself and it has a bunch of observations in here which suggests at least to me that it wasn't even written by the person who created the format it was written by some other poor programmer later on who was going through all this fetid spaghetti code and trying to figure out how everything worked so we had to do some reverse engineering so it turns out that the pack format has four magic bytes a version number and then an integer representing how many objects are in it

each object either being like a file in the git repo or what's called a delta which is like a diff essentially so if you do an incremental commit you modify a file that will probably be stored as a delta they won't store an entire new copy of the file so then it has one data chunk for each object and then at the end of the pack file you have a 20-byte sha-1 that hashes all the previous content in the pack each data chunk is it starts with the encoded uncompressed length of the data in that chunk followed by z lib compressed data they created their own encoding for arbitrary precision integers for whatever reason for that

uncompressed length so that's the format for it there's also a different arbitrary precision integer length format that is used also in deltas and stuff and I don't know why they just didn't reuse this one but whatever so git was implemented with almost religious adherence to the UNIX philosophy of having these atomic commands and instructions that then you can glue and pipe together to do more complex things so you might be familiar with the fact that git pull is really just equivalent to the lower-level commands called git fetch and get merge now git has a number of commands that it calls plumbing which aren't really intended to be seen or used regularly by by regular users but get itself uses

them so the get bundle command actually delegates to the git pack objects command when it's trying to write that form that part of the file to do get PAC objects so if you look in bundle dot C where it it's not just making a library call it's actually making a system call and like setting up like Arg V and everything like that so if we look at that line in the file all we need to do is add a single line to tell get pack objects to not compress anything so it'll still run Z Lib and deflate but nothing will be compressed they'll all be plain text so we can compile a custom version of git and I have one on my

github page that I've patched along with some other things to make this little bit easier we run our patched version of get to create a git repo and we add a PDF and suddenly that PDF when we create a bundle it'll be uncompressed it'll be plain text since PDF doesn't care about anything that happens before its header and for the most part anything that happens after it suddenly that get bundle is also a PDF and if you clone that get bundle it contains the PDF too now there is one issue with this the maximum size so that Z lib compressed data even though we told it not to be compressed it's still encoded using the

deflate encoding and the maximum size of the deflate block is 65 K so if your PDF is larger than that what are you gonna do so this is one of the reasons why this polyglot did not actually make it into the pocket GTFO because it would have just been too hard to to get a large large file like property CFO to do this so the issue here is the deflate will insert five byte deflate headers wherever it thinks it needs it so if your PDF is too long then in the center it will like rip and insert five byte header which will corrupt your PDF so we need to figure out a way to make room

for those five bytes in the place that we want so we can add a comment inside of the PDF which starts with into the percent sign we can do that between objects and then we can after the fact move those five bytes into the comment and as long as it doesn't have a newline in there the PDF interpreter is just going to be looking for that new line to just limit the end and then we're set and then you have to update that sha-1 hash at the end now this will break xrefs in the PDF as a little bit of a technical thing but their offsets at the end of the PDF that allow you the viewer

to open it really efficiently but fortunately most viewers don't really care if they're broken it just means it it may have loaded in an extra five seconds on a machine in 1995 but on today's machines you won't even notice and it won't complain so you have a an article that's also a polyglot git repo that you can clone and then get a copy of themselves so in conclusion I hope you've learned that or I hope I've convinced you that files have no intrinsic meaning and that polyglots aren't just a nifty parlor trick right they can do useful things and that maybe you can make them too they're not really that difficult to do always open

your postscript into the VM and PDF is broken now as homework if you know I've kind of tickled your fancy I heard you to check out our most recent issue pocket GTFO 16 it is a polyglot that is a Python web server not Ruby using a slightly different technique if you run that web server it opens the kite I struct web IDE which is an interactive hex viewer and we put in a number of like challenges in there so you can use that to reverse engineer itself and kind of teach yourself how to do these things so I'd like to give thanks to all of the editors and contributors apakah GTFO everybody who's helped me

and the team with different polyglots that we've done and thank you very much

all right before we conclude any questions I see what appears can you post a PDF of your slides yes however the animations will probably mess up stuff I'll probably put it on SlideShare or something like that yep great any other questions yeah and once going twice all right everybody's eyes were glazed over it's the end of the day I actually had no idea you were a contributor to POC gtfo that's pretty awesome I like I said I raised my hand I you in the book did you guys don't have the book or aren't aware of the publication definitely what's that oh you do have one too okay yeah highly the book is beautiful by the way

I mean it's beautiful there's a lot of really good very relevant information in there so I definitely recommend picking that up

File Polyglottery or This Proof of Concept is Also a Picture of Cats

Related talks