
[Music]
[Music] all right all right so we're going to get ready and start introducing our next speakers here just to show hands is everyone enjoying bsides Tampa so far all right let's get some good energy here for our next speakers who are coming to discuss with us finding secrets in PDF files so firstly we have Mr Kenny Brown Kenneth Brown is a staff technical account manager supporting Federal customers at VMware USA formerly holding a senior consultant in program manager role working with DOD customers and then we also have his counterpart Nikita Nikita is a researcher focusing on privacy issues revolving around data archival source anonymization and counter forensics Nikita's focuses on security training digital document in
video forensics in ENT without further Ado I'm going to pass it on to the gentleman here thank you so much righty folks so I know you're all probably tired of looking at slide decks for the last fiveish hours so what we decided to do today is instead do a series of live demos of actually jumping right in and going through I think we have three or four PDFs to look through um but keeping in mind that this is a live demo things don't always go according to plan just sometimes the best plan so please bear with us but let's go ahead and jump in so what we'll be talking about today is how to find
various hidden stuff that PDF producers usually don't want you to find or don't even expect you to be able to find and so right here we'll start with a little sample file we made and then we'll move on to some real world examples so here we have our title slides so we have a PDF what is the first thing that you do when you see a redaction does anyone know this doesn't require any kind of basic editing or any complex features see a hand exactly try to highlight it and select the text so here we go we can go ahead and copy the text we don't see anything here let's go ahead and paste it into a
text editor
we'll get there all righty and see right when we paste that text we see text right away of course other more convoluted PDFs will not have this kind of thing enabled usually what will happen is if you try in a more sophisticated reaction what will happen if you try to copy and paste the text is the text will not paste either because the document doesn't have the permissions to do so or because there is no underlying text underneath and the next Fe feature to do in that case is and again we're using whatever the latest version of acrobat is so things will be a little different and we'll also explain in a little bit why
we're using acrobat and not a slew of other available PDF editing tools but in acrobat we could go ahead and analyze the document content structure and if you don't see this little icon on the right you need to enable make sure that you enable showing content because by default um acet hides that from you because they obviously don't really expect you to go and start analyzing the document but if we start analyzing the content what the content feature does is it breaks down all the individual image blocks and text blocks in a PDF right and this is kind of the preliminary area where if you do something basic like copy or try to highlight the text and
you don't find anything the next thing to do is to start digging into the PDF structure right and so here we see for example that our text block is unredacted in the PDF structure and so the word Secrets is of course immediately apparent that it's there we also see that there's a number of images in the PDF and that's interesting because at first glance we would think that there's only one image right but actually the way that this poorly redacted document was done is that the black redaction Bars were treated as images this will be significant in just a second and we can highlight the image and see what is what and after we pick the image so it
looks like the central one is the one we want to look at we could go ahead and edit the image but before we do that we want to make sure that in acrobat you pop over to preferences into the content editing panel and make sure that you select an image editor doesn't need to be Photoshop it could be um or any number of editors just as long as you have an editor set up let's go ahead and see what happens if we want to edit the image oh the unredacted original conveniently pops up so that's one way of approaching this kind of thing is to open up the content structure the other way of approaching
it is if we look at PDF tools so PDFs traditionally are not meant to be edited right and we can use that to our advantage because those who produce PDFs often don't expect us to be able to edit and manipulate the data inside the PDF but in fact we do have rudimentary but still practical for our purposes editing functions in a PDF so once we go to the editing tab we can now select the different elements and we can just start moving them around as we want right so it's another very easy visual way to instantly see the redactions that have been done but that being said this one that we made is intentionally kind of trivially poorly
done um more real world examples we use slightly more sophisticated methods that require kind of trying running the gamut and trying this whole round so let's look at this contract that came out a couple years ago between the European Union and the big pharmaceutical astroica um this went out on the official EU site we just downloaded here here for you in case there are issues with the internet but this is a live contract that went out I think at the end of 2020 as we're scrolling through it we see that there's a few redactions here and there and then we get to this page right everything is blanked out so let's try running through kind of our bag of
tricks that we just talked about we'll click the content panel can start analyzing the PDF based on page we see that this is page number four let's pop over there and here we see that there are no image fields which means that the redactions have not been done using images which is good for the redactors bad for us um and then we also see that the text Fields don't actually have anything beyond the plain text for instance we see section 1.15 we see the text field there and there's nothing else right everything is redacted okay so looking at the content didn't help us in any way right we tried to find for images that we could move
out of the way we Tred to find if the text Fields were actually still there they're not um obviously if we also try to copy and paste the content nothing will show up see we can't even select it so what do we do next we could again try going to edit right we could see if we could edit the actual bars we can go ahead and try to resize it a little bit rotate it out of the way see if there's anything underneath there is not so for all intens and purposes we think that this is a good redacted document based on what we've just done does anyone have any other ideas for how we can start
poking into
this that's true we could pop up when a hexed kind of dive deeper even deeper into it uh but really adobe's content feature is more or less a glorified kind of pre package guey hex editor so really when you're messing with the content feature there's really no need to use the third party hex editors or we could use you know independent we could use the popular Library um attempt to decompile the PDF into its bare elements so you know there's popular which we won't be showing here but it's a basic command line utility which can decompile the PDF and produce all the independent images text Fields Etc and nice little output so there's other tools we could
use um but I can tell you they will not help us with this right now any other guesses so kind of the fundamental lesson that we wanted to show here is we kind of started off with kind of doing an advanced analysis going into content analyzing all the Dynamics all the different fields of a PDF but really there's a very basic feature um that was overlooked here overlooked by the people making the PDF and based on the different PDF
all righty nice all righty so just to repeat what I was saying is a complex PDF such as this you know it has a convoluted structure we have section one headings subheadings PDF printers when they produce PDFs try to make this easier to browse by creating an automated bookmark output and traditionally the way that a PDF printer is supposed to work when it produces bookmarks is it copies the headings and the subheadings right to make a basic treat structure so we should be able to go and see definitions 1.1 in here but here we start noticing something interesting is that this PDF printer didn't just produce the headings and the subheadings it actually confused the subheadings for subheading titles so
instead of just 1.1 Accounting Standards we see the entire sentence that's interesting let's see if we go back down to 1.15 cost of goods you know what do we have here we have once again the PDF software assumed that the subheading consisted of the entire phrase in that line and most interestingly it then proceeded to repeat that same assumption because the text is formatted exactly the same for the remainder of the entire page right so actually all of the redacted page that was posted on the EU site is perfectly accessible simply in the bookmarks panel right so we don't even have to go digging into the contents of the file we don't have to extract it in
HEX we don't have to use popular we didn't have to extract images we don't have to do anything literally all you had to do was look at the bookmarks yeah now this is sometimes less obvious because when you view a PDF if the PDF is set to page only not just bookmarks panel and page you may the page the bookmarks tab won't be automatically open so you may forget to open it but that's something to always go and check is hey maybe the redacted data was actually left over there and at the end of this we'll show you how guys how to do proper redactions and proper document sanitization but really what seems to have happened here is whoever made the
PDF went ahead and ran a redaction software maybe even adobe's own redaction but then they didn't run the sanitization feature which is a whole separate section in acrobat so in the lesson that we'll show at the end of this is when you're redacting a PDF after you run the redaction tool you ALS also really need to run the document sanitization tool because there's all this ancillary or complimentary data that's tantum out to protecting your redactions but if you don't take care of that it will show up and that's exactly what happened
here all righty now let's move on to another example so this is again another real world example it's a court document um it is the deposition of Christine Maxwell in one of the Epstein cases from a couple years ago and this is a huge document so we made a little note of where pages are so let's see let's start with page yeah let's start with page 135 so this was a deposition again it was posted um on Pacer which is a website where I'm sure some of you are familiar where you could get court documents so again this is all publicly posted an official document and here we see that there's a number of redactions
here so let's go ahead and try looking at our usual bag of tricks and let's see what will happen again we could go ahead and just jump to editing the PDF we see that we can't we can select the little element here let's go ahead and shift it out of the way and you see there's nothing there once again and I'm not going to waste time going through the content um pane again I'll just tell you that once again the redactions have been done there as well let's what about let's try bookmarks can we get any luck the bookmarks there can't copy and paste bookmarks are all clean they got us right they did everything that they were supposed to do
they properly both redacted and sanitized the document but now let's hop over to the index 435 I think go up a little bit so here we have the PDF again they use some other software to automatically generate an index because you know when you have a 400 odd page document it's good to have an index in there somewhere and here we see that there's an instance of a word below clients that seems to have been redacted and appears in all these numbers of pages can anyone think of a word that starts with cl that could be a person place or thing that would be implicated in the up sign documents any ideas Clinton okay that's a good theory
wonder if we could test it so the index conveniently shows us that the page the term appears on page 134 which is where we just were it also subsequently appears on the next page 135 which will actually be 136 in the PDF because it's offset by one and sure enough here we see on page 135 line seven and line 11 the word Clinton appears so let's go up to a page earlier to where our redacted document was and let's see the according to the the index the term also appears on page 134 line 15 and 16 was blank ever at any Jeffrey of saints Holmes and as we know from the previous page that same term appears on the
unredacted line as in the redacted line so could produce the redaction there actually says President Clinton right and reporters have actually gone through this document and been able to find very many of the proper names that were redacted simply because the indexing software that was used did not proceeded to index everything but then the redaction was run after and the redaction was done manually right so someone went through this document looked for Clinton redacted it and then missed some pages or in some pages where that information wasn't sensitive right forgetting that actually the index contained all instances of the term so again nothing technical was required here all we had to do was analyze the
index to be able to find the redaction so even though the redactions were done perfectly on a technical level right they didn't have any bookmarks they didn't have any underlying text they didn't use images to hide the redaction bar they completely sanitized the document but they simply forgot to check the index marginalia which revealed the name through process of elimination right so that's again the purpose of these examples is to show you that sometimes this work is not technical at all you could go through the various technical minutia and you could still end up with a document that has sensitive information in it and Kenny is now going to talk about the next example which doesn't have to do with redactions
at all because there's really no way to redact the other data but we'll talk about how to open it up later sure so how many of you guys attended last year and went to our session on whistleblowers anybody by chance a couple of you apologies it's going to be a slight repeat of a few things in here but it's worthwhile to bring up so PDF says nikito is saying you know you can redact things and take it out but there's also other ways that you can find identifying information in uh PDF documents so I'm not going to open up this document it doesn't really work that well on the internet here but we browsed to it earlier U it's an NSA
document that was released a while back and what's interesting about this is if you take a look at it everything looks to be redacted it looks fine but there's also a lot of other information or some more specific information that you can glean from this outside of metadata does anybody have any ideas what it might be so we won't go through the hassle of going through the gamut of redactions they're done well but there are other tells in this PDF so the first meor tell is that this is an actual what we call a true PDF right which has vectored images font text this is an image right this is a scan image exactly and that in
itself is a big ta when you're doing this kind of document analysis and you can tell things are a little bit crooked you know there's some scanned image from the scanner so we'll zoom in it look to be a little evident in a moment we'll zoom in real quick and we'll come to the upper corner does anybody see anything in the upper corner maybe maybe not see the little disorient discoloration from the scan page Let's uh hold on let me take a quick screenshot of just this upper corner how the hell do you take a screenshot on this thing there we go okay quick screenshot and then I will open it up in Photoshop so any photo editor happen to have
photoshoop here so we'll get it in here we'll make it a little bit bigger for everybody so to make it easier let's just try to make a few adjustments on it so we'll go over to the adjustments menu and then we're going some of you are already but there's ways can make this even more so we'll we'll invert it anybody see anything now probably not it really gets hard on these projector screens so another thing that you can do is go mess with the contrast and brightness so we're going to turn the brightness way up and then the contrast way down and can anybody see the little boy it's really hard there's little blue really again the chys of a
projector light but you can see all the little dots the little yellow dots that are coming these are called micro dots and these are things that are generated when it's printed out so each individual printer will print things out sometimes depends on the printer model but these will be on the document and then you know not really something that you can easily see when you bring it up
and so that little pattern of dots that we saw actually forms a unique serial number that can identify both when the document was printed by which username it was printed the precise time that it was printed at and if we use the generator on the website you can actually plug the sequence of dots in and then get a read out of exactly when it was done a little zoom and so again go back to the original PDF that that yep so again borderline imperceptible unless you know what to look for right so when we talk about secrets and PDF files sometimes it's not actually gerine to the structure of the PDF it's actually has to do with a
document itself that was imported into the PDF without thinking about anything that could happen and really the solution to this is then how do we prevent micro dots from showing up there's you know there's random GitHub scripts that you could find which can kind of help you normalize the contrast in the document and remove all that background shadowing and so on they're you know tentatively effective but really the best way to do this kind of work if you're reproducing sensitive document is you really have to recreate the document from scratch right that's kind of one of the underlying themes if you haven't noticed that's running through the under Curts of all the examples is PDF structure is insanely
convoluted right there's so many ways that you could trip up and accidentally expose metadata so really the best way to reproduce something like this is to just retype it completely from scratch it requires obviously a lot of work right because you're not just ocing and copying and pasting text we literally mean retyping from scratch but that work is actually worth it if you want to make sure that you strip out any secrets in a PDF file um old school back on the micro dos if you go to this website if you take the T time to go through you can go and put all the little dots in as you see them in your your
docent so if you take the time to go through and put all these little dots in here matching the document you can get information from the printer such as serial number model all this good stuff date and time when it was printed so you know with enough time and effort you can find out a lot of information about the different files and we can send this link out too afterwards so there's an old school um NSA or CIA redaction manual from the 70s that talks about the best you know before PDFs and we thing that talks about the best way of redacting which is still very much true today is to cut the piece of paper right literally cut out
what you're trying to prevent from seeing because you know before PDFs the way that we would reactions instead of messing with them is you would hold it up to the light and if someone just reacted something with a Sharpie on one side you would still be able to read through on the other side of the paper right so uh you know Foya agents agents who work with releasing Freedom of Information Act stuff that's redacted Etc they typically used to use white out right well what happens to White out when it dries it could easily be just scraped off right and that's how people used to find old records before all this and really the same kind of analog
approach of making sure that the data is not in the source is the best way of making sure that you can't undo digital redactions what this means is that you should not wait to redact a file when it is already in PDF form if you're redacting contents when you have already converted them to PDF you've already failed um the redaction has to be done in the source content so let's say you're working in a word doc right what you want to go through is replace the actual word in the word doc with like a set of x's to maintain the proper spacing and stuff or you could actually Pat it with additional X's to make it
seem like the word you're redacting is much bigger than it actually is right let's say you're redacting a four character name right and you want it to appear like the name is actually much longer go back to your Source document fill it up with 10 ex's and then redact that field so that now the data actually looks like there's no way that that just said mic or something like that right no this was a big name and that's kind of what you have to do in the original Source document is make sure that the material is not in the actual document before you convert it to a PDF but that being said lots of the times folks do
convert documents to PDF in the need to redact it so now we'll walk you guys through how to redact an actual document we'll skip to that after so let's say let's go back to our now mangled um example here which is the contract if we go back to tools and go ahead to edit a PDF here at the bottom we see the option to redact a document yeah remember earlier I mentioned that redaction and sanitization are two two different functions in acrobat this document has already been redacted but we can you know go ahead and just add another redaction over here why not go ahead and hit apply and here we have the second feature sanitize and remove hidden
information the reason we want to do that is because when you decide to remove sanitized information you can remove all this other data that's available and what's the first fuel here bookmarks so what happened in other words is when whoever was redacting this document they did the redaction function but then they did not go and subsequently sanitize the document which we can go ahead and
do and this happens too because again live demo and we just manipulated the whole thing to be out of
bounds did it save it didn't ask to you didn't ask
to in theory once you've reducted it should ask you to save it as a new file somewhere and then it'll save the new redacted file in a different directory or wherever you choose keeping the original one yep so this one had an error with flattening that's because we messed around with the PDF but really what would happen is the bookmarks would be gone and in fact they already seem like they're gone let's double check yep the bookmark are no longer in the document so that's just something to keep in mind is that you both have to sanitize and redact and then again we've been using acrobat and people are like but you know I use favorite PDF editor XYZ to do this
make show me that might not be good good idea sure so there's a lot of studies done that when you do redacting y'all can hear me on this right um when you do redacting with these other programs there's data that could be leaked out um here's a study that was done you scroll down talking about the different PDF editors out there I can't remember where it was here it is I think there's over a dozen and I'll make this a little bit bigger so everybody can see it over a dozen different PDF editors um and with the redaction the information that's actually leaked out when you do that and is available so just because it's you
know a editor of your choice doesn't mean it's necessarily the best one we're using acrobat in fact I got the free trial of pro just to show you guys this one don't particularly feel like paying for it it's on my company laptop but um you know acrobat is the most comprehensive the easiest one to use and does kind of the best job of ripping everything out and not sharing data such as mentioned over here exactly so that's the thing is like you could use any number of PDF editors to manipulate the data and such but sometimes the third party ones are a little unreliable but that being said acrobat itself can be unreliable because how does acrobat identify which fields
to remove right in a PDF syntax the text fields are clearly delineated the images are the bookmarks are and so on the problem is that lots of PDF printers don't follow the PDF guidelines to a te that means if something in a PDF printer that you use to create the PDF document is not clearly marked for example as an author field it will not get stripped when acet starts looking for an author field and so there's been cases in the past where again people have run the sanitization tool the redaction tool but because the text was not clearly formatted that text survived the manipulation and that's again kind of why we say that you should not rely on
software-based Solutions when the document is in a PDF to run that redaction this should be done beforehand to make sure that when you open up the PDF there's nothing there to remove in the first place and then kind of in tandem with not using random third party tools which may or may not work everything nowadays is cloud-based right there's tons of editors online where you could upload a PDF it it says it strips the metadata for you it says it checks for redactions for you and then gives you a clean output and let's say that that output is done properly it is clean the issue of course now is that the cloud provider has your unredacted PDF
right they may say whatever they want that you know we promise we won't keep the file we promise that we delete the file within seven days but really whenever you use any online tool just assume that your file now belongs to someone else right you're no longer the owner of that file so that's another reason why we're using acrobat not random Standalone PDF reactor. comom right because there's no telling whether online services will re actually keep the unredacted version or not I think yeah and that's more or less it if folks have questions thank you guys I think we'll start in the very back someone had a question the legality of doing proper so that's actually an interesting point
because there's the legality of working with the redactions right like I mentioned we're working with all public documents right these were on the official um Court website on Pacer they were accessible to court records the other one was accessible on the EU site so for us there's no liability but this brings an interesting question if there's liability for whoever the redactors were right um when you're doing for example a for request the government is legally obligated to redact particular kinds of pii or personally identifiable information if there's not that actually opens up and again just to make it clear we're not lawyers we're not advising on anything legal but that does open up the question
of whether if any pii is revealed in the document the person or individuals whose pii is revealed if they now have a venue to go after the person who put up the document improperly so that's where the legality lies not on the analysis but on the ones who failed at their jobs to protect people's information
yeah great question so the question was if he don't have access to the original Word document but you hav't access to the PDF can you go backwards can you convert that PDF to a Word document and then back to a PDF and yeah that's a great quit that's a great point you can definitely do that and that will for the most part strip out anything because when you convert it back to word bookmarks don't get preserved the original content structure doesn't get preserved like word in a sense is flattening the document back out so you could definitely do that redact and the convert back to PDF um there may just be practical issues like the formatting
won't quite match up like in a complex document like this when you have a table um when you convert it back to word there might be issues but in theory yeah that's a great approach is if you don't have the document recreate the original Word document itself as well the other issue though is that when we were talking for example about the example with the micro dots if you were to recreate that in word when you export it into PDF or so when you export it from PDF into word whether or not word keeps the images right that background image or whether word just keeps the pure text so that's just something you keep an eye out for is if the conversion
um kept the images or remove the images because otherwise you would not see it and this kind of goes back to the broader point we raised earlier about if you want a completely clean document you really have to recreate it from scratch right you can't start taking shortcuts like copying text converting it to a format you literally have to sit down and retype it because there's other forms of waterm markings that we didn't go through that's kind of outside the scope of this that would survive repr reproducibility like that um one sector is called nlw or natural language watermarks this is for instance when two versions of a text document have different wording like this version of a
PDF might say yesterday I went to the park right and you download it you think that you copied it over there's nothing there but actually your coworker may have received a different version of the document that says I went to the park yesterday right so that word was transfixed from the start of the line to the end of the line if you do enough modifications like that you can easily make a serial number that you Neally identifies a document that's preserved by the text order itself so any number of manipulation you know copying into a word moving it back to a PDF Etc that will preserve the word order so you have to be careful about situations like that
but great question yes someone in front yep
yes but at a cost right is the lower resolution you get the grainier the text itself the Grainer the images
the NSA they have a lot of money but yes the point is you can do that but kind when you're relying on that kind of the of the damaging the original Source number one you're decreasing the quality of the material that you're later working with and also number two you then have to go back of course and double check that what you're doing actually removes it or not but yes in general if there robustness factor of a watermark which is how resilient it is to these kind of Transportation attacks is not that high then yeah you could always decrease something to a lower resolution you know the same principle applies to audio files and so on and so
forth but their survivability um is a question that you would have to look at thoroughly yes
right so again once you like in that sequence after you're done printing you then scan you can scan in a low resolution but again the dots May survive surely the point is that really the overarching point of this talk delivered at a Tech conference is that you cannot rely on Tech right you have to do manual resolution you know we all love to have whatever the latest shiny GitHub script is like if you Google um micro dos GitHub there's at least five different scripts that claim to remove these right all it would take a researcher to do is to see what the threshold is for the scripts and lower that threshold by 0.1 and then the dots
will now survive right so we cannot rely on technological solutions when we're dealing with finding Secrets or removing Secrets it has to be an analog manual tedious process even though that's counterintuitive to so many of us here right who just want to automate this thing that's in the middle
there we know for a fact that printers do it but that's kind of not an exclusive set right is so definitely scanners basically any device that is compromised for requirements to have Micro dots is under suspect that it would introduce something else in the thing so photocopiers for example traditionally introduced imperfections that weren't kind of intentional like this but they would still let someone know which exact model of f photocopier was used and this was done um to prevent currency counterfeiting and so it would it wouldn't again introduce dots but would introduce imperfections right so you would notice like let's say the line gets a little bit wavy on the top right corner or something like that and that
wave imperfection would be enough to identify not the exact time and date that document was printed but the exact model that document was printed on and so on and so forth so yeah scanners there's no kind of definitive proof like we have for printers that they embed micro dots but it's certainly something that within when we're like doing a threat model right for scanning a sensitive document that should definitely be something that's under that consideration great question yes uh yeah it could be in the metad but kind of the point here also with watermarking and just with General secrecy and PDF fils is lots of this stuff is redundant right just like we have redundant ways of looking at it and
removing it there's also redundant ways of embedding that information so you may think that yeah you know that information is in the metadata I can easily remove metadata with like a s single command I'm using a tool like exf tool and then you think that you're good right but then you have the micro dots and conversely if you're overly fixated on the micro dots and then forget the metad data then you're cooked that way which is again kind of goes to show you that there's this recursive structure of constantly hiding data um within a convoluted architecture like a PDF file and once again that's why we can't rely on simple technological solutions right we can't just say remove the metadata
sanitize and redact the PDF right we have to go beyond that and make sure that the source document doesn't have anything in the first place so what that means concretely in this example for example is that you would not be scanning it right you would not be even printing it out for that matter you would be re writing what you see on a screen right that's kind of again the underlying Point here is that there are just to emphasize yet again there are no technological background of backends that can let us bypass all this there are no shortcuts if you look online nowadays a website where in fact this document itself was hosted called document cloud or
mukra what they offer is an online redaction service where you can upload a hyper sensitive PDF do the redactions and then post clean version online and they claim that you know they don't retain a copy of the redacted document that the document is overwritten by the unredacted one and it's like who knows right you're basically entrusting their word and your security and the security of the source who gave you the document on the claims of this random website right so again we really want to emphasize the manual approach to all this stuff any other questions oh back there
so the question was when you receive a PDF and you want to check if it has already been sanitized is there any way that you could tell um the answer is no there's no kind of like in the metad sanitized equals yes function not like that but there are a couple flags that you could run through for example if you viewed the document properties in a sanitized document all the stuff down here will be blank um the producer right application the modified date the application data it obviously the author the title it will all be blank and if you go subsequently going to additional metadata in advanced there will be nothing here like here we see all the
different fields the instance ID the user ID all this good stuff and a sanitized document if you were to open this up it would be completely blank so that's kind of one tell but there's no kind of set way there's no flag that says yes sanitized any other questions great right then thank you everyone for coming and we hope you picked something up [Music]
[Music]