
hey everybody it's guy mcdoudfella co-chair of the b-size las vegas proving ground track our next talk is revisiting the analog hole using ocr and other techniques to exfiltrate data by samuel greenfeld who was mentored by lucas morris analog whole refers to the fact that despite the fact we use digital signals and encryption nowadays data has to be eventually provided in an analog format that can be heard or seen by an end user i'm samuel greenfeld i've been a software crew engineer for quite some time i've been csisp certified for 11 years now i'm here to talk to you about how optical character recognition can be used to extract data from an environment in this talk we
have an introduction to optical character recognition discuss how ocr works look at some results consider some related attacks and also consider some counter measures optical characterisation in general refers to automated methods to identify printing or handwritten text in an image it existed in a crude form as long or back as the 1800s or early 1900s when portions of characters are converted one bit at a time into audio for use by the blind to interpret text it started with targeted characters and fonts as now much more general and and it's a very mature technology if you look at many optical character recognition programs from the 1970s they look very much like their counterparts today why should we look at optical character
recognition it's a widely available technology it can be done outside of a system so you have no clue that it may have occurred to your data users have to be able to interact with text you can't obscure it without slowing down their progress text means to be a limited number of colors and sizes making it easy to extract from an environment image precision is generally not super important we're not attempting to reproduce an object giving a picture of it you can remove non-text data which can be used to identify the source and the focus tends to be on large data breaches but smaller daily breaches may be dealt with quality or potentially lead to bigger ones the
kovid 19 pandemic has led to a an a surge in remote work and you do not necessarily see what your employees are doing this makes it much more easy for them to use a smartphone camera or other device to take pictures of data from a computer and your company issued laptop may not be able to log or otherwise identify that they have done such of a breach what is the business impact of this you could lose user data such as addresses or credit card information users may not get rid of all the data may not leave your company but you potentially could lose a portion of it similar to a credit card skimmer at a
a gas station or restaurant you could lose company data such as the the the your your battle card information used for processing with your your competition and how you're going to deal with them or potentially pricing information which could then lead to the loss of sales you could lose operations data such as your your access keys or your information on how firewall roles which could then be used to obscure uh potentially lose data later but not necessarily directly and you could lose product development information such as future plans that you may wish to work on on cameras work for monitors paper screens and many other mediums they don't care what you point them at here you have a picture of samsung's
camera application on their smartphones the t in a circle means it's identified there's text on the screen as such it wants to offer you the ability to scan the image where the text is located it's conveniently provided a tap to scan button so how does optical characterization first you convert an image that you see into black and white or another two color format ones and zeros most optical character recognitions work in black and white or they may be able to then reapply the color afterwards you then rotate the image to a known orientation normalize the height and width of the text so it matches what you expect determine the areas of interest which may be characters
words paragraphs or some combination of all the above then you process the results uh the crudest and oldest technique would be to do an image comparison where you compare a character with a reference and see if it matches what you expect you can also do a glyph comparison just again at the character level where you separate a character into its components and then identify the letter that way more modern systems will use neural networks such as long-term short-term memory networks which will attempt to take entire words perhaps and identify them with an output of confidence in verbal and but if you're doing this all with a 3d input for us such as a camera from outside of
the system you have to potentially correct the perspective know that some systems such as google's or amazon web services video engines can correct the perspective for you but in case you mainly have to do it you can either use the smartphone app that you pick up the information on or you can use paintshop pro and or another application on computer and the result kind of looks like this some optional techniques you may wish to consider a noise line removal to potentially remove any noise on the screen which is or paper piece of paper which is not related to the text you want to extract you can do script language recognition especially if you're using an engine which supports word
recognition there are engines out there which support hundreds of words alternatively you can do layout analysis which some way else is kind of mandatory but there are systems which will do things like touch with do things like to work with your things work with your invoices and other information and then we'll go from there some of there are plenty of variable applications for doing this you may have mobile applications such as microsoft lens or google stacks original equipment manufacturer software included with printers and scanners the abbey company makes a wide range of software ranging from cloud space solutions or stuff you can include with your own uh product to the oem solutions mentioned above adobe acrobat makes both cloud-based
solutions for their free software you can pay on as an add-on as well as their pro product to do scanning and recognition and you have a variety of cloud-based solutions from all the major providers both for images and video there are open source software solutions which may do this ocr as well however you may have to do a bit more pre-processing and i said many other solutions uh when acquiring data i've identified two main ways you get data from a computer the first would be page text this would be things like database entries as shown or slides like from a powerpoint presentation this had been an in-person talk you may have seen people taking pictures with
their smartphone of the screen alternatively you can have scrolling text which can be approximated by using the page text and then hitting down page down button and stitching the result together alternatively a more advanced solution may be able to channel gradually scrolling text and stitch that together as well some results you have google drive and google docs here uh here we've uploaded a image to google drive and it offers to let us open the image in google docs and here's the result of that you can see that's extracted the text that it's found in the image and even colorized it a bit and it even recognized that there was a table that because it went down the fiscal year columns
microsoft lens is an example of a smartphone app which can upload to the cloud in the form of a onedrive based word document here it's attempted to figure out the same image you can see it's kind of confused i've noticed microsoft solutions don't particularly care for noise what you can do about this is help essentially pre-filter this or use another product but this doesn't mean that microsoft solutions are bad instead it just means that they are more prone to being distracted by certain types of noise uh here's another example where i used off microsoft lens again and you despite the fact it was skewed and the skew was picked up in the ocr output was it better able to figure
out what was going on microsoft word on its own can also do optical character recognition here's the prompt that appears if you give it a pdf file and it's able to extract graphical base images with text in them from the pdf file here's the result of that while did not convert the inverted text from the image we gave it which was cleaned up to help it it did convert the text for everything else it's on the screen in addition to the text-based solution and the uh table-based solution uh amazon tax direct also when they're part of amazon web services can also process forms walt did not spot the member ib which was inverted you have the ability in this mode to
detect things like the vip member but check box was selected which is not something it does in other modes some related attacks you can simply take a picture there's no reason you need to run optical character recognition especially if you're doing something simple uh you can do image regulation prior to ocr as we said you can do the two character conversion or just extract the text colors in advance to help pre-processing if necessary here we're doing that with paintshop pro you can do screenshots or screen shaping there's nothing special you can have to do here if you as so long as you can easily get the image or other information out of the computer without being detected this may be
harder depending on what your employer or the computer you're trying to access is attempting to do to keep you from easily extracting data you can do source video capture using an hdmi capture device or another video capture solution here we have open broadcast software being used to take a capture of the powerpoint screen and if you have access to the computer again you can do advanced things like potentially if you have the ability to use the accessibility framework for the operating system which is typically used for either test automation or the user users who are disabled to extract the information from an application and easily make the text and potentially other information available to read here's an example
where i pointed microsoft's test utility called accessibility insights at firefox and it dumped all the text it found in the document we saw earlier you can also infiltrate data potentially this may be easier for things like invoices and business cards provided you don't trip any alerts that the invoice is obviously something odd like you spent five thousand dollars at a restaurant but in first citing programs is much more difficult uh here we have an example of one of the few programs which is purely in text based characters which is an antivirus test file for called icar but it does not easily get processed by many optical character recognition engines the reason for this is because many optical copter
recognition engines expect there to be a known language of text and so they attempt to take this string and insert spaces figuring there must be spaces between the words you may turn the parentheses into brackets potentially get confused like the i might be a one and this in general is the problem you have to deal with if you're trying to do stuff like that in general if you're trying to trade something like this or potentially base64 encoded data or your yours you're going to have to have very good control of the computer you're trying to or implement this with otherwise you're not going to be able to control the uh company's computer enough to the
point you can get systems in there with and you're going to probably trip a whole bunch of alerts provided they have for the proper configuration to detect these sorts of things as well on the other hand you could potentially look at something like an encryption certificate in base64 encoded data and provided they didn't have an alert triggered on you doing that you could potentially at your whim control the optical character recognition on the remote side and then decrypt and potentially base64 decode that data with ease some countermeasures to ocr you could consider visual water marking the main thing here is that most watermarks in general have to be done in a way that are not
obtrusive so if they're different color like they are here you could easily just abstract all the black text on the screen and get things done that way and alternatively if this had been an embossed watermark you could potentially you may have had a bit more difficulty but may require some pre-editing but i my attempts i was able to ocr with that but if you took something like this page printed with a black and white printer at least personally i'd have a lot more trouble using a black and white laser printers copy of this image and attempting to extract the text you may attempt to avoid direct recording provided you have enough control over the computer
such as using high bandwidth digital copyright protection enumerating attached video devices to see if one's known to be a recorder the scribbling the screen capture of the operating system and other means of potentially blocking that over this does not stop someone from putting a smartphone or other camera device at the computer and indirectly attempting to get at the information that you're attempting to block direct access to you can limit the data breach this can be done by only showing a limited amount of information to users could also be done by showing different data to different users uh this does not necessarily mean just between a supervisor and a lower level employee but if you have something like
customer service representative you could have your test callers use different names when talking to different representatives so if there is a data breach you can potentially trace where it came from and there are corporate company measures you can use corporate policies and training to discourage indirect attempts at extracting data you can do employee monitoring to attempt to see what you're doing but even if you monitor webcams uh the test various online test providers have found there are ways that people have found just point a smartphone at the camera even if they first point their webcam around the room and claim there's no uh what smartphone or other camera device around and you may also have endpoint data
protection or data prevention loss solutions these may not help except in the direct case where someone may be trying to email a photo to themselves but they may also give you a log activity in retrospect to figure out where a breach may have come from that's my information in terms of what i feel about optical character recognition but in conclusion if a human has to see it then other devices have to capture it optical characteristics may help perform a medium-sized data breach but may be too complicated for potentially smaller and larger compromises all you're looking to do is grab the someone's email address or one slide information there may be no point to doing optical character
recognition but if all you have access to is like a citrix or vmware horizon or another remote desktop based solution in graphical window and there's no other way you can get potentially extract the data from the environment such as a mass database based export then optical characterization may be your only way of extracting all the data all right well we're here with sam that was a utterly terrifying talk i i always love data x fill talks because they they inevitably cover ground that nobody else seems to think of or worry about one question i did have is discussing uh more mitigation strategies i know it's difficult because cameras are everywhere smartphone cameras everywhere but what are some
what are some potential mitigation strategies that organizations could use to try and defend against this sort of thing well i said well it obviously is it suggested you could possibly limit the amount of information that anyone sees and keep log who saw what as we said you could also potentially use some if it's a obviously a classified information facility you may have a skiff or something where you're restricting access or an exam room where you are but if the user is remote and you don't have visibility into the room even just if you're like an exam provider and just having them wave their computer around the room that might not be enough to catch it they're holding a smartphone
under their laptop what about um situations where well i'm trying to figure out how to frame this question so we've talked about users taking pictures and using those various ocr platforms to exfil data what about situations in which we're pulling data directly from a user's device with the so let's say the user's got malware on a phone because you know we all know that's that's a potential um potential uh problem that we're gonna have to mitigate so what about cases where you're just basically anytime a user is taking a picture pulling stuff like that you know pulling whatever the photo is into something in aws or azure what about what about setups like that well you
mean where the malicious malware is intentionally taking photos of the screen or is a disaster recovery technique well or you know basically you've got malware set up to any time to to monitor the phone and and grab a copy of whatever the user's taken a picture of i take photos of receipts all the time true right i mean what about situations like that you could definitely have use ocr potentially in such a situation or you may be using more intelligent software to figure out what is the user trying to do that they're taking pictures oh they may just be randomly taking pictures of their dog or whatever an interesting twist on that situation also as i've noticed
at least in some cases although i don't know if it's intentional or not at least one smartphone platform i know if you're in a work partition on android and you take a screenshot it ends up in your non-work partition with the personal data and i don't know if that's intentional on their part or not i suppose using things like mdm and other mitigations for byod don't necessarily defend against this type of attack right well if the mdm is on the device it potentially could say no screenshots at least through the official channels i mean right there i believe there are like a lot of the apps out there that try to do disappearing like snapchat and stuff i
thought try to avoid screenshots as best that they can does some mitigation like that as well it's just you is in general is all you need is another device that's just pointed at the first and dick your mitigations kind of go out the door right well what i was getting at is with mdm turning off the camera on the smartphone disables essentially one third of the utility of the smartphone right having a camera in your pocket is a game changer and not being able to use that camera for things like taking photos of you know stuff like that i believe you can also have apple officially remove the camera from an iphone at least for government use as well
right there's also you know putting a brightly colored sticker over the camera but that's easy to to um tamper resistant sticker now i know that for a long time there have been counter measures counterfeiting counter measures in photoshop and in hp printers especially but basically the there are programs where secret service and department of the treasury work with manufacturers of printers in various graphics software that prevent you from doing things as well as i believe it's the echelon constellation or whatever it's formerly called which is this right all the yellow the pattern or yellow dots with their some of the relative distances all you can to some square root or something i i have to look up the i don't remember
it live for this talk but yes sure but mitigations like that how practical are those to prevent this sort of attack well you're you're presuming this the the currency stuff primarily just kicks in when you're trying to work with currency or returning yours you your photocopy of it into just a black thing and if you screw up enough times eventually the copier locks up with the service code that gives away what you tried to do mm-hmm but in general if you're just looking at if you're just if it can if it can avoid the original picture yes potentially though i don't know i'm not aware of any cameras that necessarily implement software that implements that directly
it would have to be something like the uh if you if your ocr solution is once you have the text you there's nothing any sort of watermark can do unless you've explicitly put watermarks like slight variations within the text okay what about not necessarily in terms of mitigation but um adding sort of forensic trails you see this especially in um the screener films that the academy of motion pictures will send out to oscar previewers right they will put in not only like the the you know sort of promotional consideration only but they'll put in certain frames uh in general i would think with watermarks yes the original photo will always have the will always have the watermark in it but
by what you've done that text extra extraction it's kind of harder in fact i think for example i think one of the one of the one of the leakers of information classified information from the government earlier this decade was partial part of their prosecution message commented that the press the the press that they recopied their stuff actually copied it down to the little yellow dots printer dots on the page when they uploaded a copy of the scanned leaked information to their news website uh since then that site i've noticeable has also done instead does things like they'll pro re provide quote reconstituted information so it's yes yes your initial you if you can trace the initial leak or you'll
get it but once you've passed through the rcr filter it's a lot more difficult to potentially figure out your watermark unless you've intentionally given them different text well there's also patterns you know people people write in certain ways people use the same um terms of phrase and the same spellings and stuff like that it's a common thing with my with my spouse who's a university professor to check for plagiarism by you know running it through various tools i know that some of that is is has the potential to be added to dlp solutions yes yeah for some of this for the purpose of this talk i wasn't necessarily thinking of tracing individual emails i was thinking right primarily
although yes you could potentially leak individual emails or keep track of stuff for potential claims against your employer later in case they you are mad at them or something or just just your contacts you were as a sales guy i thought of it more in terms of this is going to be information that's potentially critical to employer but it's kind of mass-produced potentially within the organization or at least pretty widely spread or accessible within the organization for the most part okay well is there anything that you didn't get a chance to cover that you'd like to touch on in more depth i kind of thought about a bit more i didn't really cover the press of this in the
actual talk one thing i'd point out is well you don't necessarily will see a lot of optical characterisation mentioned apart from breaking captchas and potentially filling in login forms i've seen done uh is another example well you see it a lot more discussed on the potential side of defending against leaks or ai or whatever the ocr solutions come into play for defense more than offense or as just software you'd want to use around the house but in general there are plenty of articles out there about leaked pictures such as covid 19 vaccination information from the netherlands which is out there where they've shown the news has shown redacted spreadsheets of the information that was leaked or
attempted to be leaked and you could have easily used optical character recognition to process it and afterwards yeah yeah the coveted stuff um especially with regards to vaccination status that is something that scares me a little bit because it's it's so easy to grab that data and you know forge it it's not like it's it's their their countermeasures embedded in the cards themselves like you would see with an id or a passport or something like that so um i think one of the nice things i've uh one of the things i appreciate about like the excelsior pass and some of the stuff that france is doing is they're they're not relying on optical verification they're actually
using proper cryptographic sound methods of verifying veracity or of determining veracity and even if you go to walmart it's going to get in order to get in one of these systems or such as common pastic wall might actually provide you just a cryptographically signed hash to start with yeah all right well thank you very much i don't see any other questions in the channel but yeah this has been a great talk and it is a fascinating topic and i really appreciate it so thank you very much thank you