Web Scraping for Fun and Profit - DeLena & Goodwin

Name: Web Scraping for Fun and Profit - DeLena & Goodwin
Uploaded: 2017-05-13
Duration: 29 min 38 s
Description: Pastebin.com and other public ‘paste’ sites are rich sources of sensitive information. Hackers will often post their stolen ‘loot’ to websites like these for public consumption. These sources of information go largely unmonitored. Pastebin is keenly aware of this fact, and offers users the ability

BSides Boston29:38889 viewsPublished 2017-05Watch on YouTube ↗

About this talk

Pastebin.com and other public ‘paste’ sites are rich sources of sensitive information. Hackers will often post their stolen ‘loot’ to websites like these for public consumption. These sources of information go largely unmonitored. Pastebin is keenly aware of this fact, and offers users the ability to create a list of alert keywords. In the event that one of the keywords is found in a public paste, an email is sent to the user. They will also remove pastes that are found to contain personally identifiable information. However, we have shown that a well-designed scraper can capture this information before it is removed by the Pastebin team. These data can include: Suite of stolen NSA tools published to Pastebin NASA and other government sector breaches published to Pastebin Daily onslaught of compromised website credentials, Netflix, proxies, and occasionally, credit card data and even SSNs.

Show transcript [en]

my name is Nick Kalina this is Scot Goodwin we're from OCD Tech which is the IT audit and security division of O'Connor and Dru with regional accounting firm we're going to talk about Hastings great thing and the implications for everyone in this room and everyone in the greater world I'll let Scott take it off thank you thank you so we're going to yeah we're gonna be talking about a web scraping platform that we've designed at OCD tech to monitor and collect data related to information leaks and security breaches that have been posted to public sources like pastebin and other page style websites so we'll start with in just an intro to open source intelligence and

some of the sources that we're using the tools that we've used up to this point to collect some of that information and then why we decided to build our own to collect a subset of information that's kind of poorly captured by those tools we'll talk about where that data is ending up for some smaller and even the large breaches where that data is ending up on the Internet and how we can leverage that to capture that information take a look at the scraping platform itself how it's built how we're using it and finally we'll look at some examples of the data that we've captured over the past year so we've been scraping pastebin and other pastebin

style sites for about a year so we have a pretty big data set to play with so to start open source intelligence that's an umbrella term we give to the tools that we use to collect the data as well as the data that comes back I want us for people who aren't familiar I want to distinguish open source intelligence from the open source that you hear with regards to software where you have access to that the source code itself in this case it really means that that data is being collected from the public domain from sources that everybody in the room has access to so in the context of security research and our team we're

collecting data about a business sometimes more granularly collecting information about the employees of that business trying to get an idea for the organizational structure while we're planning a security assessment or a pen test and then also obviously we're collecting information about the technical infrastructure using DNS to get host names and IP addresses and then who is you can connect some of those names that you got in the last step to the DNS records who owns those websites and then any internet facing systems that we have so your firewalls and routers and load balancers and anything that sits inside your DMZ so we'll be you know pinging and probing those systems trying to get as much information as possible and we

use a number of tools some most of you are probably familiar with tools like recon ng and the harvester discover scripts and they're really great for a point in time assessment to bring in as much information for whatever sources they're using for those tools you can get a point in time assessment of the information that's available on the Internet and what we found is what you get back is some personnel information you can generate employee lists and an email address lists and stuff like that but what you get back is mostly technical data you can get back a lot of information to start assembling a picture of their from the outside of what their network looks like but the

problem is most of that data is known to be public already sometimes you'll find you know the easter-egg of something that somebody didn't know was made public and one thing that we tried to use these tools for and realize that they don't work very well in this capacity is to identify new data that's been released for a target so if something new is released and you compare that back to the data that you had before you could identify that some new information had been made public and they don't work in that capacity that's not what they're designed to do so what we designed was a tool to capture or subset of data on the internet that's

really not well analyzed by those types of tools and what we're really going after is we want to get right to the password we want to capture any information to capture email address and password together and that information is out there and it's not something that you can reliably pull back with existing tools so in order to do that we're leveraging paste style websites for those of you aren't familiar with something like paste bin they get their name from your ability to copy any text you want and paste it to the website and once you click Submit you're granted a URL dedicated to that text so you don't have to register a domain name you don't have to sign up

with paste bin or any of these other sites really some of them you do but pastebin allows anonymous pastes and because of the anonymous aspect of it they're being abused by hackers and attackers to release that information to the public in cases where they're either trying to make that data for sale make that data available on the dark web so they'll give you a link to where you can go buy it or releasing that data just to either take credit for the for the breach that they committed or to increase the damages to that target organization so lots of data is getting posted to paceman and and others and other sites as well actually and I want to be clear

that that's not what they're designed for obviously they were designed originally for code sharing and collaboration for developers on opposite sides of the planet want to share code for debugging and peer review without having to use an extra layer like email or FTP so there's an awful lot of legitimate traffic on these websites as well and that's why they're still available but what we're trying to do is grab all that illicit material that's being posted there as well so that's what we've done we've we're programmatically extracting data from pastebin and other web sites to pull down everything that's there and then secondarily go through that information and try to understand pull out the information that we're interested in

which is obviously not the majority of the traffic that you see from a web scraping perspective you're limited by the the sources that you can scrape so you have to have access to that site and a lot of web sites like you can get an awful lot of information by parsing social media and stuff like that but once you need there's an Authenticator there you need to have a username and password and once you've authenticated you still only have access to your network you can you can't see everything that's actually there and hackers wouldn't use that forum to release information publicly anyways their intent is to get as many eyes on that data as possible so pastebin style

site-geist and flexi and pastebin they really they fit the bill for this type of information leakage and you're also limited to a certain extent by you know acceptable use when your program programming requests when you're programmatically making requests over the Internet it's really easy to not tune the scraper and send way too many requests more than whatever be necessary to that resource on the Internet and if that happens in most cases they'll just shut you off or throttle the connection so you have to be careful it comes down to tuning to get everything that you do need without making extra requests the tool that we built is built on our standard lamp stack so we have a dedicated a boom to

VM Apache web server on the front end for with PHP myadmin which we don't generally use that often but we have it my sequel database in the backend and the entire platform is written in Python the actual analysis is happening Python so we'll go through how it works just really quickly this is for paste bin but it applies to any pay style website that publishes this archive page so we first will step one is to reach out to the URL pastebin.com slash archive and by parsing the raw HTML content on that page you can extract links to all of the recently created pastes so so the posted time over there was zero seconds ago so

this is constantly updated list of paste that have been created so that Stepan creates a list of paste that you're interested in scraping and then separately we'll take that list and go through each one of them reach out to paceman at that URL and pull down that route text as long as we haven't seen it already and so now we have a paste locally on our VM in memory that came from paestum at that point what we did originally was take the paste and put it right into the database that was like the proof of concept stage we didn't perform any analysis on it and then as the dataset grew in the dataset grows very quickly it became harder and harder

to use that data for anything any pay an individual pace can have a thousand lines or more thousands of lines and we have millions of pastes so by not performing any analysis on the front end we were relegated to full-text searches only which became incredibly exhaustive from a time and resource perspective so what we do now kind of like version two is while we have that paste in memory we extract information types that we think we're interested in now so one example that we have up here is an email and password combination so we have an MIT edu password I mean an email address with a colon followed by a string and that string represents a password hash

so we're using regular expressions to match username and password combinations that are separated by a delimiter usually a colon or the pipe character and we have other one set up for other types of information as well so now we extract that information and we store it in a separate table on the database which is indexed so we can get to that data we have a lot more visibility into the data now because the data set is growing so quickly at this point these are rough numbers it's changing all the time it's running as we speak we have 8 million or so pastes over the last year with some downtime in there so it hasn't been a complete year four and a half

million email addresses that we pulled out of these pastes so if I was a spammer you know we can stop right here we have a we have a tool that's generating email addresses in real time and then crucially the one and a half million about of those email emails had a password associated with them in the standard format that we see which is email followed by a colon or a pipe character and then the password we're also extracting IP addresses and finally onion links and I'm sure most of you're aware that those are links to the dark the dark web for lack of a better term and we collect those because it's hard to find resources on the dark web

if you don't already have URL so we're scrape that we're collecting those onion links now and in the future since there's such a wealth of information available on the dark web the future state will be that those onion links are actually populating their own scraper so at this point I want to make clear what we're going to be showing you and why we're interested in this data and it's because there is a tendency for users to do two things one of them is reuse their corporate or business kernel address on external services which is something that has to happen sometimes it's not necessarily something we you know you frowned upon using the email address on

an external service sometimes it's frowned upon but also the reuse of passwords across multiple services so we're targeting corporate or business credentials that have been released to pastebin and other websites as a result of a breach that happened on one else's server so when we have you know a breach of some random website on the internet that data doesn't necessarily have a lot of value you know it's not like a Dropbox or yahoo breach where it's millions and millions of records it affects a smaller user base and they can't necessarily make any real money off of that data so they posted to paceman to make it public and take credit for what they've done but in that data there is regularly

corporate email addresses for agencies and businesses of all types and when there's a password associated with that email address there's always a chance that that's a valid password for the for the organization itself so that's what we're targeting and that's what we're going to be showing you later on with the examples so how do we use this data internally well obviously we like to go through the data for from a research perspective that's what I spend a lot of time doing trying to see what's been released but also for vulnerability assessments and pen tests this is you know one of our first stops we have a year's worth of paceman data and ideally we would be

entering these environments already having credentials to the you know to the environment they don't have to be admin credentials or anything like that but we can hopefully we'll be starting an engagement already having some credentials to that environment but so besides those two things what we were trying to do at the very beginning was come up with a way to create alerts based on our own information being released to pastebin clients partners or vendors that whole supply chain so if any of that information makes its way to pastebin we want to create alert on that and initially we had sent those via email just like every other service and people stopped looking at them as is to

be expected so now we're doing sending them directly to slack so which is a chat program they have laptop and mobile clients so we're getting these letters right to our cell phone and we've also set up the ability to search through that database via slack so we have pretty much the databases more or less pretty well integrated from a mobile device standpoint you don't need to have VPN access or you don't need to putty none of that stuff no my sequel syntax just basic searching can be completed via slack so let's take a look at some of the data that we found over the last year the nasa breach was semi publicized you might have heard of that that a lot

of that information was released a pastebin we have FBI credentials you see that there's a hash up there and I left the IP address unobscured as well because it shows that you know we're using pattern matching to do this we're not we don't know what that piece of data is that's after that delimiter in this case it was an IP address and that might not be a password but we can still associate that IP address would be with the FBI email address that goes with it so we're capturing all that information regardless of what the string is Department Homeland Security and then from an infrastructure standpoint the United States the parts of energy we've captured credentials for the Federal

Reserve Bank over last year the Securities and Exchange Commission we captured one password that's potentially valid in that environment and then the FDIC we also captured a number of emails no passwords though and I left that in there because I think that most people would still be interested to understand the context of how their agencies or their businesses email addresses even email by itself is making its way to paste bin you know you don't this is a if they are not aware of that information making its way out then there's a chance that you know it came from inside their network I'm pulled out I tried to take a breath over deft approach with these examples trying to

show that large companies are affected by this just as a result of the number of users that they have so there's the more users that you have the more email addresses you have in exchange the more users are spreading that information out over the internet and the more opportunities you have for password reuse so UnitedHealthcare that's the larger self care provide in the United States we have credentials that are potentially about in that environment pharmaceutical McKesson the same story from the defense sector we pulled out Raytheon got a couple passwords for them and then in the retail space we pulled out Walmart a single password for them as well Ford looks like they might have

had an actual breach it's kind of hard to tell that's quite a few passwords Amazon has a pretty big presence here it's security beside so we pulled them out for the furry commerce and for communication people let AT&T so it really does spans all industry sectors medium to large sized businesses this is more of an eventualities than it is anything else in our perspective I mean I have trouble I'm you're more likely to find it than not find it for large businesses in our pay spend database since Harvard is nice enough to partner with security besides this year I figured it would be a good forum to bring up higher education because we see more dot edu credentials

on pastebin than any other top-level domain and that's again a function of that intuitive trend that you have more users and in this case more users who aren't necessarily security focused spreading that information out and we you know students are probably just as likely as everyone else to be reusing those passwords too so this is not a Harvard University problem by any means these are student credentials their student bodies you know constantly changing are always provisioning new email addresses and revoking old ones so you know we're pretty interested to find out one what percentage of these credentials are potentially valid for higher education institutions not just Harvard but anybody and then to you know what can we get to with those

credentials not you know we're not going out and validating that any of these work but you know is none of these student portals usually have two-factor authentication or anything like that because it's such a burden for the for the whole environment so you know every every University has a student portal and they're all available online so statistically this is only a subset I mean we've had we have a hundred in the last year for Harvard University distinct email and password combinations so statistically speaking I think we should be able to get access to some financial records web scraping for fun and profit so I had talked about the fun part from the profit perspective paceman

is a clearinghouse for stolen data it's an advertising platform for hackers you'll it's constantly finding links to purchase this information in Bitcoin on external sites usually on the dark web but we do see credit card numbers very very regularly posted to pastebin and and other paste style websites and we use regular expressions again for this pattern matching and what it comes down to is that we see so many that it's really highly unlikely that they're all valid credit card numbers even though they match that pattern so what we do is we retain the paste ID in this table so we can always refer back to the raw paste so we can get the context of how

that credit credit card or potential credit card number was released and try to understand if it really represents a credit card number because you know these for instance these ones down here that have the space right in the middle it's hard to tell I mean generally their system logs and tons of code and encrypted data and binary data people post all sorts of things to paste them so we kind of have to take a holistic approach to pulling out these numbers and if we were interested in finding out if they're real or not we'd have to go back to that original paste and try to get that context so we're not really profiting off of this obviously we're

not using those credit card numbers and we're not trying any of these credentials that we found for us it's more about a potential savings so we're going to try to identify those threats as soon as they're made public obviously we have no visibility into what happened with those credentials before they are made public or who else has them prior to that but as soon as that information is made public and I assure you that we're not the only people who are doing this once that data is made public it becomes a risk to that in that environment that someone else is going to try to use those credentials against you so the faster that we can respond to

these types of things and you know basically change that password if it happened to be a legitimate password we're going to limit the damages we're going to limit the ability of an attacker to use public domain data against us and you know at the bottom pretty common commonly cited stat is the cost of a data breach and any single one of these credentials could instigate a breach like that even if they're not you know domain admin or anything like that once you have access to an email inbox or something like that you can start pivoting and gaining trust and you have a foothold in that environment so for our for our perspective it's not about

making money as much as it is saving money question it was any freedom for a guide like these others not to limit the many good reason for they would do that that would in fact sir their ability to provide service to customers so the question is is there any reason for them not to do it exactly so like they haven't done it yet I mean I think that it sounds like like one of the possible ways to sort of mitigate this as special risk as you know you guys good good that you're really doing this to to access and others may not be so absolutely so page Benton is absolutely aware of this problem and they are more or less

profiting off of it to be honest with you so previous to have one haffley to describe a project maybe six months ago we were doing raw HTML scraping parsing the raw HTML content on the page and pulling out the you know the raw text from the paste and that's what we still have to do for all other sources that we're scraping paste bin released an API for scraping and if you don't if you don't use their API they're going to shut you off immediately we've had issues with other services throttling drawing the connection and whatnot but um the real way that they're mitigating this and paceman is not doing anything to mitigate this they will remove paste

data that's been made public but it's not before the data is made public so the data gets posted to that archive page that I showed you immediately and we're scraping all that and we raise on alert often times by the time the alert makes it to the cellphone and you try to click the link to go look at the paste the pace has already been removed and paste them but the page is in our paste bin database and it's in everybody else is based on database already so other services and the services that we're not scraping are ones that have either a an Authenticator in front of that page so you have to log in and only see your

network of people that you're allowed to talk with really or be they don't make the list of pastes public so you don't know what you're looking for they don't make an archive page public so I can create a list of pace that I want to go out and scrape I would have to basically guess paste IDs until I found them and that's just not going to work so the real way that they're mitigating this is by not making public pastes not making the URL for public pastes public domain even though if I navigate it to that URL I would still be able to access that I just don't have that data to start with yep

yep that's why we keep the original information of the original paste so we always reference back usually what you find is that they'll either be either portion of the credit card number or the whole credit card number with no extra information and then a link to go purchase these records on the dark web it's not we try to create I try to create a regular expression that would also parse out the sea snail the CCD number and everything else and them it just broke every time so what we have to do is when we find a credit card that Mike we haven't really started digging into the credit card data yet because there's nothing we can really do with it

I'm gonna borrow credit card data you know popped up we could go back and verify whether or not it came out of some random junk data or fit them and we do see full credit card we have absolutely seen full full records released and that doesn't mean that they'll work I mean it could be shut down or whatever but they're absolutely full records available - sure and is there another question yeah sure which is in general are you seeing is there a sizing or is routine works on small boss there is a size limit a max size limit on the page and think it's 500 kilobytes okay people are making you can make multiple pace write in a row

but there is a limit just based on bandwidth kind of that they put a pastebin puts a limit on how big of a pace that you can create so I'm going to get out of here and in order to give a demo I'm going to search right from slack so you know we could log into the database and do it that way but I'm going to pull up our slack client is anyone brave enough to offer a domain that they're so yeah but you got yes it okay I will give you a there's a disclaimer that's as you can see right here if there was a plaintext password associated with that domain it's going to show up right here okay

so you know I'll show you this to shell CMD paste bin we're telling we're going to tell our we have a robot set up to do this we're going to tell the robot that we're going to be executing command and bash which is called paste bin and that's living on a VM on our network and then we're going to do researching for a domain and once again are you and I didn't put in a condition that says that there's nothing say no so that means nothing so you're clean say big you know it works really well for for big big domains yep VA n OB given job required yeah as long as I thought so yeah

proof it does work what was let yeah between do Qualcomm which is two M's right yeah yeah the bigger the domain the more likely this is to be a you know problem it's just statistics really it's not to say we don't find small businesses that's an issue as well we certainly do so now these are somebody posted the same thing over and over again so we got no passwords there but hand these are all from individual pastes negatives well we have one but there's no password yep yes but it's locked down entirely to those words yeah I can't I can't run whatever I want I've know so we don't have we this is not an interactive bass

show now this is me calling out to a server that we've set up to to run a specific script and it doesn't accept anything else except for there's a number of things that we can do with this robot but it doesn't accept anything outside of that so you can't run RM minus RF soon pardon absolutely not this I'm not a developer I hope nobody nobody touch any of the stuff because I'm sure it's very fragile but proof of concept it does work that's the excellent question we've we've been struggling with how you know what what controls really need to be in place for this type of data it's an excellent question and everybody has their own

opinion and up until now or a little while ago it was our answer was that it's public domain data and we're not the only people that have it but that doesn't change the fact that it is PII we have pretty I mean I'm the only one with access to directly to the database obviously anybody with access to slack and search the database but we have access controls and the virtual disk is encrypted and that's all we have in place I mean I'm scraping data from pastebin that everyone else has as well so more nefarious people yep you also the other - it's a what is shown here is one of these identities where the hash

would be shown here we're not differentiating between a password and a hash anything that comes after the delimiter next to an email address is what we're calling the password so sometimes that's hash if the web application has two passwords sometimes it's a plaintext password so we're not free interests appreciating there so are you collecting other types of logins though like usernames and are necessarily addresses we have everything that's supposed to paceman last year so the reason we keep everything is because we can go back and answer questions like that we can react whatever data we want by designing a.m. we're running out of time by designing a new regular expression go back to that entire

database a year's worth of data a hundred gigabytes extract every piece of information are interested in storing its own table so we're looking to do that with social security numbers right now what's the ratio to hash passwords to acquaintance we see many more plaintext passwords and hash passwords and I believe that's the result of some of those plaintext passwords being a complete bump it's hard to say we don't know if it's really not we're not validating any of this but when we see a hash password the people who that belongs to some project be nervous because it's generally real or you know it's real for whatever resources compromised we think so you know any other questions you have to come up I'll

search but everyone sign up

Web Scraping for Fun and Profit - DeLena & Goodwin

Related talks