← All talks

Robots.txt - There's gold in them thar files

BSides Peru 201524:50394 viewsPublished 2015-06Watch on YouTube ↗
Tags
About this talk
Web penetration testers often overlook robots.txt as a reconnaissance tool. This talk analyzes thousands of robots.txt files from live websites to uncover patterns in user-agent declarations, directory structures, and metadata. The speaker demonstrates how sensitive information disclosed in robots.txt comments—from CMS identities to developer names—can inform targeted attacks, and discusses why many developers misunderstand robots.txt as a security boundary rather than a voluntary crawling convention.
Show original YouTube description
Abstract: Web penetration testing has benefited from certain sites providing a ready made list of sensitive areas that they don't want crawled, robots.txt. I pulled, and analyzed, the robots.txt file from numerous sites to determine most common user-agents and locations. From the results, I have derived a better listing of directories to use with tools like dirbuster and for better reconnaissance. Bio: N/A Table of Contents: 22:12 - Questions
Show transcript [en]

so hi I'm Justin uh I don't have a bios slide I don't have anything out there and this isn't going to be a super Elite cool talk it's just you know I I got really bored one day and started playing around so that's about it um yeah you you all can read I'm hoping it's good enough uh just a quick plug cuz y'all let me up here secure West Virginia formulary hack con is occurring in November in Charleston West Virginia no it's not that far incest is not that common and the Moonshine is that good so come on down robots. text all right um who here remembers being on the web in 1994 come on show hands wow

that's more you guys are old thanks yeah you're welcome so the robots. Tex standard was created by not going to pronounce this name Mr coaster I'm assuming because he was goated into developing something to prevent webcrawlers from destroying websites the one of the initial dos attacks was you'd craw somebody's site their site would go down it was 1994 Linux had been around what 2 3 years it's things going down wasn't that hard not always the case you know um as I said the purpose was to prevent websites from dossing or webcrawlers from dossing sites and who here knows what robots. text is who here doesn't know what robots. text is okay well we'll find out

here in a second it it's a text file in the root you didn't have to raise your hand if you knew man it's okay I figured everyone would um it's a text file in the roof it's located at HTTP blah blah blah. comom robots.txt the purpose is to be readily accessible so that any time your WebCrawler is going to go crawl some sites it pulls up this text file and says hey what can I do um so you want traffic to go to your site because that equals money in most cases you want webcrawlers to put your stuff into search engines because that equals more money for people looking for your stuff however you don't want some of

your sensitive things to be crawled you don't really want you know if you have password do text out there you don't really want that you know showing up in a search engine that's a double-edged sword yeah yeah you get it any so you want money you want people to go to your site you don't want your sensitive stuff out there a lot of people still believe that robots. text is actually an enforcing sort of protocol it is not it is completely voluntary um if you go to robots. tex.org they have an entire section saying surely listing sensitive files is asking for trouble and they're right so if you go a little further they pull down

to the real answer is that robots. Tex is not intended for Access Control blah blah blah think of it as a no entry sign not a locked door is it going to stop anybody who really wants to crawl your side no is it going to stop people that follow robots. text yes so as I said and for those of you who've never seen robots. text file it is not a defined standard more of a gentleman's agreement there are three things that are allowed comments and if you comments for all you developers out there or information that you put in the top of your code to let other people know what is going on I know most of you developers don't

actually know that cuz I've read them um user agent which I will rant about this later they are horrible just completely but it identifies the browser or the robot that the web crawler that it's coming from and then disallow which is pretty you know self-explanatory don't look at this there's some non-standard not formally recognized extension which is funny from a completely not formally recognized standard to begin with Aster which is strangely enough we'll find out later the most common is the you know the star the Wild Card everybody listen to this crawl delay which tells a web crawler that they should crawl at a certain rate as to not to do your site you know I don't think this was well

thought out of a but hey whatever um host is interesting it'll let you say this robots. text kind of applies to everything and then you have allow which is you know it just says hey this is cool to crawl I'm not sure why you would use this entirely but and then site map which is probably the best one because you can just give it here here's an XML file telling you how to get to all my stuff so you don't have to bother cing you can just look at this XML file and you can't see that but on my screen there's a lovely robots. text file from somebody I don't remember so the process hello Google I'm robots. Tex

don't take another step okay chill I'll back off that's how it's supposed to work with a spider that conveniently wears a crown is actually obeying robot. text okay that's that's nice nobody cares so what I plan to talk about is stuff that you can extract from robots. text if you are being malicious and then a small project I tried to do well I I think I succeeded anyway um trying to pull statistics from a bunch of robots. text files uh this kind of sums it up in addition to all the prohibited files and stuff there's a lot of good information not just what you shouldn't be seeing that's sensitive and listed on the robots. text file so as the cartoon

says get all the information you can we'll think of a use for it later so the next couple slides are what I call fun finds so I wrote a script I pulled a whole bunch of robots. text which I'll go and do while I was a pain in the ass later oh sorry I want to get nervous I swear if that offend you sorry wash rinse repeat take a screenshot put in presentation that was pretty much it so some neat fun finds that I found is uh can can you guys read that at all not a bit that's awesome well at the top it says content Management systems and Site Builders and you're just yeah

Borneo uh as I said I pulled these from a lot of random ones V tiger site Studio they're very popular listed right there right in the comments so that that's good and now you know some content Management systems that they may or may not have used last one this was updated which if you look at the date if you could see it on the bottom says June 11 or November 17th 2006 also a lot of these sites haven't been updated in a very very very long time so what else infrastructure uh some of you can read some of that hopefully but the top one is complaining about how get repositories are boring if I'm attacking web stuff I don't think

your git repository is going to be boring to me just just a hint um the highlighted one on the right there the little control lamp does anybody know what that's going to tell me nobody not not that that's a Windows control character system yep yeah probably Windows shop and that they edited that file with a cuz that's the that's what shows up in Linux not my Mac cuz it's not um it shows up in Linux when you try and edit a Windows File and the bottom says do not edit settings in file manually they're managed automatically and will be overwritten when autoc config runs for more information about autoc config refer to the Oracle EB business Suite

setup guide so I wonder what they're doing I don't know and this is an entire slide of jumla which you can't see either wow yeah I'm I'm really knocking all these pictures out of the park which is kind of I can see them if y'all want to huddle around up here I mean it's probably not going to work too great but all right what else oh yes um I guess for privacy's sake it's good you can't see that but the very top one says I pulled from redcross.com if anybody works for the Red Cross you may want to check your robots. text file because it had JM requested and then a date and then JM

requested and E all these initials of people who I'm assuming work there and didn't just submit Anonymous requests and the bottom one says Changed by Len I don't know who Len was and I googled it it turns out to be a name and not so much a product so I'm going to stick with that's a human but they kind of did comments right at least they put their name in there so you know how to go pester in case later in the future some more fun stuff it's not all bad you get Isaac asmo references from robot you know the the three laws of robotic does any does anybody know what I'm talking about okay thank

you and then there's a little robot asky art guy that somebody stuck in the etsy.com robots.com um my second personal favorite here is uh bitching there is a lot of complaining the first the top one says which you can't read is Yahoo bot is evil I I have no opinions either way the next one says this technically isn't valid since for some godforsaken reason site map site map paths must be absolute and not relative oh okay that guy's got an X grind if he put his name next to it maybe we could workk him social engineering wise cuz he's obviously unhappy um the bottom says notice if you would like to crawl LinkedIn that's where I got this from by

the way please email Whit listcraw linkedin.com to apply for whitelisting I I did not email that I still pulled the robots. text file and they have a very deep misunderstanding of robots. text which we will get into later however my personal favorite is from Pinterest and even though you can't see this Pinterest is hiring and specifically the SEO team is hiring and they gave a email address uh seod Dev Pinterest.com don't bother I emailed them saying okay did you actually get anybody to do this and it just bounced back nothing guess there goes that career opportunity okay so I was really bored sitting watching Netflix actually I think my daughter was watching Netflix and so I was super

uninterested and I thought and I was watching during the day at work while I was totally working and whoever says anything else is lying a besides Columbus on robots. text and some of the stuff I just showed you I didn't steal his material I got my own thank you and then uh I said well if you pull one or two here at a time what happens if you pull a thousand so I wrote a script which thanks to the love magic television you're not going to be able to see and if you really want it I can send it to you um just an nmap script just query Port 80 in nmap there's a feature - I capital r that

says random IP addresses I do not know why that feature is in there but it will go out and query random IP addresses I pulled that into something greable I then applied GP and O to see if anything was open then I used WG get to go get the robots. text file this in my mind sounded like a great time I set it initially to a thousand thinking my gosh the robots. text files are going to come flooding in turns out no I ran into issues lots of issues um a lot of blank files were pulled because some of you people set up and I'm blaming you specifically websites that return 200s even though there should be

a 404 oh I hate you people and I'm going to get to some examples later there were also a lot of for fours but that's fine cuz W get said I don't need to pull that and a lot of them even though if it's a 200 it still was an error page or something some of you just returned whatever you had as your index oh he obviously meant this instead of that specific thing he requested we'll give it to him so I jumped it up from 1,000 to 10,000 thinking surely I'm going to get thousands of these things now I I don't know what's going on maybe the dhir command just is not as random and

cool as I thought maybe I should have expanded to 443 as well whatever I'm not that bright so fun fact when you use WG get it actually maintains the file date that it was created fun fact if you do anything to like edit or touch those that original date just goes away and you can't get it back so don't do things if you're an idiot which I was so but just trust me most of the dates were very old a lot of these robots. Tex files have not been updated in a very long time which is good and bad because for the good for them new stuff has been moved on and they don't have it indexed it's bad for

them because the old stuff is probably still out there it's not gone away I mean most of the times you don't end up even if you don't change the file you don't end up deleting a lot of the web infrastructure and stuff that you have out there so as I said issues with pulling files not everyone was not running a web server shame on you those that were didn't return something useful and then even if it did return something useful it didn't have the robots. text file for example that guy this is an AWS instance that got randomly pulled out IP wise and no matter what you ask this website it returns that picture I know this guy like a quarter

cuz I was trying to figure out why he has an AWS service to just serve up that picture it's a good picture it's a I mean nice also that guy oh you can read that one this returned to 200 this is a here you go that literally says 404 on it you went thear Runner I guess um I don't know I forgot this was from some sort of home router login yeah there's some weird stuff out there man yeah if you go to the root of this it says here log in I'm like no we're leaving so I wrote a python script after getting about 2,000 of these things and allow me to now rant about user agents A

there's a million of them B they're not standard at all C there's just like a random conglomeration of strings that somebody threw together says oh this to toally identifies me uniquely and that may be right but it's still a pain in the ass um some site developers went a little crazy with the user agent listing instead of using the star I guess they took that non-standard to heart and just you know I heard about this bot the other day might as well throw that on there um I didn't care about the allows of the site Maps I didn't look at the comments for the python scripting which is horrible and Tiny and I'm glad you

can't read that because it's literally just a bunch of parsing and it made me feel bad as a programmer so so what did I come up with and Todd you'll be happy it's only 14 minutes and 21 seconds in um the most popular user agent was star non-standard but still popular followed by Google there were nearly a th000 user agents identified alone in 2,000 files that is a lot people just went nuts uh a lot of versioning issues a lot of Mozilla for some strange reason I I don't know these are as I said these are very hard to work with and I can show you examples apparently not on the screen but you know

um what you might want to do is take a look at your own robot text file I don't know why I said that weird but you might have some stuff out there that will surprise you cuz some of this stuff really surprised me uh and and when you're when you're just surfing the web you might want to change your user agent because I don't know if this is still true but a certain aners site used to vary the results depending on user agent so that it was googleable but then when you get there they wanted you to pay for it and if you just changed your user agent to that Google search not telling you to do that

I'm just saying that's what happened um so here's a lovely chart showing star and then Google bot then slurp then bingbot I don't know who came up with some of the names of these but there are a lot of them but I didn't care about that user agent sucked disallow I was thought I would have a gold mine of like everyone's hiding this certain directory we should totally be looking for this certain directory okay everyone's hiding slash which means disallow everything that is not as fun as it looks I took a look at the K I took a look at the results two different ways k case sensitive because the web is and in this regard and WP admin was the most

popular which makes sense because setting up a WordPress install isn't exactly hard maybe not secure but not hard um followed by admin I I can understand that totally what I found weird was when I took case and sensitivity out of it analytics followed by departments became and I have a chart because I a presentation without a chart is just completely worthless um showing slash and then analytics Department shopping and this is as you can see cuz all uppercase case in sensitive yes I just put them all to Upper I wasn't fancy or anything and then here's the exact same data set with case in sensitive and this kind of makes more sense WP admin admin

includes images modules gallery and then it starts getting weird after that those numbers to the left are instance occurrences if you couldn't figure that out so then I said to myself self this is fun and all what if you just went out and as unscientifically as possible started pulling out different types so I said let's look at some social media sites some news sites commercial search engines Banks colleges PN sites colleges entertainment this was not scientific I just said hm what banks do I know uh I didn't even even out the amount of s sorry like completely unscientific social networks turn out to hide a lot of stuff in their robots. text files uh followed by the news and

they kind of Trail off down there the first time I gave this talk which was like a little group meeting to everyone I had porn right next to college and everyone thought that was funny and probably more accurate than it should have been okay um so I took time to split them out and I wasn't really surprised by this but I was surprised by the amount of stuff being hidden just in one file by a lot of people uh as I said it didn't really didn't I didn't really get surprised like I thought I would I didn't really find anything super cool out it was a fun little project which should tell everyone in this room that next year you

should submit a talk because if a guy who wrote an nmap script and a python script in about 30 minutes could get to stand up here and Yammer on about robots. text you can too yeah I I did have the chart thank you I guess that one I'm over theart add a few minutes thanks so as I'm say as I went as I said before I'm going to try and do more analysis I did come up with a list to put into dur Buster because that'll be useful later on in life um I do need to refine my tech technique a little bit but robots. text file is not the tech the file you want to start parsing data

out of if this is your first parsing project start with something simple and less maddening all right anybody have any questions comments things to throw yes sir question so say you have a robots file and you do want to disallow a certain area what's the best way like how how should you arrange it so you don't tell them attack give them some sort of information the the best answer I can give you is don't put it on the web and if it has to be there kind of I don't want to say bury it but maybe give it a non-standard name something lock it down permissions wise so that it's either protected by something I don't want to say like basic

off but even doing that is going to slow them down a little bit but this is this wasn't really meant as a security tool it was meant as a please you're knocking over my web server problem which was a problem back and still a problem I mean there's whole IND entire industry is centered around knocking over people's web servers so yeah you think uh with the robot set text the fact that uh it isn't always followed and like other people can obviously take advantage of it uh I know they have that with advertising too where you're supposed to be able to opt out of certain advertisements because doesn't appeal to you or like get offensive or whatever and there's

supposed to be like a little blurb at the corner where you can say like this is inappropriate for this kind of thing yeah and I mean it's a standard but not everybody follows it so I guess the question would be like with all these like legitimate crawlers and stuff that will actually follow this sort of standard not standard do you think that uh it's a matter of like something that should have been standardized or do you think like you know this is kind of antic Antiquated at this point not really here's the way I think about it it's completely voluntary I mean the entire system is completely voluntary so standardizing on a text file that says

don't do this stuff there's no way to prevent you from I mean dur Buster is a tool to Brute Force directory structures and there's no way to prevent you from crawling these sites if as long as you have enough time enough bandwidth and enough you know resources to just randomly go through them so so I don't think there's an answer to prevent people from crawling your site period and there sure isn't a good one other than if you don't want it on the web don't put it on the web but sometimes you have to it's just how it's nature of the Beast anybody else all right thanks guys