Mapping the Dark Web with Python

Name: Mapping the Dark Web with Python
Uploaded: 2025-09-16
Duration: 24 min 18 s
Description: With cybercrime and threat actor activity on the rise, it is more important than ever to understand the dark web and monitor it for potential risks or signs of a breach. There are several tools and intel providers that can do this, but they’re not cheap. So why don’t we just do it ourselves? Python

BSides 312 · 202524:1881 viewsPublished 2025-09Watch on YouTube ↗

Speakers

Zack Smith

Tags

CategoryTechnical

StyleTalk

About this talk

With cybercrime and threat actor activity on the rise, it is more important than ever to understand the dark web and monitor it for potential risks or signs of a breach. There are several tools and intel providers that can do this, but they’re not cheap. So why don’t we just do it ourselves? Python can handle simple tasks surrounding dark web scanning and offers more customization for complex tasks. Using strictly free open-source libraries and any system you have available, you can set up an automated scanner and detect threats as they arise. Scan for IP addresses, potentially compromised emails, crypto addresses, and any regex patterns that you desire. Map your findings to the most relevant onion sites and get an understanding of where your adversaries tend to operate. This is just a start. From here, you can go almost anywhere. Let’s get scanning! About Zack: "I'm a senior security engineer and a data obsessive, with 5+ years of experience in the cybersecurity industry. I enjoy taking the small pockets of unused data and finding new ways to extract value from it. I am an automation junkie and have used Python to automate tasks related to anything from human risk management to endpoint security/compliance."

Show transcript [en]

Next up on this track, we've got Zack Smith, who wants you to go ahead and bring your He's a dark boat researcher, security engineer from God. So, pull up your covers, have your night lights ready, your flashlights ready, cuz we're going to see some monsters in your back.

>> All right, so my talk is on mapping the dark web with Python. Um, it's kind of self-explanatory, but the name um says we're going to be exploring the dark web. We're also going to be talking about build versus buy and how can you do this with your tool versus how can you do this with a company's tool? I'm too tall. Um, so how can you build dark web scanner with Python and how can you extract intelligence from your findings that you pull using that tool? A disclaimer, uh, dark web is a dangerous place. it's in the name. And um it's always best to follow best practices. Some of those best practices would be never download anything,

disable JavaScript, and um update your tour browser, use a VPN. Obviously, I'm not responsible for any negative consequences you may experience. Be safe and always keep OBSSE in mind. So, what is the dark web? a portion of the internet that can only be accessed using certain browsers. Um, many of you have probably heard of the tour browser. How many have used it before? All right, a lot of us probably used it when we thought we were cool teenagers and wanted to go to a very dangerous place. It's not as cool as we thought it was, but if you know what you're doing, you can find some interesting websites. So, the good um, many companies, organizations use this to avoid

censorship and promote free speech. You can access the Facebook on the dark web. You can access several other sites like that or several other news services, but it's also used for drug trafficking and trafficking e other illegal goods and selling illegal services. What we care about is monitoring it. We can monitor it for potential threats or detect other signs of a breach and just general intelligence. What type of intelligence can we gather? we can gather breach intelligence being one that I like to focus on and that is email addresses, IP addresses or just like finding your IP on the dark web. Generally, you don't want that to happen. Um, other stuff you can monitor for is threat intelligence

like what are the threat actors currently doing, what vulnerabilities are being talked about, and um just what signs of targeted attacks are coming. So, build versus buy. I'll only talk about this a little bit because I'm sure you've all heard talks about this before. But um when you're looking at dark web intelligence, why would you build or why would you buy? Um if you were to build, it's only one upfront cost and you can kind of control how much you're going to spend because if you want something complicated, you can build a complicated tool. If you want something simple, just like going out there looking for basic patterns, it's relatively cheap to build. Um you would

have more control over features and your road map if you were to build it yourself. And it, like I said, it can be as simple or complex as you want. This, however, requires in-house development skills, and that's become easier over the years with dev tools becoming more advanced, but if you don't know how to code, it might be hard to build your own tool. Buy, on the other hand, um, it may have a lower upfront dev cost if you try to build a comparable tool, but if you do buy it, then you have to keep paying for it over time. You don't own it yourself. So you would also have less control over the functionality and

roadmap of the tool and it likely wouldn't fit your exact needs. You wouldn't be able to fix it because you don't own the source code. Um the big pro though is you don't have to maintain it and there's no in-house development skills required. Then at the bottom uh you can see I just included a few of the threat intelligence providers that I've worked with myself. So how my tool works um these are the dependencies. The easiest way to connect to the tour network with a Python script is really just to run the tour browser in the back end and connect to it. So hijack it session essentially. And then I store all of my findings in a DB browser for

called SQLite. I'm guessing you've used it before. And then these are just the Python libraries. So I'm not going to read through each one, but you can see what they how they are used. So environment setup, like I said, you're going to be hijacking the session of the tour browser. And first step, install the tour browser. You want to make sure it's the most recent version. If it's not the most recent version, you are going to have like it will be less secure and might also not work properly. Um, once you have it installed, there is a tour RC file in the configuration of tour. That is where you can decide what socks port you want to be able to be

mounted. And you can see in this screenshot that I picked 9051 as my port. So, I just added that one line. I didn't edit anything else in the file. And then um after saving that file, I can just proxy all of my requests through that port and I will be able to connect to any onion service I want. Also pip install any of the Python libraries you don't currently have in your system. An overview of this script. Um so it starts with creating a session that is all the logic for that is stored in an onion session class and then that session is used by a plugin. The first plugin I built was the simple onion

plugin. It just generally scrapes information from a onion service page. Nothing too crazy. It just pulls down the HTML content. Then I've built multiple scanner classes. Um, one of the common ones that I use is I just called it a simple omniscanner. It go looks for onion services on the website as well as any pattern you want to look for reax pattern and then you map it using the network map class which I will show shortly. So, this is the code for using the session or creating the session. Nothing too crazy. You can just see in um the top line where it says where we're going to be assessing the port. That port would be the same one as

you've set the sockboard to. So, simple onion plugin. This is where you fetch the content from the page. And um it's nothing too insane, but you could be pulling it down and only really takes the text if it's a 200 because if it's not 200, you likely don't care. simple omniscanner. Like I said, it's based off of a linear scanner class, which is based off of a scanner class. Um, reason it's built like that, I am working on an asynchronous scanner, so you don't have to wait for each service to scan in tandem. You can run multiple scans in at once. But this scanner pulls down a lot of the stuff. And you can see

that it just takes the content in as part of that parse content function. There is a run scan function which actually uses the plug-in and that is then passed down to the parse content function. So a demo um imagine you work for a company called suspicious security and the name the reason for that name will be very quickly evident. You're a threat intel analyst and your boss wants you to find a creative way to improve the company's dark web intel portfolio. But as many of you have probably became familiar with you don't get a budget. You just got to make things work. So you find a dusty web server in the back room and hopefully you can figure something

out. What do you do to start? You got to sew the seed. So there are several dark web search engines that you can find links to on the internet. One is excavator. That's a great place to start and that's if you have certain search words that you're looking for, then you can search for those on that website and use that as a first place to start your search and find onion services that you care about. And then there are also dark web directory services, but often those when you look at them, they link to drug websites. And if you're looking for threat intel sites, I don't often go to those sites. They're selling something I'm not interested in. So, um, if you

use a search engine, be cautious because they have sponsored content and you will see some content that you don't want to. That's why I always recommend scanning and do not look at these yourself and do not cache content that you are not sure is legal. So there will be very illegal pictures. There will be very illegal content that you do not want to save to your device. Be very careful of that. One great source if you're looking for ransomware services or anything like that, there are a few different shared intelligence platforms. Um ransom watch is a great one as well as deep dark CTI which is on GitHub. These are just lists of onion services that are known

ransomware threat actors right now and those are updated daily if not weekly. So pick your keywords carefully and for this case study these are the CA keywords that I'm really using and this is in the SQLite browser. So um some of these would be my customers like Soul Semiconductor, Robin Hood and LNS mechanical and then one thing that you can do with something like this is scan for your vendors that you care about. So say you use Google or you use Azure, you want to know if Azure has been hacked. So you would start plugging in vendors. It doesn't have to be specifically your company. And then there's also general keywords like ransomware. I want to know

if my website or anything like that shows up on a page that has the keyword ransomware, RAZ, leak, anything like that becomes a little bit more of a red flag. So the next step once you run your scan, which you can see the command line up top, just run it and then you give it your database name. You got to wait. Since this first version is a linear scanner, it takes some time. Um, now if you keep your seed sharp and just you only have few hundred that I I think I started with a seed of 300 URLs, um, it works and it just runs. I ran it over an hour or two, which wasn't great,

but um, an asynchronous version, which I should have done shortly, will work better. And then from there, each website that has another link, which you can see in this first boxed origin, you see um a few different onion services, those are added to the database and then scanned at a later time. And then you can see at the bottom in the logs, it just prints out, hey, we found a pattern match for these different keywords. So these are the findings and going to test this out real quick. I might regret this. All right. So, this is the onion. This is the map of the stuff that we've discovered. You can see here there are several different

findings for a few of our customers. Um, one of them is Grafton Technologies. You can see like ransomware is one of the keywords that would indicate that it's a breach. The reason these are all clustered together is this is the same website, but um these ones on the edge actually are a topic page. So it discusses more about what data was leaked from that specific um target. And then you can see just a few others. Then there are all of these more hot items on the edge. So the yellow and red you will find that each one of them has six findings related to Hula Packard and two related to Robin Hood or like some

related to Robin Hood, some related either way. These are actually all mirrors of the same service. So you'll run into that eventually is if you just keep scanning URLs, you'll find mirror sites, which is very common. So I have not built out any functionality to detect that and limit your findings just to or get rid of noise. But that is something that you will likely run into. This one on the edge has no other sites connected to it and nothing really came out of it, but there is one finding. So that would be one website. Going back to the presentation.

All right. So, of the findings that we just saw, one of them was Bash. It is a it mentions one of the vendors that our customers use and that would be Hula Packard or HPECDS which is one of their cloud products and are just monitoring for your vendors can help to identify potential supply chain threats. And as you can see there are several different findings. This is not what the site looks like on the dark web. It's just a stripped down version. It doesn't pull the styling or the CSS or the JavaScript because I don't want to execute their JavaScript. I do not trust them. But this is generally what you can look at. And um when I know that a site is

legitimate and I know it's a ransomware actor, like if I get it from a trusted threat intel provider, I will cache those results, which is how I got this display. I have the HTML file stored in a cache file or cache folder and I can view those there. One is meta encryptor. That's where we saw the sole semiconductor. And again, you see it's stripped down. For some reason, ransomware actors want their sites to look pretty. They're just like us, but they're just hiding it. So, um, this is a known ransomware actor that is tied to lost trust ransomware. And the interesting part here is this was shown two years ago. So, that may or may not

be relevant. They may already know about this. If they don't know about this, they are in for a shock. Play ransomware. This is a big one. And as you can see, there are 59 pages of results with up to 12 findings each. And um when looking through this, I had to actually create a new scanner for this site because it is built using PHP. And I was just building a very basic um pattern recognition looking for other onion services. But to use the URLs and just click the links, it actually uses functions like go to topic and then it would feed the topic ID and then it would just go to topic.php PHP slash or

then pass in the ID. So this is what one of the topic pages looks like. As you can see, it passes in what kind of information was leaked as well as a download link. These download links are passed in and um these are already published. So if this is a company that cares that they've been hacked and didn't know about it, then this is frightening that it's already been published. So here is the custom scanner that I built for that. And as you can see on well it will look for specific strings and this one instead it looks for go to page or view topic in the lower parts of the parse content function and that is

looking for those functions and then from that it creates a new onion link. So, this isn't ideal and it's a kind of rag tag way of doing it, but it works. And then it will add that internal link to the database referencing the original URL. So, what next? Um, you'd set this up to run regularly. Realistically, you don't care if it only happens once. It happens all the time, and you want to keep watching to see if it happens. I only use two seeds for my initial search and there are several different resources out there. So, I'd recommend finding other seeds to pull in and I will keep adding seeds to the tool as I build it.

So, there will be other sources. And then, similar to how I built a scanner out for the um play ransomware site, you would want to build out site or scanners for sites that require additional logic. Maybe they're locked behind authentication and you've got to build a fake account just to get in. Maybe they're locked behind capta and you want to add capta beating logic, something like that. You can build that into a new scanner so that you can get even more content from these pages. Um, and then other patterns that you could search for, we talked briefly about vendors, but maybe you want to know what ransomware actors are tied to specific industries. If I see tons of

government keywords on a specific ransomware actor, but I don't see anything related to commercial and I run a apparel company, maybe I don't care about that ransomware actor as much. Still important to care about them, but it's just not as important. Um, and then also mentions of partners or competitors. If you run Adidas, this is an example, and then you hear that Nike got hacked by a specific ransomware actor, they look at companies like yours. So, let's just start paying attention to what vulnerabilities they use, what exploits they are known for. And yeah, the possibilities are endless. You can find a lot of different information out there. You just got to start looking. And summary, dark web can

be dark. Like I said, it's in the name, but it also provides a lot of useful intel. So, the data is out there. Just got to go get it. And a little bit of Python can go a long way. So, any other questions? A third >> right here. >> Like from the starting. >> Oh, the third slide. Yeah. >> Uh, which one? >> Okay. Yeah. >> Is this available? >> Um, I can post it. I can post it to LinkedIn or Twitter. Um, my Twitter is on the site right now. >> Okay. I will post it to LinkedIn as well. I can just give you my LinkedIn, but I'll post it to both Twitter and

LinkedIn in case anyone wants it. And then on this last page, um I saw many of you taking pictures of the source code, there is the GitHub repository, so you don't have to try and type up all the code because that would not be fun. If any of you like this type of project, um I've been working on this by myself. I'd be happy to collaborate with anybody who wants to do something like this. It'd be great to see more people using their own tools instead of paying uh like hundreds of thousands of dollars for a pattern recognition service that they could potentially build themselves. And especially if we all work together, we could build something and do it for

free. So, I always like open source. If you guys have any questions, um feel free to ask now or catch me in the hall. I'm good either way. Thank you. >> How often have you encountered uh these leak sites that uh have capture? >> Because you know anyone who runs >> all the time >> site and these AI bots are crazy right now. Um have the have the ransomware operators. >> They do it all the time. Um unfortunately it makes my job a little bit difficult. So there are capture solving tools that you can use, but it's become pretty common for ransomware scanners or ransomware providers and just marketplaces even to do that just because they don't want bots to be

scanning their sites. And I want to be scanning their sites. So it's kind of like if you were on a different side and you're trying to protect your own site from getting scanned by bots, but this is a good reason to do it, so I don't feel as bad. But would you say it's like 50% use captions 20%? >> I'd say out of the ones I scanned, um 20 to 30% is what I ran into. Um there were a few that dropped from the site and or from the scan. And I can pull that up just to show what it would look like. So, you can see here there's a few that come up with like a 502 code or

like everyone knows what these status codes are and you can find a few different things. Um, and when you look at those status codes like let's see if I can just pull that up. Status code. That's what I want to type. All right. So, when you look at these status codes, you see like 403, 504, 502. That means you connected to the site. And you can look into that a bit more and see why did it get mad at me. 403 makes me a bit suspicious that they don't want me there, so I want to be there. Um, and that's when you would look at what custom functionality do I have to build to scan their sites and

make them mad. But yeah, stuff like that is what you can look out for. And then also if you see like this socks connection thing, I might have just beat up my computer too much and tried to scan too fast. And that's why it's always good to run this on a server, not your workstation. Any other questions? >> Do you have any manual workounds for those the capture issues or the authentication? >> Not yet. Um, so one thing that I've thought about for that is once you solve the capture, it'll give you a session that you can use. And then after that, you establish the connection. I have thought of how can I pass that session

into the scanner, but the goal with this is to have this run on a server. So that would be a manual intervention and it could work, but I want to figure out how to do that automatically so I don't have to have a human signing in all the time. Realistically, if you're doing that, you know, it's a ransomware site. It might be easiest just to look at it yourself. I'm not sure if it'd be worth the development cost, but it is a good idea. Anything else? Yeah, you had mentioned not wanting to write JavaScript. >> A lot of them are um some some of the ones that I've ran into on the dark web are weirdly not they pass an image and

they just say what is it because a lot of users on the dark web block JavaScript. So it's a weird ecosystem where people have gotten so used to not running certain tools. like they have adapted to not running JavaScript on so many sites where they just it's completely different world. It's great. So completely different world where often the developers are not as good as the ones in legal sites. So it's not incredibly difficult to find workarounds. And yeah, good question. >> Any other questions? >> One more. Outside of the um authentication and the capture issues, are there any other gnarly problems that are working on or are on the horizon? >> Yeah. So, like I said, I'm working on

the asynchronous piece and that will help quite a bit. Um at that point, I'm not going to be running it on my workstation because I don't want to do that. But, um other than that, I've looked into the capture piece. Um and I am trying to figure that out. I'm actually talking to a friend at work just how can we do that? And then I'm trying to figure out how can we um m I'm looking instead of just more what can I find not scanner functionality I'm trying to figure out how can we map CVE to ransomware actors so that same as how you can map ransomware actors to industries or potential targets how can

you map those and then subsequently map the CVEs to you should really care about this CVE which often you get that from threat intel providers but if you can get that yourself and not have to wait for them to give it to you. It makes it a bit more proactive and you don't have to react as like >> I wonder if MITER publishes an API. I've only ever used their website. I'm wondering if they've got something. >> Yeah, so I have connected to the MITER API as well as the NVD. They provide a great API. Um MITER stores their CVEes on GitHub, which is not great, but you pull down the text files and you parse

them. Um NVD is way better. They have a search API where you can provide strings just I want to search for these keywords. I want to search for XYZ. Um on my GitHub, same GitHub as the one that I have this project which is called Marauder. You saw the map. If any of you are Harry Potter nerds, maybe you'll put that together. Um but I have an NVD wrapper that you can use to send a Python request just using the requests library. All right, got about a minute left. Any quick questions else we can end it now. All right. Thank you.

Mapping the Dark Web with Python

Related talks