2020 - Asyn Intelligence Gathering with Python - Jeff Bowie

Name: 2020 - Asyn Intelligence Gathering with Python - Jeff Bowie
Uploaded: 2020-10-16
Duration: 52 min 23 s
Description: Successful hacking if often built upon successful intelligence gathering. How much time and effort is required to gather actionable data points? Even if you can find an efficient open-source tool, its behavior or signature may be detected and prevented. Nessus is noisy, and many engagements have to

BSides Denver52:2339 viewsPublished 2020-10Watch on YouTube ↗

About this talk

Successful hacking if often built upon successful intelligence gathering. How much time and effort is required to gather actionable data points? Even if you can find an efficient open-source tool, its behavior or signature may be detected and prevented. Nessus is noisy, and many engagements have too many hosts for manual analysis. We will discuss techniques, tools, and tricks utilizing Python, to acquire information on a large scale, as quietly as possible.

Show transcript [en]

excellent well hope everybody's doing good today um I uh I spent a long time getting this presentation done uh I wanted to make sure and uh not not give too much programming uh heavy stuff up front and kind of give the more operate or open source intelligence gathering uh that kind of Juicy bits up first so that in case you uh are like me with a short attention span uh you get all the kind of more important stuff first then we're going to move in to kind of the more details of how to troubleshoot a little bit of stuff um and look at some more uh kind of core python functionality so uh a little bit

introduction about myself give me a second little introduction about myself um I'm Jeff of course I'm a web app pin tester and I uh play a major role in forensic investigations here at Alias um I have two certifications uh one is uh GC which is the GC uh general or Network Security Essentials and the other one is the gcfe which is a forensic examiner uh certification those were uh pretty difficult to get but um I I've used them quite often in my my job role um I have 14 years of cumulative experience in Information Technology uh I basically got started uh at Dell when I was about 21 years old old in uh break fix and um I learned a

lot about user behavior and uh got my start in programming uh doing some PHP MySQL there so I've been in the industry for a minute um and professionally I just started working for Alias about a year ago um so it's going to be uh yeah we're coming up on a year so uh it's a great team of people um they're very skilled very talented um and so I'm also about to be a dad to a baby boy named Callum it's going to be a interesting experience uh being a new father but hopefully uh we get along and I can teach him a little bit of hacking all right so a little bit about us um Alias we do uh cyber security

assessments penetration testing social engineering security awareness training incident response and digital forensics you guys heard from dawn earlier about uh kind of the uh red teaming uh or finding hackers and so we were founded in 2010 and we have over 30 years of experience in performing all these Services uh we let's see here all right sorry I'm a little bit nervous man uh so why python so you hear a lot of talks or a lot of people talking about uh different security tools or different scripting tools excuse me for uh for any kind of programming and for security work so python is the most common but you you also hear people using bash scripting straight from bash um you hear

about people using JavaScript uh to collect data and to do intelligence but in general I found that uh python is the most EAS easy to use um for me personally just because of a few things um one one of those would be automatic memory management so if anybody has ever tried to code and see uh that's that requires you to declare how much memory you're going to use use that memory and then free it up uh python does all that for you so it makes the coding a little bit less complex whenever you don't have to manage your own memory another good feature about python is that it's dynamically typed so instead of having to say I have a string or an integer

this is what it's labeled you can just pick a name and then add a string or integer value to it without having to declare it first so it makes it a lot easier to uh understand code at times whenever you don't have to initialize each variable before you use it and uh the most important thing like I touched on is your human readable syntax so uh python has a bunch of uh a bunch of conditionals um some of my favorites are the words not or or or is um so what that means is in Python I can say if uh if a integer is not 100 kick me into an ear message so basically the knot is

it's human readable words that let you perform logic operations uh in

Python okay so interactive mode python uh allows you so since it's scripting uh python allows you to um open up this interpreter and it starts you in interactive mode and what that does is it lets you test out different code so essentially you can try import a library you can send a command um and you can live look at the responses and look at uh variables and what data they hold um without having to save that script and then compile it or run it through python so it's actually a a interactive live console uh it's much like um node.js if you've ever used that for a console so in this image right here uh we start by importing requests uh then

this example we basically make a simple get request to our website ads infos SEC and then we want to figure out the status code um so we are able to do that like I said live inside of here as well as search for strings inside of the HTML response so we're looking for the word cyber security in the response from the git request to Alias infos set and of course it is in there we can also look at the response headers so what content type was set by the user so we have all these frame that we can a access and B do it live without having to I get worry about code management as

much all right so the the the title of my speech is async intelligence gathering with python and so the reason why I wanted to include that caveat is because of the crazy speed advantages uh that asynchronous programming provides a person so in a typical uh in a typical synchronous client server interaction you'll have the client make a request the server receives and starts processing the request but the client waits for the server to respond before it proceeds with giving another request or a follow-up request you'll see this kind of stuff in inmap where inmap will basically scan for a host uh depending on how You' had it configured but a standard configuration or uh scan will scan one

host at a time sequentially so asynchronous what that does is it basically allows us to execute many requests without waiting for the server response before doing so or before moving forward so on this uh graphic right here I don't know how good it is to or how easy it is to see um but on this graphic right here uh you can see kind of the work the flow or control flow that you would have see here I'm get a laser pointer for you guys so you see the kind of control flow whereas you can make the request and the client can continue working even though the server hasn't responded so a good example of uh

asynchronous versus synchronous in a practical uh use case is if you're doing a scan and one of the ports is hanging then the program can't move on to scan the other ports it's waiting for the response from that Port but if you have asynchronous scanning then the server even though it can't wait for that or doesn't have a response from that Port it continues with its port scan so what that allows us to do is make so many requests all at once essentially and then wait for everybody to send us a response back but continue processing and does anybody have any questions that I could clear up I may be be may be being too technical

or talking too fast uh is everybody good or have any questions up to this point real quick if you could just verify that your Discord is muted give me one second here I'll mute that

all right there we go all right so yes the um so asynchronous versus synchronous has major speed advantages so let's look at two tools that are commonly used in our in our line of work um and let's kind of do a comparison on them so inmap is a great uh inmap is a great tool for doing Port scanning um so was mass scan but like I said in map or in map does it synchronously um so Mass scan is very fast it uses the asynchronous scanning technology um and it basically can return a lot wide number of hits for ports way quicker um than than inmap the difference between them is the other differences between them however are

that inmap is a lot more full featured so you can run different scripts uh within nmap um you can set different timing um they both share similar syntax but in general masan is a basic a port is open or a port is not it does have support for some Banner grabing grabbing but in general um it's just good for open ports uh unless you harness the power or leverage the power of its results which we'll do uh in a minute so both of these uh tester testing Suites or scanning Suites have a uh have the ability to Output in Json which is Javas script object notation or to Output in Gable format so a format that you can

easily GP text out of and that's going to be uh standardized essentially so let's take a look at a inmap scan and we will time the uh we'll time how long it takes so I basically uh looked for a random cidr range in Oklahoma City um and I am scanning just web available ports um and so we're timing that so as you can see we scanned 256 addresses it took about two minutes so if I run the same scan with M scan on the same cidr range I can scan those same 256 hosts in only 26 seconds so the difference in those two is that masan is 128% faster and this is just with a single cidr range uh so the

difference is uh in asynchronous and synchronous are exponential so even if even though it was just 128% faster on that particular cidr range uh The Wider or the The More The Wider your range is and the more host you scan the more speed you can have or the more speed advantages you can utilize within Mass scan so here's some packages that uh that with python are very useful especially in the web scraping or intelligence gathering Realm um uh basically requests makes some HTTP requests it's a a full Library where we can do get head post options requests U beautiful soup parses the uh HTML that's generated from a request um and then selenium is a little bit more of an

interesting Beast so uh selenium it requires a separate binary but what it does is it actually emulates user behavior um in the browser so we'll take a look at some uh details of these kind of packages here but but these are some some of the tools and packages that I use the most within the python environment so request package like I said we can uh use this to make get requests post requests um we can even set custom headers set custom cookies uh Define post parameters as well as it has support for sessions so we can carry on cookies and data across different requests and this will come very useful with one of our case studies here so in this example up

here uh the first line I import requests Define a response and post this data to this URL so you see here request bin one of the another tool that I'll touch on a little bit request bin is able to pick up this request and this is a good way to debug uh data if you're trying to send it out uh it's a good way to see how your structure is if you're using the right headers if you're using the right encoding Etc so again this is request bin um it's a great place to post data if you have some endpoint testing that you need to do um I wouldn't use it for any sensify sensitive or classified information uh

it's very easy to derive the actual dashboard uh URL from the posting URL up here but but it is a great Tool uh there's a few others out there Bor is one um but they're basically bin type places where you can send post requests and see what data you're actually sending out so here's an example of some code for selenium and so I'll touch a little bit on uh the difference between selenium and beautiful soup 4 they both use uh browsing or they both use HTML parsing but the difference with selenium is I can actually do things like click on links so I can look for every Link in a website then I can click on those and

once I'm in there look for Content so it's a lot of JavaScript interaction uh as well that you can do with the browser uh through selenium so in this code uh sample right here and I've I've included uh line numbers in this code in these codes so if anybody has any questions about a specific line um then uh you you're able to reference it easily but essentially what we do is we start a new browser object with a Chrome web driver and we load up yahoo.com so what you don't see in this code is I've made a query to yahoo.com and we basically for each letter in that query I type it out onto the screen and

then I sleep randomly between a tenth and 8/10 of a second and so what this does is it basically makes the keyboard look like a user is typing it so there are different uh there are different apis that you can utilize to access Google searches but I just I don't trust them as much as I do scraping the actual Google site acting like or emulating like I was a user so if you do a lot of automated requests um say you use requests Library instead of selenium if you use a lot of uh automated requests you usually get flagged in Google then it will have you do a capture well I don't want to be hit

by a capture when I'm supposed to do or when I'm trying to do research but I also don't want API filtered results so this basically and I'll show the a demo of this uh this basically allows us to emulate a user's mouse and keyboard within a browser and so it it it creates a lot more value when you're testing sites it does like I said though require um require you to download a separate binary um versus these other libraries or packages that I'm showing selenium does have support for all browsers um and it uh controls it by direct communication so moving on to beautiful soup 4 um like I said this is for pulling content from HTML and

XML the advantage of uh beautiful sup would be unlike selenium we don't have to load uh the entire website in memory so essentially selenium it loads a browser and then it has to load all the content or all the um resources and assets so pictures and media from that browser within it to be able to start running well beautiful suit 4 can just parse HTML directly so that's kind of one of the advantages if you want to be quick so right here this is a scraper that I made for the Oklahoma County Secretary of State uh it just so happened that they were using incrementally increasing uh IDs for every company so I thought why not take

advantage of that and build a list of company names and email addresses for myself uh and for uh possible leads so we have uh right here we Define our URL and within the URL or after we Define our URL we make a get request after that get request is made if there is a 200 response which means okay what we do is we load beautiful soup and process that HTML so once this uh beautiful soup right here has been instantiated this HTML object is what holds all that data so what you see on this line 37 is we have a table and we're basically telling it to find in the HTML a div element that has an ID of

print div so in my research of the site there was the same area that had the same data that I needed so because it was static I could basically Target this specific element and then pull the company name from that element so you see here again HTML I'm gathering I'm finding the table by ID then I'm saying the company name is whatever is in that table under H3 tag I need that text and that's the company name so that's kind of the power that you have inside beautiful sup for um and of course I looped that and incremented that number by one each time and I I got a decent amount of results um and no uh brute or no uh no uh

resetting connection or blocking the connection um so it was very uh fruitful and that that's that's kind of the way that we uh take HTML content out of web pages do we have any questions uh up to this point okay so one of the uh common issues that you run into when you're gathering information or data is uh you often have data that's buried among other sets of data that you don't want um so being able to identify what data you want or not in a sea of data is very hard um so regular expressions are what we use to find this data a regular expression is a sequence of characters that defines a pattern so as you can see right here

we've got a string which is our address and then we say our ZIP code is this pattern if you can find it from our string so this pattern right here basically what it says is I need anything that's 0 through 9 and only that and it has to be five uh five character count so what you can see right here from the reg regular expression was that we pulled the ZIP code from a full address digging even deeper into uh kind of more our realm uh here's if we have an HTML comment so some people put on their websites uh some developers uh I have been a developer before and still am uh some Developers include comments

on their source code which you never should why would you but they do so in the event that we were parsing thousands of web pages I would maybe want to look for things like uh uh hashes or maybe uh encoded passwords or hash password sorry so I can down here Define again a pattern it has to be0 through n upper case letters a through F and a minimum of 16 digits and you see down here we can pull that hex code from HTML where it's buried a great tool to use for um testing out regular Expressions is pyx and pyx will let you type in your regular expression at that top um you can set different options um down here

multi-line all being verbose uh but basically you put in a test string and it will show you live what data is being matched so so this uh expression right here I'm trying to find phone numbers that have the uh parentheses so this is very helpful um like I said pasting in a whole entire page of HTML if you just want to find something on in a single element in the HTML this testing a regular expression before you have to use it in code or use it in an interpreter is very it pays off and it's very useful so let's talk about a case study um about how can we get data so we've got a or

intelligence so we've got a um we have resources as as Engineers like uh DNS dumpster qualis showan there's tons of uh resources out there but I hate paying for a lot of them uh I hate paying for a lot of them or I hate uh you know having to swim through a bunch of data just to find the actual piece that I need so what we're going to do is we're going to take a look at uh how we can get just the IP addresses out of DNS dumpster and um and from that we'll look at how to use those tools to facilitate that process and um I have a few more tools to show you after

that so you can see DNS dumpster it returns a list of hosts which is our domain name course IP addresses and your uh who owns that blog well if you if if you've used this before you know that it limits it which is okay um but we're going to create a query or a python script excuse me that's going to use beautiful soup requests Etc to pull out this data so the first step if we want to if we have a intelligence asset we'll call it and we want to pull data from it the first thing that we want to do is open up the HTML so what I did was I right-clicked on one of the IP addresses and hit

inspect element what that does is allows you to open up this inspector mode and look at HTML it also let you search within this HTML so it can be very useful but if you right click and inspect an element it's going to take you directly to where that element is in the browser so in my research I noticed two things one is that we have a cross site request forgery token and then we also have this put named sin scan and that looks like the IP addresses that we're wanting so up here we have a row and we have row data or row cells and basically we have multiple rows that have these IP addresses so knowing that sin scan in my

previous research is attached to nothing else but the inputs that have these IP addresses how can we get that out so now we've we've identified a website that we have data that we want we've inspected the source and we've identified kind of a unique way to identify the data we want but we still don't have a way to actually extract it so here is a script that does just that and some uh this code after my talk I'm going to be putting a lot of this code on my GitHub so that everybody can uh see and we can um my GitHub is on the the intro slide but I can put it in the Discord as well so up at the top we

start by creating a session uh and then from that session once it's established we request the DNS dumpster.com URL we use beautiful soup to then find that csrf middleware token so basically what I've done is this first URL is like first hitting the browser so when you first hit the browser is when that middleware token is generated so we want to make a a plane request to the root HTML page to get that token then we want to take that token and carry it through the session but what we also want to do is is Supply a IP address so excuse me starting from Top we start a session we use beautiful soup to find the middleware token then we

take that middleware token and attach it to the Target IP the next thing we do is create a second post request and we post that data once that data is returned we then identify all table table rows we Loop through the table rows looking for the sin scan input so in beautiful soup right here you see the syntax it's it's fairly uh fairly easy on or fairly clear so if we want to find an element that's an input with the name of middleware token that's the syntax so it's it's like I said uh python is is very uh easy to read uh this dictionary is this this way is very easy to pass parameters anyways we Loop through all

the s scans that we identified previously after we've made that request if we do have a input that has sin scan in it we appin that to a list so what we do now is after we have that list we give a little bit of output we choose a file name which is the argument you pass so the web the domain name and then we add this text uh extension we Rew write every it in that list to a file so I've gone ahead and done that excuse me I hope I don't losing my voice man I've gone ahead and done that so what you can see is if I run the scrape program on our IP address

it shows me collected four IP addresses and then if I type that out using Windows for this one if I hit type and type that out what we have is a list of IP addresses so we can do now is we can use that for other tools now obviously this isn't a major list of IP addresses but if you think about uh scripting out or looping through scraping IPS from domains then you can collect a wide number of IP addresses so I took it a a little step further and thought how uh how if I can so say I can scrape DNS dumpster right and I get a list of ips then what I still have to look them up so uh another

powerful feature uh with any program like is sockets so it's writing raw data to a a port and so what what I've done here is I've taken up here you see I've finding the sin scan and if it found a scan I try to get the host name by its address which is get host by addr and I pass it in the value of sin scan if there's an exception I set it to false but I basically had pinned this to my list and so again since the list is written to CSV or not CSV excuse me it's lit written to a flat text file I can type that out and what you see here is

we've actually identified host names now and IPS from just a single domain name so I'm sure there's other tools that that can do this and I know there are but the whole the whole idea behind using python for the this is the ability to chain these things together so easy so uh you know in in one of my other slides I'll show how we chain chain python to other programs but essentially um I can take a list of ips run a socket on them or I can pass it on to some something like dur Buster or pass it on to something like nto or I can even filter out things like NS or mail from

the host name so I don't get any uh assets that I'm not really interested in do we have any questions up to this point okay so here's a demonstration of um speaking of automating tools and using python we know that mass scan is fast right but um if Mass scan is fast and it's async and we write python that's synchronous it's going to be limited and there's going to be that bottleneck so what I did was I realized that in all a lot of our research um it takes a lot of uh intelligence a lot of upfront intelligence work so if I have a wide amount of IP addresses or excuse me posts that I want to check

I want to just see what's the HTML content or I want to see maybe what the title tag is and that's all I want to see I want to see it quick so I can know if I need to investigate further so the problem I was running into was the scans of synet scans were slow so I went async but even after going async it still takes manual entry or it takes some further investigation so this tool um I'll show show you some of the code behind it but basically what it does um it takes a pre-made IP list or a c IDR ranges You' seen up here and it basically Loops through it runs Mas

scan and it converts mascan to drapable Output once that output is produced python picks it up and then it starts making asynchronous HTTP requests to all the websites and then pulling the first 256 characters from the HTML body and so I'm going to demo this really quick I hope that the video works on uh overcompression but let's go

so here you can see I've got a mass scan firing off it's very quick um so I don't know why it has this little delay it's built in but I'm sure we can turn it off somehow so essentially everything that matched for 80 within that range what you see is I'm getting the little bits of HTML that are uh the first 256 characters and so at the bottom you see 64 hits uh from 256 host in 20 seconds so we were able to scan a 256 Host cidr range and in 20 seconds get a preview of all the HTML now uh when I showed one of our other Engineers Andrew lmon this he said bro eyewitness is that and so and so it

was uh I looked at eyewitness and eyewitness does have um some functionality but iwitness it would require some modification and it is very comparable but like I said we can chain into other things and so while it is good um I like to make it myself if I can so this crr range by the way was Google's uh cidr range so these are all pretty much similar because they're just uh load balanced uh Googles so here's a uh another demonstration this is kind of one of the the tools that I um that that I like the most like I said it emulates uh user behavior in the browser so it was created defined websites containing a

specific term um and the long-term goal would be historical so if I can find the top 10 sites on Google for a certain term and I can do that continually and alert me say if after 10 days there's an anomaly then that may provide some excuse me that may provide a little bit of information that I can have for further investigation so there's multiple things uh that you can do uh by just searching through sites and and doing it automated so essentially what I do is I start Chrome driver um we find that search field in Google I go inside of it and then I start typing like a user would but really what you see is you see cyber

security being typed out and you'll see the random delays it's pretty slick I think but you'll see the random delays and so what it does then is after it searches it hits next so I predefine a number of pages like I only want you to search five pages or 10 or whatever or unlimited but I say after five pages look through each URL then what it does is it takes those us URLs and it passes it to um requests after it passes the URLs uh to requests it does a regular expression looking for the text or term that we've supplied it so after all of that after all that's done it basically gives us an output of every URL that had

that text inside of it and it also gives us a little preview of that text the reason it gives you a preview is because beautiful soup is is finding the text within elements of HTML so say there was a paragraph block that said my name is Jeff and I got uh attacked by ransomware it would find that whole text instead of just the word ring somewhere so just kind of that that um side effect of searching through beautiful suit for gives us a preview so anyways I've talked enough about it let me go and show you what it looks

like so there it is uh I want to play cyber Punk 2077 um I don't know if anybody else is interested in that but anyways we searched cyber security uh we typed it out slow and now the uh browser as you see Chrome is being controlled by automated test software um so we hit next bunch and then what you see here on the screen is so from every URL that we met that we searched on um for every URL that we searched we're now parsing it for that word and so the the reason that this is slow compared to the previous is because this is actually done synchronously so I do selenium in the browser but then all

the requests are processed synchronously and so that's why async is so powerful because this would probably have already been done if we were able to not have blocking or be waiting on other responses so you see the URLs um up here we found and we see the text and like I said the preview of that text so there's a more more uh interesting uses of this of course there's obviously nefarious uses of um of this but essentially being able to find that list of URLs that match a search term without triggering Google that you're a bot without risking getting filtered results through an API it's it's beneficial and it feels good tricking uh the go the

browser so like I said um we have tools that exist in uh in Cali in Linux whatever Dr you use we have tools that exist but harnessing those tools or the power of them or kind of leveraging that can be difficult uh python makes it pretty easy I'm sure most program languages languages make it easy but my example is python so if you see here in our first result uh the first tool demonstration that you saw where we scan all those ports and then find the HTML um just the HTML preview so we basically take a cidr range and we run a subprocess through python what you can see right here is we're referencing the

masscan binary we're saying- P for a port and we put the port that the user provided OJ means output in Json format the dash means uh a standard out and then we got a cidr range so basically this is running a mass scan scan and it pipes the standard out and standard ear to this variable next thing we do is we look for the results and we parse that whole Json string and loop through it when we Loop through it we're basically appending HTTP to it with the URL with the port that the user specified so that's essentially a way that we can take Mas scan which is asynchronous and then code that's asynchronous and chain them together so

that we don't have a bottleneck and so that's why that first tool that I demonstrated is so quick because it it literally does not wait before it sends another request do we have any questions uh or feedback up to this point okay so uh one of the big things uh in scraping or intelligence uh in my opinion is avoiding detection first is having quality data to scrape or or gather but avoiding detection so uh in in uh a lot of circumstances activity that's repetitive um is going to be blocked by fireal IPS system um fail to ban is is one of the Linux uh services that does this there's plenty of others but a lot of times the source IP or

originating IP or and or the user agent is used to identify a user and so if there's enough hits on that same user agent with the same IP it's probably the same guy versus being say an off where a bunch of people are at but regardless any way that they can use to that any way that a person can use to determine if an automated or scripting is is running um we want to try to defeat because we don't want anybody in our business essentially so what we do here um and also if we had a thousand eurs to scrap um it'd be it'd be different you know uh scraping one URL or two um you can get by with that but

but scraping many URLs on the same block block could be difficult because you you may get blocked so what you see here is we Define a list of user agents so this is a standard python list and you see some user agents as you are familiar with or should be familiar with if you're in the industry um and then we Define a proxy list so we Define a local address for 8080 um and so why I did that is because I actually routed this demo through burp suite for you to see so while true is basically an infinite Loop we basically select a random user agent select a random proxy create a request the next

time we create a request we select another random user agent and we keep going and going so what we could have done is taken a whole list of URLs and a whole list of proxies and basically made the requests on those through the proxies and that would essentially look random or it should look random to uh a lot of firewalls IDs IPS so heris analysis may cause an issue but this if it's something static like a source IP and user agent then we may be able to get away so let me go ahead and show you what I'm talking about so I ran this here you go so you'll see in this demo I have burp site loaded and if you

haven't used or familiar with burp suite and you're in web or even uh in in it it's something interesting to see how traffic is shaped and how traffic moves through your network um especially or HTTP HTTP excuse me traffic especially so we've started a burp sweet proxy on Port 880 and in our just like our example page what we have is or not example but our example previously what we'll do is we'll make random requests but we'll try to switch up the proxy or the user agent I'm not switching the proxy because I only have one so let's uh let's go ahead and run that code so I'm spitting out uh which proxy I selected the user agent agent selected

um and the URL that I'm trying to access so I killed the code that's why I saw the error uh and what you see here is you'll see me stepping through burp suite and if you'll notice the user agent so let me get my pointer

here anyways pointer is being laggy because of the video but you can basically see the uh here it is user agent down here

changing so that that's kind of the uh intelligence um Gathering uh that that kind of section of it portion of it like I said uh I wanted to give those juicy bits up first but these are uh the kind of nuts and bolts of of the work so um let's go ahead and talk about storing data uh and kind of debugging a little bit in Virtual environments I won't get too much in the weeds for you guys uh I hope I haven't already but if I have I apologize in advance so storage modules uh everybody's heard of CSV um most people have heard of SQL um so CSV anybody can open that in Excel um you

can edit and uh save csvs in a notepad or any kind of other text editor um sqlite so sqlite has some basic SQL functionality and syntax but it has very limited data types so what that means is you may have text or integer but you won't have some more advanced data types like my SQL has so my SQL the difference between that and say SQL light is that it has users and permissions so I would say that CSV to sqlite my SQL is a progression uh and a good progression uh depending on how large your environment is and how much data you need to collect so SQL light module you see most of this or all almost all of my code

examples I show um the import statement so you can really see this is almost a complete uh uh SQL light uh lock of code so I import sqlite to find the file that I'm going to be loading and in this example um this was the results from the companies I strap from the secretary of state so I basically want to find the last filing number so I can continue my search from there so what I do is I select or I provide my SQL query get a cursor to the connection and then execute my query on the cursor after that I can Loop through rows and basically view all the data that is in my SQL light databases so SQL light

according to their website is 35% faster than reading or writing to individual files in the file system so if you did have csb and you're using it and your application was writing many times uh at once then it probably would be applicable to use sqlite um instead of creating new files or to be accessing the file system um and the application file is portable across all systems so what that means is you can send a sqlite file to your friend they can open it up with sqlite browser whereas you can't do that with mySQL um and you uh but you can with CSV so it really depends on what your use case is as why You' pick

one or the other so here's a good CSV example um I start by importing the CSV module and request and beautiful suit why not use them all so I Define a list of URLs predefined list some sites I frequent often I guess get Hub in uh Hacker News but we've got our website in there as well so I basically make a blank results list and I Loop through the URLs requesting each one at a time so then if the status code is 200 I find the title and I append it to my results what you see up here list another thing I want to touch on real quick is uh template tags and python on so this F right here that you

see before a string that allows me to place any variable directly inside of a string without having to concatenate it's very very i' I've fallen in love with them so far on in Python so digging further into the code a little bit we basically open up a CSV file with WR permissions and we Define that we want to use non-numeric quotes which is essentially just quotes then we write the row title and URL so this gets our first row written but then we Loop through our results and write the subsequent rows what you see here is that output so we said we want the title or the headers should be title URL and so we have then

the title of our site and we have our URL the title of GitHub utf8 was probably something going on right there but um that's okay and then we've got Hacker News of course the title and the URL CSV is uh is very decent in storing uh data quick so when you're all these python packages that I'm talking about um they can get kind of tricky to manage so C uh has a bunch of different tools and scripts that rely on different python Library versions so creating a virtual container for the different libraries is essential if you don't want to break your system or mess up or if you want to kind of isolate um code so up at the top it's a

simple curl if you already have python installed that's if you don't python.org will get you there so we make a simple Cur request request to um this python file we output it never directly uh run bash on a file you download but if you do uh it's your skin so then we run python get PIP and that will install the pip p package manager so if we were say we had a brand new python install and we want to do what we talked about earlier we wanted experiment with requests how do we get requests on our instance well we type pip install request after we've installed pip and it's done so it's really fairly easy npm is another

package manager um app is similar to this uh syntax so installing pip installing python um they're they're fairly easy compared to others like visual studio um and having to statically link libraries to your executable so some uh useful commands would be pip free which basically freezes your whole entire library that you've installed in your virtual environment and what this will allow you to do is create a requirements.txt file which you'll see in a lot of GitHub python examples for if someone wants to download your repo and get it running they can just install all the libraries that you've put in that file without having to do research or dig through error codes so it's very

helpful so now that you have uh now that we've gone over package managing um virtual environments are are major like I said they isolate your packages from conflicts Etc and you can even install um different versions in different environments um so the flexibility is is there so to to instantiate a virtual environment we type Python 3 we specify the module is the VV module which you can install that through pip as well and then we say this is the folder that we want after that we Define find the source of our environment to be inside of the virtual environment folder inside the binaries called activate as soon as we run that command you see this over

here V andv will dropped down into our virtual environment and so once we're in our virtual environment that's where we want to start installing our libraries and keep them just contained inside of our environment so here is what you'll see is a pip list or is the results of pip list and you see packages that I have installed as well as requests inside of there so that's pretty blank uh list and so um but our request is in there so it lets you see in your virtual environment how many packages are installed and what are they to get out of your virtual environment you hit you type deactivate and what that does is just drops you out

of the virtual environment and to test that we've actually successfully dropped out we can do a oneliner to python trying to import that module and you'll see no module so we know that this virtual environment is contained so debugging uh debugging is Major uh I use it so often um and so we'll go a little bit into that I won't get too much into the weeds because it's it gets deep but um something very useful if you use Python and you're not familiar with pdb you got to get familiar man because it's uh it's it's the way uh it's the way to identify problems and uh and look live so we start by importing pdb which is the

python debugger um it's interactive just like The Interpreter so when we drop down into a pdb show um we can access variables we can manipulate variables um we can do a lot of stuff within debugging so what I've done here is made a URL list a list infos sec I turned myself into a pickle Morty I bet you that's not going to show up and give a 404 probably isn't well I know it isn't then we have alinos sec.com which we know will come up so let's simply Loop through the URLs making a request but if we get a 404 let's drop down into a pdb shell so that we can see what's going on

inspect the code inspect the uh objects in real time otherwise printed success message and just like we did in the other tool show me the two first 256 characters of that uh response we can also change the program flow and I'll show you a little bit of that here so let's run the debug test so we get a 404 and like we like we thought we get dropped down into a pdb show so you can see the um man I left my username in there dang I thought I cleared everything out but you can see right here um that we're running debug test and on line eight is where it's breaking into that pdb sh so

now that I've gotten a 404 on this URL what's up with it so as you see I'm in the pdb shell and I do r. headers so this art header shows me the response headers that I made of the request and so it's it's very helpful that you can see if there's anything any headers up here uh that that uh if you're debugging URL you can see the headers you can see the status code um and then you can also Al or influence the flow of your program so what I mean by influence the flow so we started with the the 404 first because we wanted to be dropped down in the debug sh so after

that after we've determined okay that URL is bad I hit URL looked at the headers looked at the status code yeah something's wrong with that but what if you want to keep going through your code but you don't want to have to um get out of your shell you can hit C and hit enter and it continues execution so the shell was paused once we hit C it continued looped back through and found hey that site does exist and now you see the first 256 bytes of alias' website so I covered a lot of stuff um I hope I didn't get too much in the weeds uh but like I said there's there python is is great for open source intelligence

um we do a lot of it up here and um and I I love to answer any questions if you guys had them and again my I'll be available in Discord and uh aside from that I'll post GitHub thank you very much Jeff and we'll pop over to Discord and see if anyone has any questions okay thank you guys so much

2020 - Asyn Intelligence Gathering with Python - Jeff Bowie

Related talks