← All talks

Hacks, Leaks & Revelations: The Art of Analyzing Hacked & Leaked Data

BSides PDX · 202350:07841 viewsPublished 2023-11Watch on YouTube ↗
Speakers
Tags
StyleTalk
About this talk
Hacks, Leaks, and Revelations: The Art of Analyzing Hacked and Leaked Data Micah Lee (@micahflee@infosec.exchange on Mastodon. @micahflee.com on Bluesky. @micahflee on Twitter) The world is awash with hacked and leaked datasets from governments, corporations, and extremist groups. In many cases they’re freely available online and waiting for anyone with an internet connection, a laptop, and enough curiosity to analyze them. Most journalists and researchers don’t have the technical skills to do this, so most of it never even gets looked at. You probably do though! In this talk I’ll show you have to use your hacking skills for good. Micah Lee is an investigative journalist, computer security engineer, and an open-source software developer who is known for helping secure Edward Snowden’s communications while he leaked secret NSA documents. He’s the Director of Information Security at The Intercept and an advisor to the transparency collective Distributed Denial of Secrets. He’s a former staff technologist for the Electronic Frontier Foundation and a co-founder of the Freedom of the Press Foundation. He’s also a Tor Project core contributor, and he develops open source security and privacy tools like OnionShare and Dangerzone. --- BSides Portland is a tax-exempt charitable 501(c)(3) organization founded with the mission to cultivate the Pacific Northwest information security and hacking community by creating local inclusive opportunities for learning, networking, collaboration, and teaching. bsidespdx.org
Show transcript [en]

[Music] hello everybody uh my name is Mike Ali I am an investigative journalism and the director of information security at The Intercept um I kind of accidentally turned into a journalist uh I didn't mean to in late 2012 while I was working at EF as a technologist I got an anonymous encrypted email from someone and they asked if I could help some journalists learn how to use pgp Email encryption and so I did and uh then I learned a few months later that the person that I was anonymously talking to was Edward Snowden and I was working on helping him leak NSA documents to journalists and so I've been doing investigative journalism for The Intercept ever

since um I spent many years reporting on Snowden documents uh and I I help helped uh publish over 2,000 documents and uh but since then I've also analyzed and reported on many many other leaked or hacked data sets um and now I've written a book uh to teach everyone else how to do it as well and so that's what I'm going to be talking about is is what's in this book but before I start telling you about the book uh let me give you a brief overview of some of uh just a few data sets from earlier this year that you probably haven't heard of have been leaked online just very very quickly um so in January uh the smartphones

forensics companies celebrate and msab were both hacked their source code was leaked online and then also a police intelligence company called Odin intelligence was hacked and their data was leaked online uh and then this was in March the city of Oakland well the city of Oakland was hit with trans someware and uh their data was leaked on on uh online in March and um this data set I've looked through a little bit of it includes uh police disciplinary records for Oakland Police and information about lawsuits against the city um in May someone found a public Google drive folder with 20 Gigabytes of data from the anti-trans haate group the uh American College of Pediatricians this group is a group that

played a big role in the overturning of roie Wade um in August two different stalkerware companies were hacked databases of their customers were dumped on the internet uh these are compan that make uh Android spyware and sell it for cheap to uh people like domestic abusers so they can monitor their Partners text messages and browser history and everything else on their phones one of these companies is from Brazil one from Iran but uh the targets of their hacks are are all over the world um in August someone hacked Russia's censorship office and dumped the source code for AI software that they use to uh scour the internet for stuff to block um in June someone hacked

and leaked uh a million emails from uh donet the Russian puppet government that uh Russia just stole from Ukraine um so that was just a tiny subset of just a few of the data sets that uh have recently been leaked online they might contain newsworthy stuff I've barely had a chance to look at any of those um but also neither has anybody else no one is looking at these data sets um and I think that a big reason is that journalists and researchers um just really don't have the skills and so I'm hoping to change that uh so I spent the last two years writing a book about how to analyze hacked and leaked data sets and here's

actually the only copy that I have it's not out yet it'll be out soon but if anybody wants to just grab it and look through it or whatever you're welcome to um I just want it back um the book is really Hands-On it uses real data sets as examples and it has you download them and uh use them as and analyze them as you read the book it's designed for total newbies so it doesn't assume any prior technical experience um or any any technical skills at all all you need is a laptop and internet connection about a terabyte of dis space and you have to be willing to dive deep into the nerdy details um and uh it's a crossplatform

book I made sure that even everything works uh from Windows and Mac OS and Linux and so um it uses a lot of Windows sub subsystem for Linux um and so briefly here's what's in it I start with talking about things like digital security and Source protection and ethics and document redaction um and uh also where you can download some data sets uh then I move on to a deep dive into more technical topics like using the terminal and how to make data set searchable um and uh working with email dumps in different formats and then I uh go into like the very be beginning basics of Python Programming and then I spend time on um like especially how to

work with structured data like CSV files Jason files and then I go into SQL databases and at the end there's a few case studies um I'm also releasing the book under Creative Commons license uh to remove any barriers to access um once it's released in January I'll put a PDF of it up on the website hackson links.com um and the beginning of the book is up there now if you want um okay so before you start working with and leak data it's important to start with a baseline of digital security and so the first part of the book is the least technical but it's still critically important uh some of the homework assignments include starting to use a password manager

making sure your hard dis is encrypted and uh you know showing you how to encrypt a USB hard disk um and then also you can't trust everything that you read online and this includes hacked and leaked data sets that you download or that some stranger sends you and so it shows you some techniques on how you can uh uh verify that the data that you get is authentic before you write about [Music] it um the book also goes into how to acquire data sets both public data sets um that anyone can download and also private data sets that you get directly from confidential sources um and and also considerations on how to protect those sources and how to communicate

with them and things like that um and all the data sets that you actually download and work with throughout it you get from distributed denial of Secrets so distributed denial of Secrets or dos Secrets it's a us uh nonprofit that was founded in 2018 it's basically the public library of hacked and leaked data sets there's like a lot more data sets than just what's on there like a lot more uh but they basically try to look for data sets and get submissions for data sets that they think might be newsworthy and then they like hold on to them and help journalists access them um and I've been working with dto secrets for a few years uh I report on data sets

that they publish all the time uh the website's Doos secrets.com uh they also have a newsletter dos secrets. substack docomo set that they release so if you want to get like an email when there's something new that you might be interested in that's a good thing to subscribe to um and one of the first homework assignments uh is to download a copy of the blue leak sta set from dos secrets to your encrypted desk um so uh in the summer of 2020 in the middle of the black matter Uprising um uh that was sparked after cops murdered George Floyd someone hacked hundreds of law enforcement websites most of them belonging to law enforcement fusion centers and linked 270 gigabyt of data

to Dos Secrets uh so this is the blue leaks data set this is the data set that I work with most throughout the book um and it's large and complicated but uh uh it's great for practice and also another thing about this data set is um I think that there's still only a small amount of it that's really been like looked through I think that there's still a lot of Revelations in there that just haven't been published because nobody's really lued yet so not everyone was as happy about blue leaks as I was uh after deos Secrets was published German authorities at the request of the FBI uh seized a Dos Secrets server that had all of their

public data sets um it didn't actually do anything to suppress blue leak's journalism though because they used bit torrent as well um and then also in 2020 both Reddit and Twitter started censoring dos Secrets uh Twitter suspended the Dos Secrets account and then also started preventing anyone from posting links to Dos secrets.com or from even dming links or anything like that um I briefly had hope that when Elon Musk took over Twitter he might restore the Dos Secrets account and stop censoring the links but that didn't happen um in uh and then in 2020 like short shortly after Twitter started censoring d secrets uh they also used the same link blocking thing to censor a

link to New York Post article that was based on Stolen documents from Hunter Biden's laptop and uh Republicans were so uh pissed about this that they had like Congressional hearings and um they were very angry Twitter reversed the censorship of the New York Post after two days but do Secrets have still been blocked for over three years and while I'm talking about dos Secrets getting censored um in February uh do Secrets published data from Rasam nadzor which is uh the Russian government agency in charge of monitoring and controlling and censoring mass media and this is actually the second leak from uh Raska msor that they've published um I think they've published three of them from the same

agency now uh and the first one was um right after Russia invaded Ukraine when uh like there was dozens and dozens of different data sets uh people like activists were just going wild and hacking Russian government companies and government agencies and companies and uh there was many many terabytes of data so uh yeah but uh in the second leak um uh do Secrets is actually mentioned this is a court document uh that is an attachment in an email on that data set that is actually uh it shows a request to add dasas secrets.com to Russia's domain named censorship list [Music] oh an interesting thing about this is completely um not really related to this but chat GPT the like gp4 version

supports uploading images I uploaded a screenshot of this and asked it to describe it what it says in English and um it actually did a really good job it's like oh well the agency and it shows the cic and then it says the English and it explains what the what the accusations are so that was was kind of interesting um okay so the technical part of the books uh it starts with teaching how to use the command line it teaches the basic things like navigating the file system and running commands and using pseudo and then it moves on to uh some techniques that are useful for quickly analyzing data sets um so for example when you first download blue leaks uh

it's a folder of 167 zip files and you could rightclick on each zip file and unzip them one at a time um but that is like slow and cumbersome and so it shows you how to do it all in once with a A for Loop um after you unzip them you have hundreds of folders uh many of them full of thousands and thousands of files uh and so you learn how to use commands like du to measure the disk space of different folders and do that in a for Loop to and how to use commands like sort to sort the output so you can see you know these are the smallest folders these are the largest folders maybe

these are the ones with the more interesting stuff to start looking at um it shows you how to use find to make lists of file names how to use GP to like search those lists of file names or to quickly search for Strings and other text files and things like that um so grep is a great tool for searching uh in lists of file names and it's also great for searching uh the content of some files some plain text files like uh csvs and Json files but you can't use it to search most types of data including PDFs and office documents and even emails um emails are plain text files but uh because of M encoding uh

much of a lot of email is b64 encoded so you just won't be able to search it with GP um so if you want to search the contents of all of these you can use software called Alf uh Alf it's open- Source investigation software developed by ocrp which is the organized crime and Corruption reporting project uh they uh built this for investigative journalism and ocrp runs a big uh public Alf server that anyone can go to at al. ocp.org um and you can use this to index data sets and then search them and it it has a lot of other cool features too like entity extraction and OCR and being able to cross reference uh things in in multiple

data sets um it's also possible to run Alf an Alf server locally on your laptop using Docker um and that's uh you know one of the homework assignments so here's a screenshot from the book of using Alf and in this case I indexed um just a little piece of the blue leaks data from ice fish X which is a partnership between uh law enforcement agencies in Minnesota North Dakota and South Dakota um and this was uh when was this from this was like June 2020 this was just like a few days after George Floyd was murdered this is an unclassified law enforcement sensitive document um uh and it warns of increased threats against police by

[Music] protesters um okay so email dumps are also a very common form of leaked data set so there's a whole chapter on how to read other people's email uh and this goes over the email message format uh how email protocols work a little bit but mostly it focuses on the different file formats the email dumps come in which are EML files um which is in a bunch of individual emails um Mbox files and PST Outlook files [Music] oops um so it kind of shows you how to convert the email dumps to the right format and then import them in Thunderbird using um an add-on called import export tools and then once they're in Thunderbird you could read

them kind of as if you have access to the inbox um and you can use Thunderbird search tools and things like that and so for example here's a screenshot from the uh Naru police force data set Naru is a tiny island in the Pacific that hosts abuse ridden offshore detention centers that the Australian government uses uh to hold immigrants and Asylum Seekers so this screenshot shows an email from the president of Naru at the time um telling nar's police chief to not respond to a journalist who asked questions about two Naru men who had allegedly attacked a refugee worker and possibly ran him over with a car and stole his motorbike um it's it's interesting you

could actually filter this set for all the emails from the president of the country and you could read them um and there's a few other tools you can use for email dumps um Al if is one of them uh but also for PST files these are outlook files you can use Outlook itself and import pstd files into Outlook uh or just open pstd files in Outlook um and so this is a screenshot of Outlook where I uh imported a 48 gigabyte PST file that was leaked from the largest state-owned Media company in Russia vgtrk uh this was uh another one of the organizations that was hacked right after Russia invaded Ukraine um I had searched for the cerlic spelling of

Tucker Carlson and so this says Tucker Carlson sync and then uh this this email is probably uh like um for subbing for subtitles or for dubbing uh like showing a Tucker Carlson segment on Russian TV and it's basically talk Carlson saying that Ukraine isn't an independent country but rather it's controlled by the Democratic Party in the US and then there's conspiracy theories about Hunter [Music] Biden okay so tools like Alf and Thunderbird uh only work for some data sets but for a lot of data sets um you really have to write custom code to make any sense of them and so the book includes a crash course on Python programming for beginners and it focuses on um traversing the file

system and working with uh data and dictionaries and lists a bunch um and so here's a quick example of that so uh on the day that Russia invaded Ukraine the ransomware group kti published this statement kti is a ransomware gang they're known for extorting hundreds of millions of dollars from companies especially um Healthcare companies around the world and uh few days after they published this statement um a Ukrainian security researcher probably uh hacked KY dumped 30 gigs of internal documents online including chat logs and so um the chat logs were from a rocket chat server that they hosted themselves in a onion service and they were in Json format and so here's an example of a

single message uh posted in kti's chat room on the day of the invasion um this message translates to some American Senators suggest blocking PornHub in Russia in addition to social networks and this was right when the US and Europe started doing economic sanctions against Russia and then Russia started uh uh blocked access to Twitter and Facebook and there were all these rumors uh that PornHub would block access to Russian users which never actually happened um but the next messages uh in the chat were that's it we're done they will take away our last Joys uh uh and so here's some code from uh from that chapter that shows you how to um uh work with this data so the clip

at the top uh loads uh loads this Json file and purses it and then there's just a simple for load and then it shows how to like actually make make sense of this convoluted structure that this data happens to come in and then it shows you how to um uh you know a for to display it in such a way that is possible to like copy and paste into a translation app and so here's the reporting for The Intercept that I did on the KY chat logs and so that's the big part of the book is it teaches you how to take uh like uncomprehensible data like this and make it possible for you to read it and to

understand it and to report on it so another common format uh is the data leaks common is spreadsheets and particularly CSV files um and uh yeah so I I have a whole chapter on that and there's a lot of Cs files in blue leaks um so I'll show you a little bit of that so here's a quick example of something that I found in a CSV I gped the contents of CSV files in uh the Nick folder in Blue Lakes for the word antifa so Nick is the Northern California regional Intelligence Center which is actually my local Fusion Center that I didn't even know existed until blue leaks happened and I started looking around and I was like what is this

organization um it found uh so this file is ss. CSV which uh lists suspicious activity reports which are basically rumors or leads posted to fusion centers about things that may or may not be illegal basically um so here are the relevant fields from that CSV in a format that's easier to read um a lot of the homework of working with the CSV chapters involved uh like looping through a row at a time and making them easier to read or like searching them for things because there's just so much data a lot of these they have like hundreds of columns um so this shows that someone from the Marin County DA's office which is just North of San Francisco submitted

this suspicious activity report on June 5th 2020 at 2:20 p.m. they set the category to radicalization extremism and the summary says that they received the attached letter from a lawyer who was contacted by a University of Oregon student and the student quote appears to be a member of the antifa group and is assisting in planning protesting eff efforts in the Bay Area despite living in Oregon um and it also makes a reference to a file called letter.pdf so here is the letter um that it's it's referencing uh it's from the lawyer it's written in all caps please see the attached solicitation I received from an antifa terrorist wanting my help to be bail her friends her and her friends out

of jail if arrested for rioting and he says he's staying anonymous because he's worried about getting Bar complaints filed against him and he ends letter with happy hunting and then the PDF goes on to show the original letter from the Oregon student so it's very polite uh she says she's compiling a list of lawyers and firms that would be willing to represent black lives matter activists pro bono if they got arrested um so this letter triggered this unhinged lawyer so much that he mailed it to a district attorney's office who then uploaded it as a suspicious activity report into the northern California police intell Ence agency um so I never would have found this without first finding the CSV that

uh points to it and like understand how it all worked and how all how all this happened um so back in 2020 when I was investigating blue leaks I discovered that most of the information that's really interesting is in the CSV files um and uh yeah the hacker had originally exported all of these CSV files from a SQL database and so I realized that they were all related there were columns that were like you know like uh like one thing has an ID of another table and so I made a custom web application called Blue leaks explorer that makes it way easier to uh look at Blue leaks data and so this is actually a row from a CSV the

um email Builder CSV is the bulk emails that um the fusion centers sent out so in the case of Nicki um they send uh they their list they have a list list of about 16,000 I think local cops across Northern California and they send emails to all of them frequently um uh including like right during the black Liv is matter protests they would send two emails a day one in the morning one in like the late afternoon uh listing all the local protests happening um and so yeah if you have a copy of BLX yourself you could use BLX explor and it'll make your job way easier um and so yeah this one is an advertisement for

class for how cops can uh investigate stuff on cell phones okay so another very common format that leak data comes in is Json uh I spent a chapter going deep into Json and the examples I use are related to the January 6 Insurrection so on January 6 2021 anti-democracy activists stormed the US capital in Washington DC to try to keep Trump in power after he lost the election using their phones they took photos and videos and a lot of times these included GPS coordinates in the metadata and they posted them uh online in real time and many of them posted them to parlor which is a a farri right social network after the attack parlor refused

to moderate content that incited violence and so Apple and Google kicked them off of their app stores and AWS announced that it was going to kick paror off that service as well in a few days but before that happened uh uh an archist named Don enby uh downloaded 32 terabytes of videos from parlor uh over a million videos and that's like a lot of data and the ironic thing is that she just copied it from like one S3 bucket into another S3 bucket so it's still on Amazon even after the guy kicked off um and uh yeah 32 terabytes is like massive hard drive on this laptop is one terab so it' be 32 of these hard drives

um and then she worked with do secrets to make all of the Parlor videos public but 32 tabt is way too much data for people to individually download all of it um and so she used this command line program called exif tool which is an excellent tool that you can use to extract metadata from files um it supports video files but also office documents and PDFs and things so uh she used exf tool to extract metadata from every video every of these like over a million videos and save them in Json files and so part of the part of data set is this metadata file you could just download this a few hundred megabyte zip

file and unzip it and then you have a million Json files um they're tiny Json files but um overall it's 2 gigabytes of data um and so here's example metadata of one of the videos and as you can see uh it was created on January 6 2021 it has it includes GPS coordinates in Washington DC and so so here's an example of a type of uh python script that you'd write if you followed along with the book and so this is a script that Loops through all 1 million pieces of video metadata from parlor looking for videos that were filmed on January 6th and that have GPS coordinates in Washington DC and then when it finds these videos it uh saves

this data into a format that could be loaded into Google Earth to visualize um and the uh yeah and if you've never written code before it like walks you through the entire step process of programming from the beginning um and so it it should be doable and you should have enough programming skills to write relatively simple scripts like this and so here is a a Google Earth map of that data um and as you can see Trump supporters were deep inside the capital building and so uh many of these videos in this data set were used as evidence in Trump's second impeachment inquiry another common format that leaked uh and hacked data sets come in

is SQL databases uh and so I have a chapter all about SQL and as an example I use data from a hosting company called epic um epic is a company that's run by a Christian Nationalist and it provides domain registration and hosting services to hate groups and to other far-right extremists and it's actually really known for um when uh someone when a mass shooter posts Manifesto and then uh the website they post it to gets taken down like atan they like move over to Epic because epic will protect their domain names um so for example epic hosts uh the domain name uh oath keepers.com oathkeepers.org and if you look up who is uh data for oathkeepers.org you'll

see that it's behind epics domain privacy service but um the SQL databases in this data breach include uh the actual Reg ation information for everyone behind the privacy service so if you run a SQL query in the Epic data you can find the real contact information for behind the person who owns the domain um and so in this case it's uh or the person who registered the domain I think uh uh there's ad there the uh the registrant is uh steuart rhods but the admin is Edward dery who lives in New Jersey and who runs a company called egm systems and I was like wondering who this Edward dery guy is um um and if you look at the separate

leak of oathkeeper email you could see that Edward dery is the oathkeeper it support person he like responds to it support emails um yeah so uh the Epic data um has domain name registration information for uh a lot of other websites including sites that are run by neo-nazis um and it also has information about Jim Watkins who um who used to run h Shan and now runs aun and is probably the person behind Kon so uh finally I described some detailed case studies um at the end of the book I won't go into a lot of detail but briefly um here's one of them in September 2021 I was contacted by an anonymous source who said that they were

dropping some docks on Cadence's Health the horse paace peders and that they were hilariously easy to hack um at the time I had no idea what Cadence's Health was or who anyone involved in this story was um I just had these two files that are relatively small they're like 100 megabytes together a few weeks later I published my story I had found out that an antivaxx group called America's Frontline doctors was making millions and millions of dollars convincing people that vaccines are harmful and then selling them fake Co medicine instead Ivermectin and hydroxy chloroquin and I basically discovered that over a two month period they charge people at least $6.7 million uh just for $90 phone appointments um and I only

have two months of data but if uh you assume if they're making the same amount of money on average during the whole time that they were operating then it would have they they probably more closely charged people an additional $18 million on top of that just for doctor appointments and then for the fake medicine that uh patient spent another $8.7 million on that so it's like a lot of money this was uh they made a lot of money with us so this reporting um ultimately led to a congressional investigation into America's Frontline doctors and also the scammy tellal companies that work with them um but uh all of the groups basically stonewalled the uh the committee the Congressional

committee and refused to cooperate and then the Republicans took the house and shut down the committee so nothing really happened um and then also so Simone gold is the founder of America's Frontline doctors she's also a January sex interactionist um while she was serving 2 months in prison for that um other people at this group uh tried to conduct a little coup of their own um a America's Frontline doctor's lawyer Joey Gilbert uh audited how she had been spending her money and then he sued her to try to take control of the organization there's like so much hilarious drama with this group actually it's it's pretty interesting um but in addition to like all of this money that

they had scammed out of people um they also got they had at least $10 million in donations CU they're a nonprofit and some of this information's public um so while more than a million Americans were dying from coid uh during the pandemic um according to this lawsuit this is how Simone gold was spending America's Frontline doctors's charity money um she lived rentree in a $3.6 million mansion in Naples Florida purchased with America's Frontline doctors money she spent $122,000 a month on a bodyguard uh 5,600 a month on a housekeeper $50,000 a month on credit card bills she had purchased three cars including a Mercedes-Benz she took private jet flights including a single trip that cost

$100,000 um but this uh just like on January January 6th um this coup failed and she got out of prison and basically regained control of the group and it's still it's still active they um uh no longer sell Ivermectin and hydroxy choron anymore um uh they've kind of uh Simone gold has started a new separate company that is not under investigation that's doing that now um yeah oh and I recently gave a detailed talk about this case study at Defcon at the misinformation Village um and it goes into way more detail um if you're if you're interested or you could read the book um and so then another case study involved uh Neo-Nazi chat logs that was

collected by anti-fascist infiltrators in 2017 so in August 2017 hundreds of white supremacists spent a weekend in Charlottesville Virginia for the unit the right rally um the protesters came from uh groups like Vanguard America identity Europa League of the South and the Klux Clan one Neo-Nazi drove his car into um a crowd of counter protesters killing Heather hair and injuring 19 other people and this is uh when Trump in response to the violence famously said that there were very fine people on both sides um so a nonprofit news Collective called unicorn Riot uh got chat logs from uh Discord servers that the fascists had used to organize uh the violence in Charlottesville and they also obtained chat logs from a lot of

other Discord servers like dozens of them from the American far right um they were all infiltrated Creed by anti-fascists and at this time the fascists were really using Discord they like loved it um and so journalist at unicorn Riot uh sent me copies of the chat logs in Json format and um I was starting to look into it and I wanted to report on it so I built this custom web app uh uh just so that I could like make sense of the chat logs and and read through them and then write an article about them um here's the article that I ended up writing uh I was keep reporting on these chat logs but I kind of found

sitting there reading the inner thoughts of terrible people to be like really upsetting and I didn't really want to do that and so instead I just um uh decided to spend my time just improving this app so other people could be reading all of these um and this part of the book actually includes like a a section on Mental Health when you're doing extremism research because it's uh it could be pretty terrible um so I shared the code with unicorn riot and uh then they eventually took over development and turned it into this public website called Discord leaks and today it has hundreds and thousands of messages from uh dozens of different fascist chat rooms and it's an important tool used by

um extremism researchers uh survivors from Charlottesville sued the organizers of the unite the right hoping to bankrupt the American fascist movement and they kind of succeeded um the lawyers Ed Discord leaks to gather evidence for their initial compl so uh uh the unite the right organizers and also fascist groups like The Klux Clan and identity Europa were ordered to pay $25 million in Damages and the lawyers said while our team eventually subpoenaed the full servers from Discord itself these initial leaks provided crucial early information that made the speed and breadth of the initial complaint possible um and yeah the Char the chat logs from the Charlottesville server were like explicitly included people talking about running over counter

protesters and then claiming self-defense which is exactly what happened um and so on that happy note um does anybody have any questions um I don't know I guess I guess I went through this pretty fast I also um uh no starch press has a has a bsides discount if you wanted to pre-order it um yeah any [Music] questions uh should I oh there's a there's a microphone yeah um can you apply some of these same techniques to analyzing uh like uh Freedom of Information Act information yes uh like so the techniques um it works with any type of data I mean there's a lot of stuff about like protecting sources and how to download stuff but yeah like if you get

a bunch of documents from uh Foya you can absolutely you like index them in Alf and um search them them or if you got a lot of emails you can import them into Thunderbird so yeah it's like it doesn't matter the source of the data um uh you can use these techniques for anything uh so I'm I'm curious if you know if uh any of these fancy new um uh machine learning techniques and all all of that uh has been used with like ala for like any extensions or anything to you know load in a bunch of text and ask like hey what's the crazy thing in here and then see what the robots what the

robots think of that that's a good question I um uh this book doesn't include any of that but I have experimented with some of it I mean the biggest problem really is like the easiest way to do this is using um open AI apis like using gp4 and but the problem is you have to share data with open Ai and so that's okay with some data sets that are public but it's uh you know not okay with a lot of data sets like I don't want to be sitting there sending you know everyone's pii to open AI that's going to be used to train gp5 next um but I did actually start writing this thing

that was a uh like an app that lets you ask natural language questions about a SQL database and it'll just run SQL queries for you get the responses and then give you natural Lang language answers which was actually pretty cool it seems to work well I used a like a leak of uh the Texas GOP website actually like it was a WordPress site and was like like what type of uh articles are on this blog and it would like run some queries and then it would describe the types of articles on the blog it was school um so what is your either findings or advice for people journalists or researchers who want to do this um because we're talking here

about terabytes and when you put it all together pedabytes worth of data and you're right doesn't fit on your laptop um is there anything open to journalists for where you can do large data set analysis be it AWS or Google or somewhere else or have those data sets out there in the Raw that you can then do your analysis on without having to do the download and you know uh invest in your own local file storage infrastructure yeah that's a great question um uh what is it I forget what it's called Google actually now has a journalism product where if you have data in Google Drive they have they basically have a um competitor to ALF uh

so if you do get a huge data set and have it in Google Drive you can um start indexing it in this Google tool and then it makes it it does like a similar thing where it um looks through each document and ocrs it and you can search the whole data set I haven't spent a lot of time on it but it looks promising but that's also you know will only work with data sets that are already kind of public that you're okay with putting on Google um uh and then I mean really I think that it's a hard problem I think that for smaller data sets um individual journalists can like download them and

then index them on Alf on their own computer or you know like like like spin up a postgress database and import a SQL database and then start running queries locally that's like pretty easy um but otherwise I think that like the best bet is to uh you know trying like so ocrp runs a big public Alf server with lots of data sets you can contact them and get them to index something that you're interested in and then get an get like a private account on it and um that's one way of doing it uh but also uh you know if you're a journalist and you work for a newsroom you get your news room to set

up an Alf and have their tech people uh run it did anyone track down that lawyer from the blue leaks and get them disbar you know I no one tracked down that lawyer I um I didn't spend very much time tracking down the lawyer but I didn't know who they were because uh in uh in the suspicious activity report what was uploaded was basically like this letter where they didn't have their contact information or their name and a uh scan of the envelope that they had sent and they didn't put a return address um I think that they I don't know why they sent it to the Marin da cuz I think that they were actually in

like the South Bay which is a different part of the Bay Area um but no I'm I don't actually know who it is um it be interesting to track down that lawyer could you use a um an llm like llama or something to run those um requests that you've run on chat GPT locally yeah yeah uh you you totally I mean I haven't spent too much time on this but yeah like there's um what is that software called I think it's local llm yeah there's a bunch that do those there there's a bunch of them um so yeah you could totally do that stuff like I mean I think that the big thing is that

it takes uh it takes a GPU a bunch of like engineering work to get it set up and it's not as good as what you can pay for it's like it'll be more expensive and not as good as if you're just paying for paying open AI um uh but yeah also I mean I don't know like I also think that there's limited value like I think that that you can potentially get some really useful stuff but I think that like you can't just like run it through the llm and be like oh nothing here like I think you need to have you need to like spend a lot of time on it and figure out exactly what

you want to do and then like maybe write some software that uses the gp4 API or something like that to like do specific tasks um more than just having it do do its thing for you like potentially I could see maybe you know taking every single document splitting it into like pngs uploading each PNG getting gp4 to like summarize them all and then summarize it together and like do all this stuff but that's like a lot of work and I don't even know if it'll you know yield to anything on the was it blue leaks that was the source of most of the stuff about um uh the law enforcement stuff black lives matter yeah

um well I guess I guess that would be less but um did that reveal I what what did that reveal about sort of ongoing like Risk risk to protesters or other like targeting things were there any other like especially agous things that they were like especially Grievous things in Blue Lakes yeah um I mean yeah it was interesting so so I basically only looked at nickr the Northern California one um and actually before this I was trying to see what the like Oregon Fusion Center is and if uh it was in Blue Lakes and I saw on the DHS website that there's an Oregon Fusion Center I didn't find it in Blue Lakes there is

actually an Oregon and um oh it is what what folder is it yeah yeah there's a retail Alliance One so so like there's orcas which are organized retail crime associations which are sort of like fusion centers except they're geared towards like retail crime and so like all of the people who have accounts are like uh work for like Walmart and Target and stuff like that and it's like they share intelligence across big geographical areas with police about retail crime so there's there's that but there's not the um the one that's more about like law enforcement and anti-terrorism but anyway to answer your question um uh I mostly looked at ncre there's some other reporting from some

other fusion centers that uh like in Austin Texas in Maine those fusion centers I think there was like a lot of reporting from them there's a lot in Blue Licks that no one's looked at I found um I don't know I mean I kind I'm very unimpressed by the fusion centers like basically the ultimate thing that they do is they have a mailing list where they just distribute information to like local cops that's the whole idea is it's like it's like a way of information sharing between federal agencies and local agencies and so the FBI will like send a bulletin to Nicki bulletin will forward this along to the you know 16,000 or whatever local cops

and that's basically it it's like a mailing list like there's some more like Nick also um offers services like um breaking into locked phones so if you're in Northern California you arrest someone and you get their phone but you don't know the password um you can go to Nicki and they'll like help you break into the phone um so there's a few things like that and you can actually see like the list of requests that have been made to break into phones and to do other things like that or they also offer like social media uh like I think social media exploitation is what they called it services to like uh you know look look up people online and stuff um

but but yeah I mean it's I noticed that like the FBI and the DHS would send send these uh bulletins to ncre that were kind of like sometimes they have like already uh debunked hoaxes and then like but they thought that it was the analysts that was sending this thought that it was real and fell for it and then they would send this to all of the cops so there was like a thing about like um George Soros paying professional anarchists and like that was an example of one where like that was distributed to thousands and thousands of cops across Northern California and I think probably across the United States even though like I was looking at like the

date that they sent that email and like there was a Snopes article from a week earlier like explaining why it wasn't real so yeah hi um I've started reading the book it's really great thank you for writing it and I know that you mentioned that you know the motivation behind it is that there's a tons of data that you know not many people are looking at maybe you know lack of resources Etc how do you prioritize um the Nuggets of stories you know I know you know this this is more about how to tease the story out of the data how do you prioritize you know the tons and tons and tons of things that

you come across I mean that's really hard because I have to just ignore all of it I mean really it's like it depends on how busy I am when I hear about a data set and then it and then if it if if I think that there could be something juicy in there I don't know there isn't a good way it's very subjective and I I end up there's like all of this data and if I'm not too busy I'm like oh that looks interesting I'm going to see if I could like take some time to look at that and if it looks like really interesting to the point where it's like okay I think

that if I keep going I can find I could like actually pull out a really good story then I keep going on it um but sometimes I just take cursory looks of data sets um and that's it and then I move on um but yeah I mean I think that really it's about what I'm interested in covering and like so I'm sure there's lots of data sets that um just are not really in my area that of stuff that I write about or whatever but that probably would like be really big for certain communities that I just don't look at so I think that the more people that have these skills the better because then they can write about uh you

know what they're interested in especially like they like with blue leaks um there's stuff all across the United States and basically everyone's just WR writing about their local stuff and so if if there isn't anyone local in like New York that's doing this that or in Portland that's doing this then there's going to be no journalism from that from New York or Portland or whatever I was wondering like in the beginning of this you mentioned a bunch of different data sets that are really recent um that people haven't really been looking at how do you find out when these data sets come out like is there just search searches you do or um so all of those

data sets were all from distributed denial of Secrets um and and all of the screenshots at the beginning those were all actually just from their like newsletter so um that's one really easy way is uh dos secrets .ub stack.com it like you can go there and you can like look at the you know Post history and every time there's a they release a new thing they write a post about it um there but that this is not at all uh everything there this this is only a small a small amount that they hear about or that people submit to them and that they like decide is worth uh holding on to and releasing and helping

journalists with um uh so there's a lot more um but uh but yeah I think that that's probably the best way is to just look at d secrets and see what they're releasing I know you have some affiliations with a tour project and uh I was really influenced by Jacob Apple bom in the early years do you know if he's ever going to make it come back I hope

not all right it looks like that's it awesome thank [Music] you