Keeping up with the Pwnses with Talkback.sh

Name: Keeping up with the Pwnses with Talkback.sh
Uploaded: 2024-10-29
Duration: 25 min 23 s
Description: Matt Jones presents Talkback.sh, an automated library and search platform for curating cybersecurity research across RSS feeds, social media, and conference archives dating back decades. The tool uses machine learning to classify resources by topic, extracts metadata like CVEs and reading time, and

BSides Canberra · 202425:23273 viewsPublished 2024-10Watch on YouTube ↗

Speakers

Matt Jones

Tags

TopicOSINT Threat Intel

DifficultyIntro

StyleTalk

Mentioned in this talk

Tools used

Lucene Talkback.sh

Service

Internet Archive Thinkcipher

Protocols

GraphQL

Vendors

OpenAI

About this talk

Matt Jones presents Talkback.sh, an automated library and search platform for curating cybersecurity research across RSS feeds, social media, and conference archives dating back decades. The tool uses machine learning to classify resources by topic, extracts metadata like CVEs and reading time, and surfaces trending content through a web and mobile interface designed to help security professionals navigate information overload.

Show transcript [en]

welcome everybody uh we have with us Matt Jones who is going to present uh keeping up with the ponies uh with talk pack. sh uh so over to Matt and um thank [Applause] [Music] you hey how's it going everyone uh thanks very much for coming um I submitted this for the 101 talk it's not a technical talk um but I think it's a super relevant talk for people um I'm Matt I work at Elum and um this is kind of about a tool we've been working on at work for a while um to support us um I was trying to work out how to introduce this presentation and thought that uh from an interview in the BBC in

99 um there was an interview with David Bowie and um he was talking or they they talked about the like uprise of the internet and um and one of the quotes that came out of this was him talking about how with the internet there's going to be um a weird interplay between mediums and people who produce content and people who are consuming content and he also talked about like how he thought um the internet was from Aliens and stuff but this quote in particular was um I thought really profound cuz anyone who was around on the internet in '99 or late 90s and think about how information has changed to get to us it's so vastly

different um I was living in Melbourne in 2020 and um uh I remember feeling super overwhelmed with the difference of My reality from the digital world to the physical world being at home every day and um I was thinking about how with mediums in this industry um we we we we receive mediums via our phones our computers our work and so on but we can also choose mediums um and we might reluctantly choose mediums as well but we can choose newsletters and podcasts and all sorts of things to keep up with news but for everyone we have our own mindmap in regards to how we think about information and everyone here would have a different mind map in regards to how

um you logically break down topics of interest and things that are relevant to you now personally um at work we do technical assessments and Dan and Zan we're just talking about some embedded work we're doing but we see so many different things across hardware and software and all sorts of stuff and um on the left um is kind of how I think I felt in 2020 um just the bombardment of information but on the right is my target ideal State what I kind of dream of so um some of you maybe there's a few people here maybe about in 2012 I gave a presentation at roxom and it was a about a tool called talk back I have no

creativity of naming tools so I just renamed it the same thing but basically um I was working doing writing systems for tracking vulnerabilities and um supporting workflows basically and um I prototyped something around the late 2000s which was monitoring um people on social media talking about vulnerabilities and um it was really relevant because I noticed that in a lot of feeds we had that there was better quality data in social media um kind of like you know I what I worked on was extracting out metadata about from social networks about languages and countries and and kind of relationships and then being able to plot kind of discussions about vulnerabilities and I was classifying about analysis exploits

post modems and so on and I could do bubble charts and all sorts of stuff but one of the things I prototyped was um thinking about this General problem and it was I had a community of users which was about 30,000 or so people who had talked about vulnerabilities over a few years and I wanted to find Trends within those communities and social network and then try to train and supervise coming up with um a like trending resource feed in um in a system and the the funny thing about it was um I developed this and uh there was a Twitter feed that oh sorry there was an RSS feed and I was using Twitter predominantly and um then

someone wrote a script to pass that uh to put it out onto Twitter and then I thought oh cool I'll I'll put that into um a newsletter so there was a newsletter that was formed it was like scraping stuff from Twitter going through this process going back to Twitter then making a newspaper but uh around this time I just didn't have time to work on this and I had all these ideas but I was freelancing helped start timum and then had no time so I sat around for about a while thinking someone's going to make something that I want and that never happened so here here I am um so during that kind of lockdown period

thinking about this problem we decided at work to put a bit of effort into trying to make something um so instead of it's like a library um it's we want a library of resources that are cyber security related and um we want it completely automated and we want it completely self-maintaining in nature we don't want I don't want to touch it I don't want like don't want to waste time touching things we started developing a properly kind of around 202 2 we made it public in early 2023 and since then we've just been building features as we see fit we've made it for our team so we integrated it with our some of our work

um some of our tools and I've got some examples of that Seb Macky who works with us is a security engineer so he helps us build us tooling and he has been one of the main people behind this um we've heavily focus on content analysis um and really trying to extract useful data about resources to support many use cases and TalkBack is a web UI it's mobile friendly um and there's an API which we' built as well but we prototyped it did that public release but we did a graph Cur API um late last year and then today there's a couple of cool features that I want to talk about um the use cases that we wanted to

support um is I guess in like really simple just to get everyone on the same page is like we just want to view recent Publications and news um so what blog post slides videos and so on what open source projects are coming out or what are being talked about again um news in the industry we want full chech search and this is probably maybe where the desire came from it's like the Shaving the my shaving the yak is talk giving this presentation because um I if we're doing an assessment and I want to say what resources or news or things that are relevant maybe there's some slide deck from 20 years ago and it talks about a

function that I'm as part of a code base that we're reviewing I want to know that and that's kind of where this thing has really started from um but I also want to be able to be quite powerful with search operations as well um the idea of being able to drill down into specific topics so if you you work in forensics or IR or if you're doing exploit development or whatever it is you're doing um being able to just the filter on all that um or also topics so you might be interested in like a really specific system component or you're interested in a set of tools whatever it might be and then the final

one is Integrations and that's kind of maybe one of the most important things about keeping things seamless when people are working so you can integrate it into your workflows so that's kind of the intro um for data processing when you we started building this and we want it to be automated and self-maintaining we have to start somewhere um and the general idea was we want to index of resources that are security related that go back a decent amount of time that humans have vetted and we want to be able to just grab them all and we'll use that as our initial data seed and we chose um Google big query for our netet and um we basically

grabbed all this data it's just an RSS feed with a bunch of euros and time stamps um but from this point we had at least a lot of URLs and now we need to work out how we going to continue building the system to make this autonomous and featureful so we need to index and parcel content which we receive and the second thing is we need more sources than that we need to be able to be smart about how we dynamically monitor sources the way indexing works at a in a really simple way is um new resources come in to talk back and then if it's web related um let's say it's a blog post um

we have to write something to pass the web content and extract out the content of the article um so like on the Google project zero screenshot blog post we want the stuff in green to be indexed and Associated to the resource we don't want all the Footers and sidebars and so on because that might talk about other things that have nothing to do an taint our data um so we've done that it works pretty well um and if it's not a web page we want to be able to Index this as well so we use a patchy ticker for this so if it's a PDF or it's a PowerPoint or whatever we then grab whatever the

content is and we throw it into elastic search from there we have to extend source and the general thing is there's kind of two angles predominantly that it focuses on one is social media and the other one is RSS feeds and the goal is really for TalkBack to be much more much more focused on tapping into all the RSS feeds so that's the reliable source and you're going straight to the source of the person who's creating the content but we use social media in a few ways as well um the summary of it is that um when we pass new resources we look up is there an adom feed is there an RSS feed for where this is hosted and we store it

in a library and then we have a bunch of logic with some considerations for following and prioritizing RSS feeds later but we also tap in the social media activity and we take all the um kind of features such as likes and follows and and retweets or whatever it's called and then um we store that we store all those accounts as mediums so all the social media accounts we're kind of tracking as well um and that's one kind of angle but the second angle is really conference archives and this came about where we said oh you know Reddit and social media and all this other stuff isn't capturing you know conference papers and Publications going back 20 30 plus years

and we we actually want that data because we want to index it so um we wrote another module which was basically um it's bit of a pain to pass all this conference data but basically we um went out to black hat and news Nick and a handful of other conferences just to index all their data so we were grabbing all the PDFs and Powerpoints and stuff and throwing in it as well and we also used a tool called it's like a website called think citation and that's done by some folks out in South Africa they make the canary tokens and they have this uh resource available as well it's a big index of conference archives and

presentations so when we're passing a PDF we want to give it some context about what what conference was it at and so on um over the time of where we're at now our data is obviously much more heavy in recent years for many reasons it's the quality of our coverage of our sources but also there's a lot more content being produced but a lot of our index data goes back all the way into the '90s um we also can pull up PDFs and stuff from the early 2000s so whatever so there there's relatively good coverage but it could get better from here now one of the things about extending resources is this is all autonomous and it's all

computers doing things at the moment but we actually need some form of supervised learning to be able to assist us in flagging resources and we came up with this feature called curators and curators is this idea that um in this industry there's a bunch of people who curate information and publish it regularly so risky business is a really good example of like a long track record or publishing this are the articles that I'm talking about think scaps quarterly do a quarterly um PDF which we grab and we index as well we do tldc and a bunch of popular like kind of newsletters CTO or ncsc's run by Olie white house but they're different angles and the

different perspectives from people in different roles in the industry and we wanted to have that breadth of coverage but we use this all resources that have been flagged by this Cur um these curators to help train out data set of bit when resources come in um we run a bunch of modules um and there's quite a few modules that run and I'm showing the ones that I think kind of give you um an idea about the different types of features we're trying to go for um when new resources come in we automatically archive it in the Wayback machine um and we do this just in case stuff gets taken down we want it saved forever um and the second thing is

we calculate the reading time um we extract out the RSS fees which I talked about before we are identifies things like cves and and references to cwes as well so we can cross reference to that information using the API we run a bunch of word cloud generations and screenshots and a bunch of other stuff and we also do cross references as well because we have a library of of resources we collect and there might be like it might talk about other blog posts that like inspired the work or it might link to a GitHub repo but there're also might be um in the other direction as well but it might be news articles and stuff like that as well so we do all

the cross referen um so on the right you can see like kind of enhancing some of the data from nvd and then the references of like this article references other resources and other resources reference this resource now this this slide and this feature has been something which has evolved over the last I think since when we started it we we prototype some classifiers and um we wanted to um Define a bunch of categories in in this industry such as you know I said like forensics or exploit development or um IC or whatever it is and um we want to be able to give resources like a one to many relationship um we want to bundle them into buckets automatically

and um we the first prototype was terrible then open open AI came along and helped us out a lot um but what happens is we um have the candidate list of categories which um are defined in the system and then we run this classifier to throw it into you know um this is predominantly about exploit development but it also might talk about networking Concepts or cryptography or whatever so it's a one to many and it gives the rationale per per resource why it's being categorized this way um and it's actually in a really good place at the moment and the second thing is uh we want wanted to extract topics and topics is like what exactly is this

resource talking about so categories are one thing but topics are completely different so is it talking about specific tools or is it talking about specific system components or specific operating systems or Frameworks or whatever um so we extract that out as well and this classifier has run back a good 20 plus years across all the resources so this screenshot is showing that this resource is categorized as exploit development and it talks about Chrome and V8 but we can then we have that data set so we can now view everything about Chrome or everything about V8 um summaries and ranking so uh we really want to have this is about two features the first feature is

like a te drr for every resource um and it's basically um built on the notion that let's create five bullet points because five bullet points seems like the sweet spot of summarizing content down so it does the intro The General general gist of things and then collect the conclusion or takeaway um this is done automatically for every resource we pass um and just an example on the right as you can see like this random um article has that 1 2 3 4 five sentences that summarize the content but it's actually a 5 minute read um that's just for skim reading and for a few features we have in the UI um we also made a

custom um way to ranking formula for resources and it's basically trying to give it a 1 to 100 score of saying how legit is this resource and legit basically means um we think about the popularity on social media but admittedly that has a lower waiting um but we also look as it been featured by curators has this home where it's hosted been featured by curators before how frequently do they post like maybe they're spamming stuff bit too much so we might reduce the waiting um and we also think about the cross references and how legit those cross references are so we're trying to be when trying to make a system like this we want to

remove our personal biases and have um kind of like a general consideration across the data model that we capture and that's where it currently sits and I think it works pretty well but we've refined it quite a few times when there's been like some flood of something com in oh we need to factor in some logic to reduce the score so UI and us works so um it's web and mobile friendly um we put quite a bit of time to make it mobile friendly because a lot of people admittedly use their phones for scrolling stuff um this is just a screenshot of the landing page it shows trending resources for the past week um you can look at trending

vulnerabilities trending topics and so on um but browsing and filtering is really important based on the use cases I talked about before and also search box allows us to do Lucine based queries so we can do operators to and or and blah blah blah blah it's actually pretty powerful you can you can use this completely without registering an account um it will save a bunch of like your history and your favorites like you can save things for reading later um that's all done in local storage if you don't have an account if you do have an account it will you can persist that uh and the graph Cur feeds um so we developed it using graph kill so you can

go and create an account you can then get an API token and you can then see the schema and um create your own queries and and and stuff um we we've actually Pro this is um Zan and Dan just presented on MCU stuff and one of the things that we've prototyped but we haven't released yet as part of this research was a vs code extension that allows us to integrate talk back into vs code so we can say oh we're looking at these specific functions or we're looking at these specific Technologies or libraries and then we can bring in related research that's coming up um in in so yeah that's kind of where things are

at I hope I have internet so there's a few features I talked about the homepage so if I want to just like scroll it shows the screenshot and whatnot but you get this preview pain the preview pan kind of gives the summary of the content it shows um what topics it's talking about and it relates to ICS and net this this one's been classified this way if I view the full details um I then get kind of like the resource detail view um I can see the open AI summary I can see the word cloud we show where it's hosted and we have like we capture where everything is hosted and how many resources we've seen

the ISP and all that sort of stuff you can see the CATE like the classifier results and then you can see like this cross reference and you can read this this Resource as well if it loads um then there's another feature called Chronicles um so going across there's resources is really where you can go get everything and there's filters on the right so this is little filter button you can do like tags you can do topics you can see if it's featured by by specific people you can do URL searches keyword searches and so on you can um change your sorting so we can go all the way back however many years the data is been captured you can

create feeds from here you can do Al you can save resources for reading later so it's fairly featureful um you can punch in your search terms here and um just because of the Wi-Fi is being a bit slow probably not going to try um but basically um I did one it could be anything right so you might say I want everything with this create process was my example so I want to find Windows API articles or um maybe I'm interested in um I don't know like some specific windows subsystem or whatever I'm searching for I can do that now um that's resources but Chronicles is a different feature and it's um basically snapshots of time so um it's by week by

month or by year you can browse and you can even filter by categories and say I'm only want to browse these categories um by week by um by week by month by year um you can also browse topics um just it right there hope it

loads yeah so it's a bit of a this this this interface is really ugly but basically um

if I want to see everything about Metasploit this is all sorted I think by date so you can do everything chronologically or by our ranking score you can do this with everything and it shows like whatever we we're capturing whether it's like V8 or whether it's Java or whether it's you know whatever the hell it is like it could be malware it could be a toolkit um it shows the URL the type and we want to enhance this part a lot because you could do really cool Trend analysis and stuff with it um one of the features that we kind of publishing today a mindful of time um is you know how I was talking about I

don't we want it to be completely automated and we leverage curators to be able to help help our data set um learn from our data set um so one of the things is like newsletters I find really interesting because that's a specific angle of someone curating things and Publishing it some people like Pat gray and that do an amazing job with risky business they get such range of content but a lot of other ones are quite myopic in nature and and quite focused on a specific category so we have a new feature which is pushed which is newsletters so you can create your account you can go to newsletters you can then select what categories

you're interested in we want to extend that so you can do topics and all sorts of stuff as well and then it will send you a weekly digest of like this is the tldrs all the resources and you can then browse for talk back you can get that on your phone this is completely free by the way this is like just a tool that we wrot for selfishly for ourselves and this is actually useful um you can access it at TalkBack Dosh we have a um our Blog has a couple of articles on it as well we we'll do one just on newsletters um you can email us if you have bugs we we get um emails more often

nowadays saying hey this features broken or whatever um or can you have can you swipe articles and all sorts of stuff but actually a lot of the suggestions have been really good and we're trying to just make it better and better um but the system is running we're really happy with where it's at it runs autonomously don't have to touch it anymore and it fulfills those use cases where I started off um it's been a pain to write and it seems easy at face value but it's been a bit of a pain but so we're happy we've overcome the hurdles to get it to where it is open AI saved us heaps of time uh

we only use it in very specific ways there's no Reliance this is a AI tool it is it it is and it isn't um and yeah we'll just keep pushing updates um and if you have features let us know um hope you find it useful if you do tell people about it and um and yeah let us know what you think [Music] [Applause]

Keeping up with the Pwnses with Talkback.sh

Related talks