Matt Jones - Trumping Musky Infosec Noise with Talkback sh

BSides Perth · 202533:0392 viewsPublished 2025-10Watch on YouTube ↗

Speakers

Matt Jones

Tags

StyleTalk

Mentioned in this talk

Tools used

Apache FOP Elasticsearch Lucene

Platforms

Google BigQuery

Service

Internet Archive

Show transcript [en]

Okay, cool. Um, there was a really great video last week. Um, I would butcher trying to pronounce them correctly, but I really love this channel. Um, they did one about is AI humanity's final invention. And, um, one of the things that's quite interesting that they talk about was when AI systems are first coming out, they're learning from human data on the internet. And now there's this cycle of AI crap going back to the internet. and then these systems are rereading and referencing that information again. So it's a really time for interesting time for content and where things are at. Um but then moving to cyber security. So that's the universal apps and technology in our digital

lives. But what about cyber security? So it's a very broad field. It's also an incredibly deep field as well. So there's a lot of information at different levels in different topics and domains that we all might be interested in and there's many ways you can keep up with information. And so you might be a member a member of like Discord, Slack channels, signal groups, um but you also might subscribe to newsletters and podcasts, whatever you use. And it's all disjointed and information is in different locations. over the last like 15 years of Reddit, once a month, like clockwork, a post like this comes up and it's basically, hey, uh, how do you guys keep up with information in cyber

security? And then all the replies are like listing RSS feeds and they recommend things like hacker news and bleeping computer and bunch of stuff to use. Um, this is just people are also saying, I I'm actually looking at building an app to aggregate this. So, this comes up all the time and then you never see anything about it again. Um, and one of the more recent ones is like this has been covered to death on this sub subreddit. Um, you should search and you're just going to find, you know, 100 instances of the same question. But one of them here is like there is no single source. So, what is Talkback and what is Talkback trying to trying to do?

So in a simple nutshell um it's designed to be fully autonomous uh infosc resource aggregator uh we first released it um back in 2023 very early on um and it's had steady development since then um I presented on it when we first released it at local sec talks um and after about a year gave a talk at bides Adelaide about it with some features and then besides CRA with some more features and where things are right now. It's actually a pretty stable system. Doesn't have too many like changes. We're pretty happy with where it's at. So, today I want to talk about like all the features and show them off and hopefully you can see um how you can use the system. Um

one of the things that's different versus presenting it in the past is that we've had feedback from lots of users about how they use the system now and their use cases. So, that's kind of useful. Um we built this for our team to use. So we are actually having an organized set of data about public resources is super useful. Um so we do assessments on software and hardware and a bunch of other stuff. So having access to public information um and being able to access it quickly is really important. But we also made a decision to um just make it a free community tool as well. So that's what I'm showing today. um person who

develops it primarily is Sebastian Mackey in our team and he's done a pretty awesome job at building all these cool features. Um a few principles which we have run with. So the one was like keep it as simple as possible. Um but we wanted it to be um not dependent on human curation and that was a really important thing just to remove our own biases. So uh here there's the next point is about reducing editorial bot biases and that's really easy to do because um people who are making curated lists of resources um have their filter bubble basically of what types of things they're interested in and um will selectively choose what content they want to include. So we're

trying to avoid doing that. uh free, no ads. Um and also we wanted it to be a really clear design like UI-wise um but also consumable in many ways and hopefully it's fairly snappy performance-wise. Um so yeah, the way it works um is when resources come in, we just want to index them and we want to get the full text of whatever the resource is. So whether it's a blog post or whether it's a PDF document or whether it's a slide deck, we want like all of the text. Um and we want to then be able to store that. Um so on this particular screenshot is this showing that a blog post we want to have

um the body of the resource. So web pages are kind of funny because they have the headers, the sidebars, the footers. So we just want the thing which the article is and then in PDFs we want to be able to part or or any document. We use Apache ticker for extracting out that information and then we stuff all the data into elastic search. Um but where do where does data come from? So one of the questions on Reddit is um you know what RSS feeds do people recommend? So um TalkBack subscribes automatically to thousands of RSS feeds and it also subscribes to many different types of social media accounts, things like Reddit and um, Twitter. And um, what it's basically

doing is maintaining a list of RSS feeds. um everything that it's seen throughout time, it's has its own inventory of RSS feeds and then we automatically poll it for new updates. Um but we do the same for social media accounts as well. So people who are sharing information about technical resources or whatever it might be, we're capturing those users so we can pull from them as well. We also reference conference archives. Um, and the reason we tapped into this was because um, a lot of, uh, Louisie had his talk this morning saying like there's lots of research and things are coming back in cycles. So um, we wanted to be able to if you're looking up a

topic, you can then say, "Oh, this was actually talked about at a black hat presentation or something some other presentation from 2001 for example and that comes up quite regularly that sort of thing." So we index black black hat usix and a bunch of other conferences. Um all of that's automated. Um and I said that we didn't want to interfere with um manually tuning stuff. So we also tap into um this idea called curators which is um when people usually have their job to curate cyber security resources and make them available. So um they will be talking about maybe something that's been trending from the past week um or the past quarter. So risky business does

their weekly newsletters. Um Thinkcap's quarterly um do a quarterly PDF. So we grab that, we pass it, we extract all the resource uh resources they talk about and then we highlight this that's been talked about by a curator. Tood is quite popular. Um CTO at NCSC um has a weekly um blog as well. And then Louie uh curates a bunch of things that he's found popular from the past week. So we see when humans have talked about this stuff and we use that as a way of just flagging that these particular resources have been talked about by one or many curators. Um, when resources come in, we and we have the full text, we want to be able

to add as much information as we can and extract useful information about a blog post or a paper. Um, we save every resource that comes in into the wayback machine and we archive it and it's available via the talkback UI. We calculate reading time. We grab CVS and CWES, so vulnerability references like CV- whatever and CWE, which is vulnerability types. We grab MITER attack um campaigns, software techniques and cross reference it to the resources as well. We generate word clouds and screenshots and we do cross references between what this resource is talking about versus other resources that uh reference it as well. So you might have a blog post that talk is talking about something but it's maybe inspired by

some previous blog post 3 years ago and maybe there's a something that has come out that references this blog post that you're reading as well. So you can see that on the right. So you can see the vulnerabilities that are referenced by this particular resource with the CWE with the CVSS score and then the references where this is talking to something one month ago two months ago and then it's referenced by something in the last day. Um, one of the things that's been pretty fun to work on, um, has been to categorize things. So, we wanted to have for all of our resources, throw them into buckets and categories. So, we then know that, and it can be many. So, you might have a

resource that is a blog post pulling apart some malware for some sort of device. So maybe that's categorized as malware and maybe it's reverse engineering, maybe it's forensics, but we want to be able to put those things into buckets and we do that using an LLM and we give a confidence score for every category that we think it's about. So this is 90% confident that it's about exploit development and it gives the rationale as why it thinks it's about exploit development. Um then we also want to grab in addition to that um what is this what is it talking about? So this blog post talks about exploit development about Chrome and V8 and um it's extracted that out

and given the context of it but these are now entities in our data model which we can then pull up and query and run all sorts of stuff on. And the next thing is um summaries and ranking. So I said we use an LLM for the categorization and classification of resources but we also use it to um generate a TLDDR for every resource as well. So um you might have like a 40page document that takes you a while to read. So, we break it down to five bullet points to summarize what the content's about. Um, and that's just so that you don't have to read the whole thing. You can just quickly skim if you want to read the

whole thing or not. Make it make your own choice. Um, and then the next thing is making a ranking score. And, um, so the ranking score is something that we've been refining pretty constantly. Um and the idea is that we have a weighted formula for a number of attributes and features that are in talkback. So um when we're calculating a score, it's basically one to 100 trying to say how good something is, how interesting it is. And um we look at things like has it been featured by curators? Um what are the cross references looking like? So like has it been built on or references other things that are super reputable? Um we look at the social media score and

the weight and um and then we factor in all these things and then come up to a one to 100 score um and you'll see that shortly in regards to how that works. So um using talkback um so I guess before I go into each one so uh we built the UI um to be um mobile and just normal desktop friendly so you can use both pretty seamlessly. Um, and it's evolved quite a few times like there it was initially like just a dump of resources. Um, and now when you use it um you land on this landing page and um this is just showing like um key resources that um are trending for the

past 7 days. And then you have a preview pane as well that can always pop up. This preview pane shows a screenshot where it's hosted, what the reading time is, a summary. Um, and then like the um the LLM summary of what this content's about as well. Um, and you can see that little icon at the top next to the date. And that just says you can save it for reading later. It shows you how it was categorized and why. So, application security related, what topics it talks about, and then you can view and see more information about this if you want. I'll come back to that shortly. Um so that's how the preview works. Um we also have like trending

vulnerabilities from the past 7 days and trending topics from the past 7 days. So this is one way you can enter the data set. Um library and inbox and chronicles you can uh different features and then you have feeds and newsletters. Um search is I'm trying to keep up with the gif. Um it search is basically um using lucine um syntax which is pretty powerful. We have an API which is free you can get an account for and we have newsletters. So I'll run through all those examples now but that's how it looks. Um the main thing is about resources and like what is a resource could be just anything like a blog post or a paper

whatever but it's a consistent view in regards to what um that resource is about. Um so I have two examples. The first one is this 40et um post from very recently. Um so you see like the title, you see um the summary, what categories, what topics we show where it's hosted. So it integrates with showdown as well. Um we show how long the reading time is, the screenshot, um the AI summary, the word cloud. Um and then how it was categorized. So this is mostly about malware analysis and network security. And then you can see the cross references to another article from 40et which you can then preview as well. And then this has the same things.

You can just go down the rabbit hole if you want. And then this talks about matt attack techniques. So it's about spear fishing and scheduled tasks for code execution and um I have a second one which was after seeing the GPU talk yesterday and um so it was a great talk and it was about a lot of other presentations and blog posts. So this is showing um a blog post from Starabs about two CVS. So you get the summary you it's the exploit development reverse engineering it's about the GPU driver here and then the two CVES um but one of the things here is you can see all the references including a paper to use nix

from like 11 years ago. So it's really useful for being able to see like things that came before and other things that reference this. Um, and this is showing if I'm looking at the CVE, um, like what other posts in the TalkBack database have information about that. So hopefully people find that useful in regards to getting to information quickly um, but also seeing past work, which is really useful um, to be able to get to quickly. We've been um this has been maybe one of the biggest features for us at work which is um finding like using Google or an LLM to find past work is really hard. Um so um we want to be able to make that a bit

easier for ourselves and we hope other people find it useful as well. Um so browsing the libraries next um this is the resource view in talkback. to library um and it's just um has the same preview pane and um you have filters so you can sort by chronologically by risk rating you can select date ranges you can do full text search with lucine queries you can select categories and topics that you're interested in CVS CWES you can change your sorting order by our rank by our date you can change how the this interface is shown and um you can do quite a lot with it. But there's also a drop down here where you can then hone

in on vulnerabilities and do the same. You can also um drill in on categories. So you can enter the like library this way as well and you can hone in on a specific category. And the next one is um topics. So topics is super useful to browse. So if I'm interested in this particular authentication protocol, I can see all the different papers and blog posts and stuff that have talked about that. So um yeah, that's that's the library section. Um and then what we started with that basically and um and then we started getting feedback from people who were using the tool saying oh I use it every day or I'm away for a week and I want to

be able to catch up on information better. So we worked on two separate views to help people like that. And there's two main UI features. One is called inbox and the other one is called chronicles. So the idea with inbox is it's kind of like a a reader where you get you can then filter by technical resources, news resources, by type and by category and you get this like simple like summary of what it is with a screenshot and you can say what you're interested in just like what you're not interested in or what you are interested in and it will save it so you can read it later if you want. So this is kind of

like just a quick way of getting through all the stuff. Um so when you save it like that and you select that button that will remember it. Um and you can just either have it saved locally in your browser or persistent to your user account but then you can come back and read it further if you want later. So this is one view which is quite recent. And then chronicles is the idea of capturing uh information by week or month or year. So this is like looking at it from a weekly perspective, but you could change it to monthly and you could then sort it by chronological or by how it was rated in the system. So you can

see like the hottest stuff based according to TalkBack um by month, week, or year. And you can also filter by categories. So if you're if you're interested in certain topics such as like industrial control systems or something else, you can just go I just want to go back in time looking at that stuff. So that's what those two features are. Um and then GraphQL uh we we made an API available. Um and what you can do is you can log in and you then um just get a a a token and then you can integrate that into like your own code to do stuff. Um but we chose to use GraphQL so that it's quite

flexible for people to be able to understand what the schema is and then make up their own queries. Um but we also have RSS feeds that we created as well. So we publish the RSS feeds pretty regular. I think it's like every hour or so they'll get updated um based on what TalkBack has seen. Um the RSS feeds are just there um in the more drop down and then you have like technical news or by category. Um and then you can go to the API and we have like this how-to page and this is just a way to quickly test and prototype your queries. So I'm using GPU and Mali from yesterday and it's just selecting the ID

and title. But then I can go through the schema here on the left and I can add what additional attributes I want to query for. So in this example I think I'm going to select what topics the TLDDR um and I think the CVE which are relevant. And then I should just be able to rerun this and get all that data. So then I have all that as JSON which I can work with. And the other example here is if you have a unique identifier for a resource, you can then get all that information as well. So this is like the open AI summary, a bunch of other stuff. So that's how you can use the GraphQL

API. Um the feeds uh we've heard of people who are just like throwing that into like Slack channels or Teams channels if they're interested in certain topics or integrating it into their Feedley or something like that. So this is a way when going back to what I was saying before about people on Reddit saying what RSS feeds do you recommend? Kind of technically TalkBack should be following all of those ones plus a lot more plus getting all this extra data. So then you should be able just to use these feeds if you want. Um and then we've had some people who have um made their own apps uh for talkback kind of recently as well. So someone made this

Slackbot um and I think they're extending it to Teams as well. So what you can do if you're interested in it uses the GraphQL API. It doesn't use RSS feeds. So if you're interested in watching um talk back for specific references or to certain keywords or certain combinations of queries, you can do that and then get a feed and it will just be pumping them into this channel. So this is available on this guy's I think he's in the UK paper mountain talkback messenger. Um, and he's he asked had a bunch of like um it was good that someone was integrating with it because we fixed a bunch of bugs and we made it easier for him to integrate

with. But now you can see that like all of that data we're extracting is now available in the feed as well. So it's better than just an RSS feed which is like URL, time, date. Um, so you have all this additional data as well. Um and yeah, the final example is newsletters. Um this is basically the um talkback weekly chronicles where you have that weekly view. Um and obviously there's like tlddrc and there's a bunch of other email newsletters around um but I was saying about like that filter bubble which can happen when people are manually curating information. So this is automatically going through every category and then showing like I think it's like a certain amount of top

resources for that week summarized. So we have um the AI summary plus the score from the system and then all of these links go to talkback so you can then jump to that resource in the system and save it to read later. So it this is um this view which I just showed is obviously in the UI but that same text that same content is emailed every Monday morning. So uh we've had that running since last year and um you can go back through all the history of all the newsletters as well. Um, but we've had people saying that they subscribe to it, they filter on the categories that they're interested in, and then they get

a weekly digest that gets emailed to them every week, and then it just helps them with their routines. So, yeah, that's those are the features that I wanted to show. Um, I think I'm quite early to finish up, but um, I guess to summarize where things are at, so it's been like a pretty steady amount of development effort. Um, but nothing too crazy. Um, it's all pretty achievable. It's a pretty simple system, but we hope that, um, it's useful to people and that you can actually save time and be more productive. And there's now enough kind of ways you can interface with TalkBack. So whether it's the UI or the API or RSS feeds that you

can make it work for you. Um, we've been finding it really useful for now our assessments and what we do at work. Um, and there's potential for things like doing like more leveraging the data more to look at trends and things that are happening for specific types of attributes that are in the system. Um, it's available at talkback.ssh. You can email us if you have any requests for features or bugs or whatever. You can just shoot us an email and let us know. Um, we hope you find it useful and if you do, please tell friends, colleagues and so on about it. Um we find that most of the users are coming from like Europe and the US and

like Australia has like a small amount of people um using it. But if you do find it useful like tell people who find it useful. Um but that's it. Um happy to answer any questions.

>> Yep. your categories, excuse me. Um, they're very offset focused. Um any ide

they might be there but maybe the categories are more offensive security focused. Um, so we were thinking of changing the categorization to be more similar to how Black Hat defined their presentation tracks with like GRC or um, human factors of security and stuff like that. So I feel like maybe it's the resources might be there and you might be able to find stuff that's useful, but maybe the way it's presented and categorized isn't quite right for you at the moment. But I think that's something we definitely need to improve for sure. um the categorization like that classifier um uh when we update it, when we change it, we have to run it back through the

history of everything. Um and the history of all the data goes back like 30 plus years. So it costs us a bit of money. So we're being a bit of reluctant to change it too crazily, but I think it does make sense to us to for us to update the categories for people of different backgrounds. And um so yeah definitely I think it's a good idea to do that. >> Yep. >> Um you sort of answered the first part question in that previous answer is what's the retention of data like? >> Yeah. So we want everything and we um and um since it's been so we initially seated it from data from Reddit. So um

Google BigQuery has like Reddit databases and stuff. So we use that for our seed which goes back to like 2008 2009 and then we looked up all the RSS feeds and then we scrape all that data and then we did the conference archives. So the data itself goes back to like the '9s. Um but there's coverage gaps particularly around other conferences and there's like other archives that we should index which are online of like cyber security stuff. Um, and then we just store it all forever and we archive it with the wayback machine as well, just in case it gets taken offline because that's always a problem. Um, so yeah, um, we just want to have like

information and data is really important now and it's getting more and more important. So we're just trying to save everything we can. So it should be infinite retention. >> Oh yeah. >> Second question. Um, you mentioned one of the things as part of your mission I suppose was to remove the human bias. >> Yeah. >> Uh in addition you spoke about how there's a lot of AI regenerated content out there which is just you know >> the same things again how did you engineer that out with your rating algorithm? What were the things you done? >> Yeah. Um just trying to think like I think it's just from like the bias is like if I look at domains or

where it's hosted or what company it is and go now I'm giving preference to that I want to increase their weight or some factor like that. So instead we tried to look at the data model we capture and the more attributes and features we have in our data model the more we can refine it and tune it. Um so some examples of how we've refined that is stuff like um um when you see that a person has done some research let's say several years ago and over time many people have referenced that and then those have been popular then that initial bit of resource is going to be weighted with bit more credibility due to its

knock-on effect later. But you also have like all these challenges about like companies get bored and then like their sales team take control of their blog and then they spam a bunch of crap to it. And so there's all these like realities that happen. And um and so we have to we like have seen spam coming in from places that used to be really reputable. And so then we have to consider like the frequency. Are there sales pitches in this? Like all this additional stupid stuff I didn't have to worry about. And um but I think the cross references was probably the most powerful. We calculate something called the home. And the home is like um let's

say like github.com/username. Um that path is the home for where someone's publishing stuff, but then let's say you have like a company website that's food.com. It's different. So we calculate where the home is and then we look at the history and trends for every home and that goes into the scoring as well. So if a company has a track record of like smashing home runs with what they do then um the other thing which I mentioned was the curators. So when curators talk about stuff and about homes repeatedly then that's building up the reputation of that company or that user. So it's been but yeah the flooding like that thing about like sales teams taking over or

whatever it might be um quite a few times oh we have to think of something and it might be changing like social media is a really bad example because something kind of silly or simple might get like a million up votes whereas something super novel gets like 10 and so that gets a very low waiting in our scoring system. It's more about like past reputation and where like reputable curators have jumped in to say something. So that's been quite fun to work on and abstract it out, but we wanted to avoid like we could have it select when we publish a blog post like ours is front and center, but we don't we don't do that for anyone. Um so it's

been a fun problem. Um I think it's relatively um reasonable now. So, if you look at Chronicles and you go back for like weekly and monthly and you look at what's top 10, top 20, top 30, you probably go, "Oh, it's pretty reasonable." Like, yes, that's a really good quality article. Um, but if we see something that's wrong, we have to then go in and think about like how we might tune it. Any other questions? Cool. Well, thanks very much.

Matt Jones - Trumping Musky Infosec Noise with Talkback sh

Related talks