Duck Safari: Hunting CVEs in the Shadows with ShinyLive and DuckDB-WASM

Name: Duck Safari: Hunting CVEs in the Shadows with ShinyLive and DuckDB-WASM
Uploaded: 2025-09-13
Duration: 37 min 42 s
Description: In a world where proprietary pipelines and opaque risk scores drive threat feeds and dashboards, what does it mean to see the vulnerabilities for yourself? This talk exposes how vulnerability data, while open in theory, is often filtered through black-box interfaces. In Duck Safari, we flip the scr

BSides Joburg37:42111 viewsPublished 2025-09Watch on YouTube ↗

Speakers

Luis de Sousa

Tags

CategoryTechnical

About this talk

In a world where proprietary pipelines and opaque risk scores drive threat feeds and dashboards, what does it mean to see the vulnerabilities for yourself? This talk exposes how vulnerability data, while open in theory, is often filtered through black-box interfaces. In Duck Safari, we flip the script: using browser-native tools like ShinyLive, DuckDB-WASM, and duckplyr, we create a transparent, interactive CVE explorer that puts raw data and clear logic back in the hands of cybersecurity teams. You'll walk away with a no-installation tool to explore vulnerabilities by vendor, severity, or time - rewriting your team's relationship with the shadowy systems that mediate risk and visibility. 🌘 Rewriting Reality in CVE Surveillance Vulnerability data shapes security posture, yet the way we access it is often constrained by third-party dashboards and scoring systems - algorithmic black boxes that define what we see and prioritise. In this session, we utilise ShinyLive and DuckDB-WASM to develop a browser-native CVE explorer that operates independently of algorithmic gatekeepers. Powered by open-source tools and client-side computing, this application empowers you to reclaim control: you can ask your own questions, sort your own risks, and view what the dashboards won't show. This is visibility without vendor bias. Analysis without shadow logic. 🔍 Key Takeaways 🧠 Innovative Integration Discover how ShinyLive and DuckDB-WASM combine to form a fully offline, install-free tool. duckplyr enables intuitive filtering of CVE data with readable logic—no SQL expertise required. 🔓 Breaking the Black Box This talk embodies the conference theme by creating a transparent alternative to black-box dashboards. You control the filters, the scoring, the narrative. 🛠 Customisation Without Complexity Tailor the tool to your environment by searching by vendor, CVSS score, date range, or keyword. Extend the app with tags, bookmarks, or internal playbooks—all in the browser. 🧰 What You'll See * Where to get NVD CVE data * How to query and clean it using DuckDB-WASM and duckplyr * How to build a browser-native CVE explorer with ShinyLive * Live filtering and visualisation by risk, vendor, or time * Deployment on GitHub Pages or any internal static site * Optional extensions: analyst notes, tagging, offline use 🎯 Who Should Attend This talk is crafted for cybersecurity analysts and defenders who: * Want more control over how vulnerability data is explored * Are you tired of platform lock-in and risk score obfuscation * Need a lightweight, offline-friendly tool for CVE review * Want to bring clarity and transparency to their threat analysis workflows No R background is needed—just curiosity and a healthy scepticism of shadow systems. 🎁 You'll Walk Away With * A working browser-native CVE Explorer app * All source code on GitHub (forkable and extensible) * A quick-start guide for deploying and customising it * A new lens on how vulnerability data can be controlled, explored, and reclaimed About Luis de Sousa: Luis de Sousa is a seasoned data and analytics consultant based in Johannesburg, South Africa, with a passion for enabling organisations through data-driven technologies. With over 17 years of consulting experience, he specialises in designing and implementing customised data solutions, both on-premises and in the cloud. Throughout his career, Luis has delivered impactful results across a diverse range of industries, including Financial Services, Insurance, Manufacturing, Education, and Media. His areas of expertise span data warehousing, data integration, business intelligence, generative AI, and advanced analytics. In addition to his professional work, Luis is an active leader in the local data community. He organises the Johannesburg R User Group and PyData Johannesburg, creating platforms for knowledge-sharing and collaboration among data professionals. Through these initiatives, he continues to contribute to the growth and vibrancy of the data science ecosystem in South Africa. About BSides Joburg: Website: https://www.bsidesjoburg.co.za Twitter: https://www.x.com/bsidesjoburg Instagram: https://www.instagram.com/bsidesjoburg Masterdon: https://infosec.exchange/@bsidesjoburg LinkedIn: https://www.linkedin.com/company/bsides-joburg

Show transcript [en]

[Music] Good afternoon everyone. Thanks for uh attending my talk the better track. Um I'm standing in between you and lunch. So I am aware that um I will keep it puncti but I am asking for collaborators. I'm asking for curious people. So hi, my name is Louis Duza and I am speaking about duck safari. So essentially this was born out of born out of play. I was like I wanted to play with two technologies and I was like let me make an excuse to to use them. And the subtitle is hunting CVS in the shadows with shiny live and duck DB Wom. And I guess I was inspired by the theme of the conference. So I said, how do I

keep it relevant? How do we speak about the sort of underlying algorithms? So I'm today I'm representing two different communities. So this is my shameless plug. So I am representing the Johannesburg R user group. So R is essentially a statistical modeling language that is made up of an ecosystem of packages. So there's more than 10,000 at the moment. We meet once a month every second Tuesday at the Microsoft offices and you can find us at ruscosa and the meetups are posted on meetup.com. The second community I'm representing is PI data. So essentially this is for the Pythonistas. So people who use Python and data. We meet at City Rock. So the rock climbing gym. Yes, that one. And we

do some rock climbing actually. Um every second Monday at half of 6 in the evening. So if you want to do some data and climb or we always happy to have you. Okay. So why am I here? So, I had a telephone conversation yesterday and someone was like, "Why are you doing this? Is this your job? Is this what you do?" And I was like, "Okay, so that's that's that's where the slide comes from. My boss didn't tell me to to do it. I'm not, you know, I'm not here for attention and clout and and all those other things." But I think this is about play and I think it is actually about curiosity and and

fundamentally the actual real thing is we need to be curious in life. I think all of us need to be but not like manufactured curiosity but like being like hey what's going on here because the actual real you know sort of zero day is losing interest or curiosity. Okay so onto the topic what have what have been told to see. So in a lot of cases we have proprietary dashboards or we're working within large corporates and we we're very much removed from the underlying data sources. So I'm I'm largely a data guy. So we like playing with lots of data and understanding all these different complexities. And a lot of times we would look at

dashboards, we run KPIs and sort of threat metrics and we don't actually know the underlying logic. We, you know, we just go find red, amber, green, we can, you know, certain numbers. We start chasing things. And if everything and if all these data sources that we're consuming are open, then sort of why why does everything kind of feel a little bit a little bit hidden when you ask people to start defining acronyms. All of a sudden, everyone's like, I don't actually know what that acronym. Yeah, good good question. I don't actually know what it is. And you notice it when you when you start entering different fields. So, this is very much for me it's like the iceberg analogy

where sort of on the top like most dashboards you see have been sort of pre-tuned. There's lots of um logic and thinking that's gone behind it, but you're not necessarily aware of it. and especially when you're looking at like scoring algorithms and certain data gets lost. So turns out with CVEes I found out that oh there's actually you know there are certain ones that can be rejected for whatever reason it states in the documentation that they can say they can remove entries. So this is think of this as traditional dumpster diving. you're like, you know, you're looking at just, hey, what's going on here? Okay. Are you seeing risk or are you just seeing what you're allowed to be seen?

So, as I mentioned and and this picture is really great because things get just, you know, things just fall off the because I guess there's only so much time and only so much complexity you can understand. And a lot of times if you're able to understand the raw data sources you can actually and have the ability to manipulate it that's where the real power comes in. Now in a lot of cases it takes a lot of time. So a lot of time to process this the the data and actually understand it. But once you've got that understanding you can actually pick up some really interesting trends and actually understand what's what's happening.

So what if what would we see if we could choose the filters on the data that we that we have? Okay. So so CVEes essentially are publicly published on a nice GitHub repository and there are a few thousand JSON files. Every night there's 400 megs of JSON files. I think it's yeah, probably five, six, 7,000 lot depending on how far back you go. And essentially, we're able to use these these JSON feeds to see all the current CVEs that have been created and some of them that have been deleted. And this is the raw data we can now start playing with. So

now what's interesting about the CVE data source is it contains JSON information. So think of it as uh when we speak about data in general uh think of like most relational databases are like sort of Excel spreadsheets you know you have a column you have a row and you're able to extract data out of it. Now, what's interesting about JSON is we can start nesting data inside it. And that's kind of what makes it really interesting because you actually need to understand the structure of the JSON that you're getting to actually start extracting some interesting information out of it. And that's actually largely determined by um the schema. So there's a there's actually on on the CV um on the CV

GitHub repository there's also a schema that says this JSON should always look like this or follow this structure. So very struct so it's highly structured but obviously it can contain sort of nested information which is interesting. So in my case just just for this example I've I've chosen the withdrawal withdrawal field. Now I was like okay so things can be withdrawn. So let's let's see let's see what let's see what happens. So let's show you a raw CVE record.

the one thing I didn't open um here okay so in my case I downloaded all the CDEs of this morning and they are currently

[Laughter] Exactly. Okay. There are more than 100,000 CVEes that my poor machine has had to extract. Um, and okay, so CVE is JSON that looks like this. [Music]

And essentially it's nested data with each of the vulnerabilities when it was when it was reported who who essentially um reported it and so for example in this case this was a SQL injection and who the vendor was. So it's a PHP we still use PHP anyway. um it's really popular. Um and obviously a bunch of metrics. So these are the these are the the the the CVS metrics that have been created that they determine. And so in our case, what we're going to do is we're essentially going to flatten this data. That's essentially kind of the essence of of what this talks about. Okay, so CVEes are open data so we can all go download them,

but they are messy. So I think um when you realize that nested structures turn out to be quite large tables and it gets quite complicated getting into the sort of the essence of it. So I'm kind of going to explain to you a bit of a pattern of how from a data perspective how we start processing these things. And generally dashboards are very much they're relational. So you're not going to get your fancy NoSQL database because turns out dashboards are just sort of us summarizing information um and sort of surfacing it which is more of a relational thing. and without this and essentially with that clarity we don't need gatekeepers we don't need vendors we can make our

own sort of narrative around what's happening okay so let's build our own tool okay so I've chosen those three things that I wanted to play with so duck DB so duct DB is the analytics version of no SQL so it's an in-memory database that is essentially create it's a highly performant analytics database. So, think of it as a pivot table engine. It's probably the easiest way for me to to to explain it where SQLite for example is a very lightweight in-memory SQL database and duck DB is the analytics equivalent of that. So what's interesting about that is if you want to average things if you want to do more complex like windowing functions like um

understanding different you know different periods um it's highly highly performant because it's it's optimized essentially it's built from the ground up and started out of the Netherlands and these are database researchers so this isn't a big a big data you know sort of spark engine that like needs a thousand clusters or whatever. This is essentially a little small highly performant in-memory database. Now, what's interesting about this is we're able to handle billions of records. That's billions of the big B inside a inside a little in-memory process. I don't need to spin up a cluster and stuff like that. I can actually sort of summarize billions of records inside this small little engine. which makes DuckTV very powerful.

So, Shiny Live is essentially a in browser um framework that allows us to create frontends or dashboards fairly easily. The difference with this is essentially we don't need a server to host like your React application or your PHP application. This essentially is all running in browser. Now, that just means it has a a fairly large JavaScript payload, but essentially you're a essentially you're running this all inside your your browser where you're able to actually play with the data. And the last one was duck plier. Good old duck plier. So, essentially within the R ecosystem, there are a bunch of um packages and this was meant to make my life easier, but unfortunately it did not. So I did

not use it. So so largely um so shiny live I must must mention is it's for Python and for R and essentially it's just a way of um creating user interfaces and then show creating like a website or interactive website. So so essentially it's you have a static you can have a static website and then you have shiny live is that interactive website. So you can interact with code, interact with R or Python code or C++ if if you're that way inclined. Okay. So install 100% client side. Okay. So in our case, so from a super high level, we have the CVE JSON that we can download from the official website. We're essentially running duct DB inside

the web assembly. So inside your browser, so that comes with a sort of caveat. So we only you're only allowed u you know 4 gigs of memory which is more than enough if you're just running a little dashboard. Um and it is also the typical browser modern day browser is very hungry anyway. I'm sure most of your you know sort of Google or any sort of browser that you're using is just using a lot of memory anyway. So I I don't think this is going to it's not a deal breaker. And a lot of cases this yeah this essentially works on most modern day browsers. You can run it on any provider. There's no limitations or

anything. Just you and your data man. So in our case the guys who created Duck DB are dare I say it database connoisseurs. They've essentially taken every single possible gripe that you would have with with extracting data and like understanding it and sort of trying to clean it and they've like essentially created functions to help you. So in our case with one line I'm able to import a whole entire all the JSON files for one year for the CVEes and it's select star from readjson auto so it automatically determines the schema and that's it. I'm like damn that's pretty cool. I'll I'll show you the code. So I think what's really nice about this is it has

more advanced features like schema evolve. So if the schema ever changes, it will automatically start adding columns and flattening data. So although I have to admit though, JSON is a very it's a very hungry this is a very hungry process. So you would be running this either server side or on a large machine. my poor little machine can can can manage, but you would obviously need a bit more horsepower power sometimes if you do more complex um sort of um JSON parsing. Um let me quickly show you how that looks like. I'm using good old trusty DB. So it's just a database client and I've essentially so duck DB is a in-memory database. However, you can

also persist it to disk, meaning I now have a little duct DB instance in memory and it essentially can then saves it is I can save that actual that actual database to disk. Now, it follows a a sort of Postgress SQL. So, if any of you have ever used Postgress SQL, it's the same sort of schema. And so it's like a it's a it's a it's a it's a schema that we know and understand. And now we have this sort of all these analytic functions which help us. And I'm just able to create um a table in our case like raw CVE and I'm saying please give me essentially read all the JSON files from this directory. So I'm saying

give me all I downloaded it this morning all the CVES for 2025 and I just chosen a specific directory and there are about 100 uh JSON files in here. I don't want to be too enthusiastic but and essentially I'm saying include the file name and I can actually show you how it looks like a simpler version.

So in sort of uh I don't know less than two seconds I've essentially imported all those all this all the sort of JSON files all 106 of them and I've flattened them and I've also what's I have a nifty function that I included the file name so I could tell you exactly where it came from and I just said file name equals true. So this essentially went into the directory took every single it passed all hundred files and it imported all of them and then it flat essentially flattened it into a database table. Now if we want to get a little bit more interesting or a bit more spicier we can go into there is for example in our case there

is a CVE metadata field within the JSON and you can start now pulling out specific fields. So within that hierarchy, that JSON that I was showing you, there's a hierarchy of information. So you can embed, you know, information in information and you can actually now start pulling data out of it. And in my case, I quickly pulled out who the assigner was, the publish date, and the state. Is it published or in our case there's this interesting thing which I'm like okay rejected. Okay that's interesting. So actually there's a whole lot of so within that specific folder there were 11 CVEs that were rejected. So someone has reported it and then someone rejected it. So what's interesting is

they then just mark it as rejected, leave the number and then clear the data and be like okay well it was rejected. And in our case just to sort of add a little bit of an extra thing where and I'll explain why but I want to export this this table that I've created. I've essentially created a table. Now I now have a database table that we can if every if anyone is familiar with just normal databases just select star from a database I'm just getting data from give me all the table give me all the columns and rows from the specific table now this is living inside duct DB and in our case turns out duct DB can also read

paret files so I've essentially now taken this this table and I'm exporting it to paret in one line saying give me this table the raw raw CV table and create a paret file. Okay, so why am I doing this? Turns out that paret is highly optimized for the HTTP protocol and you can start doing sort of um if you have large data sets you can start partitioning it. I haven't done all the code for it, but I'll I'll tell you I'll gladly tell you why I've done it. Um, that's that's the actual trick where you can now put this on a S3 bucket on a Azure bucket and all of a sudden you're able to query it and

you can now filter it based on a certain directory which is essentially a partition. Now all of a sudden you're you're able to now partition and get highly performant queries because turns out that duct DB can query uh the file essentially sitting on on a public cloud and they can actually fast forwards and rewind and actually understand the structure of the data and only get back the data it actually needs. Pretty handy. Okay, in the cases when you understand the actual underlying data, you can now start filtering it and creating your own sort of dashboard. And that's kind of where I've gone with this project. So in the case of you understand the data and you can create so now you can

start creating your own dashboard. Okay. So in this case I've essentially used a shine live application. It is can everyone see? Cool. It is 180 lines and essentially this is a user think of this as a a way of articulating a user interface and sort of a graph that essentially we actually now playing with our graph. But before we go there let me quickly show you the paret file. So the part k file that I just exported looks like this. So, I'm just using it's I'm using VS Code with a parket um um addin. And this parket looks mighty, you know, it looks pretty much like that database that like that duck DB database, but turns out

it's in this open format that most sort of data systems can sort of start um ingesting that data. Now in my case I'm using within shiny live I'm essentially I'm using dlier to so I'm using duct db to import the parket files. So why am I doing that? Because if I want to summarize billions of rows or even hundreds or in in memory I'm able to to use duct db for that. The reason I use paret is so that we can sort of partition it. So it makes it highly scalable.

So in my case, select the code.

So I'm obviously showing it to you within VS Code, but this obviously will I will give you the source code to my presentation and to this. And essentially you will be able to view this in your browser as is.

But as

okay so essentially what I've done is I've just created a a dashboard super basic dashboard just saying hey these are the number of CVEes we can play with the date we can change the CVE assigner we now have multiple sort of tabs and we can start playing with underlying data Now, this is now driven by just a small little file that I've created. Think of it as like a a cleansed source. And if I want to make any sort of change, it's literally it's one line of code. But if I want to sort of re um you know within your GitHub repository, you can literally just rename something quickly and the interface then will change automatically because it's

happening all in browser because as your browser downloads the latest version, it will always be um the latest iteration of it. So I think I originally started using different data sources for this talk. I actually used the Google uh CVS data source which was essentially an amalgamation of uh all the security alert open-source security alerts and only realized beforehand it was like okay these are only open source. And what's really great about this is if you're able to take all these sort of different data sources. So it turns out when Google has their view of the world, CVE has their view of the world, you can find lots of data and it's actually quite extend sort

of extendable. Um and I have in my next slide I'll show you but but essentially um you can expand this you know to sort of how does it look like for sbomb? So like within your software supply chain, what does this actually mean? So I'm kind of showing you in a basic way that you can start importing all these data sources and creating your own sort of dashboards and narratives out of it. And it you don't need this highowered server. You don't need any of the vendors. Essentially, you can iterate by yourself quite quickly. In most cases, um most vendors or people have a specific point of view. And I think a lot of this is being very um

very deliberate about creating your own sort of your own point of view in the data sets that you're getting. And even really popular data sets, you don't realize that actually things are actually falling off that you can actually sort of bring to to the front. So essentially build your own lens on what's going on. And I think that's kind of what a lot about the theme of this of this conference is where if you're not curious and you don't say okay but like why are we doing it this way and why why is this a standard or why can you withdraw a CVE and like it just disappears. Turns out the data is there. You can go

fetch it but might be interesting might not be. But are you me curious like are you or are you just going to be like well the dashboard said seven. So that's the number guys seven who chose seven might be 42 but we don't know. Okay so in this case okay so who would use stuff like this? So sort of red team red teams for a bit of like recon and prioritization policy audits and control mappings dashboards without external dependencies like there's lots of different ways and I've I've added to as well like you can do exploit DB integration you can start pulling that data put it inside your own um dashboard you can do sort of esbomb aware

filtering so all of a sudden now within your software supply chain you can start now filtering all that information. You have this large database. Turns out if you ever want to productionize this, this is essentially this is a normal database. So it very quickly lends itself to something that you've like can play with and then it can evolve into something a lot larger. But for iterating and playing and understanding what actually what is the actual problem that we're dealing with here, I think this is a it's a fun approach and you'll be well served uh using it. I will publish um on my GitHub repository in the next few minutes uh the CV explorer essentially the source

code of the dashboard and you can like play around with it if anyone has if anyone feels that that it can sort of be better I'm always happy to sort of collaborate um be lacker you know that's um and in a lot of cases This is this is just um within your browser and it's yeah just have some fun guys. I think that's what this is about. Fork, extend and share. So I've maybe learned the hard way that visibility isn't always something given or sometimes it's appreciated. I think uh it's a it's a very difficult and sometimes you know visibility isn't it people ask questions get too anxious about it want to run away from it um but

if you if you're willing to hold it um I think it's very something very powerful and as kind of was said earlier like with even within the social media sort of sphere or what all the news or the things that we consume assume it's like you can define what you sort of your lens on things and you can actually make sure that the things that or the the lens that you're choosing is based on the things that you value um not sort of made by by some sort of weird third party. Walk your own path through the data and probably through life as well guys. Uh yeah, so always happy to collaborate and the mic will be passed around. Does

anyone have any questions? Uh I there were lots of other fun side paths that I had that I did not have enough time to mention, but um this was lots of fun to kind of sort of play with this idea. Um cool. Do I have any questions? >> Cool. Um can you get the mic? Yes. >> Thanks. >> Check one, two. Sorry. >> Thanks. I suppose regarding the sanctity of that GitHub database of CVEes. Um, so once a vendor declares something as rejected, it kind of just fades off into the background, but it is still there to go looking for it. Are there any legal avenues that a vendor could take to scrub it for good and basically like

kill the data entirely or does it rely on enough people downloading it often enough that someone can always come back and be like, "Hey, actually it is still here." M so I think my understanding is as I said earlier this is kind of like dumpster diving right where I I've seen some research well I've seen one research one interesting research paper where the researcher has looked at some sort of redacted uh sort of CVS and tried to understand the trends behind it. Um I think in a lot of cases data um data appears then disappears for for whatever reason. Um and so like from a legal point of view it would they would obviously say it was a false alarm. It

it wasn't um it you know it didn't necessarily affect us. What's interesting is there's actually a duck DB CVE. Turns out a vendor graphana they use duct dv in inside they they were like this is a great tool let's use it inside our our you know within our technology problem is they use the CLI version of of duct DV. So the problem with the CLI guys is it has access to your root your machine. So it's like like they were busy like it was a quick it was a hackathon actually and they quickly like hacked something together and then it was like the next minute like okay um you just made this giant gaping hole in

your in your application because you you've you know it was a quick thing that you did. So I guess to answer your question um I guess there are always it depends guys. It really does because I think a lot of cases some vendors will will swear that you know certain things don't happen right like um it's sort of perception and stuff it's not popular anyone else so yeah so yeah top don't use CLI guys like there like there's a CLI stuff there's a Python library there lots of other libraries don't use CLI please please please thoughts Anyone disagree? No. Silence stunned. Yes. Okay, we got one other question here. Let's go. Can we get the mic, please?

Yeah. >> Um, so with the CVES, you showed that you collected the rejected ones as well. Uh, I might have missed it, but was there any data in those CVES or were they all scrubbed? So what happens is within the nightly CVE um extract they've essentially they essentially scrub they scrub the CVE when when it's been rejected. But essentially if you're downloading the data so if I had more processing power essentially you can there's a delta um within there's delta that's created every day with the change of every single file. So if you're consuming the delta, you'll be able to pick up the actual text that that it was before. >> So if you are watching the if you're

watching the feed or actually looking through the feed as it's happening, you'll you'll see what actually what was the actual text. But if you get a point in time of today, it's been scrubbed. So you actually need to go back to that delta. >> Okay, that's cool. any any thoughts to maybe include that historic data in that rejected CBE? >> So essentially that's that's essentially what I'm saying in a roundabout way is with duct DB you probably have the best chance of doing that because because it's so performant and it's got all these functions that you can you can really quickly start importing all these things. You could probably go back, you know, with with a little bit of time

and probably in under a week, you could probably get all the data, clean it, and have all that historic data. So, it's just a bit of a bit of fun. There's a bit of sticky points that you need to get over. >> Okay, cool. I just I just wanted to know what are we doing with with the rejected data that seems blank. So, >> yeah. So, so at the moment at the moment, they just scrub it. They're like we and on the inside the schema it says determined by CV. They're like they've determined it's you know it's rejected but someone at some time thought it was a problem and they logged the CVE. They went through the process of creating it.

I don't know >> because it's a globally unique number. So they essentially their n it's almost like a like a database ID right because it's a unique number they have to start incrementing the numbers and so they have to keep the number cool that it otherwise find me and if you want if you want to come play with it give me a shout cool thanks a lot Okay.

Duck Safari: Hunting CVEs in the Shadows with ShinyLive and DuckDB-WASM

Related talks