← All talks

Good Models Gone Bad: Visualizing Data Poisoning with Networks

BSides Seattle19:1543 viewsPublished 2025-06Watch on YouTube ↗
Speakers
About this talk
Maria Khodak Penetration Tester Maria Khodak is a penetration tester, focusing on web application, API, and network testing. In her spare time, she researches machine learning vulnerabilities and participates in CTFs. Most recently, she won 3rd place at the HackRedCon CTF. She holds a GWAPT and a BS in Computer Science from RPI.
Show transcript [en]

My name is Maria and today we will look at the intersection of network science and cyber security. Um in this talk I wanted to ask what um how can we visualize how training data gets poisoned with networks. So machine learning models they've become in increasingly integral to our software applications. Um they also incre face increased risks from adversarial attacks notably data poisoning. So after experimenting with data poisoning I noticed that training data fed into machine learning models resembled graphical information um that could be defined with a visually represented network. Um my presentation will go further into that concept of data representation and uh how attacks can modify or poison networks. So, here's a quick overview of what I

want to talk about today. Um, I wanted to give you some background on data poisoning and network science before we can go further into how we can use network science for threat detection specifically for data poisoning. Then we'll look at two case studies uh that will go further into the application of how we can poison networks. Um, after that I'll go into some future work and refinement. So as defined by OASP, data poisoning is an attack that occurs when an attacker manipulates the training data um to cause the machine uh learning model to behave in an undesirable way. Um data poisoning occurs by alter uh altering data within um three methods uh addition, modification or the deletion

of data. Interestingly, addition can also mean the injection of data, but injection attacks as we know do not always mean adding data. Um, use of injection attacks uh can often delete or modify data as well. Um, and to demonstrate the addition of data, I wanted to share a story of how a privacy focused colleague once explained to me how he associated his name to the alpaca by changing his picture across social networks to those of alpacas. Um, so my colleague um started to post images tagged with his name or username to alpacas rather than his own face or likeness. So, for example, he would tag himself in a LinkedIn post um with a picture of an alpaca instead of a

human um any human. But, um by associating his name or username to alpacas rather than his face um my colleague kind of caused an unexpected outcome of the data um essentially manipulating his likeness. So this story along with some experimentation um led me to ask the question how can I visualize the change between untouched data and manipulated data and this question led me to ask another question of how we can visualize data poisoning and how do we do that using network science. So before applying network science to data poisoning, we should first define some key terms of uh network science and its significance as a tool. Uh various systems can be defined by networks such as the

worldwide web. Um the nervous system and even flavor components of food. So this is a cool network. Um all these ingredients map to various flavor compounds and you can kind of see how it's statistically represented in this network. Um you can see that garlic is used in a lot of different foods and it compatible its compatibility with other foods is represented. Um that one on the top right is starch um which I feel like would be compatible with a lot of foods. So I don't know why it's up there but um so these examples uh can show how behind every complex system is a network that is that inter defines the interactions between its components and

mathematically graphs are representations of such networks. Um so um networks um are defined by oh sorry uh okay there we go. So um networks are powerful tools for prediction um but they're only as accurate and as stable as their data source. So Pats or um Wait, sorry, I think I skipped the

slide. Okay. Yeah, I I just wanted to mention that nodes are the representations of the objects of these vertices and these edges are what connects them. They are also known as links in networks.

So networks are, as I said, powerful tools for prediction, but they're only as accurate and as stable as their data source. Um, paths are sequences of nodes that are connected to each other with edges. So think multiple edges in a sequence. Um, analysis of networks leads to these interesting statistics such as edge multiplicity. that is multiple edges that kind of um all connect to the same node. So they all share the same node. Um edge multiplicity represents the intensity of certain connections within the network which can be visually represented with thicker lines uh stretching between the nodes. Um conversely sparser lines means less edges between the nodes. So for example um we could represent the flight paths

on a map as a network and if you think about the nodes being air airports um more popular nodes such as New York, London um would have thicker lines between each other than smaller airports um I don't know such as Alabama. uh so there would be less lines going to there and so the edges would be thinner. Um I chose networks as a visualization technique because not only do they have predictive power but because graphs are simply easier to intuitively digest and look at. Um so here's a demonstration of the predictive power of networks. um paths and networks can reveal information about the network itself. So the US military operations gathered intelligence on Saddam Hussein's

clandestine terrorist network using paths in networks. Uh as Saddam Hussein relied on extended family and social institutions for protection, the US military was able to compromise the network by mapping out um his closest associates and progressing that way into the center of the network. So in this case the networks would be individuals in the network itself and the edges represent the connections and the thickness of the lines represents edge multiplicity. Um another example of edge multiplicity is this picture of Chinese companies with foreign listings. In this picture Chinese companies are planted on the map as nodes. The edges represent the foreign listings and connections back to China. And the thickness of the edges represents the amount of

connections between these various cities. Uh this is another example of the power that networks have in prediction. And this is very similar to that airport example um in terms of the analysis of the environment in these networks. So we can use this graph to infer the impact of um the various cities on each other such as New York to Beijing having the thickest lines um implying that there's more connections shared between these large hubs. To recap, networks help us understand complex systems such as the brain, neural networks, and global financial networks. It's also important to reiterate that the power to create predictions uh using the networks is as only as good as the accuracy and

stability of the underlying data. If the data is inaccurate, the inaccurate data leads to false predictions and other cascading results and could lead to um further consequences down the line regard with regards to the bad data. Um so all this vital information should be protected as tampered networks and data can have cascading results. Training data can be obtained from various attack vectors. For example, private data sets can be infiltrated through insider threats or sophisticated attacks. Um, in public data sets or repositories are prone to data injection attacks. With public data sets, poison data can proliferate into other models and applications that rely on these public data sets. So, in order to make poison data easier to detect, I decided to

visually compare healthy networks in poison networks using this tool called GEI. But before I get into that whole platform, I wanted to ask, do you does anyone remember who this is? So, for those that don't know, this is Tay, who is a Microsoft chatbot that was released in Twitter in 2016. Uh, Tay was designed to pick up our language and syntax via interactions with real people on Twitter. Um, long story short, within a few hours of launching, it started repeating vulgar and hateful language. The important thing to note here is the intention of the chatbot, which was originally just to be a conversational agent. Um, but over the course of this massive data poisoning attack, it became

adversarial. So, it started repeating hateful language, which was the opposite of its original intent. Um, so as malicious language was injected into the model, this is an example of a large-scale data poisoning vulnerability. This is like the original data poisoning vulnerability in some cases. Um, but this is like another example of what led me to think, how can we simulate this transformation of good and bad data via networks to show how all of that data relates to itself and to its malicious parts. So, I looked into some tools that could help me map out malicious networks. So, back to Gey. Um, Gey is a network visualization and analysis tool used as a desktop application. Um to use it, one can just

upload a spreadsheet with nodes and corresponding sheet with edges. It's also possible to build out your own network from scratch um and customizing its appearance. It's also open source, so a lot of people add various algorithms and tools into it to show data in different ways. Um, since you can also run multiple tabs in a single workspace, GEI provides an easy way to compare healthy and tampered data in the same workspace. So, with this tool, I just wanted to go into two quick case studies to show the difference between healthy and tampered data. So, first um to explore data poisoning within networks, I borrowed the layaba network created by Donald New. Um the novel was written by Victor

Hugo by the way, not Donald New who is a famous computer scientist. Um and this network nodes are characters in the story and the links represent the co-occurrence between these characters. There are 77 nodes and in the default network there are 254 links. To poison this network, I wrote a network poisoning script that adds or deletes edges between nodes, in this case, characters. So, this would either create additional interactions in the story that wouldn't exist. Um, or they would delete several significant interactions that would theoretically change the outcome of the story. So, here's the initial graph. Um, you can see that all of these characters are connected to each other in some way because they've

all had some kind of interaction. Even one, it is possible like Jeret or something like that. Um, so I'll let you look at that for a second. The thickness of the lines represents the characters interactions and the size of the node represents how many interactions each character has had. With my script, I poisoned a lot more interactions into the novel, some that shouldn't even exist. Um, so I basically made a fanfiction of Victor Hugo's book. Um, you can see that there's interactions that shouldn't happen. Um, so here I deleted a bunch of important interactions in the data. Some characters are left without any interaction or dialogue at all. Um, and I just wanted to use this case study as

something that can tangibly happen when you poison a network. Going into my next case study, um, I wanted to explore no uh, node irregularities in a more technical network. So, Java dependencies. Um, this network contains 1,538 nodes, all of which are just Java dependencies. Um by adding nodes we can add packages that don't exist. Um adding nodes is really simple in this case. However, adding connections between the additional nodes and the existing nodes is better. As from an attacker's point of view, connecting nodes with more edges makes detection more difficult. My network poisoning script would add a single node um to a node with multiple connections before adding nodes um that were all attached to the originally

injected node um in order to create a dependency cluster. Um the script also modified existing nodes with a lab simple labeling poison by simply modifying the label of the nodes within gaffy to null. Um so as a result the attacker um could use this modification technique as an opportunity to include um dependencies with malicious packages for example. Um initially I did a really really small modification. Remember this has 1538 nodes. So how could you even see this difference? So yeah that's like three nodes. You're not really supposed to see that. This is a closer look. Um, so I use scaffy to highlight these nodes with a contrasting color um, in order to show the attacker's presence. However, these

poison nodes were added in rather than modifying. So what if we modified the edges between the nodes by adding connections that were not supposed to be there? Here I modified this entire already existing cluster of nodes to be green to demonstrate modification. The attacker poisoned these nodes by altering their data. For Java dependency groups, it could be assigning their values to just null or just having false lines of code. In this case, I labeled these um modified nodes with null to demonstrate the poison more clearly. So through my case studies, I just wanted to demonstrate the differences between healthy data and poison data. Uh by visualizing mathematical differences via graphs, we can prove that data has

been altered in some way. Um in addition, we can detect data poisoning by labeling the data that we use. Um as part of future work given data provenence materials uh I would like to implement visualizing the change between healthy and poison networks similar to a diff function. Um so behind every complex system is a network and behind every networks are nodes and edges that define this network mathematically. Um, in order to have these accurate and powerful networks that we use in our day-to-day life, we must protect that data from data poisoning. And um, by visualizing data poisoning, we can more easily detect tampered data and highlight an attacker's presence within a graphical network. Thank you.

[Applause] Hi. Uh, sorry. It's not necessarily related to that. I was just curious about when you had the diagram. Yeah. Um, yeah, that was

um, yeah, that came from a research study. I should site my sources more often.

You mentioned accuracy as a requirement and stability. What do you mean by stability? Uh, so I feel like if it's poison, it's unstable. That's what I mean. Yeah. If it's modified in any way, then consistent. consistent data. Yes. And it should be intentionally if it's changed, it should be intentionally changed rather than unintentional changes. That's what I want to show. So if you intended to make these changes, then that's great. But if the data has been poisoned, then that should be made aware or one should be made aware of any poison data.