Securing AI Against Adversarial Attacks Using Causality

Name: Securing AI Against Adversarial Attacks Using Causality
Uploaded: 2025-07-17
Duration: 19 min 22 s
Description: Neural networks can be fooled by carefully crafted noise imperceptible to humans, while humans robustly recognize objects by extracting causal rather than statistical patterns. This talk explores how causal neural network architectures separate content from style information, enabling AI systems to

BSides Oslo · 202319:2222 viewsPublished 2025-07Watch on YouTube ↗

Speakers

Preben Monteiro Ness

Tags

CategoryResearch

TopicAI Security Vulnerability Research

ResearchTechnical Deep-dives

StyleTalk

About this talk

Neural networks can be fooled by carefully crafted noise imperceptible to humans, while humans robustly recognize objects by extracting causal rather than statistical patterns. This talk explores how causal neural network architectures separate content from style information, enabling AI systems to resist adversarial attacks by learning the same causal reasoning humans naturally employ.

Show original YouTube description

Even the best-performing AI systems we have today can be fooled by input that has been carefully tampered with. By adding just the right noise to an image, a classification system can be tricked, while to a human the change is almost imperceptible. Many think that this is because humans, unlike AIs, don’t just learn statistical patterns, we learn causal patterns. Humans can identify intuitively the information in an image which causes that image to be of e.g. a cat, and separate that information from the rest of the image information - such as background, lighting, camera angle etc. In order to make AI systems that are as hard to fool as humans, we need to teach the AI this same notion of causality. This talk will explore how that can be achieved why it makes AI robust to attacks. Preben Monteiro Ness: Originally from Trondheim, I graduated from Cambridge in 2019 with a BA and Master’s in Information Engineering. As a student, I spent time with the Machine Intelligence lab and published a paper on uncertainty estimation in speech recognition systems. Since then I have worked as an AI engineer and researcher in various industry positions before I started as a PhD student at Simula here in Oslo last year. I am now researching how we can protect AI systems from adversarial attacks. --- BSides Oslo is a independent, community-driven inclusive information security conference. A part of the global Security BSides network, the conference creates a space for members of the information security community to come together and share their knowledge and experiences. BSides Oslo is intended for anyone working with, studying with or is interested in security.

Show transcript [en]

yes uh my name is prean uh I'm at Anin and also at similar research lab here in Oslo um and I'm going to be talking about ways that you can break AI systems using something called adversarial attacks um and some ideas for for some sol Solutions uh that you might want to to implement um which is what I'm researching in in my PhD uh so first just a bit of housekeeping um I'm going to start off with a brief introduction about how neural networks learn how they're trained uh I'm going to be talking about deep neural Nets uh which for if anyone doesn't know that's the sort of the main uh AI architecture that you find in chat

GPT and the algorithms on on Tik Tok as well um so after that brief summary I'll go through what adversarial attacks are how they work um and why they're a problem hopefully convince you of that um and then in order to introduce a solution we first need to talk a bit about how I've said how neuron Nets represent information you could say how neuron Nets think if you want to anthropomorph fice a bit um and then I'll introduce what we call the causal neural network model uh which is sort of my area of research uh and end with some thoughts on the the stud the-art all right so first things first uh for the purposes of this talk I

would like you to think about deep neuron Nets as very powerful pattern Rec ignition machines uh they're very good at finding statistical patterns and data um they could be very complicated patterns such as for example the patterns relating the pixel values in an image file to the label that you would like to apply to that image if you're building an image classifier um and I'm going to be talking about images in this talk um because they're quite nice and visual but everything I say is sort of equally applicable to AIS that deal with text or speech or you know uh stock prices um okay so we're trying to identify patterns between uh some some input X

and some output y uh we have a very powerful machine uh but because this machine is so powerful um it can spot patterns that are false patterns that don't generalize so I have an illustrative example here um the um if you imagine training an image classifier on the cows in the center there um and you train your AI to say these are pictures of cows then the AI will likely pick up on the fact that okay all of these they have like some green grassy stuff in the background that's probably related to what it means to be a cow um you know you train this you get good accuracy you deploy it and in for you encounter the image on the

right and it's very likely that your uh your AI is going to fail um because that pattern no longer holds uh it was a false false pattern all right that's all well and good it's a cute example um but it allows you to do some more malicious stuff um it allows you to do what we call adversarial attacks uh by by sort of exploiting these false patterns um and an aders serial attack uh for images consists of uh you start with the image there on the on the top right which is of a a panda you have an AI system that correctly classifies that image as being of a panda um and then you add some some noise to it some

carefully crafted noise just a tiny bit to produce the image uh there on the bottom right um probably from the distance you're sitting you can't even sort of tell that something has happened um it certainly looks quite similar just just a tiny bit of noise uh but the AI system that you trained now fails completely to classify this as a panda uh I thinks it's a it's a cat in in this example with very high confidence and we think that the reason this happens is because much like the sort of grass cow example there are some more complicated patterns in your data as well um that aren't as sort of obvious and explainable they can be to

do with sort of fine structures in your pixels um you know all very complicated stuff and it means that you can add noise in just the right way to to trip up the AI um so that's what an adversarial attack is um and it's obviously not that big of a deal if you're just classifying pandas but you know if this was your self-driving car or if this was supposed to diagnose like cancer tumors or something um then uh then it's a bit of a problem um it's very interesting to note that although AIS are very susceptible to this type of attack humans are basically immune like you hopefully no one in this room looks at the bottom picture and thinks it's

it's a cat right uh All Humans sort of naturally are able to um defend against these attacks um and the hypothesis is that the reason humans are very good at this um is because we are naturally good at extracting uh what we call the causal information in in this case an image and that means we're good at extracting uh basically the important information the important the information that affects um in this case the classification we're trying to make um and we're not fooled by sort of random noise in the background um the way that these statistical uh pattern machines are um and there is a theoretical solution to this vulnerability uh and it is if you're trying to recognize pandas

you just need to collect all images possible of a panda you need to go out with your camera and uh take pictures with all types of backgrounds all types of lighting conditions and camera angles and lenses and and all that kind of stuff um that's obviously kind of Impractical um so we would like a solution that we can do without having to leave the office um so to introduce sort of why or how we can go about solving this first need to dig a little bit deeper into how how neural Nets uh think how they represent the information that you give to them uh so I've drawn a very simplified schematic here of a a neural

network um in this case um processing images this could be the the panda classifier um and we see that a neural n consists of a bunch of layers um and they process the information you give it and sequentially starting with the first layer and then moving up to the second and so forth uh so so the first first layer here um looks literally at the RGB Channel values of every pixel so that's very that's a lot of information you know you can have 200 by 200 pixels in an image three color channels that's 120,000 I think um so but that's that's very low level information that's raw pixel data um then there happens a bunch of processing

and the information is fed to the next layer which is smaller um so this this next layer has a uh a smaller capacity for sort of storing information and it needs to sort of disregard the uh low-level details and then try and sort of extract some some more high level stuff uh so the next layer might sort of learn to identify basic uh lines and Corners maybe in your image uh do some like very basic processing and then you sort of move down the chain here uh to to deeper and deeper layers and at every stage um the information is compressed down to a more uh sort of abstract high level representation um so at the very end you

might have have quite a small layer that encodes information about um I don't know like faces and sort of fur and and snouts or whatever for um for your animal classification and then at the very end uh a prediction is made based on that sort of compressed uh information uh so you can basically think about this as a as a lossy compression algorithm aimed at sort of extracting um a particular type of information okay so so this allows us to sort of reduce down from all the possible combinations of pixels to a smaller combination of sort of high level um codes or or compressed uh representations of the same information right we we now have all the pieces we

need to think about what we call the the causal neural network architecture um and instead of having a single processing pipeline uh like we saw in the previous example the Cal uh neuronet takes in the image and then produces two uh independent sort of information streams uh level one C here for Content the content of the image such as uh subject and and shape um and there's one called s for all the other information the the lighting the camera angle um call it the style information um so this is um an architecture which aims to separate out all that important information to to put it in in the SE stream and uh leave all the unimportant stuff all the style

information um all the like grass uh information uh in the S stream um and if you're able to do this uh sort of correctly uh or or sufficiently accurately then you can do something quite clever which is that you can uh introduce there's a little box here that says perturb signal what that means is uh you taken your sty signal and you Jiggle It around you add some noise to it uh you you flip a few bits here and there uh just to try and uh try and corrupt it a bit make make some variations on the original style signal uh so you might you know produce sort of say 10 different variations on um the

actual style signal that was in the image um and then you uh sort of one by one combine these with the uh your content stream uh and you have a have your neural net make a prediction at the end and very crucially you tell your neural net that regardless of what I do to the style signal you should always predict the same label um so so this sort of allows you to approximate that um Gathering of of all possible images and a much more manageable format and um under certain quite LAX mathematical conditions that I'm not going to go into this sort of converges to a uh to to A system that um is able to to do what

humans do to sort of extract the the the causal important information um so this this causal neural net architecture is um is quite quite a new thing I think the first papers I saw discussing it are from like fiveish years ago maybe um and uh they've gained a lot in popularity over the years since um because they are as we've talked about not fooled as easily by adversarial attacks um they're not fooled by sort of adding noise to images or if you're doing text you not is easily fooled by swapping words and and sentences um but we still have uh a long way to go um even though they have all these sort of uh desirable properties um

and so so that's what I'm doing in my PhD um I got three years left so I'm sort of trying to make some some progress on these things um there's a lot of interesting stuff that goes into that green box that's a separation mechanism that that tries to separate these information streams um that's highly non-trivial to to design um it's also a question of how you know that the uh the the C and the S signal streams that they contain the information that you expect them to contain um and you know how how do you we've talked in in very qualitative terms here how do you make the this sort of mathematically rigorous um so so a bunch of open

questions but they have shown a great deal of Promise um they're good at adversarial attacks uh they're good at generalizing doing stuff like training on one data set testing on another data set uh those types of tasks um and I'm I'm quite a quite a fan of them as uh you might have guessed otherwise I wouldn't have spent four years of my life uh sort of researching them um but so hopefully that sort of piqu your your interest in in in how these can be used to to make more secure um AI systems thank

[Applause] you all right this was a short talk but we uh we finished early so anybody have any questions all right let's go hi um I remember reading a tweet from some cool guy in neural network some time ago I think it was Yan leun but I'm not sure yeah he tweets a lot huh he tweets a lot yeah probably so basically what he said is that any attempt to make your neural network smarter but trying to explain it how to think is failed you know is bound to fail eventually the bigger neural network with more data will win so it's a failing strategy uh so what do you think about that and also another

question is uh when you showed the first adversarial attack with u some crafted noise um my first uh what I thought was maybe it's uh about how we access this data humans that you know for us it's a little bit blurry we don't get access to the individual values of those pixels so why not just make the image a little bit blurrier uh with some random noise in other words just put your perturbation step as the first step and that's it why doesn't that work thanks yeah yeah uh okay so to answer the first question first which is um can't you just solve this with sort of more data um the answer is yes eventually if you have

sort of infinite images you're guaranteed to get a uh robust or or a secure uh neural net um but I think the question is more about how quickly you can get to that step um because if it requires sort of it's very possible that it requires more images than Humanity will ever produce in its lifetime and then it's sort of unattainable um so the I think the causal neural architecture um enforces sort of some guidelines on uh you know how how you'd like the information to be processed to use that data more efficiently uh because you know with perturbing your style you can basically turn one image into 10 or or 100 uh so it allows you to sort of extend the data

that you have um and also want to say that Yan Lun has published a full Manifesto about how you should make AI think so you know who is he to talk um um to answer your second question um about why don't you put the perturbation step uh as your first step the that's a good idea it's a it's an established strategy um which we call adversarial training and it basically means uh you do this to your training set and then you train um the issue with that is that it sort of makes you good at defending against a specific type of attack it makes you good at defending against the attack you trained against um but there

are sort of many different attack algorithms and you s you need to patch each one and if someone uses a different one that you haven't train on then you know you're a bit screwed um so so that so that's a practical solution for sort of patching problems but it's not it doesn't solve the underlying issue

Securing AI Against Adversarial Attacks Using Causality

Related talks