Expanding Security Horizons: SIMD-Based Threats

Name: Expanding Security Horizons: SIMD-Based Threats
Uploaded: 2024-04-15
Duration: 22 min 47 s
Description: by Andrii Mytroshyn As cybersecurity continues to evolve, it is imperative to anticipate novel threats that exploit cutting-edge technologies. This talk focuses on a lesser-explored avenue of attack—CPU-exhaustion techniques—showcasing their potential through the lens of NEON/SSE instructions. Thes

BSides Sofia22:4774 viewsPublished 2024-04Watch on YouTube ↗

Speakers

Andrii Mytroshyn

Tags

CategoryTechnical

StyleTalk

About this talk

by Andrii Mytroshyn As cybersecurity continues to evolve, it is imperative to anticipate novel threats that exploit cutting-edge technologies. This talk focuses on a lesser-explored avenue of attack—CPU-exhaustion techniques—showcasing their potential through the lens of NEON/SSE instructions. These SIMD instruction sets, prevalent in ARM and x86 architectures, offer attackers a unique opportunity to manipulate parallel processing capabilities for nefarious purposes. By intricately designing operations that exploit these instructions, adversaries can push CPUs to their limits, causing resource exhaustion and severe performance degradation.

Show transcript [en]

[Music]

uh hello everyone I am Andre metrin uh and now I will present you quite exotic team it's a way of attack on CPU that is based on GPU in in more General case it's a seed Basit attack so little bit about myself small prehistory I work in Samsung Electronics mostly work with some GPU stuff after that I work almost 5 year in vone where I work with improving GPU performance and now almost a year and a half I work with uh carbon black where I also try to push this type of attack and like 10 years ago I was uh trying to work in uh cyber police I passed theoretical practical part but didn't work because of some other

reason so uh first question who familiar what is SED okay so s it stand for simple instruction multiple data it means that we will process uh our data in parallel way so where Sy is used uh first of all okay it's not working most familiar is GPU so any GPU extended uh external or internal they work in Sy way after that like all arm processor they have a special extension for uh that is provide uh Vector data calculation that is called neon another uh CPU type is AMD and Intel in general x86 they have SSA instructions streaming seemed extensions and uh now it's kind of popular this Square RV it's a risk five now they do

not have seamed extension but they work with providing it so the question is who don't know what this is almost all the this room what the hell is seemed so on this diagram I don't know why it's not working okay so this m so what this mean this is how our general CPU is working so we have one piece of data in this case 0 1 0 and we make some ma mathematical operations and this is a good way because like for example now in modern CPU we have 16 cores and we could make 16 independent calculations so it's not only like plus minus any calculation that is not related to one to another so here like normal instruction set we

have blue is like one instructions for adding multiplication copy doesn't matter yellow it's one piece of data some value and we have one uh piece of result so every everyone know how it this work and now we start working with sy so Sy provide similar way of work but we has several piece of data and we provide one the same instructions follow of them like drawing here so we has one instruction for example multiplying by four and several pixels where we need to do this stuff and we will have several pixels as a result uh the good way so the best since what is happening this is uh will make at once in parallel so if we compare one

execution as standard CPU for example like adding two digits it will take for example one nond and here when we will make like one additions for four piece of data it will also make this for one nond so it will take one piece of time for it doesn't matter how we will add 1 2 3 4 16 doesn't matter yeah it should be divided by 16 in general so why it is was used originally originally it was using for image processing so when you have your display you display some stuff and for example you need to so here you could see that sorry this left image it's a little bit uh blued here it's uh on the right side

it's a little bit sharper so how it is make so we just apply for each pixels here one convolution Matrix that is just process it in some way and it's make for example here we have like 16 pixels so for one I don't know like n second it will calculate first 16 pixel second 16 pixel and so on so on so on so where it is us it now yeah so here is kind of very old way for using now GPU calculation is used for example for neural network training for a AI processing for blockchain in general any uh stuff that could be parallelized and work and uh uh it's a need need be processed a

lot of data so how it is look in uh under the hood so this is example of how to are CPU different generation is look like so in general we have uh okay this blue part we have you see 1 2 3 four so it means like we have for example four core and each core has arm V7 30 bit 2 CPU for calculation neon at dat Ag and floating point and some local cach and after that this part below it's some memory bus for communication with external memory and external resources the newer generation cortex a 700 20 has uh almost the same architectures it has a little bit extension now we work not only with

32bit but also with 64bit we has bigger memory we has additional caching additional instruction and so on but again we has the same bus like a synchronous Bridge with some extensions for x86 the situation is same we have this image and you could see that kind of almost the same and this layer in his mean that we has like 16 or 32 course uh so okay here kind of space shuttle or space station so we has our CPU we has our bridge and we has memory and some external GPU if we have the main point of attack this part so when you processing some data you need to because your CPU could not contain a lot of data

it has a limited amount of cash for example 32 megabit uh cash at three uh level three or for example for level one it would be for example I know like one Mega megabyte but you need to process bigger images or bigger piece of data like 100 MB so you will each time copy this piece of data to your uh Processing Unit so situation is CPU but with GPU it has the same so now modern gpus they kind of computer they has own memory own bus driver own system for power supply and so on so when we write here GPU it means that inside GPU we has a lot of this kind of similar architectures and now very interesting

stuff so for example this okay I don't have mobile phone so when you take your mobile phone if it is like latest generation probably it will have like 32 or 64 neon course that could process in parallel it means that uh it could process it like really really fast but uh another Point like when you try to uh calculate some uh multimedia data you could uh feel that your phone start heating because it start doing a lot of this uh copying data so uh first time when I start working with GPU I think okay so if it is kind of uh separate uh PC that has own data so probably we could apply some kind of attack and uh much interesting

attack that I could solve from this uh using hpu is exhausting attack idea of this attack we try to overload of usage of some part and expect that some other application will start uh work in strange way and uh this is kind of example of uh uh such attacks that I saw in my experience so first type of attack it and it was happened on Nvidia GPU and uh so when you has GP these Sy calculations you need to use some special language yeah it's kind of basing on C++ but with some extensions that they called vector and extensions and the Nvidia GPU it has support two external language op and Kuda this is language for computation

not for processing with data with images yeah you could after that transfer it to open G but in general they made just for calculation and idea was we has a very big image like 10 of megabytes and make some processing in some way and we saw that when we send this image to the GP U and start uh calculating this data that like each uh 20 30 seconds computer is crashing and go to the blue screen of death the reason was is following operational system when it is work with so they has a lot of drivers in Linux it's demons who prefer what but in general GPU has a little bit different level of communication they has some

special IPI for refreshing your display and operational system s sent some signal to the GPU and say okay you active could you refresh this screen for displaying something and because of Cuda occupy all resources on GPU GPU could not respond operational system sent one notification second thir okay this device is freeze I need to restarted so it was like 12 years ago another uh type of attack that I observe it was happen in viston so in Von in a lot of clusters we use uh Renaissance chip air car H2 and M2 Generation Um M2 it's kind of low range and H2 it's high range uh you could compare it like like Intel uh like painon so kind of the same but with some

limitations so we uh just execute simple shedder for just I don't know making pixels for transfer it from Blue to Pink I don't know very simple stuff and we saw that in some cases on uh this low level chip uh lowbudget ship they start producing some artifacts so for example we we say okay this pixel should be pink but we show that it start uh be yellow for example and we could not understand what's the reason because the same chip is working on uh same uh Shader work on another chip perfectly the reason was uh pure implementation of the driver on the driver level uh so uh they when they introduce new driver they think okay we

lazy we do not want to spend a lot of time so we will make it in some way and in general we need to use some tricks for avoiding these Corner cases another situation was happening with neon calculation so neon is when we has CPU and we do not bother GPU for some calculation we sent uh some instructions for for example for copying data by the way if you use Simple C++ function uh for copy M Copy data it's work slower than you do it with neon instructions the difference would be like 15 20% so just a simple copying data with neon will work like 15% faster but from the other side it will cause a additional heat because it will

uh uh it will use another part of so cheap and you will need do a lot of transferring data and another inter kind of interesting stuff it's happening with open cell so who who do who know what is open cell okay so open cell it's a kind of standard language for calculation some data using GPU so you has GPU you you write your application for calculating on GPU you transfer some data from your RAM to GPU Ram I'm talking now about uh external GPU and you run this application and this make some calculation the same thing with Cuda and Nvidia is only one see that is only one vendor who is using Cuda because it is

their own property but returning here to open sale the situation was following we write simple application according uh specification so like in specification you have around 70 or 72 different API for like comp uh compiling program transferring data executing such stuff and we saw that this program has different behavior on different GPU even worse for same GPU but with different version of the driver they have different Behavior the reason was the conformance test this is a collections of tests that kind of says that okay your device is compatible with our standard they check from all these 72 API only 65 API so 7 API is not check it at all and this uh like 65 API that is

checking some of them checking in very strange way for example just call it once it didn't produce any error with correct data and so we could work so uh almost all open C drivers for CPU GPU and other fpj they uh do not compatible with each other so you could not guarantee that one application will work on another device and uh even worse uh it's happened with Cuda because what Nvidia uh thought okay we have Cuda and we should support open Cale and why we need to reimplement open sale in another way if we could just has our own Cuda and open sale will be based on Cuda and in this pH when we will compare

uh metrix we will say okay our C work faster than open cell and because of a different uh requirement for these uh languages we saw it was like 12 years ago I even write some ticket to Nvidia that standard way uh when you create kind of link to the application it is failing on uh open Cale and it is faing on open sale because Cuda at that time was not supporting this functionality so open sale part is working fine but because under the hood it's use Cuda Cuda didn't support it and it's crashing so that was my observation and here it's like example of uh very interesting attack that was during last uh year so first

of first attack this is uh um leftover local so idea is you have your GPU and when you training your llm or in general a any neural network or um I saw some observation on some iPhones that if you render some stuff on your application if you run in parallel another application and try read GPU memory you could have access to memory from another application so they share local memory between different application and in this case I think you could imagine what it good cause and this attack is still uh available on uh some uh iPhone devices another typ type of attack this is uh pixel stealing attack this attack is basing on uh wrong

implementation of iframe in HTML uh web driver I don't know how to say it better so idea is when we try to use I uh if frame from a HTML uh language someone call it language they try to store this data in GPU and because we have a similar attack to llm it's uh we could access to this data from another uh application from another memory uh here on cv. met.org you could also find a lot of GPU related attack and uh the last attack it's downfall back so this attack is present on Intel and IMD CPU so all generation of Intel CPU between generation Fifth and generation 11th of all of their device has this

attack I you could just kind of type this downfall uh Intel and you will see that it has a I will not like describe because I do not have a lot of time for a lot of such explanation so in general a main point of uh this presentation is just to show you that even some device that you think okay this is just for rendering stuff for displaying it it could cause a potential problem I remember one H I forgot here to mention on previous slide Vis open cell uh like eight years ago or seven I work with in Samsung and we writing uh some special application for uh displaying YouTube content from secur Storage so you have kcom extension

on so we use kcom chips in Samsung phones at that time and we has special extension for trust Zone and kcom say okay for accessing this data you need to has this special certificate after that you have a kind of special open C API for accessing this memory and it will work fine and like as usual we have strict deadline we do not have time for reading a lot of I don't know hundreds of pages of this manual so I think okay could I just try to access to this data using standard open CP and after 3 days of trying I was able to bypass all this kcom trust Zone certificate and just access as plane

memory so any application could just open kcom trust Zone memory and treat it without any certific security certification we reported I I suppose that during s years that they was able to fix it if not no I didn't say anything okay in general that's all do you have any questions [Music]

Expanding Security Horizons: SIMD-Based Threats

Related talks