What is eBPF and Why Should You Care!

Name: What is eBPF and Why Should You Care!
Uploaded: 2023-06-21
Duration: 1 h 20 min 35 s
Description: Kev Sheldrake explores eBPF, a revolutionary Linux kernel technology enabling sandboxed programs to run in kernel space with access to deep system internals. The talk covers eBPF fundamentals and practical setup, technical challenges including verifier quirks and ring-buffer limitations, and real-wo

BSides Athens · 20221:20:35103 viewsPublished 2023-06Watch on YouTube ↗

Speakers

Kev Sheldrake

Tags

CategoryTechnical

TopicDetection Engineering Reverse Engineering Tooling

DifficultyAdvanced

TeamBlue Red

StyleTalk

About this talk

Kev Sheldrake explores eBPF, a revolutionary Linux kernel technology enabling sandboxed programs to run in kernel space with access to deep system internals. The talk covers eBPF fundamentals and practical setup, technical challenges including verifier quirks and ring-buffer limitations, and real-world applications from Sysmon for Linux to Cilium/Tetragon that serve blue team detection, red team exploitation research, and system observability across security and operations teams.

Show original YouTube description

Abstract: eBPF is relatively new and “a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in an operating system kernel.” You can achieve similar results to writing a kernel module, but in a (supposedly – we’ll come to that) safe manner. eBPF code runs in a virtual machine and, depending on the program type, can access all sorts of kernel internals, with programs being launched when specified code points get hit. I will talk about the basics and how to get up and running, the challenges and pitfalls to overcome, a library I wrote when working at Sysinternals to take away some of the pain, the Sysmon For Linux tool I wrote for Sysinternals that logs events to Syslog, and Cilium/Tetragon (and Cilium/ebpf library) that makes accessing eBPF for system observability easier. I will discuss technical details and explain the different use cases that might benefit you, from blue team using Sysmon and Cilium/Tetragon to achieve super powerful abilities, to researchers building custom program tracers, to red team exploiting kernel vulns, to sysadmins seeking performance issues. Bio: Kev Sheldrake is a security software engineer and researcher who started working in the technical security field in 1997. Over the years, Kev has been a developer and systems administrator of ‘secure’ systems, an infosec policy consultant, a penetration tester, a reverse engineer and an entrepreneur who founded and ran his own security consulting company. He currently works at Isovalent on the open source and enterprise versions of the system observability tool Tetragon, and in the past he specialised in IoT, crypto, and tool development for a number of years.

Show transcript [en]

hello I'm Kev sheldrake and my talk is what is ebpf and why should you care this talk was originally given in October 2022 at o wasp Bristol and they've kindly said that we can reuse it or reuse the recording of it for besides Athens so please enjoy it and if you have your questions please get in touch thank you very much I'm here to stop by ebpf so this is just a side of you know where I've been in my career and and my Twitter handles um kept security is my work handle um which which is where I talk about evpf and things like that not very much very low traffic um and Kev sheldrake is my personal when

I talk about hypnosis and politics um and another other ranty type thing so you know choose the handle you want to follow if you want to follow them um before I worked for isolated I worked for CIS internals at Microsoft which um I'm sure everyone is aware of as a as a uh a brand and they make and maintain about 80 tools um freeware most of them on Windows and uh I ported the biggest and most complex one of those which is this one from Windows to Linux using ebpf to replace the Windows driver so if you don't know system has a Windows driver which collects events collects information sends it to a user space part of the

tool which runs as a service on Windows and that's uh turns that information into events that goes into the event log um so on Linux I ported the user space version across and in fact the windows and Linux version share that or most of that portion now and an open sourced it and uh and I replaced the Windows driver with ebpf programs to collect the information and that was really exciting really fun um and shortly after releasing that um I ended up talking with ice surveillance and um and things changed and I I kind of kind of got a job without Savannah where I build a similar but much more advanced thing so I'll talk about evpf

um I talk about the good the tricky and the vulnerable is not really The Good the Bad and the Ugly it's but it is the good the tricky and the vulnerable so uh mostly about the goods quite a lot of stuff on tricky and a little bit of about vulnerabilities towards the end so if you've used t-shock or um you know uh Wireshark or TCP dump or um anything along those lines then you've used classic BPF which stands for Barclay packet filter um essentially if you imagine writing the packet capture part of Linux back before this existed then you could um you you could build a filtering logic and a configuration logic to so user

space can tell the kernel what kind of package you want to capture um because you don't want to capture all the packets and send them all to use the space because that would be um the way it must be wasteful of resources you need a really big fat pipe to get all the dates of user space so you don't if you want to do your filtering in the kernel but if you wrote filtering engine based on a configuration then you would be limiting yourselves to um only the types of things that you dreamed up that you might want to filter on and if you then change your mind you wanted to add new ways of filtering things then

you'd have to change the kernel and changing the kernel is a slow process um and that and and that still wouldn't be available for the kernels that you're probably running on or your customers are running on or the machines that you've sat on or running on so instead what they did was they came up with this Barclay packet filter which is essentially a virtual machine that runs inside the kernel and uh your your filter so in this case minus i n e not Port 22 so any interface and um everything that isn't running on Port 22 or I either send or receive um is compiled into a BPF program and it's inserted into the kernel at the

point where it sees the packets and for every packet it sees it runs the program and the program says yes or no at the end of it to say whether or not user space wants to pack it and if you use a person wants to pack it the packet is sent to user space from where it can be displayed or whatever so it's quite powerful you can do lots lots of things with that obviously ebpf which started out life as extended BPF but no more it is now just the letters evpf with a nice B logo it takes that idea and says why don't we attach those programs anywhere in the kernel any function any Trace Point any network

operation Etc and um and let's expand on what we can actually do with those with that programming so rather than being a filter that says yes or no maybe we would be able to access some internals of the kernel maybe we'd be able to send arbitrary data back to user space and and that those programs would run in a VM inside the kernel so it's seen as the biggest change in kind of Unix um for 50 years so let's talk about it um as I say here it's it's revolutionary technology it's massive which is why I'm kind of talking about it um so the uh the the image here is kind of showing you when your user Space Program

wants to get access to some resources um you know write a file send a send a network information um fork or whatever um typically it's done by um using a Cisco a system call and uh when you do that that those system calls exist as functions in lip C Lipsy calls a particular system called um architecture which switches the task into the kernel the kernel does stuff to get you that resource and then it returns back to user space violib c and uh and and that Returns the information or the answer or the return code or whatever back to the application um so when you run your programs every time you access a file or or access the

network or whatever you're causing stuff to happen in the kernel well we can attach to those points we can easily attach to CIS called entry and exit points where you can attach to the start and end of the entire Cisco architecture or ciscals individually um but we can also attach to any other function in the kernel and when that point of execution in the kernel hits the point where we're attached the kernel process stops and starts to run our code instead and when it's finished within our code um it returns back and the kernel continues so we can observe all sorts of things um and I'll say it runs in a virtual machine so it's we're not running

natively like a kernel module we're running you know inside a constrained environment the uh the the the claim is like it's JavaScript inside a browser as an example um we can have another uh we can share date structures between programs BPF programs that all you can load multiple BP programs at once and we could share deck structures between them we can also share those with user space um and we have a ring buffer more than one ring but for now but there's ring buffer which we can use to send data between the kernel BPF programs and the user space so we can stick stuff in and it pops out in user space which lets us

get access to things so what you know might you want to do with it um if you've ever used s-trace that literally traces this ciscals um you s trace the program and every Cisco and it calls s trades will print out the syscoll and the arguments and the return value um well you know sometimes your arguments are quite big like they're massive buffers and S trays truncate sound and gives you certain information so you could rewrite s trades really simply in BPF and make it print out what you wanted um you could Target certain things so you could have logic whereby um it doesn't S trace or print out all the information until a certain event

happens a certain thing it sees a certain um Cisco or certain piece of data or or after a certain period of time or whatever and then prints out more information um you you could make it richer you could dump information from the kernel I wish this Trace doesn't do you get additional State um for example um we cannot just attach to Kernel functions we could attach to userland functions as well so we can snoop on read line in um lib C for example or wherever it is Liberty line um or Snoop on the open SSL read and write and see the data that's being sent to those functions or retrieved by those functions um which obviously with open SSL means

that we defeating encryption just by sitting on it so these are the sort of things that in the past I would use P trades to do which is the debugging infrastructure underneath things like GDB um but instead of using p tracks to attach to a process and then break on certain instructions we clamp ourselves on top of it and we're outside of the process but the same kind of things happen um but we're undetected well we're detectable in different ways um with P trace the piece of malware could detect you peed facing it and stop you from feed facing it um that's much harder with BPF um looking bigger and into the Enterprise um secure container networking so

basically um if you use kubernetes for example then um you know you'll know that there are various ways of networking the pods in kubernetes so started out with Cube proxy which is um you know based around IP tables um went to ipvs um went to sidecars where you had extra containers alongside your containers to do to proxy the information and out things like that well you can replace all of that with psyllium where we do all of that routine in the kernel in BPF programs um the benefit of that is it's fast really fast so if you want fast container networking go find psyllium it's open source you get it on GitHub um if you want to use it and if you want

to spend money to get all the extra bits and pieces that we make then we also sell an Enterprise version um also you know auditing um technical is the thing that I work on again open source you go find it on GitHub and again there's an Enterprise version if you want to spend money and get extra bits um it's kind of we can do all the sorts of things that you could do with K order or audit D um you know we can see things happening and we can we can report on them um this one for Linux which I wish as I say I ported over at uh sis internals is in a similar sort of space but that

project is kind of stalled a bit when I left um so but it's open source anyone can go and add to this one for Linux um in the same way that they can add to technical psyllium but um one of the reasons why we might want to replace K audit or audit D is that all the DNK audit are not container aware and they're probably never going to be container aware which means that the information that you get out of it is is from the base operating system perspective and what you probably want are the process ideas and user IDs that are from within the containers as opposed to the outside view of that um you might also want to know which

container which pod something happened in um which chaotic is never going to be able to give you so um so there are lots of good reasons for wanting to move to something that is container aware system for Linux isn't container aware yet it was always planned but um but tetragon is by Design so um so technical is a great solution um and the other thing you might want to do is build or detect root kits um and I just recommend you go and look at Linux store on GitHub he spoke at our Defcon group in Gloucester uh in the lockdown of virtually uh nice guy uh been doing some interesting work there so I definitely recommend going and have a

look right so um before I get to the technical details I'll just talk a bit more about speed because you know one of yeah it's one thing knowing how to use BPF and how to program it which is mostly what I talk about it's another reason you know there's another thing to understand like you know how fast it is and so if you were using IP tables uh Cube foxy for container networking then um the latency increases as you increase the number of uh services in your cluster um whereas with evpf with psyllium it's constant so if you want to scale things massively then psyllium is a much better solution than IP tables for example

um ipvs which you know another solution that k q foxy can use um what you'll see is CPU overhead increases as you increase the number of packets or the rate at which the packets are arriving whereas um with um so we can get more packets through with psyllium with ebpf than you can with ipvs and the CPU overhead is as you can see by that second graph absolutely minimal um compared with ipvs which is just growing by you know the number of packets that you have so BPF is powerful really really powerful and it gets lets you do lots of really cool things uh just to put it into perspective does all can do all the things I've talked

about on the other slide we can connect anywhere we can observe almost anything and we can send those things to user space as events and we can do all of that here using yaml to configure it so the rest of this talk is going to be about programming BPF mostly which is a challenging thing but what we've done with technical is built a flexible thing whereby you can write some yaml to say where you want to connect what arguments you're expecting to see at that particular function you've connected to match on those arguments or do some Logic on those arguments and then do something with the information send it to your space as an event

um we can kill the process uh and various other things so if you want to make use of BPF for observability then you know tetragram is going to really help there but um oh yeah and I forgot yeah if if you want to play with any of this technology like we have Labs that are free you just gotta sign up and then you can you can get into a already set up lab and start playing and see how to figure it what it does and those kind of things right let's talk about some technical stuff um before it's it runs in a virtual machine and that virtual machine is basically a risk kind of architecture

so there's um a load of registers um uh some some have preferred functions but they're mostly general purpose registers um and we have all the normal kind of machine code instructions you'd expect in the risk architecture so you can read and write memory um you can do arithmetic and you you can do conditional jumping and the likes the idea of the instruction set is that it will map easily to any modern processor so there's practically a one-to-one mapping to x64 for example and the similar to arm and probably to other processors which means you can jit uh compile BPF programs really quickly really easily on the Fly um and you and you're getting native instruction speeds out of BPF so it's

not like uh well JavaScript and abroad has been a great example your JavaScript is interpreted and it has to be kind of um jet compiled down to machine code well this is much more of a one-to-one read across so it's um it's more of a translation than a complexion so it gives you native speed of the kernel as opposed to something that's um interpreted for example the memory model is really interesting you only have 500 512 bytes of Stack and um that seems really small and it is um but because you don't really have function calls depending on which kernels you're supporting obviously um which I'll come to um and there are no variables um then

almost all you've got on the stack are local variables and um I said there's no Heap either and there's no globals so it's a bit odd but what we do have are maps and uh these are various different types of um kind of memory that you can get access to so arrays are really obvious um you tell it how big your object is and and how many entries you want your array to be and then you can um access those entries and those could be and these can be shared between BPF programs and shared user space and we have native hashes so um if you were kind of storing IP addresses that you'd seen you could hash

an IP address and uh and and count against it for example um do you can have these as per CPU and system y so um one of the things you might want is some Heap memory right so if you wanted to build an event that's bigger than 512 bikes you obviously can't just have a struct on the stack because it's too big but what you could do is store that struct in the first entry the zeroth entry of a per CPU array and when you need that Heap just ask for the Zero Entry of that array and you and each if you've got multiple CPUs and each one happens to be running the same BPF program happens to be doing

the same thing at the same time each one would get its own version of that zeroth entry um so they won't clobber each other there won't be um racing against each other um and you can use that as Heap so you could you could put uh you could you could make that entry like 64k in size for example and when you go and ask for that entry you'll get a point to two piece of memory that's 64k in size which you can then just use arbitrarily as if it's heed memory so that's kind of cool but there's also system-wide versions of these so if you were storing I I know information about IP addresses that

you've seen you probably want to store that on a system-wide map so that whichever CPU sees the packet that you're gonna record they're all writing into the same Central database there's lots of other types of maps as well that I won't go into but one that's kind of important is the ring buffer which um you push data into um and and as I say that then sits and doing buffer until it's read out and you can read it back out in BPF but usually you're kind of reading that back out in in user space and that's how kind of how you get events so imagine that you you're attached to some function in the kernel where something happens

um you record load of information about what's just happened uh you in into like some Heap space inside like an array um and then you sling that entry of the array into the ring buffer and your user space programs polling that and it sees it arrive and then user space now has that information about what what just happened in the kernel that's kind of the model so where can you attach where you can attach in loads of places um we'll start with talking about Trace points so if you don't know debug slash tracing slash events has virtual directories for a whole load of classes of um events I've just realized I'm gonna these normally I would be using these

slides as a slideshow which is a bit tricky to do while I'm using Zoom so um it's possible that things are going to be all over the top of each other but we'll see we'll see how we did in a second um but anyway so some of these classes that are quite interesting net is obviously stuff about Network you know we loaded face points in there within the net subsystem um ciscals there's Trace boards for the entry and exit of every Cisco uh raw sis calls has an entry and exit point to the entire Cisco architecture that sees every Cisco um shed is all to do with process scheduling so processors starting up task switching Etc uh signal is all

about sending and receiving signals to processes so these could be some cool places to uh to interact with um no this isn't too bad so um if you go into one of those directories where there's virtual directions like ciscals and in there there'll be a CIS enter exec VE for example and in there will be a file called format a virtual file and if you catch that file it will give you this information and this information tells you what to expect when your program runs if you attach to this Trace point and what it what you expect is a struct contained in those entries now the first eight bytes um which is the first box of entries

there is opaque so even though we know what's in there we can't access that and if you try and read um anything from that first eight bytes you'll you'll via get a violation so so you can't do that but the second box are all the parameters that that Cisco was given from user space and the Cisco number now the Cisco number is quite useful because you might want to attach the same program to multiple sys Corps um so exec ve which launches um a new executable an exec ve at which also launches a new xq rule are very very similar um but they have slightly different parameters so you you could attach the same program to both those Trace points

and then check the Cisco number to see how to interpret the information that you're given how to return the parameters and you can see here the parameters are accounts chart star file name conscious child star arc V and of course chart star N3 which are literally what you would be giving it if you were calling exact V from user space uh bear in mind these are user space buffer pointers to user space memory um which has its own swings available uh come to in a bit um but what we do is we can create a struct based on those premixes and um the one parameter we'll get uh to in to our program is a pointer to that

struct where we can just read that stuff out um actually I'll talk about it now one thing that you'll you'll notice if you get into word of BPF is uh if you do look at user space uh memory um it's not always available um it's not always paged in so you might go to read that memory and you might just get back nothing in an error and so often what you the kind of structure of your program what you might want to do um or you know a solution to this is to have a program attached to the entry point of the Sysco and another progo attached to the exit of the Cisco and on

the entry store the pointers just as pointers in a system-wide hash right and you wish you could hash that based on the process ID and the thread ID which you can get and then the Cisco and then exit and then the Cisco would run and then your exit program would run and it could it would be in the same process ID and the same thread ID use that as a key into that hash to retrieve those pointers and then read the memory that those pointers are pointing at now bearing in mind that the kernel has just read those the memory that they're pointing at there's a really good chance that that memory is now paged in and you'll be able to

access it it's not guaranteed but it's much more of a guarantee than just um hoping that it's available I'll talk about that a bit more later on later on anyway um a different uh Trace point so not not a Cisco Trace point this is um shed uh shed process and exec so this is the trace Point within the shed subsystem and this gets hit when a process Executives when it when it runs a new process um so why why would you connect here well you could connect both exact ve and exact v e at end up at this Trace point uh so you could just write one program and attach one program to this Trace

Point instead of attaching it to both and therefore do whatever slightly less work which is kind of important because your program is going to get run every time these Trace points get hit so you want to be kind of kind of efficient if you can um it's very similar to before in fact it's um you create a struct in pretty much exactly the same way as I talked about before the only difference is though you'll notice that the file name is no longer a Char star it's now a data lock and its size is four bytes which is 32 bits which is too small for a memory pointer well actually what it is is a

two 16 bit numbers um the lower one is the offset from the start of the struct to where to find the buffer the information and the upper 16 bits is the length of that buffer and so from the start of the struct you add the offset and wherever it lands that's where the date is going to be one of the problems you have with BPF is BPF is very careful about only letting you access memory that you're allowed to access um so if you don't include anything beyond in the struct anything that's beyond what's presented there in the format then um when you add to the start of the struct to get to the you know the offset to get

to where your buffer is BPF can't tell that you own that memory so what I've done here is I've added an unsigned Char array um of 4K just assuming that you know a a file name will be smaller than 4K and um and that means that BPF will now let me access not just the members of Destruction but also the the 4K of that and scient at the end of it so usually the offset will Point literally to the start of that buffer um but obviously you do the maths anyway to make sure but now you can access it so if you want to access um uh you know non-ciscal Trace points then you kind of need this the other thing to

note there is that that file name has been copied from user space memory into kernel space memory so it's much more likely to be available um and uh and it's and user space can't change that buffer while you're reading it um I don't know if you saw but um Defcon 29 people presented Phantom attacks uh which is talk worth seeing which is where um BPF or or even audit D but based programs that are reporting um events as they happen like Cisco events um what the point at which the colonel sees the buffer isn't the point at which your program is running and you're seeing the buffer and so one thing could get executed and then

user space could change that buffer and a different thing gets seen by your tracing program which is really bad uh whereas here this file name was already copied into kernel memory and so it is literally what the kernel is seeing is what you're seeing there so another benefit to attached to a non-cis call based Trace Point um if you if there is one that's suitable so when there isn't one that's suitable um you can actually attach using a k-probe to any function in the kernel pretty much as long as it's not in line as long as it's a real function you can pretty much attach to it now what you will need to do here is go to the

elixir.bootline.com um which gives you the source code for Linux kernels um it's really really great website um it lets you search for things uh it lets you browse code it lets you compare versions all sorts of cool stuff and find a function that does the thing that you're interested in in this case Fork right I was interested in um seeing what happens when when a process talks and so there's a function called underscore underscore shed Fork and I could attach a k-probe to that now you can see that the function itself takes two arguments um clone flags and a pointer to a task struct and that will be the task that um of the newly forked process

so you can get your current task of the current process the pairing and the task of the newly formed process so um what you get as a parameter though to your k-pro program is a pointer to this structs structure which contains all the registers for your architecture okay so this is x64 but if you're on arm for example there would be arm registers and um and we know the calling ABI um um application binary interface for Linux so we know that the first argument is passed in register Di and the second argument is passed in register s i um and then it goes DX and then CX to the fourth one and then R A and then R9

so um if you get your pointer to this um PT reg struct you can pull out those registers and those registers will contain the parameters so in this case di will be an inside long clone flags and Si will be appointed to a test struct so so that's really cool so so what you might find is that you don't really want to touch for a CIS call you what because you want the data buffer to have been copied before your program runs so you you can trust the buffer um you might not be able to find a trace Point that's suitable for you for for that purpose so instead what you can do is go through the kernel source and find

the point at which you you know everything Stacks up the data buff has already been copied um the information you want is a is easily available at the start of the function as parameters and then connect your k-pro program there pull out the parameters and then do the thing you want it to do you could also there's also a k rep probe which attaches to the return of a function so um that might be useful as well because sometimes the parameter going in um like in this case the task of P that Tasker probably hasn't been populated it probably gets populated by this function so actually you want to look at the task trucks on the on the exit so okay rep

probe would would fire at the exit so that would give you um access to that task struct after the functions ended so um how do we go about inserting your code well you pretty much write your function in C you can use rust these days apparently I haven't but apparently can you could even write it in um BPF machine code if you really wanted to but if anything that's maintainable um mostly you're writing these things in C uh compile it with clang uh I believe GCC now has a BPF Target as well um but most people use clang

excuse me um and that will compile to an elf object containing evpf code and all the maps and other sections that you need and then um and then you can insert that into the kernel using a library so lib BPF is the kind of traditional C based Library um you could if you like go use psyllium ebpf which you can get on GitHub for free um and I kind of recommend go as being a has been a good way of doing that you can even use um says internalized I wrote It's just internals which is a c based library that lives on top of BPF and simplifies a lot of the things because there's a lot of like

boilerplate and a lot of things you do over and over again which um which I've simplified with that Library and there are other libraries available so so like rest has one for example go go and find whatever suits you um that gets all programmed into the kernel and attached to the trace point or um k-probe or wherever and then whenever that thing happens your code runs writes stuff out um now for debugging purposes there's a trace pipe which is in that same sort of directory so CIS kernel uh debug tracing there's a virtual filer called Trace pipe and you can cat that and you can write to it from BPF and whatever you write will pop out in there really handy

for debugging but for production you really need a ring buffer like I was talking about before and then you monitor the ring buffer from your user space code that that loaded the code in um now you can't run any function in the kernel okay from your BPF program it's not like a kernel module where you can run pretty much any function you can run node functions but what you do have is access to BPF helper functions which do a lot of the things that the kernel functions do they're kind of abstracted for you and made safe um so we can access Maps we can update elements we can look up elements so we can delete elements

um we can access memory by using probe read and write but only memory that we're allowed to access and uh an only memory that's actually paged in and um and we can say right to the ring buffer with Perth event output or we can write out to the trace pipe with Trace printk uh some more helpers that are kind of useful where you can get the current kernel time in nanoseconds since start since boot um doesn't include the time it was asleep which it says the monotonic clock rather than the boot time clock but uh with later kernels you can get the boot time clock instead if you wanted to um you can get the process of ID that

you're running on you get the current PID and T get now um bear in mind in the user space what we call a process ID appears in the kernel is known as a t-gid a task group ID and in user space what we call a thread ID a tid in kernel space is a pit um a process ID because the kernel assumes all threads are processes and they're grouped um into third groups so um so it's worth bearing them on terminologies it was a slightly different so you don't get confused it will only confuse you for the first five minutes and then you'll realize what's going on and we can also get the current user ID and group ID of

the process that we've interrupted and we can get a pointer to the current task and the current task is the task truck for the process that we that we've interrupted and that contains all manner of stuff like uh appointed to the memory map where you can mine the memory owned by the process um the pointer to its FD table so for the file descriptors that's got open um you can go and look those up and uh and follow those and do things like that um uh credit struct or the credentials of the process Etc so there's all sorts of really nice stuff and you can find out most of that thing by looking at the

elixir.booting.com and find uh struct um task struct and uh and have a look at the centuries and from there you can work out what you can access we can gain access to um there's lots of other helpers and so you can find those manned bpflies and that gives them you can I mean a top tip that I've I've I use every day is if you type man and yeah what you find a man uh get a manual for into Google Google's got all the Man pages you don't even have to type in your command line um but uh what you'll find is BPF hasn't been particularly well documented in terms of learning um it moved really quickly over the last

four years so every time someone wrote something down it went out a date in six months um which means that you'll find documentation but a lot of it will be quite difficult to understand because it won't quite map to what reality you're experiencing and it's very hard to take a tutorial from um a year before another tutorial and and read across because the Technologies change it's moved on so one of the things you could do is go to the assistant terminals GitHub and inside the system for Linux repo is a doc directory where I wrote up a whole load of stuff about how I wrote cismon um to help people in the community add to this one ultimately

um that's mostly Trace Point based I don't do I don't talk about K probes you know at all but um it would get you off the ground to go and have a look in there and also in there you'll also find an examples directory with three example programs um which are con self-contained like each one's got make file it will make um and you can modify those in order to make them do different things and and get a feel for what's possible and uh and what you might be able to do with it and they're described in the in that documentation so if you want to get started that's I think as good a place

as any in order to quickly get up to speed um read the code for live BPF that that's where you'll find the functions that the BPF offers um and equally silly me BPF if you're using the go library to load your code um have a look at the githubs and and see what what functions there are to do things and that will give you another flavor for how to get started another thing you want me to do is read the code of tetragon um even read the code of system for Linux um they're both huge I might add but it's worth reading and you'll see uh how we've gone about making these kind of things

to monitor a variety of things and um we do have a docs.com.io which will tell you all about psyllium and all and how that's been built so um if you wanted to do specifically network based stuff like moving packets around low balances firewalls that sort of thing in BPF then the docks at docs.solume.io will really help right so that's the sell for how brilliant BPF is let's talk about how tricky it actually is to use um and and I will say a reason for using higher level tools like BPF trace or tetragon where you can make it do stuff by just configuring it right um when you write your BPF program and you load it into the kernel using the

library of your choice um the kernel will then verify your program before it will allow it to be attached and the kernel verifier is renowned for being a difficult thing to satisfy and if it's not satisfied it will just drop your code on the floor it will not run it which is really annoying um so one of the things is size before kernel 5.2 you were limited to 4096 instructions in your program and their machine code instructions so a single C line C could could turn into you know four or five instructions perhaps um they lifted that limit to a million instructions in kernel 5.2 which is really helpful um however the jump instruction for the vast majority of kernels

um it only takes a 16-bit um operand so you can only jump forward or backwards 32 000 instructions um now if you're writing your programming machine code that might not be a problem because you can structure your code accordingly so you don't have to jump further than that or you could add trampolines so you could jump and jump again but if you write your code in C then you don't have control over that and so what it what it kind of means is um you're you're kind of limited to uh 32 000 instructions even though you're allowed a million um another thing is it checks that it halts within a million instructions processed so the verifier will execute or simulate

execution of your program and make sure that it stops right it will not allow you to have infinite Loops um or just run regardless yeah run forever or whatever so um you are allowed Loops from kernel 5.3 onwards but it will want to make sure that those Loops end uh by by simulating execution and if it doesn't it will complain about complexity and it will kill your program um now you can have function calls but not if you're using tail calls before five or eight so what's a tail call well given the small size of program you might have something that's more complicated than will fit into that size of the program so what you can do is you can chain BPF

programs to BPF programs they don't return so it's it's not like a function call it's like it literally replace this program with the next program replace that one with the next program and these are called tail calls um and you can chain up to 30 tail courts but if you use tail calls before 5.8 you can't use function calls as well so um your kind of limited I've found um the function calls seem to be pretty much banned maybe it's just a way that clang compiles things but whenever I try to use a function call I get told that I've got a forward Edge um which is verify slang for you've done something horribly wrong you are the

looping or or caller a focus chord and it won't let you do that so generally you don't have function calls what you end up doing is inlining all your functions um which bloats your code so so yeah that's that that's that's a kind of annoying constraint and the other thing is there's no sleeping before five foot ten so I'll talk about that in a moment um you also don't it won't let you have any direct memory access as I said before you have to do all your memory access through BPF probe read and BPF probe right and there's various flaves of those depending on which color are on and sometimes divided between user space and kernel memory but whatever

um and and even then it will only let you access memory that you're allowed to access which is which has certain constraints and controls around um it won't let you have programs that might have a normal pointer in use so you have to check for uh that your pointers not know um on every pointer that you you obtain um it won't let you use certain helpers depending on the type of program so there's lots of different types of programs depending on where you're attaching to so a k-pro program attaches to K probes and you're allowed the set of helpers that are allowed for K probes whereas SKB which is a socket buffer which is to do with networking has a

different set of helpers which overlaps and then there's like C group programs which are attached to some of the C group kind of functions and um they uh again they over that but not but each one doesn't have all of them so if you try and call to help her that you're not actually allowed access to again the verifiable complaint and will not let you run your code so um the verb one of the things to verify does but when it's checking if you allowed access to memory is it checks like how how randomly bound your variable is so if you had an array of six entries for example and um and you had an index I and you

um assign something to index I of array a and your I is unbounded as in it could be any number like then even negative right then um then the verified will spot there and it will say we will access it outside of the array so what you could do instead is you could have an if statement in front of it which says if I is greater than equal to zero and I is less than or less than six then do the array access okay and that would work in a lot of cases but it um it doesn't always work for reasons to do with the verifier sometimes the verifier is not so smart uh understanding those

quite that kind of logic what it is quite good at is masking so if you increase the size of your array to a power of two like eight and then you mask your index with one less than the size of the array which will be all um all ones on the right hand side then it will limit I to between naught and seven and uh you might only be using between naught and five because when you want six entries perhaps right but the verifier will be happy that your index cannot be outside of the range note to seven which is perfectly good for an array of size eight and it's also that that construction is a lot more uh a lot more

efficient it's one instruction to mask a register off um rather than two or three to do a compound if statement problem is is that the compiler is quite clever as well and the compiler has to be running with optimization turned on otherwise it generates code that the verifier won't like and the but when it's optimizing your code one of the things it might do is optimize out your constraints because it might look at that and kind of go well I know your index is never going to be greater than seven or lower than zero so I just won't bother doing that I'm asking because I don't need to because it makes no difference so it might be

smarter than the verifier and then what you end up with is code that looks sensible in C um but doesn't actually pass the verifier and when you look at the machine code using um llvm objectump you can see that your constraint has been removed now the trick to solve that is inline assembler so you add your constraint back in with a bit of inline and you make your assembler volatile and that forces it to be at that exact point in the code um another thing that happens that that's also really cool from an annoyance perspective is um register spilling so you've only got like 10 registers and um you're holding all these variables in registers pretty much for for efficiency

now what might happen is clang might reorganize all code um for optimization purposes and then what happens is it it does the constraint uh where it masks off the variable and then decides to do other things before it uses that index in theory and when it's doing the other things it might need to register that your index is in so what it does is it Chucks It On The Stack does it so the stuff and then it gets it back off the stack at which point is forgotten that you had masked uh that index so now it thinks the index is completely unbounded so um so what so another good reason for using the volatile assembler is that wherever

you put that volatile assembler is exactly in the program where it will occur so you can get the constraint as close to the array access as possible which um is something to learn how to do um I've not had that bite me recently but if you look through tetragon and you look through system for links you'll see that that has bitterness numerous times because everybody fights midline assembler it's because this sort of thing has occurred um I say you need to check every pointer is not null before you use it um typically what you might do is just if a pointer is no and you're not expecting it to be null that's a good reason to just exit out of your BPF

program because I mean something's gone wrong um and that's a really simple construction that the verifier can definitely understand so um if it hasn't returned it knows the points must be valid it might not be pointing to valid memory but it knows that it's not null um Loops so before 5.3 you couldn't have loops you had one roll all your loops and pragma enroll will do that in clang and clang's pretty good at unrolling Loops it can take quite complicated Loop constructions and unroll them but after fun photos real much you allowed Loops um but the verifier has to check that they terminate and so if you have a complex Loop construction with a with a

complex test case the verifier will probably fail to verify it because it won't be able to hold enough things in memory without using the stack in order to keep track of the state so it's actually so if you're going to use Loops you're pretty much better off sticking to like the really straightforward for Loop um that you see there um and then use break statements to get out of the loop if you do have conditions that want to exit the loop early um or and you can still unroll your Loops from fire drop to the armpits if that makes more sense but let's say once your programs start to get complicated complex then you'll start running a

instruction account so um it might make sense to to use Simple loops uh page fault so it's all about sleeping earlier um so what normally happens with your program and this can be a kernel module or a user Space Program if you attempt to access memory that's not paged in so I'm sure everyone's aware of memory being patient or not like the virtual memory space is bigger than the amount of ram you've got in your machine and some pages will get paid out onto into softspace for example and you try and access some memory that's not paged in because it has been used for a while it's been paged out then what would typically happen is that

info would occur which will take the context to a particular place in the kernel that the kernel will then page the memory in go find it wherever it is page it in into the slot and then it will return from the interrupt to the start of the instruction to try to access the memory and then that instruction will run again for a second time and try and access the same memory at this point the memory is paged in and then it will you'll be given the the information that's in there before 5.10 with BPF you attempted to access memory if it wasn't paged in it just said no gave you an error code and the memory that you tried to read the

buffer you're trying to read it into would just be zeroed um so that is kind of annoying and and so this is what I was talking about before about um if you're trying to read user space memory for example um it's often not paged in when you want it to be so it's much better if you have to use this calls store the pointer on the entry to the Sysco and read the memory on the exit of the Cisco but it's probably better to use um a trace point where the where the buffer has already been copied into kernel memory which is much more likely to be paged in um all attached to a k-probe where

um where the memory is is literally being active you know literally just being copied is a great place to go um there's a lot of helpers that aren't there like you know helpers are written by people using BPF uh who are writing code for the kernel and they built the helpers they need it and so there's load helpers that aren't available it's specifically a lot of the string based ones um there's no real path either um that I think there might be something like that very recently um where you could take a path and um get the kernel to resolve it to the actual path so like yeah resolve symbolic links um all the dot dot slash dot dot slash

dot dot slash uh type stuff um and there's no easy way of mapping from a dent tree was a directory entry or an inode which is a a place in the file system where you store a file um back to the actual path name so you can do it manually by writing your own functions and if you look in system one you'll find those functions then you might find them in technical as well um where you can take a directory entry which contains the uh the actual file name of the of the file within the directory um so and then and then you can uh get the parent entry from that entry and uh and then read its file name which

will be a drip which will be the directory name that contains the file name and then you can go to the parent of that one and get the next correct freedom of the parent of that one get the next directory and keep on going back until um you get the top of the tree which is the various different ways of detecting depending on which distribution of Linux you're on and as you're going chain all of those bits of strings together with slashes in between them to make a file name obviously when you get to the top of the uh the tree you only at the top of the file system and that file system might be mounted

somewhere so then obviously you then have to go and look in the mount tables and see if you're mounted and if you are mounted get the dentry for the mount point and add that to this to the front of your thing and then then Traverse up those Dentures to get to the top of that tree to and and repeat and recurs and until you're until you've got the whole path name which is a pain to do it'd be much nicer if you could just give a dentry to a Helper and it gives you back the path but unfortunately that doesn't exist so some things are very tricky um the Perth ring buffer is um is interesting so

depending where you read the prototype for the function for the helper function that puts stuff into ring buffer the size of the entry of the element you're putting in is sometimes referred to by u64 which is bigger than anything you're ever going to be using other places it's a u32 which seems a lot more likely I mean four gig is big but you know still you I mean you might want to put quite a bit of data in like a command line could be 128k for example um so you know it's bigger it's quite big but um whatever you put in there when it gets through to the perf subsystem which is where the ring buffer actually

resides it gets masked down to a u16 which is a 16-bit number so down to 64k um and actually it's not quite 64k because that includes some headers and it's very difficult programmatically to work out how big those headers are it depends on your compile options to your kernel for your distribution and various other things um if you assume 64 bytes of headers then um that would be a reasonable assumption you probably won't find many more than that um so if you don't know about this as I didn't and I found it through uh trial and error what happens is the the ring buffer takes your entire chunk of data and puts your entire chunk into the ring

buffer um but it only but then it masks off the size and then advances your the pointer by this 16-bit masked size so if you had a um if you put in exactly 64 k um including the headers it gets masked to zero and the pointer doesn't move at all in fact it doesn't even put it in at that point because it goes oh the size of zero why am I bottom to put anything in it but if it was 64k and 8 bytes it gets masked to eight bytes the whole 64k and eight bytes goes in the ring buffer and if you read it out in user space it'll all be there but the pointer

will only be Advanced by eight bytes which means the next thing you put in doing buffer writes on top of the stuff you put in there and corruptural data and um I actually reported this to the BBF mailing list and uh had to include a code to demonstrate it because people hadn't hadn't seen it before um and uh but it turns out it's not BPF bug it's perfect bug so um and and finding the size of the head is and say isn't isn't completely easy so it still exists and it hasn't been fixed so you have to be aware of it as I say you might want to put 128k in there you might want a whole client line

in there for example so the answer apparently is to break things into chunks and um an actual and in actual fact uh KP Singh who works at Google um gave me the advice that if you've got a big um uh event that you want to send to user space then actually it makes sense to chunk it up into like 4K chunks for example because if the ring buffer is full which can happen if you're sending a lot of events in and user space is very slow at getting them out then um your boat your your um event just gets stopped on the floor and it never makes it to user space and so if you only deal with full-size

events uh and an event gets dropped then you you in user space you don't even know that that event's been dropped I mean there's a counter that will tell you how many events has been dropped but you don't know anything about that event if you've broken into 4K chunks and you throw it into the ring buffer more of those chunks are gonna you know some of those things are going to get through to user space even if you fill up the ring buffer it's gonna be much more efficient use of ring buffer and um so you could number those chunks You Could reconstruct them in user space put them all back together and you could detect

that a chunk is missing but the thing is you still know the event happened you know you still might know that it was an execution event you still might know the file that was executed you know you might know the user ID that executed it you might just not have half of the command line well that's much more useful than the event just never getting there in entirety so it actually makes sense to chunk things and therefore this arbitrary 64k ish limit isn't really a problem in in Practical terms if you were to take that advice um like finding things in the kernel is funny right so depending on the configuration options to your kernel and

depending on which kernel and depending on which distribution and what patches they've applied to the kernel the kernel Strokes are different sizes and shapes and are laid out in different ways so which makes it very hard to read an element from a struct unless you have the kernel headers for the system on which you're running now if you're running on your own system because you're doing you're building tools to help your reverse engineering then you have all those kernel headers just use them it'll be fine right um but if you're building production software to go and run on other people's machines then you don't have a clue um so instead what you need to use

typically is BTF that's BPF type format um it was it appeared in kernel 4.18 um and it was enabled by default in Ubuntu 20.10 which I think was the first distribution to enable it by default so before that it exists but you'd have to recapile the kernel to enable it um which is kind of annoying what you could do instead is use cilium ebpf which has another way of accessing uh BTF um data um you could use PTF hub from Aqua which given the kernel version it will go to the internet and find the BTF for that kernel version assuming it's a a known kernel version like it's a it's a published kernel from a

distribution and then that BTF information will be used at the point that your program is loaded into the kernel and what it does is it remaps all of your accesses into struts so that they go to the right place in the structure for the kernel you're running on rather than the placing the structure it would have been out on the machine you devped it on the other thing you could do is um use the approach in CIS internal cbpf um and I I sort of mentioned that partly because I wrote it but also because it's it's kind of cool um what that'll do is it will um if if there's a a file of offsets

that describe how to access certain members of certain struts then it will use that and it will use those offsets in the BPF program it doesn't re-order the code or anything it uses it in a function so it so it it knows how to use that information in BPF to find the right places in the structure to get the information if that doesn't exist then it looks in a massive database of published kernels that I ship with systems EPF and um and that's a similar thing to BTF Hub except it is only the offsets that I needed right um but you could extend that to two other other offsets if you if you needed to

um I actually got those offsets from Project Fritter in Microsoft who have offsets for all all um published kernels um for a similar a similar thing to BTF Hub but for a different project but if that doesn't exist or your Kernel's not in there then what it does is it uses BPF to do memory forensics on the Kernel memory in order to work out where things are so imagine that you know your program that's loading your your blueprf program but you know everything you need to know about that program you know you know um uh what it's called you know you know it's it's com for example well um you know it's paid right that's the

first thing you definitely it's paid in its tit well what you could do is you can get the task traps using the BPF help without get your point of the talk to the task traps and you could dump 4K of memory from the start the task route to user space and then user space can search through that code looking for that period and tid and when it finds it you now know the offset to the pin tid from the start of the guitar struct store that offset because you might want it later um let's say you know the com so you know which is the short version of the uh the program name so you can

search through it for that um you can recognize pointers because pointers are into kernel and we are very recognizable so um you know roughly the layout because the layout doesn't change that much so you could move forwards and backwards looking at the point is expecting there to be a pointer to a like a credit struct for example well you could Fork your user space program and drop your creds to a specific uid and GID that are quite random numbers like one two three four five and uh 54321 for example right and then you could dump memory from each of those pointers that the task Hook is pointing at and search through them until you find your uid and GID at which point you

could be quite confident you find the credit struct and therefore the pointer to the credit stroke and therefore the location of the uid and GID in the credit struct and so on and so on there's lots of things that you know about your own process you can use to then find your own information in their memory that your own program is going to um the way I trigger the BPF program is I just call assist call and attached to that Cisco so I used um youname which gets you the system name um just as a lesser used Cisco really and attached might be preferring to that and then when it runs um I can check the process ID

um by sharing that through a configuration map which shared from user space and the BPF program so but be performing and get that configuration map you can pull out the process ID it can check the process ID against the current process ID which it gets from a helper make sure it's actually my program that's been interrupted and then it can do all these clever things and using that you can then build up a set of offsets to the things that you care about um so yeah it's not massively scalable it does have some bugs on red hat but it works pretty well on Ubuntu um or it did I don't know if it still does

um but it was it was another way of finding stuff out when when you haven't got BTF available of course what you could do is just demand that your customers run a BTF enabled kernel um and then that solves a lot of these problems but but yeah finding finding the size of the shape of structures is quite tricky um licensing is really interesting right I'll start by saying I'm not a lawyer um so do go get your own legal advice um one of the things that's really interesting right so a lot of the BPF helpers are GPL to licensed other ones aren't but most of them are and most of the good ones are the ones that you want

to use so ultimately um because you'll be calling these BPF uh helpers their GPL too it will make your code gpl2 for the code that runs inside the kernel it's the BPF code and the way it does that and this is I think unique is that your BPF program has to buy virtual to verify checking it state in the elf binary what license your code is licensed under and you can dual license things but if your license doesn't include gpl2 and you try and your code tries to access a GPL to helper the verifier will prevent your code from running so you know normal in the world we write what license something is in in a

license file and in the readme or or in some documentation somewhere in BPF you have to state in your BPF code what license is licensed under um otherwise it won't be allowed to run and if you don't state that it's gpo2 at least then you don't get access to the helpers that you need so your BPF code is almost certainly going to be gpo2 um now lib EPF and CIS in terms of ebpf which which was built on it are both lgpl 2.1 which is allowed to load a gpl2 program into the kernel and you can attach any because they're libraries you can attach to any user Space Program to them dynamically um you under any license you like

um psyllium ebpf is MRT but uh but but your your bpfo code will be gpl2 which means that if you sell it or provide it to somebody as binary code they can ask for the source of your your um gpo2 code your your BPF code which is kind of um a kind of interesting place to be I spent a lot of time talking to Microsoft lawyers in order to understand this um so uh but yeah get your own legal advice and then finally we're almost at the end I'm going to very quickly talk about vulnerabilities to do with BPF and I'm not going to go into any more detail than what's on this slide um but basically if you go to any CBE

database and put in the key you know the code BPF you'll find a load of CVS right most of them will be in lib BPF and you don't really have to worry about them to be perfectly honest um you know yes some things weren't bounded particularly well they're not very interesting bugs like who massively cares okay but there are some interesting bugs in the verifier and I urge you to go and search for them and read the write-ups about them because they are some of the most interesting vulnerabilities and exploits I've ever come across in my career um I I haven't discovered these I'm literally reading the work of geniuses here but um imagine this right so the verifier

verifies your code by simulating execution and following the code paths if you have an if statement based on a variable like if this variable is one go this way and if this variable is zero go this other way and the verify works out that that variable can only ever be zero it will only check the path for that that it would follow if it's zero because what would be the point in checking the other path if it's inaccessible code right but there's a way of getting a 1 into a variable when the verifier thinks it's a zero and that means that it verifies the zero path which is all very nice friendly code that doesn't do anything

dangerous and all of your dirty nasty illegal memory access an illegal BPF helper access can all be in the One path and in practice that's the path that gets taken because you've managed to get because that variable actually holds a one even though the verifier thinks it holds a zero you can also use that one to continually increment yourself into places that the verify I think you're not incrementing you know if it thinks you're adding a zero over and over and over again it it maintains that the pointer is still where it was but obviously if you're adding a you're actually adding a one over and over again then obviously um your pointer is actually increasing

and you can do the same thing you could left shift that one bit through through the variable and make it a much bigger number and then add it to a pointer and again the variable the verifier still thinks you're adding zero but actually adding a very big number to your pointer and therefore it'll allow the memory access when it should probably prevent it so as I said it's a really interesting bug um a few different people have done work in that area and have written blog posts about it girl ever read it's quite interesting um it doesn't make BPF vulnerable so the verifier was written mainly with safety in mind like it's protecting you from yourself from writing code that you

shouldn't be writing because you've accidentally done something silly that could crash the kernel it's not really built for security to prevent nefarious programs from running you pretty much have to be route to load code in the uh those actual vulnerabilities that are were exploited because there's one program type where you don't have to be root or have capsis ads when in order to load the code and um and therefore um you could elevate your privileges by being a lowly user exploit this fear of get a programming into the kernel it runs it exploits the spare fire bug it uses that to then override memory in your task drugs um to change your uid to zero and then

your your process is now root so um I say really really cool vulnerability but um but BPF is either on or not on is either available or not available in your kernel and it doesn't become vulnerable just because you're running your own BPF programs to look for look for things and audit things and store things it's it's either vulnerable or it isn't um the only solution is to upgrade your kernel and so or apply the patches to your channel so uh is there's it's not a good reason not to run BPF um and then that's the fact if you turn off BPF you lose so much functionality that I don't think anybody is and I

don't think anybody's recommending that so let's say it's more of a it's more of an interest perspective to go and have a look at that that vulnerability [Music] um so um with that um we're at the end um there's a complimentary slide of a whole load of organizations that trust psyllium to do their um the kubernetes networking about scale um usually the first question is is can you provide can you put those links back up and yes I can including a link to where you can get these slides um if you want to go over the slides yourself and with that I am there

What is eBPF and Why Should You Care!

Related talks