Finding Consensus Bugs in Novel EVM Implementations with LibAFL

Name: Finding Consensus Bugs in Novel EVM Implementations with LibAFL
Uploaded: 2023-02-01
Duration: 29 min 26 s
Description: Consensus bugs in Ethereum Virtual Machine implementations can cause network splits and financial loss. This talk demonstrates using differential fuzzing and the LibAFL framework to detect consensus bugs by comparing multiple EVM implementations (Geth and EVM1) and monitoring for divergent behavior.

BSides Berlin · 202229:26160 viewsPublished 2023-02Watch on YouTube ↗

Speakers

Daniel Schmidt

Tags

CategoryResearch Technical

StyleTalk

Mentioned in this talk

Tools used

American Fuzzy Lop

Frameworks

LibAFL

About this talk

Consensus bugs in Ethereum Virtual Machine implementations can cause network splits and financial loss. This talk demonstrates using differential fuzzing and the LibAFL framework to detect consensus bugs by comparing multiple EVM implementations (Geth and EVM1) and monitoring for divergent behavior.

Show original YouTube description

Finding consensus bugs in novel Ethereum Virtual Machine (EVM) implementations can be challenging, but using the American Fuzzy Lop (AFL) fuzzer can be an effective method. AFL is a popular open-source fuzzer that uses genetic algorithms to generate inputs and identify bugs in software. When applied to novel EVM implementations, AFL can help find consensus bugs that might otherwise go unnoticed. About the speaker: Daniel studies Computer Science at the Technical University Berlin, currently writing his thesis at the Security in Telecommunications chair. His study is focused on distributed ledgers, privacy and security. He spends his free time researching and exploring new topics related to Ethereum and security. https://twitter.com/ogechno

Show transcript [en]

thanks for having me pleasure page is all yours please welcome please Round of Applause

yeah so I'm gonna show you today um what consensus bugs are and how to find them using differential fuzzing by utilizing lip Alpha so first off the short version of what is a consensus bug essentially if we have two or more nodes that should normally come to a consensus about a state and they failed to do so because of a bug in one of the clients this is what we call a consensus book so such a buck was triggered on the ethereum mainnet in 2020 which resulted in the loss of about 30 blocks where and which transferred over 6 million US dollar worth of eth and ether is just the native currency of ethereum so nor I already mentioned ethereum but

what actually is it it's a decentralized network of nodes that come together um to a consensus about the global state of ethereum and that's what we call a blockchain and it has this uh also this digital value in it which is um which is transferred which is called eth and but it also has a touring complete computer inside of it which is able to execute so-called smart contracts to build decentralized applications on top of it and each of these nodes consists of two clients which are mainly the consensus client and the execution client since the merge which happened this year but in this talk we will focus only on the execution clients so this is what you

can see here on the picture we have the guest client which is the little Goofer with the mining headset on and we have the other client which is nethermind so as you can see we have these different clients and they each need to implement a execution spec which is the spec for all execution clients so we essentially have different implementations for these specs and these are the main clients for the execution clients which is Geth and nethermind and we also have Basu and Aragon now why do we have these um why do you have these many implementations of these execution specs it's because of client diversity so for example if we only had one client

if this one client had a Dos vulnerability in it and this would be exploited the whole network would be down so the Integrity to keep the Integrity of the network we need to have multiple implementations of these specs and to ensure the stability of the network and inside of this execution client there's the ethereum virtual machine which I already mentioned um which I already mentioned that there are smart contracts which can be run on ethereum so this is the computer that actually executes the smart contracts and it's a really complex part of the software stack so it's perfect for vulnerability research and it's a simple stack based machine so instead of storing these temporary short

Pass based values in registers we would push them onto a stack and pop them off the stack if we need them but this evm that's the short form also contains a memory and a storage and the storage which I will go into in a second which all will push the values that are stored in storage onto the global state of ethereum so these smart contracts which I mentioned are run on this ethereum virtual machine are usually written in a higher level language like solidity or Viper and they then get compiled to these evm up codes so you can kind of imagine it as a Assembly Language for these smart contracts so this is an example as you

can see on the slide um of this evm up codes and what we do here is essentially push an address onto the stack and push an index onto the stack and then we call the Astro op code and by that we will store the address onto the storage slot with the index 0. and if we retrieve this want to retrieve this value back we would essentially push the index back onto the stack and get the and call the s-load opcode and get the address and retrieve the address so now since these nodes are operated by people and they essentially since they need to run these smart contracts which requires computational resources which somehow need to charge people for

executing these smart contracts and this is done by gas so gas is a unit by itself and everything that is can be executed is charged with gas so every operation every stack every memory and storage access and but also calls to outside to other smart contracts and this is this used to be calculated by so the total fees which the person who executes the smart contract needs to pay are calculated or used to be calculated by um getting the amount of gas usage that was used at the End by the smart contract and multiplying it by the East value per girth per gas sorry and so then you have the total amount of fee that is that needs to be paid

and now we have this consensus box and essentially um yeah this consent sparks happen if one of these clients since I told you we have this execution spec and if one of these clients does not Implement them correctly and would this would lead to um and this would lead to a different state a different local state of the node and the other nodes would have the correct State they would they wouldn't come to a consensus with one another about the global state so for for example in this um in this slide you can see the account zero has 10 East for the nethermine client and the guest client thinks account zero has 13 each so this

is both their local state and since they're differ they wouldn't come to a consensus about the global state of the network and this could for example happen if there's a an error in the gas calculation so as I said the gas is used to charge fees to the person who executed the smart contract and if there's an error inside of it then this could lead to such a difference

so resulting resulting of um this consensus bugs we have a so-called Network splits because when they don't can when they don't come to a consensus about the global State they will just split off into two networks which each have their state and which now each have their own Global state so as you can see here now all Nether mines are now in one network and all gas clients are now on one network so as I said in the beginning with the motivation that such rebuk was triggered on mainnet and it's really complicated to emerge these two networks back together but also there's a different problem which is double spending so we have these exchanges exchanges where you can

basically exchange your eth for real money and these exchanges need to run some kind of client in their back end and if one exchange now has another mine client in the and their back end and things account zero has 10 East and let's say I'm malicious and I uh maliciously exploited this this consensus bug and I control account zero then I could go to the first exchange and say Hey I want to get euros dollars from my eth and I would get the US dollar worth of ease and then normally this value would be updated for the whole network but since we now have two networks the other network doesn't know about this transaction so I could also go then to

net exchange 2 which uses guess in in their back end to get kind of the state of ethereum and I could go there and say Hey I want um I want U.S dollars for my 13 is and they would give it to them and so you know kind of double Q money with it so since this is a since it's really cumbersome to find these consensus bugs manually we are going to use a technique called a fuzzing and the tldr fuzzing is that you throw random input at the Target and observe it for undesired behavior for example crushes crashes or consensus bugs on the slide you can see a a smart browser which just means that we also

have a monitor and a mutator and the monitor just gathers information about the Target and then feeds it back into the mutator and the mutator derives new inputs from learning about this information so and then the father would learn over time which um test inputs are more interesting and fuzzing has been proven to be really great in practice for example in this quote you can see as of July 2022 it was as fast found over 40 000 bucks in 650 open source projects um so yeah that's pretty good

so now as I told you we are trying to find these consensus back through fuzzing and essentially um so these consensus box as I said appear when there's a difference of two or more clients implementations and they differ in some kind of logic so what we could do what what we could utilize is differential fuzzing so we now have these two Targets or EVMS in our tests in our use case and we would feed them the same input and we would observe them for the result so if the results differ we know we have a bug in one of the two clients or the two Targets and which could be in a yeah could be a consensus bug

potentially and but this technique is not perfect so for example if we have this vulnerability in both of the targets then our differential father wouldn't be able to Define it because the result is the same so next one we need as I said two or more targets to do this differential fuzzing and we chose guest and evm1 um because guess as you saw in the client diversity slide had a huge market share of execution clients I think it was like almost 80 percent and it has been a lot has been around for a really long time so it's really well tested and then we have evm1 which is a novel very promising evm implementation and

yeah so the major difference between those two is that gas is a full client and it has an evm inside of it but it also does other things and the evm1 is just an evm implementation and we also chose both because we will call um each evm functions from a c plus file so evm run evm1 is written in C plus plus so it will be really easy for us to call it from C plus plus and get is written in go which is also doable to call from C plus plus which I will explain later um like for example on the other hand we have nethermind which is written in C sharp and we have basil which is written

in Java which is a bit more harder to call from C plus but also evm1 has a differential fuzzing approach but it only fuzzes against one of its own targets so it has multiple implementations and it's code space code base and it fuzzles them against one another so it never causes differential fuzzles against EVMS that are outside of their code base so we also thought that would be good opportunity for us to do that um what I forgot so what you essentially could do you could now take these evm binaries and call them from the command lines and passed the fuzzing input through the arguments and collect information about how the execution went through qimo or

um or the standard out but we chose a different approach approach which is called code coverage guided fuzzing a coverage guided in process fuzzing and yeah what is coverage guided fuzzing first so essentially coverage guided fuzzing is a tracking code path in the Target to leverage that information to mutate or to derive new inputs new interesting inputs so this means if we have now a test case and feed it into our Target we would track the code path as took inside of our Target and then if it um if it goes through a code path that wasn't already seen or hasn't seen as often then we would say to our father hey this test case is interesting

to do that we need to instrument our Target so we need to insert additional code into our Target which is done by the compiler and as you can see here it inserts the sanitizer cough Trace PCS which are inserted at every code block so before the if statement inside the if statement and after the if statement and these are callbacks which report back to our fuzzer that we hit this certain code path now we're also going to use oh and this technique kind of slows down the execution of the target so but it has proven to be really to yield good results and practice and nonetheless so uh code coverage so what we do now is we use an

optimization technique called in-process fuzzing which essentially allows us to have a higher execution speed and this is done by not launching a new process for every execution but doing the fuzzing in one process and also mutating the input in memory and but this is kind of prone to this disruption by unintentional memory leaks or dos bugs and you could Loosely say the slowness of code coverage fuzzing and the um the higher throughput we gain through in-process fuzzing kind of loosely you could say that cancel this other or balance each other out um now what approach did we take to to to utilize these Concepts so basically there were two main approaches to do that we had have these pre-bundled

fuzzers which are AFL or AFL plus lip puzzer and Hong fos and we have these written these fuzzles which are written from scratch which are often um yeah academic but of course there are also other approaches in Academia it's just one example but we didn't like both because uh pre-bonded fathers are normally kind of unflexible and they're huge code bases and we also didn't want to write a further from scratch so we chose a modular fuzzing framework which is called librafl and I heard they just presented their paper so congrats to them and now essentially what a modular fuzzing framework is is that you can pick and choose the components you want to build your fuzzer

so when we get back to this example if I just wanted to change the mutator I don't need to change the whole fuzzer so I could I should be able to switch the exchange just the mutator since each of the components just communicate over an interface with one another so the benefits of lift are file are the reusability of code the readability it's really good at scaling um straight to for comparability and it's also good for quick experimentation so what features do we require from librafl we want to as I said use in-process pausing we want to collect coverage and for to collect coverage we need to instrument our Target somehow

so first off how did we instrument our Target we used a decline wrapper from librafl where clang is just a C and C plus plus compiler mainly um and we use this to kind of insert these these additional callbacks into our Target and but this of course didn't work for and this is mainly for evm1 and our file our c plus file where we will call our EVMS from but this of course didn't work for Geth since I said it's written in go and we first tried to use go fast build which which is a library to instrument go code so it would insert these callbacks in go code and and would be able to also call

go from C plus for example but this unfortunately didn't work for us and I only figured out like last week why this probably was because we used a little puzzle support and they um if they're if they're finished you know if Gulf hospital is finished it exposes a lip puzzle harness which is just a function you could call and we also had this function in another place in our code so they're probably just overlapped and so that's why it didn't work and we pivoted to a seagull so sigo just enables us to call go code from C plus and this is the example from our c plus file the actual code so it's really easy

but unfortunately we because it only enables us to call and go from C plus we weren't able to instrument go now um now we have our actual fuzzing setup so all the components we as I said live off else they kind of pick and choose your components you want to build your perfect fuzzer and so this is all all the components but don't be worried I will go over them so don't be overwhelmed I will go over them one by one and I split them up in two so you can see them better so first off we have the Corpus and we have essentially two types of Corpus um first off we have and a corpus is just a

fancy word for saying we store interesting test cases um and yeah we have an in-memory Corpus which stores interesting values in memory which are derived through the fuzzing process and we have the on disk Corpus which essentially stores our objective which in our case will be the consensus box and we have an initial Corpus which we manually added so kind of as a starting point for our input and next off we have a scheduler which is the index's land time minimizer scheduler which essentially selects the quickest and smallest test case from our Corpus and this just means since I sorry I forgot that but our test cases are essentially smart contracts and so for example if we have now two

smart contracts and they reach the same state but one takes longer to execute than the other one it would always prefer the faster one next off we have the stage and mutator component and the stage is just a component where you can where you can actually do some actions on the test input before it passes them on to the executor and for this we only have a mutator and we use a very simple mutator called the Havoc mutator which just does random bytes and bit flips and inserts so it's a very simple one and a mutator um I already explained it earlier is a component which takes an input and derives new input from it

now we have the in-process executor as I said in the requirements that we need we want to use in-process fuzzing and sorry this is the component that enables us to do so and yeah it executes our Target essentially within with the input the test case input and the test case input which are our smart contracts are represented in a in bytes so the AVM up codes are essentially in bytes format and we will person onto the harness and our harness is the C plus plus file I mentioned a couple of times and it exposes the loop puzzle harness and we yeah our Target calls this harness and the harness is just an entry point for our tar execution Target

um and we call both our evm implementations from this harness which is the get evm and the evm1 and we do call get by sigo and evm1 directly and both after so we will call these functions directly and both of them would give us the gas back that they use for executing the smart contract and next up we have the Observer The Observers just say component that kind of observes the target for some interesting values and we use two um which are at the Observer for our code coverage so it observes the target if new code coverage was uh tracked and we have the time Observer which is important for our scheduler to track how

long it took for a Target to run next up we have the feedback component which is just classifies our test cases if they're interesting or not so at map Max we have the max map feedback component which is for our coverage and it's essentially tells the father did this test input maximize our coverage map next we have our time feedback which is a knob feedback so it essentially says always it's not interesting and because we need this component to actually track the time for test cases for our scheduler to select the quickest test cases and then we have a crash feedback so this might be confusing because I said we're gonna find consensus bugs and not

crashes but we're at the end of our harness we are comparing both the gas we received from our EVMS and compare them and if they are not the same so we have a um State difference so potentially a consensus bug we will tell librafl that we crashed uh I know this is kind of a hacky way but it works um yeah and there are also other ways to identify the state inferences we use gas and but there's also the state route which is just a hash of the current state of ethereum and we have evm traces which are just structured logs of metadata like the evm at the op codes executed or the current state of the

memory and we used and essentially the state route and the evm traces both would cover more State differences than the gas but we chose gas because it's implemented anyway and it's faster so it gives us a higher throughput and the execution so um yeah but and also evm1 doesn't have the state route out of the box implemented so we kind of defaulted to just the gas so now the feedback components essentially put these test cases if they thought they're interesting into Corpus so if we collected new coverage we would put these test cases into the in-memory Corpus and if we have an objective which is our consensus bug or crash feedback and we would put him onto this and now this whole

Loops kind of yeah repeats so first what are our results so we now have a functioning setup to find these consensus bugs via coverage guided fuzzing and differential fuzzing with librafl where we can easily expand experiment with components so if we for example wanted to change the mutating strategy as I said we use a very simple mutator but if you want to use a more sophisticated mutator like a grammar mutator you could easily switch them out using librafl and for our metrics so normally you compare executions per second it's really hard to compare them here because we have this gas and essentially we gave them a lot the evm a lot of gas to use the

maximum amount actually and it could potentially run so the evm could run for a really long time so we've seen every value between like just eight executions per second up until like 260 executions per second and we also tested with a constant test case which did about 500 000 executions uh in five minutes on this laptop and we also found a potential first bucket and just a test structure and we are also looking at the moment at a bug that um we are yeah still investigating if it's if it's even a buck yeah and now to our conclusion uh liberal is an excellent tool to quickly and to have a quick and flexible fuzzing setup

and as you hopefully know now know that the the ethereum is at risk of consensus bugs and that that can be caused through the ethereum virtual machine which is uh By Nature really complex and differential fuzzing can tame this complexity of the AVM by identifying these consensus bugs at scale so first thanks off to Vincent for inspiring this idea and kind of overlooking this project and thanks to Dominic from the librafl team for answering all my questions and thank you guys for listening thank you

Finding Consensus Bugs in Novel EVM Implementations with LibAFL

Related talks