"Program analysis for reverse engineers: from ⊤ to ⊥" - Adrian Herrera

Name: "Program analysis for reverse engineers: from ⊤ to ⊥" - Adrian Herrera
Uploaded: 2018-07-10
Duration: 54 min 13 s
Description: Adrian Herrera explores how formal methods and program analysis techniques from academia—including SMT solvers, symbolic execution, and abstract interpretation—apply to reverse engineering and binary analysis. The talk bridges research from the compiler and formal methods communities with practical

BSides Canberra · 201854:13178 viewsPublished 2018-07Watch on YouTube ↗

Speakers

Adrian Herrera

Tags

CategoryTechnical

TopicReverse Engineering Vulnerability Research

DifficultyAdvanced

ResearchTechnical Deep-dives

StyleTalk

About this talk

Adrian Herrera explores how formal methods and program analysis techniques from academia—including SMT solvers, symbolic execution, and abstract interpretation—apply to reverse engineering and binary analysis. The talk bridges research from the compiler and formal methods communities with practical security tools.

Show original YouTube description

BSides Canberra 2018 Slide deck: https://drive.google.com/file/d/1j9rfMt14pubi6G9PKK3akddyeet5bf0x/view?usp=sharing

Show transcript [en]

right now we have a really great talk by Adrienne Herrera on program analysis for reverse engineer engineers from top to bottom so let's welcome Adrienne to the stage thank you can you all hear me all good all right so yes so that last stop is really great and hopefully we can sort of continue on that reverse engineering theme but I know depending on how you feel about this my aim is kind of add to add a bit more mass to it so hopefully you're like some maths like I do and so yeah so the the focus of this talk will be how do we use some some research from the formal methods mathematical world and how we

apply that to reverse engineering all right so Who am I my name is Adrienne I'm a researcher with the defence Science and Technology Group here in Australia I'm also a visiting researcher and I knew and I guess my day job is building these tools looking at this this idea of how do we take this academic work on program analysis and apply it to reverse engineering problems so that's what I yeah that's what I really like doing all right so I'll cover a bunch of different things in this talk lots of different so program analysis is a massive field I guess it's been a research field in academia for forever so I can't talk about all of it

but I've picked sort of three what I think are really useful areas that can be applied to reverse engineering and that's looking at SMT solvers symbolic execution and abstract interpretation all right so what is program analysis pregnant ass is pretty broad so the definition is kind of a bit wishy-washy and probably doesn't mean anything but it's basically how do we build tools to reason about other programs behavior as I sort of just said it's been a research field active research field for decades primarily in compiler the compiler field but I think quite recently with DARPA is cyber ground challenge that they hold maybe one or two years ago this a lot of these concepts from the compiler space are

making their way into the security space and that's kind of what I'll talk about today so yeah so program analysis is automatically reasoning about computer programs and what they do so what kind of things do we want to reason about you well I've listed a few appeared security correctness and yeah compilers are some of the common things that we want to we want to reason about in our programs alright so in program analysis there's kind of like two main flavors if you're reverse engineer you're probably already familiar with these terms anyway and we kind of brushed up on on this in the last talk but you can either analyze something statically or you can analyze

something dynamically and the difference is basically when you do doughnut dynamic analysis you execute the program and you reason about a particular execution trace whereas with static analysis you reason about the program without actually executing it all right so what are the pros and cons of this well so with static analysis we get a broader overview of what our program is doing so we can reason about all well in theory we can reason about all execution traces through the program but as we'll come up a bit later on when we talk about abstract interpretation this is not actually possible due to some theoretical reasons in computer science and so in contrast dynamic analysis we can only reason about what we see right

so in the previous talk we talked about which they talked about instrumenting the the Ruby VM and so we can only sort of generate and reason about the traces that we actually get for a given input so an example of this in fuzzing is we have to generate lots and lots and lots of random inputs and feed them to our program to try and uncover many different code paths through the binary but the upside of this is that because we get a complete execution trace we can be more precise about what we're reasoning about and so yeah so if your reverse engineer you probably already use tools that do some form of program analysis so yes if you do is take

analysis with either Pro you know you load your binary in either it gets disassembled a control flow graph gets reconstructed you might get some like type inference information that either can maybe or maybe not figure out depending on how of for skated your binary years and and all of this jump table recovery as well all of this is I think falls under the umbrella of program analysis in contrast with dynamic analysis if you're a malware person and you use a tool like cuckoo sandbox you you know you inject a DLL into your malware you run it through cuckoo you generate traces and then you can sort of reason about the behavior of those traces and start with with malware

maybe you're interested in API monitoring and with fuzzing maybe you're interested in in code coverage and again I would argue that that all falls under program analysis all right so program analysis in academia is kind of a disgusting thing I think so this is kind of like the paper that introduced expect interpretation and it's completely incomprehensible I have no idea what any of it kind of means and so I'm hoping with this talk we can sort of avoid this garbage and well I shouldn't call it garbage this stuff and and then talk about it you know more practical sense for for reverse engineers all right so what are my aims here I kind of want to

talk about new maybe possibly unfamiliar program analysis techniques so really focus on you know control flow control flow graph generation and that kind of thing gear it towards reverse engineers have a ton of examples try and limit the maths like we saw in that previous slide I'm sort of not going to dive into code or anything like that I sort of just want to talk about the ideas behind these different techniques and we weren't focused on any specific instruction set architecture and that's mainly because it's that's hard right so something like x86 has hundreds and hundreds of instructions and trying to write programs that reason about those hundreds and hundreds of instructions is really really difficult and so instead

what we'll talk about is will sort of talk about program alysus on top of intermediate representation which is quite common anyway and the one I'll focused on is this one rail that was developed by now Google people and it's a very simple intermediate language for assembly code and so you know all the examples that I talked about will focus on this i ah all right so we bit about rail before we sort of dive into some analysis yep it was developed by dynamics who are now at Google if you've ever used been Navi before you can export your code to this rail intermediate language and it's very simple and it was designed to be very

simple so there's only 17 instructions in the entire set all instructions take three operands some of them may be unused and I think the most important thing for rail is that there are no implicit side-effects so when you execute an instruction and say on x86 maybe you're doing an ADD as well as getting a result of that addition you also set flags you know maybe the result is zero so you set the zero flag maybe get an overflow and there are all these these implicit side effects that happen kind of behind the scenes and this complicates your analysis right because you have to reason about all these possible side effects that can so the nice thing about Roe is for an

addition it just adds two numbers right it doesn't set any flags you have to explicitly generate instructions for it to do that so this is an example of right now looks like yeah it just looks like any old assembly language I get yes we've got EXO's addition this astir instruction is nothing to do with strings it's a store to register so writes the value 22 register r2 and STM is store to memory so it stores the value in r1 into this memory address yeah and so that's kind of what random looks like and so I think it's important that we kind of differentiate between the syntax of rail and the semantics of rail so we kind of heard the worst

semantics a bit in the previous talk which is really good but I'll just kind of dive a bit deeper I think into what we actually mean by these two terms so the syntax if we're talking about languages is the symbols that make up that that language while the semantics are the meaning behind behind those symbols so what do they actually mean and so when we do something in Ida Pro we're kind of looking at the binary at a syntactic level right and we sort of have to reason about the semantics in our head when we're going over the the C of G or whatever it is that we're doing and so in this case our syntax are our

various instructions or X or additions whatever upper incisors registers integer literals that sort of thing so when we you know write a parser we pass syntax right and we we generate these these things here on the on the left so in contrast when we talk about semantics we're talking about well what does it actually mean to do an ad and so as I said the nice thing about rail is that it's very it's very obvious where these instructions do so an ad in this case we look at the value from one we add 10 to it and we store the result in not one and and that's it that's the meaning of add and yeah so if

we were talking about an x86 ad then we would also have well if the result is zero then we set the zero flag sign bits and and all of that sort of thing all right cool so let's dive into the first sort of area and that is SMT solvers all right so I'll kind of switch to see phone for this example but we'll get back to the assembly stuff later so say we have this if statement in C so it's based on three different variables XY and Z and depending on some combination of x y&z the if condition will be true and we'll execute whatever code is inside there so what we can do with this

in the program analysis world is is model this kind of code as as math and and when we do that we can then use tools such as there's some T solvers to to to reason about these these formulas for us and improve particular properties that we may be interested in so yes so this is what the C code looks like in logic land and you can kind of see that it's pretty much a one-to-one mapping except for we've got a couple of weird symbols in there so we've got you know this upside-down V is an end operation this is just a not and this is a this is how we represent all and so once once we

have this formula we can do various things with it so we can do just basic things just assign values to x y&z and then get a result so if it's true if it's false we can give the formula to our SMT solver and say an Oscar will actually is there a set of assignments for x y&z that will make this formula true and yep so that's checking satisfiability and the other thing we may be interested in is if the formula is valid so the validity of the formula and what this means is that no matter what values we give to x y&z the result will always be be true and so we'll see later on where

this is kind of useful information all right so we'll just quickly do a simple assignment it's yeah it's pretty straightforward so we just pick some values the x y&z and then plug them in and then we can sort of see 5 is greater than equal to 3 that'll get it true you know explained this explained all this stuff so yeah so we we rewrite our formula we evaluate part of it now we've got some more funny symbols from the title and so this T is top and here we'll it means true and this other one is bottom and it means false so all we've done is just you know solves some of these comparisons here like this 1/5 is

greater than 3 that's true and and so on and so we can sort of keep doing that until we get an end result so false and true is just false so is the formula valid well clearly not because we just gave a set of assignments for x y&z that didn't return true so the fourth this formula is not valid but we can ask that SMT solver is the formula is satisfiable and it will say yes it is and then you can say well give me values for x y&z that will make this true and when I ran this with a SMT solver it gave me these values 5 minus 2 and 5 and so yeah so in

general if we want to sort of model code as these logical formula formulas this is kind of the approach that I would take so the first thing we have to do is convert code to what is known as SSA form static single assignment form and basically what this means is that every variable in our code is Excite is assigned only once and if we reassign that variable we create a new variable and I'll have an example that sort of walks all through this once we have everything in SSA form we can convert all our instructions into these formulas and then we can end all these formulas together and then give the resulting formula to the SMT solver and and find

out whatever information that were interested in so here we've got some rail code it's a bit more complex than the the if statement that we just looked at before so we can see we've got X or instructions so clearing our 1 and we're storing the result in r1 so things will have to change when we convert this to SSA form we've got some additions multiplications and then this one here in rail is a conditional jump instruction and sorry it works by if the first operand is not equal to 0 then we'll take the true path we'll do the jump and if it is 0 then we'll go it the other way and so inside all these

registers we've got these import registers that we I guess we controlled and we can give values for and what we want to work out is what value of in do we have to provide to go down these particular branches in our program so the first thing we do is convert to SSA and so how I've done that here we've just appended these characters to the end of each register name so so remember the rules for SSA form that you can only assign to a variable once so that's why you know on the previous slide when we we cleared our 1 and assign the result back to our 1 we to create a new variable now and so

likewise when we now use when we read from our one stay down here we have to update that to read from the newly created our one a and there are some tools that will do this for you so if you're if you've used our binary ninja before it has intermediate representation that has an SSA form so you can kind of get this transformation for free if you use that tool alright so then once we've converted to SSA we can sort of rewrite all our instructions as math and so this is kind of what it looks like so I won a it's just X or r2 you know we're just storing the negative one little literal in r2 and we sort of

do this for all our instructions here and because we're interested in going down this path so this is dress here this variable here must be equal to zero so we add that you know you know constraint set and so again I guess this is where semantics becomes important we can do this conversion from the rail code to these formula really easily because the semantics of rail are very simple and so likewise if we want to take this path JCCC means that we go down this path when this is not equal to zero so we use not equal to see right here instead so as I said before now we take the conjunction of all these

formulas so we end them all together this is the other one for the green path and then we pass our formula to our SMT solver and SMT solver come backs and says well if you want to go down this path then the in the input that we control must be equal to zero and we can say well what are the values for the input can we use to reach 14 and so to do this we update our formula with the inverse of our previous result so we say well to get to 14 we want all these to be true and we don't want the input to be 0 because we've already got that solution right so we're not

interested in getting it again and we can keep doing that and in general and iterated iteratively generate new values for n that will take us down this path and likewise for this path we give our formula to the SMT solver and we say is there a solution to this formula for the value for M that will get us down here and we do that and it actually tells us no there is no solution for there is no value for M that will take us down this path and so what does this mean so if you're a reverse engineer or malware analyst I should say maybe you've come across this term I take predicates before and basically

what it is is this an obfuscation technique to full reverse engineers into Irene pointless code so I can come up with some complicated code like this that looks like a really complicated branching condition but in fact it will always go down one particular path so in this case will always go down the to address 14 and then I can put whatever code I want here and know that it will actually never be executed but a reverse engineer may waste their time sort of I ring all this code down here and so if we use a technique like this then we can sort of prove that actually there's no use reversing this code here because execution can never possibly get there

so that's kind of nice so yeah so some applications of this opaque predicate detection in the compiler space if we're optimizing code and we find some code paths that actually there's no possible way we can execute them we don't have to add those code piles into the binary that's another use and and there's a whole pile of research on automatic exploit generation and using these techniques to essentially represent your shellcode as these formulas and then finding paths to exploitable code and then seeing if it's actually possible to use your shellcode in in that exploit so this example is pretty straightforward we didn't have any loops or anything anything like that and if we do get loops this is kind of

where this breaks down so if we know the loop bound so we know the number of times that the loop iterates we can unroll the loop and just generate larger formulas but if we can't work out the loop bounds maybe it's obfuscated in such a way that we can't or maybe it's just an infinite loop then we can't really use this technique in that in that case and so it's some of the other issues with this with SMT solvers is even though they've improved a lot over the years they still may not be able to solve really complex formulas and so you know they may just time out or give up and not give you a solution

as I said if you're dealing with infinite loops you can't really convert your code into into a formula this way because you don't know how many times it will execute yeah so that's it for SMT solvers and I think this leads nicely into symbolic execution which is also SMT solver base so previously we kind of looked at modeling code as these logic formulas and so now what we'll do in symbolic execution is to run an interpreter over our code generate these formulas dynamically and then solve them as we're interpreting our code so as far as all these kind of program analysis IDs go I think a symbolic execution is probably one of the more popular ones

there are tons of symbolic execution engines out there on github that all go free and open source anger this one here is probably the most popular one I say but there's others Manticore and clay history and there's a whole bunch of them that you can download and have a play with so this is kind of a spiel on what symbolic exclusion is I'm not really going to read through it I think it's best just to go through an example and then we can see how it works through that but basically instead of operating on concrete values we operate on what we call symbolic values and then when we operate on these symbolic values we generate the formulas that I was just

talking about all right so let's look at an example slightly a bit bit more complicated than the last one so we now we've got a to sort of different points that we can branch from and what we'll do is we'll kind of just run this through our interpreter or a symbolic execution engine and and generate the constraints required to take each particular programming path through this CFG so you know down this path here down this path here and down that path there all right so we'll start so again we have our input that we control and we'll make it symbolic so we don't give it a particular value we just we'll just give it a symbol for now and we and

everything else sort of stays as is so we've still got like zeros here when we XOR these together zeros here and then we reached our first branch point and so this instruction here is saying well if n is not equal to 0 we'll go this way it is equal to 0 that way so we'll update our formula to say that so maybe we want to explore this path first and so to go down this way this this path in must be equal to 0 and we add that to our set of path constraints and then we sort of keep going we update all this stuff this is not operating on any symbolic data so it doesn't change much

and then we get to here and then we sort of exit our program and then we go to our SMT solver and well what are the values for our input that will take us down this particular path and this case is pretty straightforward it must be equal to zero so then we back up and we look down our other path that we can take and to go down this path it's the opposite of going down this path right so now we're now in must not be equal to zero but now we're operating now we've got some instructions that are operating on this symbolic data so all we do is maintain this formula representation of what the

results will be at the end when we pass it to our SMT solver so yeah so temp here we're ending you know our formula here with this and then we're shifting it and then we reached another fork point and again it's the same conditions if this variable here is zero we'll go down this way so we update our path constraints for this branch point view which was in the input is not zero and this branch point here which is this complicated ugly looking thing and we keep going and then we get to the end and we give our path constraints down here to the SMT solver and it says well if n is equal to one it'll go down this

path and then maybe we want to generate other solutions for this so it'll give us two three etc and that's because what we're doing here is basically checking the sign so these are all 32-bit integers and we're checking the sign and if the sign is zero then we're going down this path so it's a positive number and then we back up to our last branch point and we repeat the process but now our constraint is that this value for temp is not equal to zero and we keep going and then we reach the end and we say to our constraint solver what do I need to go down this path and it gives us these results so that's

kind of cool so where is this stuff useful so symbolic execution was originally designed for test case generation and software testing but it's kind of now really heavily used in security tools and binary analysis stuff so the DARPA challenge that they ran out of all the finalists or I think there was seven finalists in the competition they all used a symbolic execution engine to do their automatic exploit generation and they they combine their symbolic execution engine with with a fuzzing tool and so this is useful because if you do fuzzing you'll know that when you files you're kind of just doing things blindly right and you're just mutating different parts of the import in the hope of uncovering a new

program path but what we can do instead is we can be a bit smarter and generate these formulas and then invert them to get a solution that'll take us down a new path that we haven't seen and then sort of use that result as a seed for our fuzzer and and switch back to fuzzing and sort of repeat this process so it's it's good for improving improving that and it has uses in malware analysis as well so again there was no loops in that example what happens if we have loops well bad things generally happen actually so we sort of when you have loops each time you iterate a loop you generate a new path

to explore and if you have lots of if-else conditions in your loop then you quickly get an intractable number of states to explore and and this is kind of I think the main downfall of symbolic execution and there's a lot of research at the moment into how we can reduce this state space explosion problem as it's known and yeah so this is kind of what stops I think stopped simple execution from being really useful on large complex binaries so with the DARPA challenge a lot of the programs that they were finding bugs in where these toy binaries that weren't overly complex and so this this technique kind of worked really well for that and so yeah so that's kind

of the main challenge here the other one that I've listed is this path selection and what I mean by that is is if you've generated all these Forks in your program you want to sort of pick the one to explore next but you want to just not pick blindly but sort of pick smartly so you can work out you know what you're interested in finding if you're for example doing vulnerability research you want to find a bog so you want to sort of pick the next state to explore that you think will lead to somebody corridor or something like that all right cool final topic is abstract interpretation and so this is kind of different to the

other two so the other two we sort of use SMT solvers to reason about our code and this is kind of completely different to that so why do we want to use this abstract interpretation thing it all comes down to Rice's theorem which kind of says that any non-trivial property of a program is undecidable to prove and reason about and so well you might think well what's the point then because you know if all the interesting things we want to reason about in our programs undecidable well then we might as well just go home right and just stick to either Pro but this is where abstract interpretation comes in and so the way that it works is that it kind of works

on an abstract layer there is now decidable so we can now if we abstract our property program properties we can now prove things about them that we couldn't prove that before the downside of this is that we kind of lose some information as we see as we'll see in the examples all right so yeah so what we'll do without strict interpretation it is we'll take our semantics so or our meaning of all our different instructions and we'll abstract them to work on I guess a fire a fire level to make the analysis possible and so I kind of like to think of breaking down abstract interpretation into these three things that we need we need a domain we

need an obstruction function and we need some semantics and then it sort of we combine all this stuff together then we get an abstract interpreter all right so I'll go through each of those three things one by one and then we'll look at an example so with our domain when we when we sort of do concrete executions where we're reasoning about an infinitely large set of values right so you know all the possible sets of integers or whatever and so the key idea here is that we reduce what we're analyzing we reduce the size of these set so that we can approximate them in such a way that we can sort of still infer some meaningful information about

our program and so some common examples when you sort of read the literature and this stuff are sign domains and interval domains and so what this means like so for the sign example instead of looking at actual values of integers 10 - 10 0 a thousand whatever we only look at their sign information so you know maybe we only care if if a value is positive or negative or zero and the same with intervals so maybe we don't care about the precise value of an integer we only care about the range of possible values that can hold and when we when we do this abstraction we can yeah it makes it easier for us to reason about and sorry

if you ever looked at abstract interpretation stuff you always see lattice theory and all this disgusting nurse and so I guess one of the rules I don't know is that our abstract domains have to form a lattice and so for our sign domain this is what it looks like so we have our abstract values that we we operate over so we have the minus zero and the positive and we've also got these two new symbols that sort of actually turn our domain into a lattice and as we've seen before they're called top and bottom hence the name of the talk I guess and they kind of mean so before we said that they meant true

or false now when we talk about abstract interpretation top means that we don't really know what the value is and bottom is kind of representative of the uninitialized set and so I guess this is where the power of abstract interpretation comes in is because when we're operating and interpreting our code we can always fall back to saying the value is top in which case we don't actually know what it is anymore and so you know depending on how we design our domain at the end of our interpretation we could just say that all the values that were interested in a top and then so is that really useful probably not so this is kind of like the trade-off

between designing a really simple domain like the same domain and coming up with more complex domains all right so this is a lattice for the interval domain and so yeah so it's it's kind of a bit different now because before with our sign domain it was only three levels deep now we've got an infinitely high lattice that we have to work on and this sort of causes issues later on when we get to loops and things like that and so how this lattice is used is when we're executing and we reach the programming fork and we go down multiple possible paths and then those paths meet again we use this lattice to sort of combine the

result from those different paths when execution joins up again so for example down one path maybe the the value of a particular variable falls into this interval and down another path the value of a particular variable falls in this interval and then so we use the ladders and we perform some operation on this ladders to say well actually if the very sorry the variable on one path is this and the variable on the other path is that then it must fall in this range right and so that's kind of how we use our lettuce to abstract as we as we execute and again like we can always just fall back to saying well you know

when we join when the two piles join again well we just don't know what the final value is so we go to top all right so that's the domain the other thing we need is a function that takes us from concrete values into values in our domain so that's this abstraction function and so for the sign domain you know if we have the empty set we get bottom if we have the value 10 when we convert it to assign domain we just we lose any information about what the actual value is and we just know that it's positive and similarly here ten and five they're both positive so we can represent these values as a positive

value in our you know sign domain when they're all negative or negative and when we have some combination of the two we we can't really say with any precision what those values are because our domain only allows for positive minus or zero so we just have to say I don't know and similarly with our intervals the value 10 goes into the this very simple interval ten and five you know now we know that I interview it sorry I value is somewhere in this range likewise we there are negatives but now when we have a combination of positive and negative values we can be a bit more precise because domaine allows us to be more precise

because of the way that we've constructed our lattice and so instead of saying well we don't know anything about this in our abstract domain for sine we can say well the value will be between five and ten so there's lots of values in that range and we don't really know which one it is but at least we have more information than just I don't know all right and so the final thing that we need is some some meaning to our program and you know when it operates on our abstract values so as an example when we have the add instruction we're no longer operating on you know integer values we're only operating on whether the

value is minus zero or plus and so we can sort of work out from our basic rules of arithmetic what the final result of this addition will be so if we have two negative values if we add them together we'll still get a negative value if we have a negative and a zero value we'll get a negative but if we add a zero and some positive value we don't really know what the final result of that would be because it depends on the actual values right so the best we can do is say top and I don't know and so you know intervals it's a bit simpler I guess what the table looks a bit simpler

and when we add two values but in these intervals and it just becomes our lower bound is just the addition of the two lower bounds and the upper bound is that likewise all right so this is our program from before that we did symbolic execution on and now we'll do some abstract interpretation on it I guess so we start we set everything we initialize everything to bottom so that's the empty set uninitialized and again we'll treat this input variable as input that we control and maybe you know we've done some other analysis of some other part of the code prior to reaching here and it's already been set to I don't know what the input value 14

is and so this is gonna as you can Polly guess is going to call cause problems because we're working on the really simple sine domain so when we XOR all this together we get zero because that's one of the elements in our abstract domain similarly with our two and now when we reach the branching point we don't know anything about the path that will take and so we'll explore both so we'll split States and explore go down both paths again we know that to go down this way in has to be equal to zero and to go down this way in we still don't know anything about it so we can't be any more precise than top and so we keep

sort of running our interpreter over over our two paths this path doesn't really we can't really get much more information out of this because you know anything that we do on I don't know if the answer is still I don't know so we can't really be any more precisely but down this path we can sort of you'll see that you know when we do this instruction we're storing the value 10 to r2 and so actually becomes the positive value and again we don't know anything about its its actual value we sort of throw this information away as we go and so we keep going and then we hit a point again here where we have to

sort of take two different paths once again we don't know anything about temp because it was based on the input value which we didn't know anything about and so now we have to explore both paths because we don't know anything that time so to go down this path we know we do know that temp must be 0 so we can sort of reassign that as 0 and to go down this path again is still just top so we keep executing and we keep interpreting you know domain and then we reached here and now we've reached a point where we have to merge all these values in our table together so this is where the lattice comes in and we do this join

operation on the lattice and so what we do is we take each row in the lattice I'm sorry in the in this table and we sort of map it back onto the lattice and we take the a point above the lattice that accurately describes all these other values are below it so for the sign domain I'll just go back so for the same domain you know the top row of our tables we've had some 0 and zeros and some pluses the only thing we can do is move up to top when we do our join operation if we were operating on a different domain maybe this one maybe we could be a bit more specific and move up

to here or here or whatever depending on whatever user but unfortunately we're stuck with a really simple abstraction and so we can't really do much with that yeah so this this one here are two we can sort of do something with because we know they're all positive so the result here will also always be positive because no matter what path we take our two will be positive but unfortunately we've in and temp because we have this sort of diverse range of possible values when we do our join on our lettuce we just end up with top and then so here when we add top in our one with top in temp we still get I

don't know so what was the point in all that did we really learn much from this hmm I would argue probably not we know that r2 is all positive at all points in the program is that useful and then I probably not and this sort of all comes down to how we design our domain that we're going to operate on and this is really hard to do on binary code so unlike symbolic engines we can sort of go out and download a bunch of them from github if you were to look at an abstract interpretation engine on github that dear I don't know x86 or something like that you wouldn't find much or anything I

would assume because yeah I guess when you're operating on binary code you you spend a lot of time ending up at top basically and this all comes down to accuracy versus cost of of our lattice that we we choose to use here we can we can do something a bit more complex then assign lettuce but the downside of that is that now when we talk about the semantics of an instruction set on these more complicated lettuces then we have to we have to really think carefully about what each instruction does you know domain so again this example had no loops when we get to loops we have to do some other stuff that I won't go into

here but if people are interested you can sort of Google for this stuff you have to use what's called a widening operator and yeah I don't really want to say anything more about that so the hard things with abstract interpretation is designing doing that initial design of your domain and this really comes down to well what properties do you want to reason about in your code if you care about you know whether values of registers are positive or negative sure go and use the sign domain if you want to you know learn something maybe about buffer sizes maybe something like an interval an interval domain is more useful because you can say well I have a

buffer here and it's its size this in this sort of interval maybe this is useful if you're doing vulnerability research stuff because then you can sort of reason about overflowing buffers and that sort of thing but yeah it really comes down to what do you actually want to look for when you analyze your program all right so that's kind of it for abstract interpretation and so I guess just to sum up the talk you know I've kind of covered these three topics really superficially and you know I saw you saw that I didn't really talk about loops and more complex code constructs but I guess my plan was just to give you an idea of what these

techniques are and sort of point people away from this really disgusting academic literature that sort of describes things like abstract interpretation I think that binary program analysis is sort of becoming popular again and I think that's largely due to the DARPA challenge and you know all these symbolic execution tools that exist out there but there's still a lot of work to go especially at scaling up these sorts of tools and techniques onto real-world binaries as opposed to you know the toy programs that that they looked at in this challenge so some of the important research questions I think is how do we handle this state space explosion you know when we're when we're reasoning about something like Adobe PDF

Reader there's lots and lots of possible states that our program can be in so how do we reason about all of these and how do we prioritize which states that were actually interested in how do we scale this these techniques to look at these large binaries so programs like anger written purely in Python and sort of one of those so I kind of work on the anger slack and one of the things that pops up a bit these people you know pose well I try to load Firefox executable in anger to do symbolic execution in you know and my Python interpreter crashed is a pretty common thing so how do we scale these techniques to actually work with

real programs like Firefox or Adobe PDF and other than symbolic execution engines yeah there's not really a great deal out there in terms of open source and yeah so I guess that's it thanks for listening [Applause]

incidentally a dream talked about things like flee and anger and ste Adrian is actually one of the developers of estimating the symbolic executioner engine say so that's great as well so are there any questions

so you just give them a great talk covering things like lattices and few other concepts from abstract algebra but you're kind of dismissive of the academic literature which kind of introduces these so I'm wondering if not from there right okay so yeah maybe I was overly dismissive I don't mean to I mean I'm an academic nerd myself right so yeah maybe my face of language was not the greatest however I do think in particular for abstract interpretation the academic which is you're behind it is really difficult to understand for for things like symbolic execution and that sort of stuff I think it's yeah the barrier to entry is a bit easier but for abstract interpretation yeah I think a

lot of the work that happens in academia is and the papers that get published are very hard to digest so where would you go instead mm-hmm yeah I don't know I guess I guess if you if you really want to learn about it then you sort of have to dive into all this lattice theory and and and all this kind of stuff that that sort of helps you gives gives you the background to understand these papers

from a practical applied standpoint what is what is the right size of application to analyze with the current tools the right side there may be like thousand lines of code or something right so yeah that's a good question I guess it depends on the tool itself so for things like anger and Manticore or those kinds of symbolic execution engines when I read stuff about them and people writing blogs about them they're mainly using them to solve CTF challenges and that kind of thing so I guess looking at binaries of that type of size and well I guess I'm kind of biased because I'm the developer or one of the developers of s3 but I would argue that as two we can

handle much more larger and more complex programs because unlike all the other symbolic execution tools s2e runs on top of TMU so the emulator and so it runs a full virtual machine and so you can sort of reason about how your program interact with its environment and the operating system and you can't really do that with tools like anger and and Manticore and things like that because they're there emulators written in Python and and when they you know make a sis call or call something in Caleb C then it's harder to you essentially have to write a model that that emulates that functionality and so if you've got code that does a lot of that kind of stuff then yeah I

don't think you would get very far with tools like hangar than that any more questions well let's thank Adrian one more time you [Applause]

"Program analysis for reverse engineers: from ⊤ to ⊥" - Adrian Herrera

Related talks