Reverse Engineering Websites

Name: Reverse Engineering Websites
Uploaded: 2020-11-18
Duration: 51 min 13 s
Description: In the ideal world, every engagement would grant you source code access and a copy of the application/environment. Having 100% visibility into the static and dynamic environment of an application is incredibly powerful. By its nature, it eliminates the need for guessing and will make attacks signifi

Bsides CT · 202051:13343 viewsPublished 2020-11Watch on YouTube ↗

Speakers

Andrew Wilson

Tags

CategoryTechnical

TopicWeb AppSec

TeamRed

StyleTalk

About this talk

In the ideal world, every engagement would grant you source code access and a copy of the application/environment. Having 100% visibility into the static and dynamic environment of an application is incredibly powerful. By its nature, it eliminates the need for guessing and will make attacks significantly more informed and reliable. Simply put, a better job can be done because this is a position of advantage. In all situations less than that ideal, we can use reverse engineering to get into that position. This talk outlines the concepts, strategies, and specific methods I have used to learn the inner workings websites for exploitation. We will specifically cover: *pattern matching to quickly identify technologies *deductive and inductive reasoning as ways to dial in our understanding *how to ask informed questions to prove out those assertions *walk through of how code structures look, and what the rendered website will show *demonstration of decomposition techniques Andrew has spent the past two decades working with technology. The first half of his career was as a professional software engineer with an emphasis on agile, cloud, and secure development. The second half of his career was as a penetration tester with specialization in application security and training. Andrew has performed hundreds of penetration tests throughout the last decade and led even more. Andrew is a co-founder and ex-main organizer of CactusCon, an ex-Microsoft MVP, the lead of the Sen Security project, and the Vice President of security consulting services for Bishop Fox.

Show transcript [en]

andrew wilson kazushi who's going to go over reverse engineering websites so let me uh bring andrew in now hi andrew hey how's it going can you guys hear me yeah sounds good awesome thank you all right uh have a great talk and take it away cool let me just uh share my screen here

okay just as a sanity check you guys can all see that i'm hoping you could say yes i wouldn't see it anyways that's okay all right well um here let me get let's get started so first off uh thanks everybody for having me uh this is actually really exciting for me personally just because it's the first time i've given a talk in about last five years uh and uh i have kind of stepped away a little bit from sort of the community in the scene for other reasons but i'm really excited to be able to come back and kind of share a little bit about what i've learned over the last few years and kind of share with you all

in general i guess the second part is i'm from connecticut i think it's an awesome state i love basically everything about it except for uh property taxes so if you live there uh good luck but um born and raised in middletown so uh it's pretty awesome that the first place i get to come back to and kind of share a talk with is is kind of my home state if you will and i'm pretty pretty happy to be able to be a part of that um kind of a little bit about myself so i work for bishop fox i'm the vice president of consulting services so basically all the pen testing that the company does kind of revolves through

my department you know bishop fox is one of the largest privately held security firms uh doing nothing but offensive security that's kind of our primary focus um i also started a thing just recently called the send security project so if you like application security you want to learn more about it you know not just from this like how do i get a job and pass like an exam sort of thing but really and learn a little bit more about sort of the ins and outs of pen testing of applications uh we kind of explore that a bunch there um i'm a co-founder of cactus con so uh also it's gonna be virtual this year um

please join us it's in february of next year i should say excuse me um so cool uh to have you you join us we started as a b-sides and we've kind of grown uh pretty big since then but i'd love to kind of see some of my friends from connecticut join um for the sake of this talk in particular though there's kind of two facts i just wanted to kind of call out which is one i was a software engineer for 10 years right i spent a lot of time in code building enterprise systems for financials uh vertical for academia i've worked in startups i've worked for big companies i work for small companies i've sold a lot

of interesting problems kind of over that time it was one of the first microsoft windows azure mvps and i spent a lot of time kind of in code um the other half of my career really the first sort of six of the last decade was as a specialty tester focusing on application security predominantly web applications in general and so spend a lot of time thinking about the problem of how do you do this right and how do you do this with sort of an engineering mindset because for some reason there appears to have been a bit of a divestiture from web application testing to uh engineering as if they're like two separate things and really

they're they kind of go hand in hand they're two parts of the same stone um this particular talk actually came about through a conversation that i had with a guy named thomas patchak if you're familiar he was one of the co-founders of and partners of matasano and uh we talked about the fact that security testing is first and foremost a problem of visibility and i've always taken that to mean that if you had come to me and said like i've got a million bucks right i've got all the time in the world i want to go break into this this application i want to really figure out you know what the problems are what would i do um

i would say give me access to the source code give me access to the server you know let me sort of look at something how user traffic and real data is moving through it right i want to see what's going on because now i'm not guessing right i'm immediately in a position of greater advantage than i would be on the outside because i don't i don't have to like sort of fill in fill in the gaps and i'm not playing sort of 20 questions in a very expensive you know pen test if you will right um so if the idea is that that is the ideal state right if that's what we agree is like yes that

makes sense we should we should be targeting that then the question is what do we do in situations where we don't necessarily have that access or have that same level of um sort of information available to us um reverse engineering is just sort of one way to do that right and i would say that it's maybe not even the most preferable because as you get into some of the source code and you get into some of these systems when you start kind of going down this methodology you'll find that you can just go download the source code right it's public available frameworks and you know you know what version someone's using and you can take it down and you can

sort of shorten that loop and you can pull it in and in other situations you might find that you could target specific vulnerabilities to sort of obtain source code off of the web server through like directory traversal or breaching the server itself and then grabbing source code like all of those things are just as valid ways to sort of solve the visibility problem but reverse engineering is is the one that i think we're going to talk about today obviously right in practical terms this is something that is necessary for a variety of reasons first is that if you don't understand the systems that you're looking at it's very difficult to understand weaknesses in them right i don't know i don't really care

what you know about vulnerabilities per se but if you can't look at a system and understand like the composition it's gonna be very difficult to take vulnerabilities and then marry them up to what's actually happening if you have no kind of concept of what's happening behind the scenes and vice versa the more you understand of it the more you can leverage vulnerabilities that you you do know about because you can look at them in kind of unique ways um the second part of it is that you would do it because uh if you want to write specific exploits for uh this application or for a web server website you basically need information you need to understand how it works in order to

create reliability in those attacks right so the more sort of information you can kind of gather in terms of how it's composed and what it's doing the less guesswork that's necessary in it which is again the value of having you know a workable lab or workable thing on your own sort of uh premise right and then you can go test with it and then use that as a way to validate against something else that you're going after but i want to be kind of clear as we're talking about this that when we say reverse engineering websites we're not talking about reverse engineering like a binary that process is going to be mechanically very quite different for a

variety of reasons right when you have a binary that you download and you're going to go down this process of reverse engineering it you basically for the folks aren't familiar with i guess the term you have a decompiler which basically takes the the state that you're looking at it and it recomposes it into a reasonable representation of code from what it's doing right because you have this intermediary layer that you can sort of work backwards against but we don't we don't necessarily have that right we don't actually have a guarantee of a home court advantage that's the first thing that goes into it and what i mean by that is like when you pull a binary down

it has to operate in your environment on your computer and so it's your memory it's talking to it's your operating system it's talking to you and with that you have a lot of control around the outside perimeter that you can actually very much watch and understand and see sort of how an application is behaving and then use that to kind of understand it and so unless you're downloading source code like we said in the first place like you don't you don't have that right you're looking out on the outside and trying to figure out what it's doing um the second thing is it's not always a full picture right um when you download an application right

in order for it to operate and function it has to have source code to do that right meaning that any feature you'd be looking at at any given time that is functional right you have to have the full scope of that feature set available to you to look at whereas in a web application there's no guarantee that there are a bunch of other forces sort of invisibly affecting the way that this is behaving there is ways to detect that but you don't always get the whole story either because you you lack that visibility or you just lack the fact that it's not on your computer right it's not on your computer and then the last thing i'll say is that

there's no real set standards there are standards but they're kind of more like the end quote standards of like we said this was a good idea and then like some people followed it but i wouldn't necessarily call it like a standard standard a binary file when you pull it down has to meet sort of a minimum contract a minimum criteria of execution right and it might be very difficult to discern how it operates or what it's doing because they have a bunch of counter measures and things like that deployed but the reality is it has to follow those the operating system can't execute against right you know code has to meet these minimum constructs but when we look at a website

um we do have some rules and we're going to talk about those rules but we don't necessarily have them in a set like this is exactly the way it has to operate or it doesn't work in the same way as we would talk about a binary file so our process right of testing is going to revolve around three basic ideas and the first of this is a thing called recognition prime decision making so back in sort of the 80s or early 90s a gentleman named gary klein bid for a contract with the us military because they were having a problem teaching field commanders how to go solve difficult problems that were time gapped meaning they had to solve them

relatively quickly and they didn't necessarily have information complete information to make those decisions and they had just gone down this route of like making like a flip chart that said like if this go look over here and if that go look over here and it failed like it was it was awful and so gary klein and his team wrote about this in a book called sources of power about this this exercise they underwent to try to help the military learn how to make decisions better basically right and so knowing that the problem was time gapped and incomplete information they took a look at people who had a lot of experience with this already and so they went to firemen they went to

nurses they went to chess players and they looked at how they solved problems to understand what was going on in front of them and across the board they all basically came back to gary klein and their team and said i don't know i'm not making any decision right i just look at it and i know right which is fundamentally awesome but also super useful right useless i'd say because uh that's not that's not a testable method but when gary's client team sort of got into it what they realized was that they aren't in fact making decisions they just know what's going on because they've taken the time to understand the sort of lever points that actually

affect the outcomes that matter right so firemen are very versed in how fires work they look at how smoke behaves they look at how fires start they understand a lot about buildings so that when they're actually trying to solve a problem of going to a fire and identifying what's going on from an outside they can take all of the different factors that affect how this is going to behave when the sort of combination happens and they can look at and go well you know this is probably going up an elevator shaft it appears to stop on the fourth floor maybe there's an obstruction you can see how the smoke behaves different and you can do this almost instantly

right this is almost like an instantaneous pattern recognition model and for anybody who's done something for a long time i think that's going to feel very natural which is why they call this style of learning called naturalistic decision making right it's very similar to folks who are familiar with the oodaloop which has the same premise of you observe what's going on you orient what it means in your head you decide an act and you repeat but the model the actual model itself in our case is following kind of this pattern over here which is is it a familiar situation yes or no if it's no then you do a lot of seeking information reassessment and there's

sort of a loop over here if it is familiar then you watch to see how it's executing and you determine like is it is it meeting my expectancy is it behaving the way i wanted to does it have the relevant cues that i'm looking for in order to confirm that is in fact what i think it is and then you decide right and that decision part also has a loop which might say either a i already know what this is so i can go test it and i can go do something with it whatever that is um or i don't know i know what it is but i don't know how to solve it and you

might go through this like mental game of like scenario scenario play in your head to go like here's some ways i could solve that um this actually turns out to be one of the core pieces of this and i want to just call out this really difficult and uh to sort of learn over time but it's super useful and uh if you don't have this expectancies if you don't understand the situation being familiar it can really kind of lead you astray as you go down the path of understanding how code works where you won't be able to know the difference between custom code or code that they've developed or architecture or even just other interactions that are

happening in it because you have no expectancies you have no cues and you don't know how to sort of measure the mismatch of what's happening in it and the best way to explain that is like a couple of years ago you know back before google is sort of ubiquitous and and like cellular service was not so great i went to go visit my friend who had just moved up to the pack west and he'd given me these directions on paper i printed them out and it was perfect directions right it's like get off the freeway come over here and you get to like the last step and it was you know follow this until you hit

hidden something road right and hidden road how far away it was i don't know right it's dark it was night time when i got there and i'm all i know is it's a road right it says hidden so maybe i'm thinking it's like a small side seat or something like that i kid you not it probably took me an hour to an hour and a half to go two miles down the road because i had no idea what this like far what it was i had no way to measure the delta i'm literally every road you see whether it's left or on the right you're like is this it is that it and it was pretty

disappointing because when i finally got there there was a huge sign that said like oh this is the hidden area that you finally arrived at i was like oh you jerks like i said for an hour i've been looking for this thing but that's the problem when you can't actually use patterns and you can't use this sort of expectancies to solve problems is it makes you have to go really what would be this kind of loop up here repeating over and over and over again until you get there right the second part of our reverse engineering is going to rely on sort of the scientific method which is sort of part one and part two of this

which is the inductive method and the deductive methods right and just scientific method would be a better way to kind of look at it the inductive method is best probably described by edmund lockhart and what he called the lockhart's exchange principle which basically says you can't do anything in this world without leaving evidence of how the action was performed even if what you're doing is trying to hide evidence of what you've done the lack of evidence and how you hide the evidence leads evidence and it's very likely this transference sort of principle that you can with enough clues figure out backwards what happened inside of this is like the sherlock holmes like epitome of a sort of statement and

from a forensic science perspective the deductive method is more of like a reductionist master this was from the rene descartes in descartes and he talks about uh this idea of like a general theory and then an application and if we know like x is true and y is true then then the other one must be true so like in this example it's like all men are mortal socrates is a man therefore so creates his mortal and you can kind of use a general theory apply to a specific person and then make assertions about who that is and how they have to operate based on those sorts of things right so with that kind of in mind right with

that kind of premise let's talk about patterns right because we don't even though we don't have standards meaning you're not obligated to compose an application on a web server a certain way there are some fundamental problems you still are kind of up against um and this is probably the best explanation i've ever seen on this which is applications fundamentally have to solve three basic problems right the first is that anything that that they are doing they have to present back to you so there's some interaction with the user which is typically your presentation tier it could be a command that you're sending into the application that says hey go do a thing for me or go show me this information or maybe

i send data because i need to store it but ultimately some sort of interface and that could be as simple as like a terminal or as complicated as a web application or a desktop application or a very distributed application but that would be sort of conceptually what we all call this presentation tier the second part of it is that you have to have some sort of underlying logic so like if you make the get total sales total there's like a function that would correspond to it and go okay like what are the permissions this person has what are the sales that they're trying to reference and it follows like a routing workflow system to kind of

check and aggregate the stuff that would solve it and then it's going to go request that from your persistent tier your data tier then some storage thing right this is sort of the fundamental problems that everybody sort of would have inside of building an application this could be as simple by the way as like five lines of code or ten lines of code like in one little application right so they can kind of get really blurry where your logical tier is shoved into your presentation tier which then directly accesses the database or what have you you can have cases where they pull all the logic tier and shove it into the database and you're just talking about a

sort of thin veneer on the database but these are common patterns and they're all still taking into consideration sort of logical problems that have to happen from like a modern perspective websites kind of sort of behave like this especially like high availability highly scalable web applications this is sort of something that you would see right you're going to have a load balancer that's going to distribute kind of how things come in you might have a hosted zone that's specific to like a geography or a particular sub domain the web servers will scale up and down and so they'll distribute horizontally in order to correspond and talk to an app tier and then your database kind of

at the bottom the couple of things that are here that weren't in the other one was this sort of caching tier that's that's this is a cache and in theory you can kind of think of this as a cache which you basically distribute static information or information that doesn't need to necessarily make what we call a round trip to the database and then back all the way up through the chain because it hasn't changed right if it hasn't changed there's no need for it to go all the way down and all the way back up and then the other thing that's here that's not in the other one is sort of this monitoring network which

says if i build a thing like this to make my cool awesome website i kind of want to know what's going on with it so i'm going to have alarms and have notifications tables etc and they're going to give me insights inside of it so even though you can't see them even though they may not be accessible this could be internally hosted this could be in a separate like whole zone you know that presumably it exists and that you can actually anticipate that those things exist right so uh that's it that's all i wanted to say kind of about some theory and then i figured the best way to kind of do this was maybe to pick on something and

take a look at it right um first i'll just like clarify your normal warnings is that this is not a pen test against wikipedia um we're just looking at source code we're gonna make assertions right we're not doing any testing of any sort no kittens were hurt in the making of this presentation right we're just we're just playing with it right we're just taking a look um the second is because it's not a pen test because i'm not actually breaking into it um this is in fact something that's like uh just we're picking our target for our own sake right theoretically when you're doing this for an engagement knowing like i said security is a

visibility problem i might have a set of priorities that i'd go after in this case i'm just decomposing it for the sake of exercise and learning so normally would be a little bit more directive than what we're going to do kind of here so with all that said right with all that said we already know a lot about wikipedia before we ever even look at anything else inside of it you know first there's the landing page but just in general problems we know that wikipedia has millions of articles right just millions of articles written in multiple languages distributed to multiple different languages meaning that the french version of wikipedia and the english version of wikipedia

are independently growing because there's independent articles on it so it's actually pretty safe to assume from a general theory working our way down that these are separate entities meaning like the english instance doesn't also contain the french instance or the dutch instance but they probably independently grow even though they might really legitimately all be on a similar if not the exact same code base for each one of these sort of units right that's the first thing we know the second thing we know is it's super highly trafficked right wikipedia is i think the ninth most popular website in the world um you could do basically anything on it except cite it academically while you're writing research papers

but it's got information uh you know out there and we know that it's it's got to solve some real problems like scale so we can take this general idea of like its size its frequency of use the geographic problems that it has and i'm just gonna make posits right and again i didn't cheat and like look at source code and look at all this stuff because again it's open source i didn't do any of that until sort of way later but you can actually assert very basic things like they probably have a huge investment in a caching infrastructure they probably geographically home people as they go through the website to content networks that actually make

their life a little bit easier which are going to determine how the server scale and what you're talking to you can start testing for that right and that's my point is that we started with knowing nothing about wikipedia other than that it exists and there's some concepts and we can just start with general problems that it has to solve right from the get go right from the beginning and then we can test it right and so uh we did we pulled the first page this is actually the response body for the previous page or the response headers at any rate that's interesting and what it proves basically right off the bat is that it's cached right it's served out

of this apache traffic server the 808 version it has this other caching mechanism this g509 where it hits here and then it basically knows that this is serving out of that the second thing it actually kind of reveals in the header is that it also has this sort of external logging endpoint that it actually pushes data out to in terms of network events or network errors how that comes into play we don't know but immediately we know it exists and we can start even just on this to build a mental map of how this thing works the other thing that it does is it throws a cookie here that tracks sort of when i came last and it's not

actually on this response simply because um it was actually the 304 not modified so it didn't reissue the cookie but i had already come to it and it gives you a geographic home so it shows my ip and a header and then it shows my p and it captures it in a cookie and then it uses that cookie to make decisions right which is a lot right we learned on one request that my assertion about the cache is true we identified a technology of the cache we figured out how they're likely tracking it we don't know how yet to get to real servers right so when we think about caching as a problem we go okay well i want to get to data

that's not cached i want to get to real data you can actually still take that exact same principle and you can kind of make questions like um you know what if i create new content can you cache that before it serves what if i'm requesting things that are unique to me how does it serve dynamic data and you could take those as sort of theorems if you will and say like these may not be served out of cash it's possible and you can start using them to create a testing plan or sort of an exercise to understand how it's actually making those compositions inside of it right so um you know we don't know if there's

other servers we don't know how to get to the real servers we don't know how geography affects where we land so if we click kind of on the english page again like i said assuming that this is probably an isolated instance we can kind of walk into what is is this sort of landing page here and this is interesting right because it's not an article per se right it's not it's not like a specific article which we would assert may not be changed super frequently um it gives me recent news it gives me today's articles that are featured stuff that's happening right now and so we could kind of guess right practically speaking that this

would be served from more of a dynamic fashion and we would be basically wrong right i'll say that right what is interesting about that request is that when you go to the en wikipediapage.org this particular page it redirects you out and it serves you from a whole other server set that's not the apache traffic server we've gone somewhere else in this case it doesn't cache the request because it's just a 301 move right which is most likely happening at the server level and then it has and it adds on to this this request id which is tracked for logging right so if we're thinking about our first request that we know they have a logging infrastructure

we actually also kind of already know how they're doing the logging or at least some mechanisms of what they're tracking because that request id is probably a correlation id that they're going to be looking at errors against and they'd go back to if you have problems but one of the things that's interesting on this is that when you land on the page you should might have noticed that i requested this main page here right this was the url and instead when you look at the view source it actually retrieved it from a whole other page this en wikipedia.org with the w index php control and when you look at the two pages kind of side by side this

is not the same main pages two pages that serve the same way you can see that it's exactly the same traffic it's totally the same there's nothing sort of different here and we got to start asking questions about what's kind of going on so kind of back to theory um a guy named martin fowler created a book i want to say this came out in like the 2003 time frame called patterns enterprise application architecture it's very related to some of the original pattern books that came out the design patterns by the gang of four group that says when you write code there's only a handful of ways in which code gets distributed and you solve problems whether it's

intentional or not intentional there's sort of these patterns that people follow as developers and even trends if you are a developer today you'll see that there's trends that modern developers do that maybe older developers or people who've been around a longer may not follow or have some variation against well so what he calls out here is that this index.php page is most likely being served as a thing called a front controller so a front controller basically means that when you go to that page this index.php this is going to operate as a routing mechanism to the actual functions which probably exist beneath it as a do edit or a do history type of call and so you're going to have

page right the action that you're going to call one of these folks and then the parameters that get passed into it basically the way that you would control variables of it we also know that because it's supporting this index.php piece that the second call is most likely some sort of regular expression or alias map to that call meaning that whether at the service layer excuse me the server layer or whether on an app tier layer there's some sort of like routing mechanism in between that says hey when you request something that looks like this what i really want you to do is i really want you to map it back to this index.php page and given the frequency in which it

comes up understanding a little bit about the history of php and its development and things of that sort you could presentably assert that the index.php page with these sub-actions are probably the core framework of mediawiki right this is most likely how the rest of the page is put in together but i will call out that it's interesting because when you make the difference between my url rewriting and then my direct call through the sort of functional query method you now have expanded your scope of attack actually in two ways right because uh for folks who are not familiar there you know there's a standard here i think this is um the rfc one 1738 maybe that breaks

out how this kind of goes out you have schema like what you're calling you can pass username the host port etc but kind of towards this tail end is where it gets interesting is that you have a path and you have a key value and a query and this path and this query do not behave the same way meaning if i wanted to call a function and pass into it a parameter that was path slash data i can't do that in the path right i can't say hey go get path slash data slash data because the uri is going to route entirely different but in an index.php page i could actually pass like for instance the

title name of it back modified in ways that it wasn't anticipating simply because it now exists as a part of a query string and it's not a part of the path variable in the path parameter and that becomes even more complex as it you start thinking about the problems that it has in sort of that deviation right if you're a person building a web application firewall and you have to map these variables out in order to check it for problems you now actually have two very different sort of ways to access a site that might represent two unique problems all just in what is fundamentally a really basic design sort of consideration right um but it

gets more interesting right as we kind of go through our page here we look at wikipedia is one of the things that pops up in the burp history here is you see another page called special download as pdf page equals page name so i saw that my main page and then this user page um actually shows up here the downloaded as a server side and when you go to that request and you you play it out you'll notice that it will give you both of the exact same pages that we had before so if you kind of go back a little bit and we look at this page you can see that this is now served another two different

ways i know four ways to access the exact same page both as a pdf and as an html and when we map this out if we start playing again our we've we've looked at the general theory we've looked at the clues we can actually put together almost a functional map that would say something like this is that when you go to this uri here there's some sort of rewriter here that has to have a regex function or an alias function in order to map it and correlate the request somewhere back here right and if we go directly to it we sort of bypass it meaning that this doesn't exist if we go here something even more

interesting happens which is number one the server is now making the request on my behalf not me i'm not making that second request to this page this happens as a part of this piece here so if you're testing for things like server side request forgery and you're looking for opportunities to go after it like you know i actually have identified a very reasonable route to start testing for confirmation to see how it would work and what it would do but more specifically kind of to it it means that there's probably also a pdf converter right and some way to convert the content that would come out of this to go through it and all of that now

becomes very testable right behind the scenes even though i have no idea what their implementation looks like i don't have to know what their implementation looks like i know the problem of i clicked here and it went and got this and somehow i got a pdf right and if that's all true then the existence of a pdf converter has to exist somewhere in that context right and we can start playing with it and you can start testing it and there's become some real vulnerabilities we've seen in web applications in the past you could use this like server side request forgery do does this call a web request does it bypass this function and maybe call it out of a database

right again we don't know but those are things that you could start testing as you kind of go further and further into the decomposition of this and start kind of playing with it and going from there which is a lot right we've basically looked at what amounts to three or four requests and i can tell you the language of the website i could tell you his core request patterns i could tell you the names of some special functions i know that the index.php page is likely how everything serves for articles and so you know if you go to one article it's basically the same like going to all of the others unless they have special functions on them

which they were so convenient and nice to call special function do things special log here and so it kind of saves you a little bit of time in terms of figuring out what's what's unique um we've identified a super user account that guy blocked me because i was pretending to be from japan but whatever and you know we don't know things like what are all the other pages we don't know what other actions each page supports we don't know how the edge nodes match or understand the request pattern differences because there's now three or four ways to get it we don't know what their coding protections are but all of that is stuff we can actually test for and we can

create more questions and more questions and more questions based on this to start building out further our kind of mental map if you will um maybe specifically picking on the fact that we still haven't figured out how to serve cached content um what i'd say is i went backwards and i looked at anywhere the cache missed and when i went through i found another front controller called load.php which seems to be targeted specifically around the idea of dynamically loading javascript on the page which again makes sense right some of that could be cached it could be that this request in particular maybe isn't something people frequently call maybe it's something that has custom content in it

for me for the thing i was going to do and so it didn't serve it out of cash it served it from a miss here and you could see a miss and it says hey we bypassed the cache but because i'm here right on this oh sorry i went wrong button mw the 1364 what have you i know that that has the ability to serve content even if it's not being served out of cash so i actually know a fair bit now about their hosting environment more than they were sort of anticipating um kind of going back to our front page when we looked at this guy before one of the things that i come

sort of jump ahead a little bit and skip in in the context of this is that it actually gets served on a whole other framework that's different than every other part of the site so it actually comes out of this api rest v1 page summary component it's a change in technology right rest is representational state transfer for folks who aren't super familiar with it it's a very popular way or at least it was a very popular way to serve like an api data right back out using things like post and get and headers to control how are you getting the data out of this sort and so it means that you know i know frankly they probably

requested this through javascript i know that the down here at least according to them what functions it supports i know the spec that it's being rolled against and i also know down here that it's served on a whole different server this is not even on the same framework there's a whole other infrastructure component that it's going to give to me as we go through this right the other thing that come out that was uh fairly interesting and sort of very dynamic a little bit as we went through it is that they have this thing for a beacon right and the beacon and of itself is not actually terribly interesting right again this is probably just a

one-way push that says hey you went and updated it and here's some data but it's yet another change in technology so it's not a front controller it's not a rest controller this is something that we actually would call a page controller which we'll talk about in a second but i'll call out just before we move on again um there's now another framework there's like a whole sort of other framework here that doesn't that didn't previously exist inside of it so now i know now i know another caching entity right um a page controller is uh not a front controller in the sense that typically the page is the action so if you're familiar with legacy

programming languages or you've seen in other places you might actually have the page named as an action like to do edit or a do update or do delete or do create and inside of these sort of statements of actions what typically happens in a page controller is that in and of itself it is its own sort of function its own ecosystem and meaning in this particular context that means there's really probably no other way to access what's going on in the beacon without actually going to the beacon page meaning it's not a part of a bigger scheme and you can actually imply that it's maybe an additional extension to wikipedia and you can assert that it's probably added after

the fact and it's something different right because it breaks the convention it breaks our patterns um they are likely to have some very specific behaviors meaning that the beacon is is not something that the rest of the site needs to care in some cases it'll inherit a base code and it shares it out but that's typically what you would see if someone designed nothing but page controllers so like if you had a whole bunch of uh common code that needed to be pushed up into the presentation here you'd create a base for it but in this case because it sort of exists in a very like almost aberrant sort of way to the overall picture um

we don't we don't know right it's probably just in and of itself its own piece so we've looked at it basically on five or six requests and in the five or six requests that we've taken a look at wikipedia i actually know a whole lot about the potential architecture i know that they have a ton of edge content that's served through the ats 808 system right we know that there's probably this caching tier underneath that that serves and it enumerates up and down numerically which if you looked at it you would see that it would grow and we know that there's a backing server behind it with the mw servers that gonna up and grow numerically as well

we know that that's probably our primary code body which is the front controllers and then we also know that kind of off to the side right there's an interaction between these pages here in the front controllers and the sort of restful service this restful api which also enumerates you know up and down numerically and then we also know that there's a varnish server or set of varnish servers for this beacon and with all of that you can actually do do a lot right now i'm sorry and the last thing we know is this sort of logging framework over here we didn't actually get into how it logs we didn't get in all the conventions and all the pieces of every

kind of thing inside of it but we know it exists we know that it's a possible target we know that it's capturing information and transferring it there but we know we know of its existence and so we could actually like i said continue this process of decomposition until we have a fairly well composed picture of how their infrastructure looks just based off of really a handful of requests and a little bit of process inside of that for folks who are maybe not as sort of key versed if you will in the context of how to do some of this yourself i would call out and just say like look um there are some tools that support

that you have a thing called wapalizer it's not necessarily the best thing on the planet in the sense that we went to the same page and all it identified was uh that is mediawiki which we you know we already knew has a cart functionality simply because they accept donations and then we know sort of the jquery and we know how the ui sort of works in the context of this right we know some of these decisions here in our cases those things don't necessarily have a super strong implication in terms of how the website works but it gives you that rate off of it and that's off of one request whereas we took a look at three or four

and we were able to generate almost like half of a network map in just a question of a couple of minutes right it's just it wasn't necessarily even that sort of big of a stretch on it so that's really the kind of the gist of the reverse engineering process in the context of wikipedia if we were going to go down the route of turning this into something from a web exploitation perspective or something that we would do to go kind of create attacks you could actually take this and go even further very topically or very specifically in it right so server side request forgery in the pdf download functionality it's making a request on our behalf

which is very interesting we don't know whether it's an http request it could be served out of a database meaning that in some models what you'll see is that the source code actually gets shoved into the database and then it's interpreted on the code in on the server in real time and then you know pushed back up into it but we don't know that we could test for it we could kind of continue to decompose it but we actually have a pretty good starting point for where we might want to go for tax that look like it we also know that if you're kind of looking at cash poisoning and other http like splitting based attacks um there's

probably a lot of ways to bypass it you could actually explore inside of it i know the server infrastructure inside of it to make the requests i know different aspects of how the server infrastructure behaves we were able to see relationships between all of the servers and how they kind of communicate in terms of where topically they live and what problems they're solving we know that other functions are interesting so if you go back kind of to this this slide here you'll see that if we take the assertion that this is my functional map here meaning these are my pages the title equals what have you everything else kind of shows up here as a parameter so if you

like a good pen tester or a bug bounty person what you might want to do is look notice that like this return to actually exists on a lot of different pages and so does some of these other ones well how much of them i don't know but you could actually make a parameter map based on the way that it exposes it and then re-request and iterate with a very high likelihood that some of these are cross-functional and a variety of the other pieces that are going to affect the behavior of your overall site and with it you could actually create almost like an exports list of the functions and how the parameters look and what it does

simply by taking what it's telling you and then continuing to kind of dig and dig and dig and dig and dig and dig right in the last thing you can kind of take away from this is that you know there's this logging service right and this logging service is possibly right if you're going to go after second order attacks you have to think about where the data is going and who would be used and with that is it possible to sort of like poison the log or poison like live chats is where you might see it in other cases but in this case we know that there's a logging service we know sort of kind of

how to talk to it in some regards and you could actually decompose even that by looking through your request processes trying to cause errors trying to cause network errors seeing how those are going to go back into it and how how it requests back out and what it does and then now we know where you might be we might know how to go after it and what we can do with it um and that's it that's the whole sort of concept of reverse engineering web applications kind of in a nutshell but we we still sort of learned a lot and i wanted to say like we've really only kind of scratched the surface in terms of things like

patterns because patterns are complex and we're talking about basically server distribution and how that goes out to the internet and how that works but in all reality you still need to learn about things like how does a browser work right how does a browser engine work in sort of rendering a dom in the context of the browser you have to learn the interplay between the render agent and the browser itself to get into some really advanced like cross-site scripting attacks and taking advantage of dom differences in terms of how they execute either at the rendering agent or in combination you have to learn about new design patterns right so martin fowler talks specifically about a

thing called a front controller pages controller and mvc those were kind of like the three big ones at the time but since then flux has come out since then we have react which is basically a single page that operates as its own sort of functional operating sort of thing inside of itself and while it looks sort of like representationally if you look at burp suite in terms of network quests it's it's own page in and of itself and so like taking the time to go learning those patterns is in fact the only way to take advantage of a methodology like this right you can't you can't learn how to make decisions like an expert unless

you're already an expert that's the sort of statement but i can show you what sort of things you need to value and what sort of leverage points you have to kind of learn in order to take advantage of that as you continue to go down this road so that as you're decomposing a web application you can figure out kind of how it operates and what it does the scientific method still rules right you can't hide what you've done on a website even though in some cases you might might want to right i was sort of joking that maybe wikipedia has a lot of baggage because it looks like a bunch of legacy code intersprung with a bunch of newer code

and with that creates unique opportunities but they can't they can't hide that it exists because it's in order for it to function it sort of tells you a lot about what what's kind of happening in that the other thing you could take away from it is that you've got standards and when you violate those standards you create ambiguity right so if you want to learn more about standard bodies and what they actually say that you're supposed to do and the sort of browser decisions and server decisions that go into it there's a book called the tangled web which i highly recommend everybody to kind of read that breaks out the different protocols and standards that kind of compose a browser

what a browser is trying to do and the ambiguity inside of it and how it's used and how that creates opportunity a phenomenal read which is kind of a lot of how this this sort of lets us do that testing and then last of course but not least is that when you increase the code patterns inside of an application or set of applications you actually increase the amount of opportunities a bad guy has to go after it and do a tax and so as we saw in that we have literally four ways to access the same content we don't know things like how is authority derived you know can i access potentially private pages because my authority and its impact of

authority is not passed along to the method body and so with that you can kind of continue to go layer after layer after layer and then make more assertions about the underlying piece because they've created new opportunities simply in some of the design decisions that they've made and how they how they exist and what they're doing and with that that's really actually uh yeah i kind of got through it a little faster than i was and just but i kind of that's that's the pros that's how we do and that's how you can use it to go now um i know kind of done with that but i'll offer one thing is like when you're reverse engineering for

attack you're typically not necessarily trying to understand the universe you're just trying to understand enough to operate inside of that world in order to make better decisions right so i know we kind of like did this sort of journey together where we looked at servers and request processes but like i said my approach to pen testing this sort of stuff has always been very opportunistic so like i would go after a site like wikipedia or any site really with the goal of watching the web server as my like primary target and so typically my decomposition focus is going to be very specific to those activities so i'm going to look for directory traversal the areas where directory

traversal might look any way i can get access to exposed source code or secrets i'm going to look for file upload based vulnerabilities specifically to see if i can get things onto the server or take over the server because again my goal is not to try to like read all the source code you know backwards through this reverse engineering exercise i'm going to reverse engineer only within the pocket of nature to understand what it is i'm attacking and get enough information necessary to leverage that against the server to use it for very advanced and very custom unique exploits to to that situation so um i guess with that uh you know again like i said i got through a little

faster than i was expecting a little four minutes um it gives us some time for q a so if you have questions i'll stick around and we could we could chat um if this is something that's interesting to you um i'm gonna be putting together like an actual training course to do a lot of these exercises you know chat with me and let's uh let's see if we get you in on that i'm trying to do like a pilot class where i'll probably do it for free the first time and get some feedback from everybody but again you know thanks for having me really excited to be back uh hello uh 47 for middletown uh exciting to

to see more people from connecticut and uh you know thanks for having me so that's it

thank you andrew that was great anybody have any questions at this point i'll leave the uh opportunity for questions open for another minute if anyone is an ask andrew looks like there is a question here that says what's next for applications even thinking we find mobile apps but for this question how do we these concepts today's apply to other current applications and future use of tech what is also for web services and next for testing you can use all of this for any of that right this is this is a process that is generally applicable for figuring out how a thing operates and sort of works inside of it and we use this often in decomposing mobile

applications in the same way where we don't necessarily have source code but we need but we do know how it operates and right with that you can kind of take the same idea of like you know relationships between code bodies relationships between data how the data comes in and you could still create this sort of pseudo-functional map and then work it backwards to do it whether it's a mobile application or a desktop application in fact what i'll say more specifically on it is there's a great book called practical malware analysis and in that they talk about complexity of malware analysis and some of the things you might do and one of those is to just not even try

to reverse engineer but to put it into an environment watch it behave and then look at all the outputs of how it behaves as a means to determine what it's probably doing right and depending on how far down that rabbit hole you want to go you can take that same approach in other applications and use that as a way to determine what it's behaving whether it's malware or whether it's a desktop application or a web application even you can actually use that to figure out a lot of the composition techniques inside of it too so same same same it's the same general applicable process mechanics are different but it's the same process and rest did change in popularity we

started in 2000 it was around for almost a decade it became really popular in 2008 um it was sort of the way to create services from like 2008 to 2010 people drop soap um but since then i think people are now moved to like graphql and other services as maybe being a little bit more popular than um rust is today in terms of state behavior and state modification so cool sorry any other questions i just saw that one from you andre

awesome well i'll be hanging around and feel free to reach out to me either on twitter or after and i'll be i'll be hanging around for a little bit

Reverse Engineering Websites

Related talks