← All talks

M. Scott Ford - Building a Bridge to a Legacy Application - How Hard Can that Be?

BSides Knoxville38:4624 viewsPublished 2019-06Watch on YouTube ↗
Mentioned in this talk
About this talk
Recorded at Knoxville's 5th annual BSides on May 3rd, 2019 My team loves working on legacy code projects. It's all that we do. That's why a friend of mine reached out to us for some help. His startup was building out a universal API across a very fragmented industry with little to no interoperability or standards. Up until now, integrating with the systems in that industry had been pretty easy, because the companies that built them were willing to help. But now he'd found one that wasn't willing to help. There was no obvious API for getting data out of the legacy application so that it could be exposed via his company's API. A big client for his company was riding on his ability to be able to pull this off. He remembered how much I loved a challenge and how much my team loved legacy code, so he figured we were his best shot. The goal was to be able to read from the application's database. In this talk, I'll cover: * the different approaches that we took * the one we really wanted to try because we thought it would be fun * the approaches that we needed to try before we could attempt the fun one * the excitement that we felt while working on it * the grind toward completion once the big technical hurdle was crossed * the sense of achievement when we got a read-only solution built * the hope that we'd get the green light to start working on a read-write solution * the disappointment when the plug got pulled and we weren't authorized to proceed any further It was a fun journey.
Show transcript [en]

how's everybody doing good join the conference so far and I know it like right about now this time of day at a conference is usually when I get really sleepy like two o'clock after lunch just saw a talk sitting still too long so so if you're not off that's okay or if you need to get up and walk out and walk around a little bit stay awake I get it so there's a couple questions like the start every talk I give with how many people in the room identify as being a software developer all right awesome so of those of you how many of you like working on something that you've inherited from somebody else

I wasn't specific whose it is okay hmm all right all right so a lot of caveats about about how that work would go so I for one absolutely loved working on that kind of those kinds of projects my name is Scott forward and picking up were someone else left off is something I I find a lot of fun a lot of fun so much so that I decided to try to you know build a company around that idea and so what we've built is a team of people who absolutely love working on older projects we like to call this work mending to kind of contrast it with the idea of making so where a lot of people

get really jazzed and excited about like taking something that didn't exist and bring it bring it into life we like to take things that already exists and make them even better or take something that's old and breathe new life into it or take to things that are old and smash them together to make something new or take something new and something old and smash those together to make something something new and yeah we don't build any any systems from scratch we add new features to things we improve the way things work you know that's what gets us excited so today I'm gonna tell you about a story you know tell a story about a time that my team was tasked

with building a bridge to a legacy application and I'm gonna talk about the challenges that we encountered along the way some of the things we tried and spoiler it doesn't have the best ending but I think it's it's still Scylla story that we learned a lot from and I think and can teach others favorite too so the story starts with me getting call from a friend of mine and for the purpose of this story I'm just going to call him Jeff his name is not Jeff and in fact I'm gonna be essentially vague about a lot of details I'm gonna be vague about the industry that we're working in and I'm not gonna mention the name of Jeff's

company this is you know in part because I'm not exactly sure that details of the NDA that I signed and whether or not I can disclose those things so just be on the safe side will be very very nonspecific so Jeff's company was building an API across a very fragmented industry and the idea was to take a landscape where there were tons and tons of vendors that had very inconsistent ways of your providing communication with their platforms and providing a very consistent way to communicate across all these unders so that anyone who was interested in you know building an extension on top of these vendor platforms they could they could build an integration with Jeff's API and then

that gives them the later to the flexibility to perhaps even switch vendors if they wanted to so kind of a route to organizations who are like locked into one particular vendor you know being able to potentially break that vendor lock and eventually we had done some work with Jeff's company in the past we had helped him build out a test suite to qualify that some of the integrations that he'd built were implemented correctly you know tons and tons of vendors that they integrated with some of the vendors had had different systems that had you know different different capabilities and qualifications that they integrated with and so there lots of small little details for these integrations and so

having like one single test suite that they could click and like okay this integration is behaving the way the rest of the system will expect it does was really valuable to him so he knew from working with me in the past and he knew from working with us on that project that we really loved the challenge and he was convinced that he had won for us so before we get into some lower details a little more background on how his integrations typically worked so his API would talk to what they would his team called a connector and one of these connector services would usually run on the same system where the vendor system is installed some of these vendor

systems would be you know Linux boxes or UNIX boxes some of them would be like Windows Windows desktops sometimes or one of the servers a very very wide variety of technologies that they were needing to need to talk with so this connector would you know communicate in a none protocol and kind of build the bridge between the API and then the vendors SDK and most vendors had had anis some form of SDK and the vendors that they were looking to looking to integrate with we're usually pretty happy to provide that SDK because Jeff's company's API was providing capabilities to their customers that they had been wanting to provide and so they kind of saw this as a you know a benefit to both

companies you know the vendors customers were were getting extra capabilities at their locations and and Jeff's companies was was getting a little bit a little bit of revenue and an API was being built for the vendors platform without them really having to do anything so most companies were pretty excited about it but then Jeff encountered a really really big client who was using a vendor that he hadn't integrated with before and this vendor was really reluctant to share any details about the inner workings their platform they claimed there's no SDK they claimed there's no way to integrate with it they claimed that they were building their own API and we're trying to you know Stonewall

the client and say that they shouldn't reach out the Jeff's company and work with him so the this potential client that there was courting Jeff's company wanted to see proof of concept for both organizations so they told the vendor you build a proof of concept and Jeff's company you build a proof of concept and so Jeff hired us to try to tackle a problem of some way somehow getting the data out of this vendor system in order to be able to in order to be able integrate with it through there through Jeff's team's investigations they discovered some details about the vendors platform there were several components to it there was an on premise premises client

UI that wasn't like an ax customer setting and facing so like in a retail location or you know where somebody would come to the door and and work on you know provide some kind of input on the system and then that's talked to a centralized service that stored its data somewhere not entirely sure where and then there was an admin UI that would you know allow administrators to set up the look and feel of the the client UI to set up you know different configuration options things of that sort it ran on the Windows 7 desktop which I thought was an interesting interesting choice but it was evidently unable to run on newer versions of Windows which

kind of gave us a hint at how old the technology was and again you know there was some data details in the documentation that there was a database but there's no information on like you know what what who the database vendor was was their very database vendor it was at flat files like what was the storage mechanism I really had no idea so we came up with a couple ideas for like how we how we could how we get a you know pretty much attack the system because we're trying to get it they're trying to get at its information that it's hiding below its user interface so one idea we had is well perhaps there's

an undocumented SDK that's already on that's already installed on the system you know the client UI is communicating with a server somehow perhaps we can find that mechanism and then and then take advantage of it and become a client ourselves another idea was to try to try to find libraries that are doing the database communication and go with the data in the database directly so kind of you know bypass any normal application communication and just target the database directly another idea we had was to actually screen scrape the UI which I saw some gags in the audience like yeah that's was kind of our reaction and you know if we needed to we we could we could certainly try that

so that was one option the other one was the one that we're really excited about we really wanted to try was to reverse engineer the database file format and this was like we were getting really excited about it people in my team were fighting over who's gonna work on the project because so many of us who really wanted it like digging and tackle this challenge so we did a little bit research we wanted to figure out like ways to like quickly eliminate options and so while we're doing our research we dug in and we took a peek at the screen scraping possibility and we noticed that there were some severe limitations in the way the UI was built that was going

to really cause us some problems the admin UI was built on top of 132 so we were able to get we could have used tools like win32 api messages to to get at all the details we would need that's on the screen so we could have like a remotely controlled the application gotten values out of text boxes gotten values that have labels on the screen it could have worked but it wouldn't you know would likely would've been pretty difficult when we looked in the client UI what we noticed there is that a lot of the information on the screen was actually drawn there directly so what we would have had to do instead is you know

somehow captured bitmaps and then OCR the contents and then that's where we decided like that's not that's not going to work so we we scrapped that we scrapped that idea pretty quickly so that left us with with three options one of which we really we really wanted to do and we thought you know we would be able to do and when you would work if we could actually crack it but if there's it's also the one way that's probably the hardest and we feel really really stupid and really really silly if along the way reverse engineering the database we discover that there's an easier way to get at the data so we decided to attack these problems in order so first

we just to hunt for an undocumented SDK and then hunt for libraries that let us communicate with a database and then if those two failed then we would actually dive in and try to reverse-engineer the database file format yeah they told us there was no SDK documented or otherwise so and so we thought you know perhaps it's something you know you perhaps it's not something that they've documented for external consumption but it's perhaps something they use internally and so if we could find breadcrumbs for that it's there that maybe we could you figure out a way to access it so that was actually the first thing we tackled is like well maybe there is an sdk hidden somewhere

here so we did a quick hunt for it we skimmed through the directory structure where the application was installed and just looked at looked at file names and you know tried to see is there anything in here that looks like it might be named like what we would expect an SDK to be named we did find evidence that like this is a really really old application then granted this story is only like two years old so I don't does anybody recognize that icon yeah I remember correctly that's like so forever okay that's like visual visual steals Plus around like 3 2 or 3 was like that was the default executable icon most of the executables in that

directory had this icon so so when you were working with something that was like you know late 90s early 2000s in terms of like at least when the project was started so so definitely like an indication of how how old the system was we used a utility on Windows called dumpin which comes used to come with the win32 SDK I think it's ship so the Windows platform SDK if you were to go try to find it now and what this does is it will take a DLL or an executable and it will just dump a whole bunch of information like so any function names any other symbols that happens to find and any strings type names things of

that sort there were a lot of dll's there's a lot to sift through it produced really long text documents so you know you'll probably you know upwards of a thousand lines per DLL that we had we had to you know kind of skim through and a lot of it was gibberish because it looked like there was some ossification that was going on there they were taking advantage of so a lot of the function names that we could see they were you know just gibberish we we could tell it was a function we could tell it took three parameters that's about all we could tell so not a whole lot of help there so then our next

question was well that's okay let's go ahead and try to get at the state of correct directly and and maybe we'll get lucky and they're using an off-the-shelf database that we can find external libraries for so if that was the case we thought we would find like some database access to us or some indication of like a known file format or something like that so we looked for evidence of a commercially available database if you name it we looked we looked debase foxpro access Postgres MySQL sequel Lite we you know every everything we could think of we did a really exhaustive search for any evidence of any of those spa formats and we saw nothing

so then we went back through the dump in output and we looked again this time not looking for like you know not looking for functions that would tell us like you'll give us some indication of like how to get particular data elements that we were looking to get but instead like you know maybe functions that would let us read read from a table or execute a sequel query or you know see a data value that's in a column or you know anything that might might be data related we didn't we didn't find anything we did find a few dotnet dll's that were there we D compiled them and they were they were also obfuscated they

were office gated to the point that they crashed the the reverse engineering program they were losing and we tried several of those as well and it crashed most of them so it was a very well obvious kata that definitely this company definitely invested well and trying to make sure that people couldn't do what we were trying to do so nothing obvious found so we got really excited because we got to do what we really hope we did you which is like you know crack open crack open some some binary files and and start reverse engineering that they do that's inside them but first we had to find those files so this isn't the the tactic that

we chose for this was to open up the admin user interface made some edits close the admin user interface and then just look for something in the file system with the date/time stamp that it was very recent and we we got lucky we we found one file and I remember the exact extension but was something religion Erik Like dot Ben and so again like you know that was a that I was worried there was gonna be like thousands of files and and but it turned out it was only one which is good so then we turned to my favorite hex headed area which is called Sena lies it there's um its Mac only but there are as

a Windows and Linux version called hex inator if you've ever worked to go for it is really neat one of the things you can do with it is kind of investigate unknown file formats and as you're investigating them you can build a a grammar that defines this file format and it's able to generate code for reading that grammar and your some languages I think it can generate C code for you it can you I think there are plugins that will generate reading code for some other languages as well you can map out the structure of the the actual raw binary file format and it has like color highlighting which is a little hard to see in this picture but up here

there's like some some pink the if the grammar kind of defines the color highlighting it can be there this screenshot is one that I took this is a just a random debase database file that I found on the Internet and debase is one of the built-in grammars that cinemas has support for you can open up a JPEG file and you know it's there's a grammar defined for that so it's a really handy tool especially built for purposes another thing we're able to do is we could select any blocks inside of the inside of the airplay passcode you know it is not part of the present you have a really good straight face by the way this could you had me

there how do I switch back to okay

that's definitely about the one it says Apple TV right so whoever was trying to stream earlier you're disappointed okay we're back don't know maybe I stood still for too long thank you another thing that's really nice about cinema is it is you can highlight you can highlight a block of data and you can tell it to attempt to parse that data in tons of different formats you could say like you know interpret this as a float for me you interpret this as an integer for me interpret this as a utf-8 string or a ucs-2 string and there's so there's lots of different ways that you can like you look at look at different pieces of data and try to

get a sense of what it is and then it also has good support for different byte and word ordering so you can see like big endian versus little and indian for for byte ordering so definitely a really good well-thought-out tool for for reverse engineering a file format if hex inator it's it's built by the same company both both tools and hex inator is billed as like a freemium version of sanitizer so the the Mac users get to pay for it but the when does the linux people get it for free so so so we we have we have the database open and a hex editor we hunt around for Strings we found a list of what looks like table

names we found that one of the tables was named a b-tree anybody here with a computer science background okay so yeah so Beach so b-tree is a way to build a binary search tree based index with like with that levels itself that's it's what most modern databases use to to build on an index so that told me that not only are we looking at a custom database but we're looking at a team that built their own index for a custom database engine so and you know it was likely built this this database system was likely built sometime in the 90s and between then and 2000s you know seventeen the team does and never once decided to actually

replace it with something off the shelf and more modern so it was definitely fascinating from that perspective almost like an archaeological perspective so what you know while we were looking at the data in cinah lies that we're also looking for patterns so when we found the areas with the database table names we also we found that like the next like six bytes were were blank where had some numbers in it until we got to the next column name and when we looked that looked at those bytes we discovered that some of them some of them were offsets to other locations in the file where we'd find other information about that table so one particular value was an

offset to where we would find column names and next each column were some metadata about you know the type that that column contained and how many how many bytes wide it was so if it was like a string or a car you know how much space it was going to take up and then also back in the the tabled metadata there was another location that we discover is also location to file but that's where the actual world data started so by combining the metadata about the table the columns the data types we're able to like actually start to piece together what data is stored here are these integers or these floats and then to validate our assumptions we

kind of ground through and for every piece of information that we wanted to get out we went to the UI we made a change and we did it before and after diff and sanitized it has a defeater where you can do a getter before and after view so we were able to like you know save a copy of the database make our change do a before and after diff and validate yes those four bytes really are and yes they did change from 15 to 16 you know we're good and we can like we could do other tests to validate that the the byte ordering was what we expected by picking values that are big enough to actually make the byte

ordering matter so these were lots of assumptions we had to validate and you know we marked all our assumptions analyzed it and we repeated over over and over and over again for every field that we needed to get out and this took weeks this was you know the you know the big bulk of our investigative effort but we really felt like we were making progress and to simplify things we really only focused on the data we needed we could have tried to map out this entire this entire binary file and like produced this really awesome documentation for everything that was in there but at the end of the day we didn't really know if everything in

there was been used by the application there's a good chance the given how old how will this application was that a lot of the information in this database file just is dead right like it perhaps will always be 0 or even if we did find a column perhaps it's neo never updated you know you have things of that sort things I see all the time in more modern applications with more modern databases you'll have people leave tables and columns around just because it's you know it's easier to do that than it is to delete them so how do assume that something similar was taking place on this application so next task was actually you know now that we know a lot

about the database we and we've collected enough information through reverse engineering you know changing the field values finding out where things are we now have enough information that we needed to actually start building out and building out a connector and one of the things we were instructed to do for this demo was it only needed to be read-only so we didn't have to worry about you know writing back to the database at the same time that the system is trying to write to it we we did start to come up with some back of the napkin ideas for how you know how to go about doing that like moving that moving the file off to

another location making our rights to it they're looking to see if it had changed swapping it back really quickly you know all risky stuff you know kit you know trying to kill it you know trying to put an OS level lock on the file so that we were loving ones who had had right access to it you know a lot of different ideas there we had played around with for trying to make sure that we wouldn't corrupt the data if we did try to write to it because I was definitely concerned that we had so what we ended up with is now we have a connector and we've basically built our own custom SDK and

so now we're ready to go demo to the client and the demo kind of went something like us I wasn't at the demo so what I'm hearing was all kind of second hand but the way it was told to me is that the CTO of the vendor was present at the demo and they were really alarmed that we were successful because it turns out there was an SDK and when they found out that we had plans for a rewrite support they decided to hand it over so and no no there so my team was like man because we're really excited about this like we we were having a lot of fun we were looking forward to the rewrite version

it was like there is a lot initial initially there was a lot of disappointment so I tried to spend things around and say okay okay not not so fast we learned a lot through this process one we liked we thought we could do something like this but now we know we can do something like this it was a lot of fun like hopefully the how much fun I had is being conveyed I know a lot of you're like cringing at the thought of some of what we put ourselves through but we've really enjoyed it like we really enjoyed this kind of work and one thing that I think is a really good positive is like because we did what we

did it ended up with the best solution for everybody right like the safest way to integrate with this system is through an SDK that's approve of the vendor like that's hep so that's the absolute safest way to go about doing it so the fact that like you know even if it was like you know the perhaps some like really ugly business to business politics taking place or maybe even interdepartmental politics at the vendor level who knows but for some reason the CTO is like withholding an SDK and then provided one there and that's because the readwrite implementation would have been incredibly risky for us to implement you know we would have tried it we would have found lots of bugs we

would over done our best to make them work but it would have an expensive for Jeff's company you know it you know would have started to give in the realm of frustration and discouraging for us and it likely would have drawn the attention of the vendor in a very negative way because we would be we would have been corrupting their data you know that to say like you know the probability was incredibly high that we would actually corrupt data once we tried to do the readwrite implementation so as I almost always do no matter how hard I try when I give a talk I priest and relator go very quickly but I do love facilitating

discussion so if any of you have any questions I'd love to hear them yeah so Jeff's company was anything but like 15 to 20 engineers you know most of them were focused on building and maintaining the API proper with a few people who were focused on kind of owning each individual vendor integration so you know and they so they'd have like small two maybe three person teams kind of focused on a particular vendor but the bulk of the team was focused on kind of making the API platform work selling access in the platform you know things of that sort

exactly so so the yeah so Jeff's company has customers right and their customers have this vendor software installed at their location right and they're they're looking for Ana capability out of it that they're not able to get and so that's where Jeff's company comes in and says hey we can do this for you you know you know you want to build out you know some cool feature that involves cell phones and your vendor you know your vendor system was built before cellphones were in existence or the idea of a smart phone was even thought of and so like you know hey we can we can be a shim and provide you provide you that capability so that's kind of that was

kind of a sales pitch yeah I don't know I I wish I knew why they wouldn't give us the SDK I mean I I can like I can conjecture that you know maybe it was a case where the marketing department at this vendor or the marketing sales department of the vendor was really excited about you know then be able to gain this new capability without having to not you know necessarily having to spend any money to get it but then maybe in the the IT organization felt threatened by that right so like but again that's just a speculation I don't know I've seen similar similar things played out at other companies where you know one

department kind of feels under threat by by an outside organization and when your livelihood is threatened you sometimes make some hard decisions but yeah I think yeah yeah not sure yeah like looking back I'm not sure what we could have done differently to find the SDK yeah yes like like yeah we could certainly say like we've done this once before like you know if you're if you're if your IT organization is is withholding there's holding something you know we'll get at the information like we can we can get at it so

you mean still like 10% of the database table columns things like yeah it was probably um you know around 20% that we need up actually using a lot of what was in the database was configuration information for like you know that the way the client UI was displayed and things of that sort or like different minimum and maximum values for like enforce different business rules so those were all those are all details that like we're definitely important to the vendor but for our integration they weren't what is big an issue

yeah so the landscape for for the these vendors and it's like this is the reason why I'm not I'm being very nonspecific about the industry but in the specific industry there are you know it's it's an application it's a you know hardware and software that's installed at a physical location and it's often for large chains that have thousands or you know sometimes hundreds of thousands of locations so switching vendors is incredibly costly and the vendors know that yeah exactly yeah so something you know something you know if but if it was like a coffee shop right like the switching cost for them is a lot less than it was say like on the scale like a

Walmart or something like that so yeah

I think you know one of the lessons is that you know it it's gonna be a lot of work and it's it's often worth it so like I think you know even even if all we even if we had only had to stop and read only that would've been read-only capability that didn't exist for that system so I think you know it's one of those things where you know if the system is still providing value you know and if it's been around for 50 years it's it's still provided it's still providing value nobody's decided to turn it off so it's more of those one of those things of like I think instead of trying to just

outright replace it embrace it figure out figure out a way to kind of you know modernize it at the edges yeah right yeah so yeah so I think you know the proliferation of lots of small devices is gonna mean that you know some of these older systems with yeah yeah exactly any other questions or discussion points yeah oh I love this audience because I didn't notice that ya know I typed Dvorak and that's not - for yeah it's like yeah I don't know that's a I I leave that as an exercise for the audience maybe this is a different type writing vendor who really likes Caesar or any Caesars name of the typewriter who knows yeah yeah yeah yeah well we'll

post this and see if anybody if anybody the internet knows the origins of this layout somebody's already found a layout it's the layout of each Caesar like the Turk yeah yeah Portuguese okay there you go right here yeah it's got a little age says oh okay there you go okay I see this is an awesome audience if I give this talk again I'm leaving it in there and if people don't notice I'm gonna call them out on it yes great so if you if so if if stories of this sort fascinate you at all you know I encourage you to check out like we have a community that we we manage called like c-code rocks we have a podcast of

the same name legacy good rocks we've actually the other engineer on my team who who worked on this project with me we interviewed her on one of our early episodes so if you want to hear how excited and happy she was to work on it I think she went through like 90% of the grind or 100 or so the grind so she's the she's the one who really deserves all the credit for this so dot rocks is a top-level domain yeah yeah so yeah so if you if you go there we have we have a slack team that you can join a chat with other people who enjoy working on legacy code I guess people here could go heckle them

I don't know I didn't see a lot of takers we also have a weekly mastermind where we talk and commiserate about you know working with older systems so yeah I appreciate everybody's time and thanks for the great questions and noticing the keyboard layout that's awesome thank you [Applause]

[ feedback ]