← All talks

How Quality Engineering Transforms Application Security by Alexander Beaver at BSides Toronto 2023

BSides Toronto21:3567 viewsPublished 2023-11Watch on YouTube ↗
About this talk
Presented Oct 21 2023. Are you tired of constantly patching vulnerabilities in production systems? Would you prefer proactive security mitigation to reactive response? This presentation explores the lessons I learned building a culture of quality software engineering, and how that culture can mitigate vulnerabilities before they are ever written. The lessons discussed can help your organization break the inertia of endless patching, and instead benefit from consistent, meaningful improvement. We will begin by exploring what quality means, and how it can be applied to a software world. We define quality as a physical representation of a culture that takes pride in their product. With software, if a team focuses on quality first, it can lead to a sustainable culture of secure software development. When a team writes code, they should invest in doing it once and doing it right. We will then explore of the lessons I learned building a culture of quality, and how it can be applied to application security. It is based on the “Toyota Way” management philosophy, but has been adopted to be more relevant to the audience. We will discuss what a culture of quality means, and how it looks in the world of software. This includes encouraging individual developers to take the time they need to build high-quality code, and driving them to want to develop high-quality software. It will also discuss how investing in quality can lead to efficiency and productivity gains in the long-term. We will then discuss how security teams can standardize technologies, and make them easy for developers to use. When the easiest path for implementation is also secure, engineers are more likely to use that path rather than finding alternates. Teams should use libraries to standardize and abstract high-risk operations to reduce the number of low-quality implementations. These may include ORMs, AAA libraries, input validators, and more. Lastly, we will discuss continual improvement, and how time can be leveraged to make lasting change. This includes security sprints, which focus on making small changes to developer behavior. As the developers adapt to the workflow changes, more can be introduced. Continual improvement takes SDLC policies and slowly integrates them into the developer workflow. This leads to more lasting change than a quick transition to SLDC policies, which may overwhelm the developers and lead to misunderstandings. It will also discuss the importance of maintaining developer relations and not being too hard-handed, which could destroy trust and lead to hidden risks.
Show transcript [en]

hello everyone my name is Alex Beaver I'm a Computing security student at the Rochester Institute of Technology in a San Francisco native today I'm going to talk about how quality engineering can be used to transform your application security I want to begin by talking about my freshman year of high school somehow I ended up as the software lead for my robotics team I don't know how it happened but oh well my team somehow managed to qualify for the World Championships wonderful no pressure but when we got to the world championships our first five matches we won three and lost two on paper this should be great as a team that barely managed to qualify but there's a real problem every single

match our robot was acting unpredictably we had to disable core functionality because we could not understand how it was going to work on the field and it would not work in a way that we could predict and so this was a factor of the fast that we were moving fast see I'm from Mena Park the home of Facebook famous for moving fast and breaking things in a robotics you have to move fast you have six weeks from when you get your challenge until you have to put your robot in a back and you can't touch it again that means that you have we're measuring your development time not in days or weeks but instead in hours and

especially as software people we get the very last chunk usually we got 30 to 45 minutes on the robot before we couldn't touch it again so if you're moving fast you're breaking things but the problem is that we get stuck in a cycle of breaking things this is not just a problem match to match or even uh competition to competition every single season for almost 20 years we had been stuck in the cycle we we would be fixing our code we' spend so much time doing it that any new features we had would be rushed and then we spend even more time trying to fix it this led me to ask the question do you

have to move fast in order to break things well we started by looking at it as a software problem see we are running software why not look for software Solutions the way that we deal with this in the software world and the security world are where policies standards and best practices more commonly known as a secure development life cycle or an sdlc if you've had the unfortunate um event of having to look at one of these before these are documents a mile long that tell you exactly how you are going to write code the problem is these documents don't stand up in the real world you see they were designed in an ideal scenario but when you are working

in the real world you have external constraints you have deadlines you have cost budgets you have a whole bunch of things and what you have in your policy and what's happening in the real world there's a big disconnect between them if you're lucky you may have that one file that you developed all your policies around and then everything else is just completely unrelated this is a real problem across the board not just in high school robotics of my friends who work at major companies almost every single one of them has said that what the sdlc says and how software actually run is completely separate so instead rather than looking at a software problem we looked at it as

an engineering problem see in the when you're dealing with hard products the stakes are a lot higher for as much fun as we have when we have to issue a patch we should be really glad that we aren't working in recalls at Hyundai ARA right now and when you're looking we looked across Industries Aerospace Med uh Medical but of particular interest to us was Automotive Engineering see Toyota is by far the most reliable company in for making Automotive automobiles they have a long history doing it and they have Decades of experience Toyota also has a lot of really good documentation and so after realizing that things needed to change I spent weeks and weeks researching

Japanese manufacturing methodology and trying to figure out how we could apply it to software My ultimate conclusion was that investing in quality engineering was the key to delivering better software faster and by better I mean more reliable and more secure there are three rules to Quality engineering first is investing with with intention second is engaging with your engineering teams and third is transforming with time let's start with investing with intention I use this word quality but what does it actually mean you may think of something like a car a phone a watch something that is high quality but what are you actually saying well it's not just a physical characteristic if you ask a mechanical engineer they'll talk

about the manufacturing Ines if you talk to a supply chain engineer they'll talk about how efficient their lean supply chain is but all of those are symptoms of a cultural component specifically it is a culture that understands what matters and is willing to invest in it now this is a vague statement and at least the question what actually matters well a lot of things matter delivering a product matters efficiency matters profitability matters and yes confidentiality integrity and avilability are all things that matter but when everything matters nothing matters and so we were put in this position and we decided that reliability was by far the most thing that Ma that mattered we were not going to lose a

single match due to reliability issues with our room what does this look like in the real world well during our Following season when we started on this reliability Trend we realized three weeks in out of our six weeks that we were not going to meet our objectives there were simply too many bugs and we knew that there was a risk that we would lose a match because of our software and so I made the decision that we threw away our entire code base tens of thousands of lines of code years and years of development work gone in an instant because we knew that it was not going that we were not going to achieve our goals if we did not try to take the

risk and make more reliable code we completely rearchitecturing our new workflows in our policies as we were going along as you can imagine doing all of this in three weeks was quite an experience ended up being hugely successful the next step after we decided that reliability was important was engaging with our engineering teams this was easy because I was both an engineer and responsible for this but the thing is that adversaries don't care about your policies you may have an sdlc that might lead to perfect software but what doesn't it doesn't matter what was in the policy it matters what's in the real world people get frustrated really easily and the fact is there's so many

environmental constraints that are possible to account for just think about if you're writing code and you have a deadline the next day it's 2: am. and you just want to go home and see your kids are you really going to be asking about what page 73 of your sdlc says probably not and this means that we took a supportive role rather than using our policies to dictate what our Engineers were supposed to do our policies were there to support our Engineers achieve their goals I think of a lot of it like a doctor when you look at a primary care physician they they have a goal they want you to be healthier but they know that unless

you give buy into it they're not actually going to be successful and so rather than creating the perfect plan that makes you as healthy as possible they create a more realistic plan one that may not get you to perfect health but one that will cause people to go along and has a higher chance of success the other component is that we made sure that everyone was willing to speak up see it's very easy to cast uh for people to lay move problems on to the next person it does it's not about me especially when you're dealing with engineers and telling them that this matters they may not feel welcome to speak up we instead created a decide

that we need to deal with issues when it was early and small the later you find any issues be it reliability or security related the more expensive and the more timec consuming it becomes to fix if you can identify them when it's early then you're going to spend a lot less energy and you're going to get back on track faster there are two approaches and we adopted both of them the first one was aspired by Ford Ford has a system called the stoplight system during all of their meetings when people have three cards red yellow or green we adopt to this if people if our Engineers were happy with the State of Affairs they'd hold up a

green card no concerns if they were concerned or they had a future risk they' hold up a yellow card we'd take a subset of our engineering resources and redirect it towards triaging and attacking that problem and if there was a major concern like the situation where we realized we weren't going to hit our goals you would hold up a red card in that situation all development would immediately Halt and all development resources would be redirected towards that main goal this may seem like it's hugely inefficient but when you start taking people and redirecting them towards a specific goal you quickly realize that some people aren't needed those people can then go back and continue working on their existing work

while the rest of the people remain engaged in the triage process but by taking people and redirecting their energy you're able to triage a lot faster the red string is another approach that we also took this is based off of the Toyota production line every single person on the Toyota production floor has a red string that they can pull when they pull that string all the production line shuts down instantly this is because they want to identify production issues and quality issues before they grow out of control and they trust that their Factory workers will go through and see if they have an issue and be willing to speak up now it's scary for people to speak up

and especially they're worried about personal retribution and so we I started by personally raising concerns when a person in Authority raises concerns people are more likely to follow through later and we saw I saw a lot more people willing to speak up once I had personally spoken up the other thing is to deal with it in an honest and blameless way we took a page out of Google's uh blameless postmortem to figure out how are we going to attack this you don't blame individual people for actions that happened but you also don't sh to coat things if you try to hide things that are systemic issues for the sake of protecting egos you're not going to be

successful and so by dealing with it in early and when it was early and small and in a way that was honest and blameless we were able to identify issues much faster and cause them or prevent them from spiring spiraling out of control third is transforming with time see change is hard if you ask people to make a major change to their behavior it's not going to be successful this is my biggest problem with how sdlc is often adopted in the Enterprise you have a guide a mile long of what you're supposed to do and if you're lucky you might have several hours of training to go along with it but the problem is when

people are given all the information all at once they don't know where to focus and it also becomes very intimidating and people are likely to not follow through so how do we do this or how do we deal with this problem We Begin by taking smaller steps when you have a big behavioral change that you want to have happen it doesn't need to happen all at once you can instead have people take small steps and over time you'll get to that end goal and so you can leverage time to your benefit to make sure that these changes are sustainable the other thing that we did was is shortening our focus when people are thinking five years in the future

they're not going to be as focused on the present and so we I introduced a system where Engineers were just focused on the present and I was thinking about the long term guiding them towards a more successful future I think of it a lot like a GPS as I'm said I'm from Rochester and so that means that I took over three hours to drive here from Rochester I did not sit down on a map and immediately plot out how I was going to navigate here there's just simply too many turns too many freeways I can't handle that and so what did I do well I would took out my phone and I had my phone calculated that for

me then my phone would say hey take a right turn here join this freeway here and I didn't have to think about the long term I just trusted that my phone would get me there we took a similar approach for how we approached our reliability every single time there was an issue that required a recompile that data would be put in a database I would then go through and analyze all the failures that had happened figure out where are we falling short and what can we do to change it these would identify one or two small workflow changes that would then be implemented for a small period of time they would be very heavily marketed and

over a Sprint our developers would make sure that they were adopting those workflow changes as part of their overall workflow at the end I would collect the data again analyze it and show how those workflow changes impacted uh the work that they were doing we would then repeat this cycle and over time we were able to whittle away many of our systemic development issues but the important thing here is impact it's easy to think about things in security terms or in reliability terms or whatever after all that's the world that most of us are familiar with and while your developers likely care about security they have a lot more on their plate than just security and so it's

important to articulate the results in accordance with what your developers care about rather than just talking about vulnerabilities mitigated talk about how much less time was spent mitigating those vulnerabilities talk about about what new features were able to be introduced because of the time that was spent on new feature development rather than fixing existing issues talk about how they were able to work less overtime because they were able to because they were following these development procedures and when you can articulate results and ways that matter to them they are a lot more likely to follow through with them in the future so what did we do we invested with intention we understood what matters and we made really difficult

risky decisions because we knew that reliability was the most important thing we then engaged with the engineering team making sure that we are in a supportive role helping them achieve their goals and making sure that everyone had the freedom to speak up and then we transformed with time rather than taking big long-term changes we shorten it down and allowed our developers to focus on the present now where does that take us in the long term and just as a disclaimer this is based off what our specific organizations's needs please just don't take these and Implement them yourselves because this first part is what led us to the second part our basic design philos phosy was

do it once do it right anytime we wrote code we made sure that it was production ready see it's easy when you're in a stressful situation to say that you're going to deal with the problems later oh it'll be fixed in version two but whether it actually gets fixed in version two is a different question and so no matter the constraints no matter the environment our developers were told that the most important thing was to write the code once and write it right we did this by investing heavily in libraries by investing heavily in design and doation to make sure that everything was done properly we also Focus heavily on data collection remember those Sprints that I

talked about earlier well our goal of course was to not disrupt our developers and so we were in a situation where we needed to collect data but I didn't want anyone having to spend any more time than was absolutely necessary I knew that it took a minimum of three minutes to compile a blank program with nothing added to it so what does that mean well it means that I made the form 30 seconds long that was more than enough time for developers to perform a root cause analysis and fill out the form and still have time to relax by shortening the form and not collecting all the data we needed but by parsing it out later it gave our

developers the freedom to continue developing at the speed that they always were while having no disruption to their workflow we also had non-compliance we had both natural and accountable policies natural policies accounted for over 90% of our policies and procedures especially focusing on our development workflows these were we looked at policy issues and non-compliance as a policy issue rather than a Personnel issue when rather than asking why didn't you follow this policy we asked the question why was this policy not such that it would integrate into your workflow naturally and this means that we' go through and refine our policies over time sure they might not be the perfect policies at the end but they were policies that were

able to withstand the external environments and what this meant specifically is that even in our most stressful situations we had a nearly full comp compliance with our policies within a year we also had accountable policies these are the things that can't change these were the last 10% in where we had dictated workflows in this situation we focus on preventative accountability we wanted to make sure that people were a uh forced to go through the workflows themselves rather than skimming over and also making sure that there was a double check on it things as simple as having people sign their name at a bottom of form or requiring physical check offs or even having two people check off meant that

we would have a lot L uh issues caused by people just skimming over checklist and not actually paying attention so where does this take us as I said at the beginning we our goal was to have zero matches lost due to software failures and I'm happy to say that in our 20 or in The Following Season we achieved that goal there are no software issues or matches lost due to software issues but not only that we had zero major issues in the field period that were caused by software all the issues that we detected were either caused by mechanical issues or by Electronics issues and we ended up taking this approach and expanding it to

our Electronics team in the upcoming season this was hugely successful for us and I'm a strong believer that this is a model not just for robots or in high school but for security in the real world I encourage you to adopt a similar strategy and I'm happy to answer any questions or talk to you more about

this right so are there any questions for Alex yes thank you for the presentation how LGE was your so and how often did you use redard so we only had to use red card probably three times in that first season there was one that was a severe issue that required a complete Redevelopment and then there were two other times there were smaller not completely throw away your code but issues that needed to be addressed urgently our team it was probably six or seven programmers nearly all of them were high school and freshman in sophomore and so as you can imagine trying to manage a team of freshmen in sophomores was quite the challenge but I

think it stands for how successful this was because people who would not be inclined to work in that way were able to do so um the first time it was me the I think the third time it was also me the second time was someone else yeah

um we were agile um given our short development times we were agile with usually three-day Sprints and how did you like give you some more how you incorpor yeah agile So within agile what would happen every week we would collect the data and then on Monday we would have our new workflow changes and so on Monday there would be heavy marketing we would on our team meeting I would say this is what needs to have these are are your new workflow changes that you're going to go through we'd print out a whole bunch of stuff we'd make sure that everyone was aware of these workflow changes and then that it was on a separate Sprint just because we're

moving so quickly but we would have so we had a separate Sprint cycle that was uh start on Monday collect data through Sunday and then on Monday it would be the new workflow changes as

well yep um over the previous year probably in the 60s or 70s it was almost every single match had a software issue on the field and so I at least 60 I wouldn't be surprised if it was 100 plus as because it was just so unreliable but it's hard to tell because we didn't I don't have that data anymore yes so in accountable policy one of the things that we had was setting our robot up on the field and there's a specific way that you need to do it every single time and so admittedly I was the biggest cause of non-compliance on the team even though I was also writing the policies and what would happen is I would I knew

what the checklist was since I developed it and so when I got on the field I'd be there and I'd say oh I know what the checklist is I'm just going to run through it in my head I'm not going to bother checking it off and myself and our head of AI development were the biggest causes of non-compliance um and so what it meant is it was checking off every individual checklist so we a separate form printed every for every single match and then we' have our driver would have to go through and sign off that they saw that the checkboxes were checked off and by requiring that the checkboxes be checked off it was a

way to make sure that it was actually followed rather than just skimming over yeah that's accountable um a lot of our national policies were things like um trying to figure out uh a lot of it was code structure so structuring our code in such a way that it' be a lot faster um things like figuring out how do you diagram code how do you turn your code into pseud code into actual code requiring that rather than just jumping into IDE you would start out by writing pseudo code and by designing your code on a whiteboard things like that things that would um help uh it's hard to say because a lot of it was very flexible

depending on the developer but it was for example we would require a written out diagram followed by pseud code followed by production code for every single function written up yeah okay that's great thank you wonderful thank you thank you