BSidesWLG 2017 - Kerri Miller - Crescent Wrenches & Debuggers: Building Toolkit For Rational Inquiry

Name: BSidesWLG 2017 - Kerri Miller - Crescent Wrenches & Debuggers: Building Toolkit For Rational Inquiry
Uploaded: 2018-02-01
Duration: 29 min 35 s
Description: Software exists in a constant state of failure, facing pressure on many fronts - malicious intruders, hapless users, accidental features, and our own limits of imagination all conspire to bring our system to a screeching halt. Untangle even the most tangled of Gordian Knots by building your own tool

BSides Wellington29:3517 viewsPublished 2018-02Watch on YouTube ↗

About this talk

Software exists in a constant state of failure, facing pressure on many fronts - malicious intruders, hapless users, accidental features, and our own limits of imagination all conspire to bring our system to a screeching halt. Untangle even the most tangled of Gordian Knots by building your own toolkit for inquiry, by relying on the simplest technique of all: asking "why?"

Show transcript [en]

do this kind of stuff. Tomorrow, I'm flying down to Christchurch, I'm renting a bike and riding around the South Island. Next summer, I'm leading a women's adventure motorcycle tour around the wilds of Alaska for eight days and then coming back across Canada, back to the States. And I felt confident to do that because my dad taught me some very simple ways to address problems with engines because there's really only four things that an engine needs. Anyone want to take some guesses at what those four things are? Like the bare minimum. No, actually an engine doesn't need oil. I mean eventually, right? Yes. Yep, we need a spark. Fuel. Air. What's the fourth one? Compression? Did I hear compression?

Yes. Sorry, I got lights in my eyes and I'm half blind anyway. Yes, an internal combustion engine only needs these four things. And so anytime something goes wrong with your auto, any sort of engine, it's one of these four things. Because internal combustion engine, despite all the different varieties of them that we have, whether they be cars, trucks, diesel engines, two-stroke scooters, where you do actually have to add oil to the gasoline, they all have these four major subsystems that make up that engine. And so when something goes wrong, you can very quickly drill down by using these very simple search trees. Do I have a spark? Is the battery good? Oh, no, it's not. Problem solved. Do I have gasoline? OK, I do. Is the air filter

clogged? Did I forget to screw in the spark plug so I have no compression? These are the sorts of very simple searches we can do across these major subsystems.

So in some ways it can be used also within the context of the scientific method. Now, if you're, I don't like to assume, but the scientific method has been around for a long time. I assume most people know it, but as a quick reminder, scientific method is a way of structuring our inquiry into a topic where we have no prior or very little prior knowledge, where we have a question. We have a problem or an idea or an area where we need to acquire more information. So we pose that question. We do a little background research. Like, what do I know about this topic? How could I phrase this question better? Who do I know? What are my local resources for answering this? You construct

a hypothesis to say, well, if this is true, I will see these effects over here. I use this a lot when I'm debugging, where we say, well, if the server isn't responding, How would I prove that?

You test your hypothesis by doing an experiment. You analyze the data and draw some sort of conclusion. And then you communicate the results. Now communication can usually be, in the scientific world, it's publishing. For most of us app developers, it's like writing a blog post or answering somebody's question on Stack Exchange. In a war team situation or incident, it's reporting back into the war room. hey, I went off and checked the logs or I did this other piece of diagnostic. I come back and I report that information. And this becomes a feedback loop because it informs our next question because part of that background research. And our next question, our next question, as our knowledge expands and our ability to affect change continues. Now, if you know

anything about modern science though, you'll know that there are two in fact steps here that are missing. Those two of course are write a grant proposal and then the second one being change your results to match the grant. But why do we do this? Why have a structured way, whether you're asking the three Cs, whether you're considering major subsystems, whether you're doing more open-ended research like this, why would you do this? It's to convince, it's to test your knowledge of the world around you and make sure that you're not fooling yourself into thinking something is correct. It is exceptionally common in the app development world for someone to write some code, ship it, deliver it out to production, and then all hell breaks loose. This happens about once

a week, unfortunately. And we roll it back and we very quickly say, ah, well, so-and-so pushed a change. The error must be in the change. And if we look at what are the odds of that, that's probably true, but it's not always true. And so we have to be very careful when we make the assumptions about where we think defects exist within our code. Nine times out of ten, the fix for a bug that we find is going to be six characters or less. Just going in and analyzing six months worth of pull requests that were related to errors that we found in our code, there was only two or three that were more than eight characters. Those were very exceptionally large

ones where we had a real structural problem where we didn't understand the problem we were coding for. But most of the time, a bug is very simple. It's subtle.

a little race condition or a forgotten comma or somebody hard-coded AWS keys somewhere, right? Those are very simple, simple errors to make.

Because that's where we're interacting with the software. The computer's a dumb thing. Computer is a five-year-old child who is completely logical. It has no context. It has no ability to understand the world. It can only touch what we've allowed it to touch, and it will do exactly what we tell it to do. We are the chaos. We are the ones who are bringing weirdness and interpretation into its interactions. And so it's really important that we understand the messiness of humanity, that we understand who we are and where we are in this process of interacting with the software. I mentioned that I used to be a full-time professional poker player. I took a year off from tech because I was like, F Amazon, man, I hated that. So

I took a year off, never going to work in tech again, became a professional poker player, studied, loved it like mad. And one of the books that I read during that time was by this luminary in the poker world named Mike Caro. And he wrote this book. He loves naming things after himself. So you know he's got to be a smart luminary, right? Mike Caro's Book of Poker Tells. And it's full of these cartoonish drawings of him like, doing this and doing that, like winning hands. It's kind of ridiculous, but it's got some interesting information about the poker world. But one of the things he posed that stuck with me, he named it after himself, Mike Caro's threshold of misery. And basically it's this idea that there comes a

point in pain, whatever that pain is, there comes this point where you simply can't feel any more pain. And you can no longer feel joy either. Now in the context of poker, Caro talks about it in terms of somebody is fully prepared to lose $500, but on this particular evening now they've lost $200. So screw it. Winning $50 isn't gonna get them out of that hole. It's not gonna get the money back in their pocket. But losing another 500 is meaningless at that point because the numbers have lost any meaning. They've lost any context. This manifests in my life in other ways See, I had to be on a call once at 8 a.m. with a bank in London, and I overslept. So it was 9.30. I

woke up, it's 9.30, like, ah, guess I'm going to the beach. Or, you know, I've had three drinks on this boat. What's a fourth? One of my friends and dad has expressed this the other day when he said, hey, I've got this problem with JavaScript. So I want to add this library. It's going to save me a couple hours. It's only 200K. The front page of our website is four megabytes already, so it'll be fine.

It's kind of an unprofessional thing. It's an unprofessional choice to have made. But it was a sign to me. It's a signal. It's a canary in the coal mine to say somebody is at this point of threshold where they're making unprofessional decisions. They're getting lax. Things are going to start to slide around here. An engineer who is at or past the threshold of misery, they're very important to identify because they're looking for any kind of evidence that what they're doing matters. They're the ones that are going to continue to be engaged, to find good solutions, to create and make good choices along the way. Somebody who has clearly passed the threshold of misery wrote this code. I worked for a hosting company like five or six years ago.

It was my second or third Ruby job. We did domain name registration. Every five minutes, we had this cron job that kicked off. For every client, we look up all their domain names. For each one of those domain names, we would queue up an asynchronous worker task that would first build the client on our internal accounting system. Then it would use our company's credit card to go ahead and renew the domain name with a registrar. And then if there was any kind of error, we would retry. And if that didn't work, we would just re-queue the job with the same information and try again. How could this go bad? How many ways could this go bad?

Well, it never caused a bug. It never caused an exception. It never raised any alarms. What happened, how we found the bug in this was that our major client, who was normally a $100,000 a month client, they got a new CFO who immediately went out and did an audit of all of their external vendors. And he said, hey, I noticed about six months ago, instead of charging us $100,000 a month, you charge us $130,000 a month. What's going on? So a little scrambling forensic accounting on my part later, our credit card had expired. So every five minutes, we would bill the client $15. We would attempt to use our credit card. The credit card would fail. We would re-queue the job. Five minutes

later, we would bill that customer $15. We would attempt to, you know, On and on and on and on.

I want to be her someday. That's super distracting.

But why did that happen? How does that actually happen? So of course I went to git blame. Git blame. What do you think I found?

No, not my own name. I know that's what you're thinking. That's usually what I find when I look up a bug. Like, who the did this goddamn code? That's my inside voice. My outside voice is much more generous and compassionate. So I look up the git blame. And the commit on that line from 18 months prior was update the credit card number again.

Yes, because the credit card number was actually hard-coded on our, like,

Semi-private repo on GitHub? Oh, God.

So why are we doing this? Okay, so our company made $3 for every domain name. It took myself and another developer approximately 60 hours total. We'll just spitball just for funsies and say that we cost the company $100 an hour. We lost money. We lost three years of profit because of this incident. So what do you do then? Why are we doing this? Why are we registering domain names? Well, you go back into the company's history, and five years prior, it had been like three guys, three pearl hackers writing this app, and one of their customers had let their domain name expire and blamed them, and they said, oh, well, never gonna let this happen

again. So they whipped something together, and it worked great for like a year and a half, and then the credit card number expired. Fine, we'll fix it, we're busy. They fixed it, 18 months go by, and now it's my problem. Did we really need to offer this service? At the time, yes, we did. At the time, they absolutely had to. They could not lose this client. I love errors like this. I love them because they tell me the story of what's going on. But these errors came about because somebody made a good faith, honest attempt with the best tools that they had, the best resources, the best knowledge they had in their brain and their heart on that particular day. They wrote the best code

they could. Now they were Perl developers. They could have been writing obfuscated code. But that's probably not really what they were doing. They probably weren't just doing it for a challenge or something. They were attempting to solve a problem. All software exists in a state of failure that we just haven't discovered yet. When the context of that software changes, that's when a bug will occur. That's when the really interesting stuff happens. Legacy code creeps up from out of the swamp and bites us in the ass. And it's almost always something that like, oh, we wrote this thing because of this one client. One of the classics that I love is the Bob from accounting problem. Does everybody have like a Bob in accounting

or Bob in sales? There's almost always a Bob in sales. And Bob in sales comes to me and says, I need a weekly report of everybody who purchased our software. It's more than six digits and lives in the UK. I have to whip up some crazy report. that gets bundled up, turned into a PDF and emailed to him every Monday, and I have no idea what he uses it for. It's probably going in a spreadsheet somewhere. Well, Bob leaves the company three weeks later, but I don't ever think about that report that I generate, right? The mail subsystem doesn't generate any exceptions because all those exceptions for the undeliverable mail are swallowed by the mail daemon. But I'm still generating this report every Monday. And that's fine, that

hums along for two years. And then suddenly one day we're getting alerts and alarms because the website is like starting to brown out in performance. And the problem is that report, that report that I'm generating constantly because now the dataset has grown to the point that the code that I wrote for a quick one-off when I was sad and kind of mad at Bob for making me do this silly thing, it's now browning out the site. It was fine as long as Bob was there. Bob's not there anymore, but there's no reason for me to go back and change that code to improve it. Because the context was such that it didn't cause a problem. Just like evolutionary traits in creatures, we

only get rid of them when it's a problem.

So successfully understanding code means that we have to understand not just the code itself, the set of instructions that we're asking the computer to do. We have to understand the humans that wrote it at the time and who currently maintain it, what their incentives are, what do they own, what drives them. Is it performance? Is it delivering new features to customers? What are they measured on? What are the metrics that drive their success? We have to look at the context that code lives in. Who uses it? Who interacts with it? I managed to take down all of GitHub for 12 minutes. That was one of my favorite errors. But the best part was I only

took it down for Japanese users.

Because of a bug I made in how we were decoding the Unicode around Japanese character sets. It was only 12 minutes and I was freaked out in the middle of it. But it was actually kind of cool. Because now I knew how it could fail. I had discovered at the point in which the hard, cold technology rubbed up against the really messy organic world and created a dissonance between them. That's where the bugs come in.

I'm sorry I'm not talking about debugging tools per se, because I think this is the best tool we have, especially when we're trying to deal with bugs, because bugs are this unusual, weird thing. They force us to think about our software. Our software's doing something we didn't know it could do, or we never assumed it would be asked to do. This is an adjustable wrench. I don't know what they're called in the rest of the English-speaking world. This is absolutely hands-down my favorite tool in my toolkit. It's adjustable. It can fit any sort of wrench or nut head or whatever bolt I want. It's got this sharp pointy bit, which makes me feel a little bit safer when I'm walking home at night. I've used it to pound

in nails. One of my careers, I spent about five years as a stage hand. So I have in fact used these to like prop open like loading bay doors, jam them in the back doors of a large delivery truck. That's really handy. They make a really good pry bar. I changed a tire once with one of these. I did not once use one in a place of a 50 amp service fuse. It's the theater, the show must go on. And I mean, it's made of steel, right? It's conductive, it's not gonna melt. It was fine. It would have been fine if I had done that.

Those are a lot of different uses for this one little piece of metal. If you go to the hardware store, they're going to only tell you about using it to work with bolts. In cognitive psychology, we talk about this as functional fixedness, this idea that we get into our brains that something has a meaning or has a use, and that's all we can ever see for it. And a lot of puzzles and riddles, especially my favorite puns, are all about shifting your perspective on what a thing means and how it fits. Isaac Asimov, he wrote jillions of books on all these different topics. He wrote one on humor. It's really good if you're kind of a fan of Isaac Asimov or you like jokes, look up Isaac

Asimov's big book of humor, or big book of jokes, I think. But it's this long, he's got like 60 or 70 pages just on like, what makes 1950s Jewish comedian jokes funny? What we've heard is the Borscht Belt or the Catskills, the Hamptons in the States. It's a particular kind of comedy. What makes it funny? It's really fascinating. But he talks about this, this idea that like, what's funny is when something changes our perspective and we're a little bit uncomfortable with the dissonance of seeing a thing in two places at once. There's a juxtaposition. There's a common example in, it's not a funny example necessarily, of cognitive, excuse me, functional thickness that talks about the Titanic. So the Titanic went down, real

tragedy, hundreds of thousands of people died because they didn't have enough lifeboats. There was an iceberg right there. Why didn't people get on the iceberg? That's not a perfect example, but it's that idea of we see things having a role, we see things having a purpose, and it's very difficult for us as humans, the way our minds work, to shift and see them from a different perspective. But the more that we can do that, the more that we're able to understand the shifting environment around our software. When somebody asks it to do something different, than what is intended. I worked at GitHub and about once every two weeks, gists would go down, gists.github.com. And it was always,

always, always, always that a JavaScript developer realized that there's an API for updating gists. And they started using gists as a data store. And then all of a sudden, their little site got on the hacker news. And before you know it, the website's down because they're saturating our API. Because we never thought somebody would do that. Why would you do that? Why would you use GIST as a data store? Who knows? Once we figured it out, though, it's pretty easy to address. But it surprised us. It surprised us that somebody would be that creative. Our brains are wired this way to understand things, having a role and having a purpose for a reason. Because 99 times out of 100, a thing is what it appears to be. It

is a tree, it is a rock, it is a fire, it is a bug from the code that you just pushed to production. These patterns exist because they're standard ways of solving frequently encountered problems in our environment.

If we never question those idioms or question how well we understand a particular problem, though, our abilities don't grow. We're very, I imagine, based on all the talks that I've seen, the conversations I've had, and the general ethos of everything I've seen at this conference, I'm going to make the assumption that we as a group here are exceptionally curious individuals.

We should be looking more into how things work and not make those assumptions. We do. We do, though. But the more that we do that, the less we become blinded to opportunities, the more we're able to extend our reach of how we understand, again, this way that hard software rubs up against chaotic, messy, organic world around it.

My dad gave me this advice as I walked out of the bar. We went to a bar. It was actually the first time I had a beer with my dad. Just this last summer. At the age of 40. It was amazing. And he gave me this advice. Take it apart and put it back together again. How many times have I fixed a piece of code simply because I'm like, I don't know what's going on. And I rewrite it. Line for line. Character for character. And it works. This happens less now that line endings are less of a problem and I don't use notepad.

But so often when we rewrite something, a subsystem where we examine it, we look at the problem, we take it apart and put it back together again, our understanding of it grows. Know our comfort zone. Figure out what you're comfortable doing. I love working on my motorcycle. I will touch everything on it. I've broken apart the engine, I've cracked open cases before. The only thing I will not do is work on the brakes. I'm scared to death of that. There is a nice, very nice gentleman down at the garage who will do that for me for very, very cheap. He fixes my brakes. Move at deliberate speed. When we're in a rush, especially in a stressful incident, if we have an outage

of some sort and we're addressing that, right, we're panicked. We have to fix it. We're losing money. There's a client on the phone. The CEO pops into Slack and says, hey, do you guys know the website's down? But we still have to move with deliberate speed to give ourselves time to go through these structured processes of approach, whether it's the scientific method or the three Cs or something else. We have to move with that speed so we know that we're not making mistakes. We're questioning our assumptions. And then we're able to understand the information that we find out as we move along. Consulting experts. I always want to call in an expert. Unfortunately, I'm a lead developer, so I am the expert. So I consult myself sometimes.

Leave yourself breadcrumbs. Whenever I take apart an engine, I photograph the entire process. Every two or three bolts, I'll take a picture. And I'll have little bins and little cups for all the different nuts and bolts. And when I was a kid, my dad used to bring home radios and toasters and things and give them to me because I was a very precocious child. And he'd be like, take it apart. And I took the first one apart, this big jumble of pieces, and I'm like, great, now put it back together again. It tricked me. So leaving yourself breadcrumbs to find your way back out when you're shaving the yak of a problem is really important. So I take copious notes about who's working on what,

especially when I'm an incident commander. And finally, knowing when to bail, there's just certain points when, like, I think that bolt can go back in, but if I just give it another couple pounds of But every time I do that, every time I force something, I'm going to strip the bolt or break it off or something.

And sometimes when I think I'm absolutely convinced, what we need to do is just nuke and pave the server. It's almost never that. It's never that. It's just like, oh, we just have to empty the log file. The log's filled up. The disk is 100%. We don't need to nuke it. Although nuking it would work in this situation. So I didn't talk about any debugging tools, and I'm really sorry, you guys. I'm sorry. But everything that we work on, everything we touch as technologists, we're invented by a human. Now, I'm only tangentially involved through my job with information security practices. But I love them. I do everything they tell me to. I don't click on any links. I change my

password. All of that sort of stuff. I almost considered getting a disposable laptop. People said, you're going to InfoSec conference? Why don't you get a little disposable laptop and use that? No, that's not gonna happen.

But if I wanted to, and I do, I can learn anything I want. I just have to go out and ask the right questions in a structured way that allows me to learn new things, that allows me to understand and integrate that information. And to remember that when we're working with the technology, we are the single constant, and it's that seam between humans and technology where failures occur. The world moves forward, it changes the context for that piece of technology. And that piece of technology is now something we laugh at. Aron reminded me of that this morning in her talk talking about PHP and how we really shouldn't look down on PHP. I was a PHP developer for a long time. I was a Perl dev

and then I got a whole bunch of jobs asking for PHP and PHP is just a subset of Perl really. That's what it looked like to me at the time. That was really great. And there's things that are great about PHP. I just went to a PHP conference last week. It was fabulous to see a different technology, people arguing and debating their problems. But all of their problems came down to humans using technology. Because software doesn't exist for the sake of software. It doesn't exist to just simply talk to another robot. We're not there yet. Even though we have computers that can program themselves, they're still driven by a human desire. They are designed by

human minds. and you can understand and reason about it. And if you can't, you can learn to do that. You can learn to put yourself in the place of the people who wrote it or the people who use it to understand the scene between them and the technology and how that interface is breaking or changing. I'm a really good developer. I'm an exceptional developer. But not because I got lucky, although I did pick a really good language with Ruby. Not because I got strapped on some rocket and got high fives all the way up to where I am, but because I realized I could learn anything and I retain that curiosity every day to push forward and make good, solid decisions about the technology,

to understand it and to make it better for other people. And in doing so, I become better because I learn how this technology lives, works, and ultimately breathes in a very organic world. Thank you very much.

BSidesWLG 2017 - Kerri Miller - Crescent Wrenches & Debuggers: Building Toolkit For Rational Inquiry

Related talks