← All talks

SAST and the Bad Human Code Project

BSides PDX · 201827:42193 viewsPublished 2019-02Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
John L. Whiteman Static application security testing (SAST) is the automated analysis of source code both in its text and compiled forms. Lint is considered to be one of the first tools to analyze source code and this year marks its 40th anniversary. Even though it wasn’t explicitly searching for security vulnerabilities back then, it did flag suspicious constructs. Today there are a myriad of tools to choose from both open source and commercial. We’ll talk about things to consider when evaluating web application scanners then turn our attention to finding additional ways to aggregate and correlate data from other sources such as git logs, code complexity analyzers and even rosters of students who completed secure coding training in an attempt to build a predictive vulnerability model for any new application that comes along. We’re also looking for people to contribute to a new open source initiative called “The Bad Human Code Project.” The goal is to create a one-stop corpus of intentionally vulnerable code snippets in as many languages as possible. John L. Whiteman is a web application security engineer at Oregon Health and Science University. He builds security tools and teaches a hands-on secure coding class to developers, researchers and anyone else interested in protecting data at the institution. He previously worked as a security researcher for Intel’s Open Source Technology Center. John recently completed a Master of Computer Science at Georgia Institute of Technology specializing in Interactive Intelligence. He loves talking with like-minded people who are interested in building the next generation of security controls using technologies such as machine learning and AI.
Show transcript [en]

[John L. Whitman:] Okay, as you heard my name is John L. Whiteman. I always put the 'L' in there. And I'm a web application security engineer at Oregon Health and Science University and today I'm going to be talking about Static Application Security Testing, or SAST and the Bad Human Code Project. And today, I think it's National Cyber Awareness Month? So, today I'm gonna call this SASTer day. [ audience laughter ] Okay, and there's still room on 'A' if you don't want to continue here. [ audience laughter ] So what is Bad Human? Well, this started for me 26 years ago; 1992 in Japan. I lived in a very small apartment there, which goes without saying.

And I played this game called Aces of the Pacific. Anybody heard of that one before? Yeah? Cool. Awesome. And I found out... Well, it's a DOS space World War II flight simulator, and I found out that the creator is this company called Dynamix. Which is out or from Eugene. Which is awesome. So, this is a great game. You could fly on either side of the war. You can fly as the Americans and you can fly as a Japanese. So, one day I was on a mission. I'm flying around, coming close to the ship. Came in too close. I crashed and exploded. The software crashed and exploded and these words appeared out of nowhere. And I'm not kidding.

'Bad Human.' Now this is the early 90s, right? And I like.. it's like 'Bad Human' and I cried. I thought I was being punished by the game developers or something and I couldn't remember. Was I friendly flying into a friendly? Or something of that nature? Something with my behavior caused a developer to write that message. And the thing is, I couldn't reproduce it for however I tried. In fact, I probably spent more time trying to be a Bad Human than an Ace. So, that's where Bad Human comes. And think of every time I mention this from now on, is about behavior. Particularly when we're talking about coding. So what is SAST? Has anybody never heard of SAST in here?

At all, never? Cool. Now I'm gonna talk to you. [ laughs ] But it's basically the analysis of source code. So, we have programming code. And what you're doing is you're trying to find security flaws, bugs, whatever. In our case we're talking about security vulnerabilities. And it could be automated. There's open source tools. And it could be manual. Which I call, 'People made it.' And we're going to talk about automation tools particularly here. And why it's called SAST? Why it's static? Is because the source code itself is never executed. So, there's a DAST version where they're doing dynamic execution. But I'm here to tell you that source code is not static, but it's a living thing based

on human behavior. It could be good human behavior or bad human behavior. And, this is a lesson that I learned at OHSU. You know, kind of coming back and looking in context of what I'm trying to talk about ... OHSU is a university, it's a hospital, and it's a research center. And it's a target. We've got a lot of high valuable data and a lot of people want to get that data. And so one of the attack surfaces, of course, are web applications. So, when you think about web applications -- OWASP's Top 10, it's an organization that does a great job, kind of spelling out what these vulnerabilities are. Could be XSS, SQL injection and even vulnerable third-party software.

Those are the ones with the CVEs as non-custom code, stuff that we look at. So, our security team... We've got two people. And we're focused on web application security. And we started about two years, in fact, one month from now, it's two years, ago. And we started everything from scratch there. But, we're really part of a much larger effort to secure data at the institution. There's obviously a bigger data team. So, what makes us different? Well, a typical shop might have ... Maybe it's a company, they might have one or two web applications. And one of them might be the corporate website, the other might be something for micro services. We have hundreds of web apps at OHSU.

They, the typical company, may have dedicated dev teams. We define dedicated with a question mark. Usually when you walk in there's that team. And we have some of them but not always. And there's an expected skillset. Particularly, if you are a commercial company, you're gonna hire people. Maybe they're a Drupal developer, but there's some sort of level of expertise in the hiring process that you have when they're developing these web apps. For us, it's an eclectic mix of developers, hobbyists, and PhDs. And if you look at the code from a hobbyist to a PhD it's about the same. The only difference is that the PhD person is trying to cure cancer. So, this is a human issue.

Because I would come in and maybe they're at the cusp of finding a cure for cancer and they're gonna announce it on their web app and I tell them they have to shut it down because they have a DOM based XSS error. They're not gonna like that. So, we have to deal with human as well, interactions from here. The only thing that's the same between us, is that both of us have one to two dedicated people to handle these web apps. So, some of the early challenges... When I first started, we knew a lot of where these apps were, but we didn't know all of them. We just didn't know. That's been worked out. Also orphan projects that were still running.

Now this is common, especially for research places. They may get some funding and that funding is, like "Okay, here's a project." "Here's the deliverable and you're done." And they move on. Like some grant money, like that. And here's this project running since the Civil War. And we don't know. And it looks like 'people,' according to the Apache logs, are using it, but we're not sure. We don't know how critical it is, and all these people are gone. Oh, and, by the way, there's a huge vulnerability that was discovered on it. So, that was a common thing, as well. But the thing that really challenged us, and this is sort of the bulk of today's presentation,

is we really had limited knowledge of the people who built the projects. We just didn't know. We knew some but we didn't have that insight like you would have at a normal company. So, what we had to become, were web app whispers. And what we needed is to get as much information about the app as possible. We listened, we collected the data. We use open source tools to do this. We put the data into a SIEM, and then from there the SIEM will have it's dashboard. This is the goal. And we could actually profile the web app security from there in addition to all this other information that we got. Because, just imagine us getting just a URL of the repo, and that's it.

It's cool. So, here's our security tool chain. Nothing spectacular, but probably additional scanning things are taking place. And our projects are in Bitbucket. We're using Git, and we're using Jenkins. So, we have these customers come in, they do their commits, and push. Jenkins can fire off on-demand a scan through this whole chain. And we also do offline scans, as well, like, on the weekend. And you'll see that it's divided into three components: One is the non-SAST guide, which is in green here. And this is like, Git, Cloc, Lizard, custom. I'll talk a little bit more about what those are, specifically. But these are what we call 'features,' and we're collecting these data points, and we

have about 80 some data points now that we collect.The third-party stuff is like third-party vulnerable components. This is the non-custom code: OWASP dependency checker, Retire, and Drush for Drupal applications. We also use those tools. And then finally, the SAST itself. And again the SAST was the tools that we use to find the vulnerabilities. We call these 'labels.' And why do we have multiple ones? One, we have SonarQube there, which we used early on. But, we also have a couple of commercial SAST tools. And the commercial - why we do all these? It's because there's not one tool to handle everything. There's gaps. I wish there was. You'd be a rich person. And then we dump it into the SIEM.

And also, you notice below: students. We teach a one-day course that we built, and we track the students who attended this class. This class is for web application coding; Secure coding for web applications. Which is awesome, and we track that. I'll show you what we do with that data later on. Not all SASTs are standardized, if you will. Some will look at a vulnerability, they may call the severity, they may call it a risk. Some may say high, medium, low. Some other ones might say, 'Really bad, Really Really Bad,' whatever they say. But, the ones that we use for this classification system is: Critical, high, medium, low, and none. And there's a numeric that's given with it.

But, typically, what we've sort of adopted is: at least the same tier "tier-ing" of the CVSS version 3.0. If you look at here, it's usually a float especially when we're starting to take averages here. And that's what we're using. So, at the end of the scan for a given project, we sum up the scores, and then we divide them by the number of vulnerabilities to get the average severity for each project. So, you may be wondering why we do that. It's time for machine learning. I'm getting a drink. I get excited when this comes on. So, in machine learning it always starts with a question. And our question today is: Can we predict a project's average severity score based upon

data from non-SAST tools. Can we make that prediction? So, if you remember that tool chain that we had? If I took that out, all of those SAST tools on the right side and they were giving me those average severities, can I do it with all the stuff on the left hand side? That's it. We're not going to know what the big names of the specific vulnerabilities are. We're just looking at those averages. And just a caveat here: This is something we didn't start out to do. This isn't, "Let's make a machine learning model for this stuff!" We were just desperate to get the information. But then later would say, "Hey, this looks like something that could be fun."

So, can we do this? Maybe. We decided to use supervised machine learning to find out. And the process for that, real quick, is, you take your whole data set and you divide it into what they call 'training' and 'testing' segments from there. And you use 'labels' and 'features.' 'Labels,' as I mentioned before, is going to be our SAST data. That's going to be the answer to our question. That's the average severity scores and the 'features' are going to consist of our non-SAST data. So, here's an example of a 'training' data set: We call it 'the truth,' and each row represents an instance of a web application that we just ran a scan with. And you'll see that the red, that's the SAST 'label'; That's the question we're trying

to answer. And then everything here's our non-SAST, our 'features.' And we're taking all this data and we're shoving it into machine learning algorithms, and they're trying to figure out, "How can I make this prediction?" That's it. That's all it is. And this is where we create our machine learning. But, we have to test to see if it's good, as well. So, this is our 'testing' data set. And we call this the 'unseen data.' Now remember, we divided this existing corpus of information and we take this other piece, which is our test data, usually a smaller piece, and we take the actual scores out. And we just put these things in and see how well they score.

And we kind of, you know, compare because we actually know the scores. And that does very well, we're doing very well. We do something called K-Fold Cross-Validation. And what it is, we take that data set and we set a value for 'K.' We just set it for 10. We tried different ones but 10 seems to be the one that people recommend. And we do this 'train' 'test' thing that I just mentioned. But what we do is take that whole data set, divide it in 10, and go ten iterations until we got everything. And the reason why we do that ... Oh, and by the way, when we load the data set up, we shuffle it; We randomize it.

That makes sure that we're being honest; Making sure that there are no clusters of data that might give us some false readings. Whether good or bad. And I'll also show you why we do that, in a bit, to determine the health, if you will, how good your accuracy is. So, we had to pick some algorithms. And we did things like Linear Support Vector Machines (SVM), Naive Bayes, Neural Networks, Random Forests, Stochastic Gradient. These are great kids' names. [ audience laughter ] And we said, "We use the the Scikit-learn from Python," which is so easy to do, and even if you don't, exactly, know how the algorithms are working behind the scenes, they're great literature, basically to show how they run.

But the reason why I picked these, was that they're very different in how they make that model. Some are based, like Naive Bayes, it's a probabilistic thing. The other one like Neural Networks, Perceptrons, and so forth, Random Forests as decision trees ... Other things like Euclidean distances from some point, they're all very different and that's why they were chosen. So, how did we score? Well, first after this break, we'll talk about the non-SAST data. Because, I think this is interesting. This is how you have to choose your 'features.' And we have 80 of them, so, I'll go through all 80 in 20 minutes -- Just kidding! Cloc, I think, maybe most people have heard of Cloc.

It's a simple tool that counts lines of code. And you'll see the output here: it gives you 'the language,' 'the files,' 'blank,' 'comments,' 'codes per file,' that type of thing. But, it's pretty cool too because, again, we're looking for Bad Human behavior or some kind of smoke here. Most languages, like, a website probably has less than 10 different languages associated. It could be HTML -- HTML, accounting even non-programming languages; But maybe PHP, there might be some XML in there, JSON, that kind of stuff. But if we're seeing a website that has, or a project that has, 300 different language types, this is probably indicative of somebody's development environment in a production environment. Something -- We've seen that.

'Lines of code comments' and' lines of code per comment.' That big argument: Should you comment your code? Should you not comment your code? What we're asking, the question is: more comments means less vulnerabilities or vice-versa? We might be able to answer that question. And my favorite: The unnatural server-side bedfellows. We have seen a mix of NET and PHP and sometimes ColdFusion in some applications. You say, "What the heck?" A lot of times, that's about transitioning from maybe one language to the other. And it's, "Well we better keep this up there." I'm not saying it's more vulnerable or not, but that's, kind of, something we're monitoring. And remember, we're looking at this data without any sort of bias.

We're just putting these data points in there. Anybody heard of Lizard? Yeah? Cool! Nobody? Oh one? Oh, maybe not. Well, that's a camera guy. Lizard's about code complexity and I love this tool because it looks -- And I'm when I'm talking about complexity, I'm not talking about it's doing nuclear physics or something of that nature. It's looking at things like -- a nice word here -- Cyclomatic Complexity Number. To explain what that is: That's linearly independent paths. Which is equally ... But think about if/and statements. If you're doing something like 15 or greater, starts to throw out a flag. Function token and parameter counts: If you've ever been in, like, a compiler class and you

had to make a parser, for example, you know how you'd parse the actual language, right? Well if you have larger function token count, like a thousand, that's really just a really big function. A really long one. Smaller functions might be looked on more favorable. Again, we're not making a decision here. But, that's usually what we hear. Parameter counts: if you have a function that's taken over a hundred parameters, you might want to take a data structure class. [ audience laughter ] And finally, duplicate code: even 0%. You say, "Well, maybe that's not bad, maybe it's not great," but this is -- What we think about is more of cut and paste code. Sometimes, people cut and paste and there's a lot of problems with that.

So, the question is, again: Do lower scores suggest less vulnerabilities? I'm not going to answer that yet. Git 'Features': This one's my favorite. So, Git we need, of course. But we look at things like project lifetime, number of commits, last time the project, it's activities. One thing we do too is: We check the emails of the Git committers and we determine whether, or not, that committer had taken the class. Right? Now we have this class. So, imagine a metric where you have 10 developers and all 10 took the class, that's 100%. Or one that has nobody taking the class and it's a really bad app. That could be a good indicator. Also, signed off count: Sometimes, have you ever used Gerrit?

There's some other ones where they have a signed off in there. That usually indicates that somebody else, instead of just one person, looked at the code. So, like a manual code review. That might be good as well. And my favorite -- and my boss looked at me kind of strange about this one: Natural Language Processing: I look at the messages themselves. And I do things like -- and there's great libraries out there. And they're like, within seconds, I mean they're fast. It's not taking a lot of the time. But of course, we look at blanks. But we look at misspelled words, but more importantly, There's an indicator called polarity. Anybody know what polarity is? If the developers were depressed, are they making more vulnerability things?

From the language, it looks like that. And Subjectivity: Are they subjective or objective? I don't know. Let's track it and see. So, now it's time for our results. Anybody want to guess the best number, talking from 0 to 100%? Give me a number. [Audience:] 60 [Audience:] 82 [John L. Whiteman:] 82? [Audience:] 40 [John L. Whiteman:] 40? [Audience:] Are these Price is Right jokes? [ Laughter ] So what else? [ Inaudible audience comment ] [Audience:] One dollar? Ooh, and you're the one that didn't know SAST? Well, you're the closest. Yeah. 86. Good job! Okay, so, again, what we've done is we took hundreds of samples ... We didn't use all the 'features.' We did about 50 of them.

The K-Fold that I've already described, we picked ten. And I did it a hundred times just to make myself honest. And remember, every time, it's randomized the data that goes through there. So, random forest came out first. And that was around 86%. And then, the next three: The linear SVM, Stochastic Gradient, and Neural Nets. All of them around 80%, which is still pretty good. And the Naive Bayes was down there at 46%. So, another way of looking at this is that, if you, as a person, were given this web application, that's the first time you saw it, you have a one in five chance of determining whether or not it was 'critical,' 'high,''medium,' 'low,' or 'none.'

Right? You have one in five chance. But if we're using Random Forests, for example, you have an 86% chance of doing it. So, that's not bad. And here's a caveat for the Naive Bayes: These other four classifiers are really good at classification. They're all supervised learning algorithms, but they're really good for these multi-label things. Or what I just described: The 'critical,' 'high,' 'medium,' 'low.' This Naive Bayes is not really for that. It's a little bit more probabilistic and not really suited for it. It did, you know, less than a coin flip, I guess, and I was expecting that. I kind of wanted to see if there was some sort of difference between these algorithms. So, yeah, 80%, it's time to quit my job.

I can build an AI security company, go on Joe Rogan, and laugh, and get rich. But, not yet. There were some issues. There WERE some issues. So, let's look at this. How many ... Okay, you already know how many there are ... Here's a thousand points. Remember my K values? And I multiply it by 100 times 10? That's why I like this this K-Fold stuff. Because I can keep track of each one of these. And down below in red is, of course, our Naive Bayes. But look at the spread! I'm talking about things as low as 20% accuracy to, maybe, up to 70%? And I'm being generous there. That's a 50% spread. You're going to want to reject that one.

Or let's go the other side: We'll take our Random Forests up here. Yeah, I got a thing here, yeah! Here we go. Yeah, we get some hundred percents up here. That's awesome. But we still see a spread. Maybe 15-20 percent, right? So the question is, from here: is that acceptable? Well, if you have an application that determines whether someone lives or dies on these labels? No, absolutely not. But in our case, remember in the beginning ... Well, maybe I didn't say, but in the beginning we didn't have really good commercial tools in place, yet. We had all these websites and we wanted a way if we had to go in and do some sort of

triaging, without knowing what the websites were, and maybe that's okay. It might be right on. Because we already have that data. It's already there. But, we're not giving up. So, what's happening is something called overfitting, probably, and that occurs when we don't have enough data. I said hundreds of samples. That may sound like a lot, but we probably need thousands of samples. Or tens of thousands, or millions. So, we're telling the web app teams across the organization to build more web apps. On the underfitting side -- This is what I call the buzzkiller. It could be that our algorithms are not right. But we did try a good set of algorithms and they all showed reasonably consistent.

Except Naive Bayes. So, they may be wrong. Or it may be that we have the wrong 'features,' we have these 80's. Maybe we have too many in there. I mean, you could go in and do this permutation combination, whatever of all these 'features' and maybe that's not right. But the buzz kill is, maybe this is not going to work no matter what. And you have to walk away. But, we don't walk away. Because, we're looking for the 'Goldilocks Fit.' And at the time when I first ran the first set, I thought I was going to walk away. But what we need is lots of data and the right set of 'features' to get this curve.

So, how are we going to get more data? I think of code as a bunch of babies in a crib. I took this picture ten years ago, I can finally use it now. Imagine the crib is BitBucket. And each baby is a repo. I like that. Now, they all grow up. Each commit, every time there is a push, it's a new incidence of the web app. Right? The code is changed. Something has changed in that web app. Something is evolved. So we treat it as a new sample. We're collecting it anyway, well let's treat it as such. Another thing is, the SAST tools themselves change. So you may get a new version of a SAST tool and hopefully it's going to be better than

the last. Right? That's another thing. They get updated as well. The big one here, where people get involved, is that the false positives will go away. This data that you saw, there was absolutely no triaging whatsoever. Nothing. We didn't talk to developers. We didn't go through and say, "Yeah these are all false positives." I'm sure there are. And there's probably false negatives too. Which are harder to find. But maybe as a tool progresses they will be found as well. So these are the three things that are always changing. So we treat it as a new web app every time and we just keep collecting that accordingly. So, for us we wanna keep collecting the data and we want to be 'Good Humans' as well.

And before I stop ... Way early in this project I was trying to ... We were characterizing and comparing different commercial tools, commercial SAST scanners out there. And there's a lot of stuff on GitHub that has, like, vulnerable web apps and stuff. But one of the things I found challenging was just finding pieces of just snippets of code that were vulnerable for a given language. So, if you have really a corpus of crappy code somewhere, and you want to put it in to here, go for it. What I'm looking for, though, is if people do want to run their scanner, try their scanner or maybe they're building something on their own, they can download this project, run it

and if you are so kind, maybe you want to upload your SAST results as well. Because we're trying to ... All these commercial SAST scanners are very expensive. But this would be a good thing to do. Or at least a good thing to try. There's some code up there now but just feel free to add it. Anyway, thank you very much. And, any questions?

[Audience:] Did you attempt to list the prediction, like the continuous, were you predicting the slope or is it predicting like a whole new class problem where you predict the float? [John L. Whiteman:] Yeah, so, you get the float from the average and then you put it into that tiering system. It's like that CVSS v.3, if it falls into there, it's a '4' it's a '2' ... So it becomes an integer. Yeah. Yes. [Audience:] So you mentioned you had some issues with falsing. Do you have the link to your [inaudible] analysis or any descriptors of the statistic of your model? [John L. Whiteman:] I have the code I can push up. So there's the one that I had before, uh let's see ... There's another one called '/tools.' Right afterwards.

I have one more commit to put up there but you'll see what I did. [Audience:] Cool. [John L. Whiteman:] So everything is up there. The data is not up there. I can't but I can show you what the results were. But the data has to be redacted, obviously. Yeah, take a look. I'm not a machine learning expert. But, I call myself a doctor, so that's good, right? Anybody else? Yeah. [Audience:] So, some of the numbers you mentioned during the presentation >100 parameters, greater than, you know, ten languages in a single project, I've never seen those in the real world. Am I just missing something? Are you seeing any machine generated code whatsoever? [John L. Whiteman:] No, so, the code complexity tool, the Lizard tool, this is how they do

it out of the box. So, they look at code complexity as those types of things. And for them ... And it's set-able. Right? You can set these values, but by default, they're saying if you have ... It's going through the code, it's reviewing that source code and saying, "Hey, you have a 'function' here with a hundred parameters, we're going to set a 'flag' to that." [Audience:] Is that literally a hundred? I mean ... [John L. Whiteman:] No no, it could be whatever value. Yeah, so that's the threshold. It's a threshold. [Audience:] So, is that threshold set? Did you guys set it manually? How does it? [John L. Whiteman:] No, so that was it's out of the box default.

In fact, we don't even look at the threshold. We look at warnings, but we also just look at the number. Like I said before, I don't make any decision before and I just said, "Okay, that's a point and I just want to check." Let the machine learning figure that out. Yeah. Anybody else? Yes sir. [Audience:] What SAST tools are you using? [John L. Whiteman:] I can't tell you that. But they're very expensive. [ audience laughter ] Huh? [Audience:] Were you using two tools? [John L. Whiteman:] We had two commercial tools, and SonarQube. And SonarQube, it's not great. There's some good things about it, but when you start looking at the, like, PHP in the security rules, it's hideous.

It just looks at .inc files and stuff. But there were other things that we looked at with SonarQube. And that was in the early days before we could get those good tools. And again, it was mostly because one tool didn't have a language and the other did. That's what we're looking for. Language coverage. [Audience:] Open Source alternative? [John L. Whiteman:] There's tons and it depends on what your language is, but if you want to make a lot of money, build your own. Yeah, you know, there's like, I think, a group out of Germany that did some PHP. It's not bad but they started Open-Source and then eventually ... I think even SonarQube started that way.

[ audience comment inaudible ] That's right. That was one issue collecting the lines of code and Cloc and all of that is, they say, "Well how much data are you going to do?" So sometimes, the SAST tool will say how many lines of code or how the file sizes are and all this stuff. Some go by number of users, some go by number of projects. They were pretty surprised by, "Well how many projects are you going to have? Ten?" No, we've got hundreds. So they had to change their model for that. Anybody else? Okay, thank you! [ audience applause ]