← All talks

GT - Profiling User Risk: Borrowing from Business Intelligence to Understand the Security of Your Us

BSides Las Vegas33:0174 viewsPublished 2019-10Watch on YouTube ↗
About this talk
GT - Profiling User Risk: Borrowing from Business Intelligence to Understand the Security of Your Userbase - Emily Austin Ground Truth BSidesLV 2019 - Tuscany Hotel - Aug 06, 2019
Show transcript [en]

Hey so as the title might suggest today I'm gonna be talking about using techniques from business intelligence marketing analytics and use their behavior research and I'm gonna talk about understanding what users are doing in your environment or your infrastructure or your application as kind of an approach to end user security so taking this framework of thinking about user behavior and applying it to our domain here we talked a lot in security I think recently anyway about adversarial machine learning and model hacking and and graph stuff that's super cool and all of this stuff is really really interesting but I also think that there's like another branch if you will of data science of analytics that really

due to technical difficulties a portion of this presentation was not recorded we are joining the program already in progress I probably want to send coupons for the restaurant closest to each of those people to get the most like return on investment so I can group them based on geography I can measure the distance that each of these people have from each of these restaurants and take the shortest distance and send them a coupon for that restaurant this is really simple we'll come back to this later so remember this example so now I'm gonna get into some machine learning terminology and like I said I think this will probably be pretty basic for a lot of you if you've if you've done any

sorts of machine learning but if not hopefully it will be helpful and kind of demystify some of it if you've heard terms turn around so I'll try to make this as not boring as possible okay so there's two main categories of machine learning algorithms typically we talk about we've supervised learning and we have unsupervised learning so supervised learning is when we provide input to the model and we also tell the model what we expect the output data to look like not what we expect it to be but just some kind of sense of what it will look like so for example take an example of a Bayesian spam filter so here along the top I'm

providing so I have some span I have terrible email then I've ham or not so terrible just normal good email and I can label these as such and then I can feed these to my classifier and they now know my definition for spam and my definition for ham cool so then I can take this other text down here this other like set of emails and I can feed it to my classifier and based on the definitions I've provided my classifier will assign a label to these emails right so it's kind of like if I tell you to draw an apple and I put an apple on the table in front of you or give you a picture of an apple so I'm

telling you both what I want from you but also giving you an idea of kind of what that should look like so contrast that with unsupervised learning so this is when we provide input and we just let the model figure it out we don't define the categories as we did in the spam classifier example but we let the model do it for us and a really popular example of this is using an algorithm called k-means to cluster handwritten numeric digits together so we don't provide any source of truth we just say you know sort all of these in two groups that look the most similar to each of the others if that makes sense and k-means is actually one of the

algorithms we're talking about in a moment okay data types so categorical these are categories these are things that don't have numeric properties and then numeric I think that's pretty self-explanatory there are numbers so before we get into talking more about the results I want to talk about in balance classes and so this essentially is when the number of observations our data points in each of your classes or groups is not evenly distributed this is a really really important concept if you're wanting to do any sort of anomaly detection or detecting things that don't typically happen a lot of the time so to explain how this would work with MailChimp so here's here's our user base

and these are normal users the majority of all of our users these are people who are just marketing their small business they're creating landing pages for their like community group whatever they're doing normal stuff and then we have these users so these are the users that we're talking about these are the ones we're interested in they're doing bad stuff they've been attacked or they're sending malicious content these are their ones that we really want to focus on so if I were to randomly sample this data and let's say just pull you know 10,000 users out of the entire MailChimp user base the likelihood that the distribution of okay versus not so okay users would match this is really high

like I'm probably gonna get the same distribution in my sample data set which is great from a sampling perspective but it's not so great from the perspective of training our model to recognize these because there aren't enough examples of these if I just sample randomly so that's something that we're gonna have to address when we get into feeding data to the model okay so now we're gonna talk about the models and I will say this is I'm gonna talk through this this has been a trial and error process and it's also still in progress this work isn't finished so it's not gonna be like a super straight shot of just I did this and I

did this and it worked no it's a little bit more real than that so I hope that that's useful okay so first we have to figure out what data we're gonna feed to the model and there's a lot of options this is obviously not all of them but it's just a sample of them there are tons of different ways I could look at this and I eventually ended up filtering things down to something that looked like this and you'll notice that these are mostly security-related or what I would consider security adjacent features so I want to call attention to this specifically because this is where in particular if you are you've been in InfoSec for a long time you've been in

security space for a long time and you want stuff like this this is where your expertise is so critical and this is a place where you can really shine even if you feel like you're not sure about the modeling piece of it because you already have a sense of what might be important like I chose these features because I thought I feel like these are going to have some bearing on how users might be grouped and they might have some bearing on whether a user is deemed more or less risky or at risk of being abused or attacked and so I just want to point that out like domain expertise here is really really really important and can

save you a lot of time so yes I start with security only or security adjacent features and it seems like a good place to start so this is kind of what my data looks like this is a simplified version of course but you can see that I have two factors status represented and then the two other features that we talked about earlier times founded breaches and then the number of logins and you can see how these accounts differ one thing I want to point out here is that these accounts or excuse me this variable the two-factor presence or absence this is the categorical variable I've encoded it numerically because I want it to work with a model that takes numbers but

these are categories so just keep that in mind so I started with k-means I started here because this was really like I was just sort of sitting at my desk one day and I was like this would be a cool thing to do okay okay k-means is a thing to cluster okay how many is k-means that was literally how I selected this to begin with okay so I mean come on like the one of the cool parts about this type of stuff is that it's all sort of like try stuff and see what happens like there's experimentation to this and so that's that's really fun this part wasn't so fun but anyway so k-means essentially

it's it's a distance based algorithm and it I will point out it only works with numeric data so that will be important the way it works is just like the geographic segmentation that we talked about earlier so it basically will take a distance much of each of the data points so these will be your cluster centers the restaurants and then it will measure the distance from each of those data points to each cluster Center and then assign each of those data points one of those clusters cool okay so I said that you know k-means requires numeric data well most of the lot of the features that I had were categorical in nature so I had to

transform them and there are a number of ways to do this but I essentially encoded them each each of these things as numeric values because I was like you know that seems like that'll work so I ended up with what essentially looks like a sparse matrix of ones and zeros for these mostly binary categorical features so k-means uses distances right so what could end up happening here is k-means is gonna consider really close to objects that are actually very distant from distant from one another just because they've been assigned to close numbers so just because two things have the value of one doesn't mean that they're close it just means that was the only other option

because it's a binary variable so I stopped right here I didn't actually go further with us I was like no this is a waste of my time I'm not gonna do this but it wasn't really a waste because it kind of made me think more about how this would work so I was like okay I have categorical data let me find something that's gonna work with that instead of trying to fit my data to this model I was a weird way to say that instead of trying to make my data work with this model let me try and find a model that will work with this data so I found K modes and K modes is pretty

similar in concept to key means but instead of distances it uses dissimilarities and so essentially it quantifies the total mismatches between two objects or two data points and the smaller that number the more similar the two objects are the larger the less similar they are sort of means that uses modes so here's an example of the data that I fed to K modes again you can see a lot of a lot of this is binary categorical there's a lot of other stuff that's over off the screen that you can't see and so here's what I ended up with alright so after trying a number of different numbers of clusters trying different ways of doing this I in using

eight clusters just for no real reason let me be very clear no real reason again I'm just kind of throwing stuff together and seeing how well it works and I ended up with this distribution so each of these here to orient you these are these numbers over here on the left are the classes or the clusters and over here are the counts of users who fall into each of those clusters so you might think as I did oh my gosh okay these are the anomalies it's five and four and seven are the classes this is what I wanted to find that was not actually true it ended up being that there were a couple of accounts that looked really

similar they just had different role based like access controls like one was a viewer one was an author so that wasn't quite as exciting as I hoped and then I started reading everything and so I went back to the drawing board and I was like okay let's add some features maybe instead of focusing on security attributes only maybe what I need to do because the whole point of this is to generate kind of a holistic picture of a user and understand all of the attributes that end up made that go into that whether they're more or less secure and it's not just security related attributes they can't be that just doesn't make sense and so I ended up

adding a couple of other features things like account size so like how many email addresses or do they have saved in their account how much do they pay us do they pay us and a couple of other things in addition to these security features but then I wound up with both categorical and numeric data so that's a problem because now I've got to find another algorithm and I actually came across this I don't to call it obscure but there's not a whole lot of stuff they're about it I found this algorithm called K prototypes and it was introduced I think in a 1997 or 98 paper that was essentially for this type of problem not necessarily security related

but clustering with mixed data types and I was like okay you know what this seems like this might work pretty well and so I have my data both you can see we've got like a nice mix of different types of attributes and okay let's run it through K prototypes K prototypes by the way works kind of say it's the same idea as k-means and k modes but it uses dissimilarity instead of just pure distance so we now have this and again these are the clusters and these are the counts over here on the right or the counts of users that fall into those clusters and so okay this looks a little more balanced than the results I got

before but it's still kind of I'm not really sure what to make of it so like I'm curious specifically about clusters one and seven just because they're small and they seem like they're outliers and so I was just very curious my like the security minded me was like okay two-factor authentication maybe there's something different about them here and it was so these were the charts that I showed you earlier and so cluster 1 and cluster 7 again just two separate groups of users and they these these two clusters had the highest rates of adoption of two-factor authentication across all of those clusters so there's something and then these clusters four and five were the clusters that had the

lowest across-the-board adoption of two factor and cluster four and five let me just go back for a second so they're not the largest but they're somewhat similarly sized so that's kind of interesting as well this this is a really I'm so sorry for the quality of this this is really embarrassing but there is also an extremely extremely extremely weak but positive correlation between whether someone pays us monthly so has a nonzero average monthly payment and whether they have some type of two-factor you actually can't even see it on the projector because it's so slight but it's there so I feel like I'm on the right track okay but ultimately these results are still kind of imbalanced so

what am I gonna do and I so to be clear from this randomly sampled data these were 10,000 user accounts it shook out to be about I think 13,000 like accounts that could log into these accounts there's a chance that the actual target class wasn't even represented in this data so there's a couple of ways to deal with this with the imbalance class issue so one is taking so is over sampling so this would be taking all like normal users and original excuse me and not so normal users and basically creating copies of this not so normal like these bad users and inserting copies into a duplicate data set so I would have the same or almost the same number of

observations of the the class that I'm really curious about and also just the normal user base I could also do something where I understand 'pl so I could just cut this large normal user set and make it equivalent to be the smaller more interesting target set that I want to look into I will say that I have a little bit of a hunch that this is probably going to work better that oversampling will probably work better just because I think giving my giving whatever model a chance to see this this much will probably help out a lot as far as giving extra visibility into kind of what these accounts look like and being able to recognize them but I'm probably

gonna try this too just because this whole thing is you know in the spirit of just let's randomly try stuff so I'll probably do that and then I kind of want to I've touched on the this isn't finished this is still like where I stopped just there is essentially how far I've gotten in this research so I want to talk about a couple of questions I presented this to the data science folks at my organization a week or so ago they got some really good questions and I want to talk about those so one of the the big big questions that I've gotten is how are you gonna evaluate the results of this particularly given that you're

using unsupervised learning this is a really hard problem just like in any in general across any domain across any problem where you're using unsupervised or excuse me using clustering clustering ISM supervised but a lot of it really depends on why you're using unsupervised clustering techniques to begin with I keep saying unsupervised clustering clustering is unsupervised there's not a really a tight that's supervised so it really depends on why you're using this method in the first place and unfortunately most techniques that that allow you to evaluate like formally evaluate the success of a clustering model require access to some sort of you know ground truth some sort of labeled data and so that's kind of tough I could

generate some labeled data myself because I going back to the whole idea of domain expertise I think I have a good idea of what constitutes a fairly secure versus and not so secure account but again that kind of almost defeats the purpose of feeding it to a clustering model to begin with so I'm not really sure I think this is probably gonna take some manual review I think I'll probably have to go through and just look to see you know do the clusters make sense which doesn't really sound super cool and like advanced but that's the reality of what I'm probably gonna do and then another common question I get is you know what other

models have you considered so all the models that I talked about here are centroid based clustering and so I'm probably gonna try DP scan or some sort of density based clustering model at some point again I'm probably gonna throw a bunch of things at the wall and see what sticks just because I'm really curious how this will shake out and what I'd really like to do is eventually long-term put this data into a supervised model use this to train a supervised classifier and you know I'll probably do something like look into a random forest or some other type of ensemble model so and I think this is really the crux of this whole thing is

what is the point of this and how does this actually fit into the defensive security program at MailChimp wait how does this fit into what you already have so ultimately my goal for this is to at the simplest layer if you are a high risk account if we deem you high risk or you're a security risk maybe we don't let you feel that log in so many times or maybe there are certain things around the perimeter we don't let you do quite as much as we would an account that we deem not so much of a security risk and that's super broad I realized that but part of it's gonna be figuring out what that looks like but really what I want

to do is provide a heuristic for my team and for our anti-abuse team anyone else who touches this type of problem to focus monitoring efforts on accounts that really need it because we have we have millions and millions of users and it's impossible to catch everything so if I can help carve out these groups so one of them would be the like these kind of the accounts that I showed on the left side of the spectrum earlier if I can help provide attention on those and give a spotlight to those then I feel like I'm doing something to help my team and that's really what I want to do and no you know this isn't perfect at all

this is just one tiny piece of the puzzle that you know security is a defensive security requires a layered approach and this is just one more layer on top of other things that we are doing that we will be doing and that I hope will help both my team and other teams get get to the important stuff quicker Thanks I think we definitely have time for questions so I am up for so you looked at two FA to see if there was anything special about it did you look at the other variables as well to see if there's special things about them or not yet yes so I didn't I can't really share all of it but yes there definitely and

so it looks like you're using Jupiter notebooks and that's been something that I think cero the speakers today have actually kind of demonstrated so could you tell us a little bit about how Jupiter notebooks have facilitated doing this kind of analysis for you yeah for sure in fact let me go back to one of the screenshots just as an example so Jupiter notebooks if you've attended any of the talks in here today probably seen some examples of this but essentially it's an interactive Python environment that allows you to break up your code into these neat little cells and it provides you probably provides kind of like a living document you can reuse it so I can plug in any type of other data

set like I could just change what actually what data this is and still run this analysis again so it makes it really easy to continue to like hand analyses off it also has the nice benefit of like I can I can create notebooks that have these types of visuals and then along with markdown and other types of text annotation so that I can pass this off to my boss and there but like I can pass this off essentially to an executive with with this kind of content in it so yeah interactive environment but also one thing that I really like about it is that it makes it easy to share that kind of stuff so did

looking at either the medians or the prototypes or the means when you did your clustering give any insight as to what the clusters actually meant yes so unfortunately the way that the way they kind of took shape was based on account attributes that could be found really evenly spread across different different people which doesn't I realize that doesn't make sense I'm not explaining that well yes but it wasn't super helpful for discerning why an account would be in one bucket versus another can you talk about security education for the users and I mean is there an outcome of the research or retraining users are you communicating with them changing the variables yeah so this is actually

something I've been discussing with some folks very very recently we historically have been really careful about reaching out to users or suggesting proactive things partly because you know let's say we decide we want to email everyone who doesn't have to factor authentication on be like hey lock your stuff up like what are you doing that can also kind of have a different interpretation people can think they will see that and go oh no the male shep sent a security thing we're really concerned this looks like it could they might have been hacked oh no so there's a really there's a fine line and we're still trying to decide where that line is we're talking about

it though and I am fighting very much for us to be able to somehow talk to these users all of these and say please do yourself a favor and turn on these types of security features that we offer so yeah it's it's a really hard prime e nits you know there's the whole like policy versus what we want so yeah first of all great talk I love seeing machine learning applied to like data science problems in the security world and it kind of brought up for me like this question of like what is the dependent variable in this model like it seems like you have a lot of independent variables about users and not a ton that

the model would even be able to pick up on in terms of outcomes and so maybe just to build on the previous question like do you have metadata about the kinds of outcomes you're interested in or is that just not available or like what are you hoping to see at the end of this model aside from identifying user accounts that are anomalies yes so ideally what I would do and I didn't really get into this and the talk just because it was if there was a lot to cover we have a known data set of accounts that have been attacked and taken over I know what this looked like and so those are things that I want to

go back and be able to eventually feed to some of the future models and say can you discern whether these are one interesting four I'm sorry I'm only going on about this but one interesting problem that we have is when an account is attacked or taken over or abused in some way almost immediately we have a team that reaches out to that user to say hey something happened you need to change your password you need to turn on two-factor you need to set up these notifications and so it's it's great for the user but it kind of is unfortunate for this research because what happens is we don't get a picture of what that attacked user looks like at the time of

attack so that's actually an initiative I'm working on with some of our data engineers right now to get those account snapshots so that we'll be able to have that source of okay this is what they look like so I hope that answers your question so a follow-on to that no you're good so it seems as though you would have two types of malicious users so the first type is an account that wasn't malicious where their account got hacked taken over and now is being used for nefarious activity and then you'd have the other type of malicious user account that was set up with the premeditated idea to be to use that account maliciously did you

find anything from the clustering that would help identify or correlate with the first those premeditated ones I'd call yes so this is actually another thing that I am working on kind of in conjunction with this this is this is completely different so the way those accounts look those accounts that are created to be bad they're just garbage they look different they there are specific things about them that unfortunately I can't go to here but there are definitely things about them that make it there they can be very easy to detect and so so I don't wanna say that's an easier problem but in some ways it can be because particularly if it's something automated if we see bot sign ups or things like

that we have some methods in place to deal with that but this is a little bit harder because or at least it's harder and it it's hard in a different way and we haven't really spent a lot of time on this particular piece of it which is why I'm focusing on that that's also another really a really big vector though for sure yeah

so it sounds like this is something that you kind of you had the idea and then you kind of have learned it involved your knowledge of it as you've researched it what kind of tips would you have for other people that are in the organization and might want to try to apply that kind of to learn how to apply data science to their own internal problems starting from kind of the same position you're starting at and trying to kind of follow in your footsteps yeah so that's a really good question so one thing I will say is that I think I've been very lucky because MailChimp encourages a culture of experimentation and so if I do all of this and it turns

out to be you know we can't really use this in production we can't really do anything with this that's still okay because we've learned something so I will say I will put that out as like as a caveat but I would say you know think about your user base or think about you know what questions keep you up at night and is there a way to somehow start thinking analytically about this and I realize this a lot easier said than done but thinking through the lens of an analyst or a data scientist what types of data do you have available what can you query easily what can you get through get your hands on data yet even

if it's I mean if it's in a Google sheet if it's in a spreadsheet doesn't matter start getting familiar with it because the only reason I knew to go to some of these things was because I lived in an analyst role for so long I knew what what the user account landscape looked like so I would say get really familiar with that and then just start playing with things like start getting your hands dirty like don't be afraid like I got so much of this failed but I learned a lot from it so

I'm sorry but I don't know if this is covered earlier before I was able to get in but I did you have if you you're doing this work did you discover that there was a gap in data so you had to go back and add new data sources in order to be able to get it the correct sort of information yes sorry I wasn't sure if you had a second part to your question um yes absolutely so one of those situations was in going through the different models realizing that I needed not just account security related data I needed just normal what does your account look like how much do you pay us how much how many users do you have like

other types of just account attributes but what there's one piece so I touched on this just a minute or two ago we don't have data around what an account looks like at the time of attack and that's that so those those values aren't they're not time-stamped and stored essentially when the user turns on to factor that gets updated but we don't necessarily like timestamp that and say this is the history this is the change log of when a user enabled these things and so that's something that I'm working on like right now with our data team to get that to get that in because that's been huge that's actually created a whole lot of extra work for this because

that's something that I don't have that source of truth so does that answer your question okay

[Applause]

[ feedback ]