GT - Security Data Science Teams: A Guide to Prestige Classes

Name: GT - Security Data Science Teams: A Guide to Prestige Classes
Uploaded: 2023-10-25
Duration: 51 min
Description: Ground Truth, 17:00 Tuesday As more of security becomes driven by data, a menagerie of job titles have cropped up across the industry. Data Scientist, ML Engineer, Data Engineer, AI Researcher, and more have become de rigeur job titles – but the lines between each role remain blurry, especially for

BSides Las Vegas51:0062 viewsPublished 2023-10Watch on YouTube ↗

About this talk

Ground Truth, 17:00 Tuesday As more of security becomes driven by data, a menagerie of job titles have cropped up across the industry. Data Scientist, ML Engineer, Data Engineer, AI Researcher, and more have become de rigeur job titles – but the lines between each role remain blurry, especially for early career and non-data folks. In this talk, we talk about where the skills of these roles overlap, how to pursue a security data career, and crucially, offer some hot takes on why maybe we need some clearer lines. Erick Galinkin

Show transcript [en]

good afternoon everybody and welcome to bide Las Vegas ground truth this talk security data science teams a guide to prestige classes given by Eric Eric is a hacker and computer scientist working as principal researcher in in Rapid 7's Office of the CTO present Eric leads R&D supporting rapid 7 manage detection and response service an alumnus of John Hopkins University he has published a number of academic papers and given talks on security decision Theory and artificial intelligence applications for security at conferences from aaaii and gamc to Devon's AI Village he has spent his entire life in different parts of information security ranging from threat Intelligence and malwe Analysis to Cloud security and security architecture before we begin I have few announcement to make we would like to thank our sponsors especially our Diamond sponsors ad Adobe and our gold sponsors Prisma Cloud Sam grap blue cat Plex track Toyota and conductor one it's their support along with our other sponsors donors and volunteers that make this event possible we have few policies that we want everybody to be paying attention these talks are being streamed live except in on the ground and as a courtesy to our speakers and audience we ask that you check your phone and make sure it is in silent mode we also have few photo policies here so the bide Las Vegas photo policies prohibits taking pictures without the explicit permission of everyone in the frame so if you want to have a picture or a photo make sure you have explicit permission of that person in the frame that being said we would like to welcome Mr Eric on the [Applause] stage thank you so much for that beautiful introduction it is a pleasure for all of you to be here I am surprised at how many people turned out given that there was a you know nice little break between the two talks so thank you all for being here um I wouldn't be excited to speak to an empty room but I am excited to speak to a room that has at least seven or eight people in it so with that uh my name is Eric Lincoln uh you know as as was mentioned I lead AI research at rapid 7 and I'm going to talk a little bit about security data science teams uh and kind of what that means so just to begin what is security data science right which I think the the clear definition is the study of security data to extract meaningful insights and if you disagree that's fine I have a microphone and you don't so a little bit of about what security data means right because that that feels like it can mean a lot of things so you know this usually means the analysis of things like logs whether that's system firewall load balancer logs I have spent so much time on load balancer logs God please I don't ever want to look at load balanc or logs again uh files right so this can be executables documents Scripts uh read malware uh or you know other artifacts right so packet captures which don't quite fall into logs or files right um but you know I'm sure that some of you are coming up with things I haven't mentioned yet and you know there are lots of things use your imagination right if it relates to security and you can extract data from it you can probably do security data science on it so security data science is of course done by security data scientists what does it mean mean to be a security data scientist well it means that you're someone who does security data science you're welcome uh most security data scientists come from two backgrounds right that's either data scientists who are interested in security so typically this is somebody who started a PhD in physics and decided they wanted to make actual money uh or security analysts who are interested in data which are you know that that's my background so I I have a little bit of a bias here and I acknowledge that up front now when we think about security data scientists and especially these data scientists who are interested in security one of the points that I like to make to aspiring to Young uh new hire security data scientists is that it's kind of a Prestige class right and so for those of you who somehow are not nerds but are listening to this talk Prestige classes are a concept from role playing games right and that is to say there are prerequisites to reach a Prestige class so if you want to be right you want a Prestige class you want to acquire it you have to be a certain level you have to have certain attributes you have to have certain traits you have to be an existing class and then you kind of prestige into the prestige class right there's a certain level cap before you can get to your prestige class it's not it's not an entry level thing and uh when I say that I get a lot of rea where is this gatekeeping and yeah sorry yes it is right and I think that I have a a fun anecdote that that will help you understand so I'll tell you a little bit about uh a malware classifier that was built by data scientists so they started with this big Corpus of malware um literally millions and millions of malware samples and they did all their analysis and picked it apart and you know identified the features and how they were going to feature I it and how they were going to build the classifier and then they trained a whole classifier on this um and this is a true story from a a former employer so how did it do well it got uh above 90% accuracy on the test set it did incredibly well uh excellent F1 score excellent Au if I remember correctly it was like a 096 Au for those of you who don't know what Au is that's the area under the curve one is like literally perfect the AU basically measures the trade-off between false positives and false negatives right um the higher it is the better so that's incredible that's it's unreal classifier and so what were the two most important features for the classifier uh number one most important feature for determining whether or not an executable was malware was the system language uh number two was the compiler for those of you who have ever thought about malware a moment in your life you may realize that these are not features that are particularly important in determining whether or not an executable is malicious so these data scientists went off on their own built a classifier and said here you go it's awesome it's so good and we were like hell yeah what what does it do explain it to us and uh they were like yeah it just checks the system language if it's Chinese or Russian it's pretty much always malicious if it was compiled with Borland Del it's pretty much always malicious and it's like nope absolutely no wrong wrong right which is to say security data science requires security skill and data skill right and if you're a low-level character that is you've just graduated college you know um you may not have the right balance of skills to be a good security data scientist to start right that's not to say that you can't get there um and of course you can get there right as you start off in your data science journey in your security journey and you aspire to become a security data scientist you'll acquire more experience you'll acquire ability points and you can put those ability points in different parts of your skill tree right so in role playing games skill trees are a way that as you build up your levels you will unlock new skills some skills are prerequisites to other skills sometimes you need to have both skill in the line to get to that third skill you need to you know have your spheres or your ability points whatever analogy makes sense for you but it's tough to move directly to say assessing the security of large language models if you've never trained a logistic regression classifier you need to grasp what's happening under the hood before you can really get to the point where you're making well reasoned valid assertions about what is happening where right and there's a lot of skills that can go into being a security data scientist I've put a bunch up here I'm not going to read them but one of the things is you know especially if you're thinking about security data it's tough for people to reason about well I built something that tells me whether or not a a an HTTP stream contains malicious Network traffic if you've never analyzed malicious Network traffic right you can build that classifier but when you get a false positive when you get a false negative it's going to be really difficult for you to understand why that happened explain it and fix it uh a lot of times data scientists data people in general they get stuck on this notion that well all we need is more data we just get more data and then we train it some more and then works and that's not always the case because you have these weird ambiguous Corner cases especially in network traffic which is a nightmare to do analysis on you see things like um we were training a classifier for anomalous data transfer and one fun thing is that printers sometimes get a lot of data you send a lot of data to a printer some printers depending on the make and the model and the protocol don't actually receive that much data so does it look like xfill or does it not look like xfill well I guess that depends on whether it's a Lexar Mark or a Xerox and of course if you don't know how to look at that pcap and say oh okay yeah this is weird it's using this printer you know protocol that wasn't in our training set you're not going to get that uh it can be really tough right and so as we're looking at the skills and thinking about the different skills whether that's you know good oldfashioned AI deep learning data visualization containerization and deployment mlops Etc that brings us into job titles and job titles are something that drive me uniquely insane um because well we'll get into it right but some some common titles you see machine learning engineer data scientist data engineer data analyst mlops engineer uh Etc right and so you can kind of break up the responsibilities of the role uh I'm not I don't need to read this list to you uh you don't have to read it you can take a picture of it it's fine uh or a screen capture if you're watching remotely what's up um but essentially you know there is some overlap in the roles and there are some you know really defined things right uh mlops is almost completely disjoint from a data scientist there's overlap between an ml engineer mlops overlap between the ml engineer and the data scientist my job title is AI researcher and um that's not on here because it is silly so the problem is that this is my idealized version because most orgs end up structur like this where everybody has the job title data scientist and we don't distinguish um we don't distinguish at all between whether you are doing the deployment whether you're doing the maintenance whether you are just doing data visualization you work with data you science the data and therefore you are a data scientist uh and so my hot take is like maybe we should just stop using that title no more data scientists uh I think that by putting that restriction on ourselves we kind of force ourselves to think about how those titles might matter and how we can delineate those roles and responsibilities uh and I'll talk a little bit more about that shortly but when we're thinking about the roles and responsibilities of security data organizations right that's presenting security findings to leadership in digestible ways that usually means uh hopefully something other than a pie chart but sometimes they want a pie chart even though it doesn't actually tell you anything meaningful please stop using pie charts uh right presenting security data in stakeholder relevant ways so this can be if your stakeholder is like a sock analyst well a chart is not going to be nearly as helpful to a sock analyst as something they can read and take action on a lot of times all a sock analyst wants is red or green is it bad or do I not care about it right and that really matters how you present those findings does matter uh and it's where that data visualization skill comes in a skill that I am sorely lacking right developing um task specific data models and machine learning models if you don't have a data model that makes sense it is going to be very difficult for you to train machine learning models uh using the wrong data structure can be a total nightmare uh especially if you're dealing with Text data and you've turned it into J and then you need to return that Json as a string and then your model chokes and dies on it and you can't figure it out for 3 weeks not that that happened to me like a month ago um and then of course enhancing the ability of analysts to deal with ATS scale data which I really do mean is taking the Deluge of data that sock analysts are faced with and turning it into something that they as people who don't find using a Jupiter notebook exciting people who don't want to train models just people who want to find evil and get rid of it um turning that data into something that they can cope with right so a key line that was missing from the earlier chart is that understanding of security processes right and it's really important for analysts data scientists if you're going to use that term for ML Engineers to understand those security processes that way you don't write a classifier that depends on the system language and the compiler right so how do we how do we think about understanding security processes for data scientists right for people who are coming from you know a physics PhD into working in a security organization um I don't want to imply that you need to be an expert right you don't need to be a super competent reverse engineer to know how to write a malware classifier right it helps but you don't have to be what's important is that you can work with those subject matter experts and you have enough of a background to understand what matters to them and how they do their jobs right if you spend a day with a malware analyst you're going to very quickly learn what matters they're going to say oh it's making you know this API call it's importing these libraries we've got you know uh packing in here right all of these things are hints that something might be malicious and you learn how to deal with them and reason about them together and so when you get a classifier that does weird things you can say that's not right and you don't have to wait until like two days before it goes to production and customers freak out you can catch it early on in the process um and you know I've I've mentioned this a couple of times at you know various get togethers at at you know meetups and uh even to my own organization and one point that I always get is but security data scient data scientists they're all so busy and like so what like excuses um I I think that that's an excuse right we are busy but this matters it's important it's important that you have the appropriate skills and that you invest in the right parts of your skill tree to do the job that you're assigned so what is the job that you're assigned and how does it matter how do you structure your team right and it's important when you're building your party when you're building your security data science team that you collect the different skills you collect the different strengths and weaknesses so that you can support whatever your organizational mission is right so I'll give a lightly fictionalized real world example of my party right my my team and so we have me right uh I'm I'm kind of a a Mini Max Rogue I've I've invested a lot in my decks and Charisma right I've got very high security skill High machine learning skill uh but I am I cannot write terraform gun to my head I could not write terraform it I love everybody who does my brain doesn't process terraform it doesn't make sense I can't do it I've tried go langang and terraform those are the two I can't do um if you love go langang I actually don't apologize um data visualization is just not a place I've spent a lot of time I can build like some basic charts like if I can do it with like pt. plot I'm a filthy python user um I'm sure there's somebody in here who loves R I'm sure Gabe is listening somewhere and to him and to Bob rutus I apologize um I I do know that GG plot is better I'm just never going to learn how to use it um so I I'm very very poorly skilled in data visualization and so when I'm trying to build out my party I want to bring on Jamie who's who's our tank right and by the way I do have the permission of these people to show their faces it's not just these are real people um so Jamie comes from like a a real Dev background she's wonderful she's brilliant you know kind of familiar with security but but newish to it uh but she's you know competent at dat processing ml data visualization competent more competent than me but she brings up all of the infrastructure and Ops stuff that I can't do if if it involves a TF file if it involves aars file I go Jamie can you please help me like I need you for this I can't do it somebody said ECR to me that's gibberish that's nonsense eks never heard of her don't know her we're not friends right AWS doesn't make sense to me uh but but it makes sense to Jamie and so I am happy to you know do my my backstabs and whatever and she is happy to tank the uh the AWS damage for me and then you know we've got another member of our party Robbie and Robbie is just a a wonderful like Druid kind of mid-range like 15s on every stat you know he's like he's fine he's good at everything um not not minmaxed he's got no dump stats he's really built a balanced character and he's wonderful he's really really good um and so we have this party with these these complimentary skills right we've got me working on the Deep security stuff and being able to Mentor them on the security side of things I have a lot of background and and deep knowledge in the machine learning side of things and the large language model side of things uh so I can you know cover for them there Jamie is happy to bring up all of our infrastructure and manage it for us and then Robbie's kind of just an allrounder whatever you need but there's something missing uh we don't have any casters right so my party uh even though I've have tried to build it very carefully doesn't have anybody who's really really really good at data visualization and for what we're doing now that's okay because we're mostly supporting these internal operations uh sock people right but if somebody asks us hey can you write an executive report and and put it out to the world I say no no I can't I don't know how to do that I'm going to build you really ugly charts and I'm going to have to go to our bi team and and say like hey can you help because you all build beautiful charts all day and I don't know how to do that so it's really important to build that balanced you know security data science org and it's really important to have a deep well of security knowledge to pull from especially when you have data scientists who are coming in from this non-security background right and so with that I kept this incredibly short um I I am all set and happy to release you all to ask me questions uh and then to go eat dinner so thank you for your time uh thank you Eric wonderful talk uh really interesting if you have any questions you can use this mic uh and ask Eric your question thank you I I've asked a question every everything that's going on in this room um I'm curious to know how you deal with llms because they seem to be so just uh who knows what's going on under the hood you know when I push you know regenerate regenerate I get all this stuff back there's all these ways you can sneak in the previous uh presenter talked about how you can reposition something because I deal in a governance and I'm trying to get my own head around that and I just wondering if you have any thoughts on that I'm so glad you asked that question uh I I love no I I really do th

GT - Security Data Science Teams: A Guide to Prestige Classes

Related talks