Using machine learning to detect sensitive documents on SharePoint

Name: Using machine learning to detect sensitive documents on SharePoint
Uploaded: 2023-05-10
Duration: 23 min 13 s
Description: Wilson Tang presents a machine learning approach to detecting sensitive documents stored on SharePoint, a critical insider-threat mitigation challenge. The talk covers document classification techniques, defining sensitive documents (both wholly sensitive files and those containing embedded secrets)

BSidesSF · 202323:13284 viewsPublished 2023-05Watch on YouTube ↗

Speakers

Wilson Tang

Tags

CategoryTechnical

TopicCloud IAM Threat Intel

StyleTalk

About this talk

Wilson Tang presents a machine learning approach to detecting sensitive documents stored on SharePoint, a critical insider-threat mitigation challenge. The talk covers document classification techniques, defining sensitive documents (both wholly sensitive files and those containing embedded secrets), building random-forest models with a one-vs-all ensemble strategy, and deploying an automated detection pipeline with user feedback loops. Key challenges in ML-based security applications—false positives, data quality, and model interpretability—are discussed.

Show original YouTube description

Using machine learning to detect sensitive documents on SharePoint Wilson Tang Sensitive documents can pose an insider threat data exfiltration risk if permissions are set incorrectly on them. Our project uses machine learning (ML) models to help us detect various categories of sensitive documents. We will also discuss unique challenges when building ML models for security. https://bsidessf2023.sched.com/event/1I0H7/using-machine-learning-to-detect-sensitive-documents-on-sharepoint

Show transcript [en]

hello everyone in this presentation by just inviting everyone to this simple exercise so first what I want you to just I want you to close your eyes for a second are you with me I want you to just imagine all of your personal information and secrets are all safe and secure now open your eyes could you imagine it no no me neither unfortunately we just don't really live in this world right now that's why I'm here to talk to you about machine learning for detecting sensitive documents on SharePoint

all right so before I jump in just going to want to give a quick introduction myself so I'm Wilson Tang I go by heathe pronouns and I'm a second generation Asian American I just want to mention that because I wouldn't be here where I am today sitting on this podium in front of him or standing if it weren't for my parents immigrating here for a better life for our family and yeah and our my current role at Adobe I'm a machine learning engineer in threat hunting uh and before that I attained my master's degree in computer science from the University of Washington in 2022 where I specialize my studies in machine learning and now my role I'm passionate

about finding new ml applications in security so I know there's a law I've been a lot of hype about chatgp chat GPT you might heard about that but I think now is the moment with all this hype building around I think we can really take this opportunity to really think about all the things in machine learning not just chat GPU but everything else to see how we can really apply it to solve some of our security problems and also I have security some of my hobbies include cooking uh cooking new foods I'm a foodie like to try new restaurants uh traveling like to go to different countries and explore different cultures and uh all my off

time like to play video games the one I'm pretty into right now is valorant and yeah that's a little bit about me hope that gives you a good picture of who I am for the rest of this talk so how I'm gonna I'm gonna give a basic overview of what I'm gonna go through in this talk first I'm going to talk about document classification and how it kind of fits into this realm of sensitive documents then I'm going to talk about how we Define these sensitive documents in this project and then I'm going to talk about how using that definition how do we build machine learning models to detect these documents and finally how do we deploy these machine learning

models as an automated pipeline for real-time detection and if and then I'm going to wrap up with uh noting some key challenges we face during this project and also talking about some future projects we want to tackle um based on this so now let's jump into the meat of this talk um first I want to Define machine learning and there are a lot of definitions of machine learning this is the one I align with and agree with you can say that machine learning is a category of artificial intelligence or AI where we use lots of data to train a model that can predict on some task and on the diagram on the top right this is

a common depiction of what you might see machine learning is so on the outer bubble you have ai and then machine learning is a subset of AI and deep learning which uses neural networks as another sub is a subset of machine learning but what we really want to apply this is uh with sensitive documents it really falls under the problem of document classification within machine learning and this problem is assigning a document with one or more categories and the most common example you probably felt in your everyday life uh is Spam filtering like how does your outlook how does your Gmail know if an email is Spam or not class example of document classification and there are a lot of common techniques

for document classifications such as like naive Bayes expectation maximization TF IDF I won't talk your ear off on all these techniques uh you can go research them after this talk but the one we're really going to focus on today is random forests so how do we Define sensitive documents in this project and we Define it by this files containing any information that should not be shared broadly across the company most commonly when access permissions are set incorrectly on these documents so the first example I think of Landing of a sense document is a job offer a job offer you would really only want it contained within the HR department and the hiring manager maybe even the candidate and you wouldn't want

this document to be exposed to everyone in the company because you know you don't want your priv privacy you don't want everyone to know how much you're making or what you got offered and a quote that I really like that highlights the importance of this is this uh from Dan O'Day from the unit 42 IR report uh unauthorized acquisition of a single spreadsheet containing a list of individuals pii could result in a large data breach even though the file size itself may be very small and I think this really highlights that these documents are like small in nature but they're at the utmost importance to really protect these so they don't get leaked and exposed to people who

shouldn't be seeing these files so now that we have this problem um at Adobe large organizations and any org in general these days we're using we're relying on cloud documents to do our daily operations more and more but so as the scale of cloud documents increases then how do we effectively scale and scale in detecting these sensitive documents that really shouldn't be exposed to people who shouldn't see them and the answer we're coming to here and how well I'm going to address in this project is machine learning before we drop into the model uh it's also really important in ml task to Define our uh Define our data very accurately so in this project we Define

sensitive documents in two key ways the first one is sensitive documents where whole documents are sensitive in nature for example I talk about job offers that whole document is sensitive and you don't want that exposed and the second one is documents with sensitive information and these are documents that aren't necessarily sensitive in nature but they can contain sensitive information for example API documentation not sensitive but they could actually contain an API key and that makes it sensitive oh jump let's slide too quick so now I'm going to jump into the first type of sense documents that I Define which is the whole sensitive documents and what I'm going to do here is I'm going to talk through how we go from the

plain text raw document and go step by step through on how we get to the final prediction so the first step here is that on the left uh the left gray box this is the plain text document that just has words on it so this is document with some informative and non-forward words what we do first is an optimization step where we filter out non-informed words and you might be asking what's informative what's not we define informative words as words that help us categorize whether a document is sensitive or not for example a document containing the word job offer it might be sensitive so that could tell us that that it's a job offer but words like

this is ah you can say that these words are present in both sensitive and non-sensitive documents so we want to filter those out filter out the noise to make the model easier to train so this we take the SEC in the middle diagram where we filter out the non-formal words and we're left with a some of just some subset of words and then finally with those words we create a vector with counts of words and this uh this framework is commonly known as a bag of words so if you look at the final Vector we output it has one two two one one zero zero and you can interpret this as the first one could be

there's one word that's document there's two sums there's two informatives and one job one offer and the zeros can be informative words that aren't uh that aren't present in this current document so cool now we have a numerical representation of a document now how do we take that vector and run it through our model and get a classification out of it and from a very very base level what we're using in this project is decision trees and what a decision tree is it's a tree-like model that makes a prediction based on certain criteria and boundaries on that criteria so if you look in this decision tree example this is uh I'm not a medical professional but uh this helps highlight

uh a risk for heart disease so the first question we can ask is oh what's your age and then if you're less than 18 then we can ask your weight and then less or 60 kg can determine lower high risk and then 18 to 30 and then more than 30 um at the as you can see just by looking at this that this decision tree might suffers from overfitting and overfitting is when you tune a model too well on training data so it doesn't perform well on real world data for example you can see in the middle one the 18 to 30 is there really no one 18 to 30 that's high risk of heart disease probably not so

but why did the decision tree come up with this maybe because in the training data there was no one 18 to 30 that had high risk heart disease and there and so there are probably other factors that go into heart disease here but this is how this decision tree learned low risk and high risk on how the they chose to split on Age first and then wait and then smoker so you can see that the decision boundaries themselves like the 1830 why is it 1830 why is it in a different range or why did you choose Age first these can lead to overfitting because it's really tuned to the training data but not generalized well on real world

data but decision trees are a very classical approach to ml and they can be very powerful given enough data it might not overfit but it is still very prone to overfitting so how do we combat this overfitting so we can make it even stronger one answer to this is random forest and just by the name you can see that we're going from One Tree to a forest which comprises of many trees So Random Forest classifier is where we train multiple decision trees and then average the decision across all trees and how this helps reduce overfitting is by we're making sure each decision tree randomly chooses different decision boundaries and criteria on each level of the tree

so now we're going back to documents but I can talk about like the next slide where maybe instead of choosing age at the top maybe we choose it later in the tree but now I'm just going to jump back to the documents example since we're talking about sense documents so what we do here is that we have to document at the top and then we feed it through these five different trees and tree one might say oh 0.95 I have a 95 confidence that this is a sensitive document tree two says I'll have a 90 confidence and then we add all those confidence up together average them and we get the final 0.95 score and we can feel more

confident about this number instead of One Tree because we'll have multiple trees that have learned the data in different ways so we can be trust this number just a little more and finally we add one more layer above the random Forest where we create one versus all classifier classifiers where we train multiple models where each model is specialized at classifying one type of sensitive documents so in the diagram here you can see random Force One could be really good at classifying API document API documentation the second random force can be really good at job offers and third one just another sensitive document type and you can see in this example that we got 95 confidence that this red Forest

thinks that this document is a job offer while the other concepts are 30 and five percent so what we do here we just take the model the highest confidence and we say that the document is a job offer and we chose this method uh as an it's an ensemble method so where rather than having one multi-class classifier we're having multiple binary classifiers and we found this method better to easily easily train an accurate model and this here this picture is our final model technique for detecting the whole sense of documents so you got the dog you got the one versus all random forests and the ram Forest contains trees in them

cool now I'm going to talk about the second type of a sense of documents we talk about which is documents with sensitive information to remind you that's AP for example that's API documentation that might actually have API keys and the way we approach this is through regex and if you don't know of regex's these are regular expressions and it's just a language to help match patterns in the text and what we did is that we did regex detection for both secrets and pii and also uh engineer how we Implement is we use a combination of Open Source and internally curated regexes an example below this is an example of a red X for an AWS API key

open sourced and uh and then below is an example it is API key and where you can see the correspond is the in the red the Akia that that part matches to that segment in the regex and the 16 characters matches the end of the matching on the right there cool now we have two modeling techniques to detect sensor documents machine learning and regex now now how do we deploy these machine learning models and how do we create an automated pipeline for real-time detection on them so first uh we start with the SharePoint API where we query a number of files for a certain time period and then with those files we run them through our

model and then if our model detects if a if a document is sensitive then we're going to directly alert our users through our user Notifier app and the user Notifier app is going to prompt the user to respond when this document is actually sensitive I'll go lock it down or two this document is not sensitive go improve your model and what this does for us is that we store the responses back to our Sim and then we can give it back to our model to establish a reliable feedback loop so we can continuously improve our model performance and why I highlight this is because we can have really complicated algorithms and stuff with our machine learning

models but it's also really important to have Innovative automation so we can make the use out of full use of our ml models yeah that's our whole pipeline foreign cool now I'm going to talk about some key challenges we faced while doing this project and I believe these challenges also apply to any machine learning project you're trying to do in security the first one is minimizing false positives we definitely why we care about this is because we definitely don't want to flood our analysts with a lot of alerts a lot of false positives and they'll experience alert fatigue so the way we answer this is by using a random Forest random forests are less prone to false pauses because of the

nature of like you need these decisions and these decision batteries to fit on the data what you what you might get from random Forest is you might get false negatives because as we talk about it might not generalize well as well on the data but for this particular use case with sensitive documents within one company you can imagine a job offer doesn't have much variation across the entire company so in this particular project in this use case our random Forest didn't have a high uh false negative rate either so false followers false negatives all good the second key challenge we face is a reliable feedback loop and with the reliable feedback loop in security it might be hard to get

malicious and benign samples and also get real user responses into your training data and how we answered this was using the user Notifier app because this will allow us to automate recording responses from our users and finally some future projects for you hope to extend this is we want to implement more sophisticated ml models such as we have random forests now we also want to try out a gradient boosted trees and also try out Bert models that are open sourced and maybe we can try chat EBT on it also we want to expand the document category so we have a set of sense of documents that we're trying we can definitely make that more Broad and we

want to scan other Cloud sourages such as AWS S3 buckets Azure storage blobs and also gcp as well cool if you miss anything if you miss anything in this presentation or you just want to read it instead of listening to me talk we also published this as a blog post on our Adobe Tech blog it's called using machine learning to help detect sensitive information go check it out give a like give some engagement it'll be really cool and finally I want to thank our team members that uh helped me on this project it wouldn't have been possible without them that would be to Vero borosh Andre Stan Kumar vikramji and Joseph Davidson and finally I also want

to thank besides for hosting me today it's been really fun getting to know everyone here and as I know it's my first live conference so it's been really cool to like see everyone and be everyone's stories and everything and finally there's I have my LinkedIn QR code up here if you want to add me connect I definitely would love talking about yeah go ahead um I'd love to talk about machine learning and machine learning and security what you think about it and if you have any ideas I'd love to brainstorm with you and yeah with that I'll open it up to q a thank you Wilson [Applause] all right if you have a question please

raise your hand and I will come to you see one in the back up there awesome

hello uh do you think this technology could ever potentially be used as a weapon to seek out sensitive documents over SharePoint and then I'd report them back um I think when you put it like that with any tool of detection I think it could be used like maliciously like that but um I hope that like so for us for our SharePoint it should really only be internally accessible like there should be no way that like an outsider should be able to scan our own SharePoint environment so for us using this as an internal tools to scan our own SharePoint environment um is the use case there so yeah any other questions next question is from The Middle on your

left oh and I say no way there but like oh there could always be a way you know attackers but just ain't playing out there yeah so um how do you continuously like uh refresh your data uh the data that you train from like you know in theory if the company gets better there's less sensitive information in the SharePoint um but like how do you know like if your model isn't detecting something it's because like there's less sensitive data versus like you know the model isn't detecting something properly yeah that's a good question um I would think that even if the data is old if we see that like same job offer pop up on the SharePoint it

wouldn't perform any less worse because it's able to detect it already um but I do get we might if we're constantly improving then like uh there would be less sensitive documents on the SharePoint um but I guess they also bring up another note maybe the document has changed and that's just a common machine learning problem of just refreshing your data once in a while and making sure you have enough good samples to train on um yeah but if it's a document that like we've already know that is sensitive then we should be able to always detect that no matter how much time has passed other questions and I'm coming yeah so in regards to um random Force you didn't discuss

ensemble the ability to combine the multiple configurations so we do you have anything you can say about that because I think that's a very important aspect of it uh referring to this sorry just declaration we're offering that this part of the ensemble [Music] just [Music] just being able to combine multiple models um you know to um it's like a collection of models used rather than you have like you have your single models and I see the trees but like specifically like if you have that model and then there's another one that can relate combining the two models together ah I see um that particular use case we haven't like necessarily explored like combining like you're talking about like combining

one and two together in their results if I got that right okay

um you're using it as a one sorry again okay so you have Ensemble method and you're using it you know if you have one multi-class classifier but you're doing it within one model yeah true so I'm talking about you know Ensemble models like putting models together yeah and so it has that capability and I just wasn't sure if you were going to discuss that yeah yeah so that is basically with the Ensemble method then uh you would want to combine results of a multiple models of hearing that correctly I guess the way we approached it is probably more simple than that so instead of combining these results together we do a Max operation on them so that's how I would

interpret that last question up here on your left I'll make a quick uh how well does the bag of words method generalize to source code documents to source code documents um does that make sense yeah yeah yeah uh source code documents haven't necessarily explored to that because that's not necessarily what we see on SharePoint um but that would be interesting to see for me how that bag of words would perform you know yeah that was totally open to explore that yeah awesome let's give one more round of applause for our presenter [Applause]

Using machine learning to detect sensitive documents on SharePoint

Related talks