
welcome I'm Don mallerie this is adventures and data labeling uh this is the short version so I'm probably going to speak very quickly because uh the original Talk was like 48 slides and we can't do that in 25 minutes so who am I um I uh I've worked over 30 years in it um primarily in critical infrastructure mostly with a security Focus um today I'm I'm working as a healthcare security professional uh in a large Hospital service that you probably have heard of that I won't name in this particular uh setting uh but talk to me afterwards if you'd like to know um I if searchs are your thing I have some to my credit uh
I've buil helped build mentor and uh and run a variety of infosec programs over the years um hack for kids Toronto which was the was done in 2019 for example and hopefully we'll bring that back sometime in the future um but I also teach black and white dark room techniques because sometimes we need things that aren't the same as the technology we use during the day um these are about my experiences these are my Adventures um if you've had different experiences with some of this some of this tooling or some of these things please share them come up talk to me afterwards talk to other people around you afterwards this is how we learn in the time so I've
done this talk two times before the talk was done uh in a very small group and the very first thing that I learned was something I had not even considered and the second time somebody brought something up that else that nobody in the room had considered and in the process of that we all learned something this is where I've come to this is what I've touched you'll touch different things we all need to learn this stuff together because what the vendors are doing out there is they want you to pay for Professional Services to do to do this stuff and frankly we can all do it ourselves we just actually have to have those conversations so um standard disclaimer
the thoughts and opinions as we go through this presentation are mine and mine alone um they are not those in my past present or future employers um and uh let's go through an agenda so um there's no need to take photos as we go through this presentation all of the references will be shared at the end uh the uh the GitHub is actually not populated quite yet but the get the the QR code will work um the uh QR code at the end is the right QR code to use feel free to play along with the game throughout the uh the presentation uh there will be awards for for the winners and um as we uh as
we go through this we're going to be going through some through a general overview we're going to cover a data asset inventory we will go through um uh some labeling models uh we're going to tie this back to uh some the implementation of the services and how you would Implement data labeling models within our organizations and then we're going to tie that back to technology we're going to get into sensitivity labels retention labels information asset protection scanning uh and then we're also going to tie that in with data loss prevention at the end uh and that's part of the reason why I'm talking quickly because there's there's a lot here but as you can see the path
ahead is not quite clear but we will get there so some assumptions um there's no such s excellent you are the winner you get the duck treat this duck is tapped this is the csse
duck thank you very much for playing um so uh anyways the uh on this there is no one siiz fits all what happens in your organization and what works for your organization is not what's worked in the organizations that I've worked in or the spaces that I've that I've built built or invol been involved in deploying this in my career um the uh this doesn't have to be perfect because it's an iterative process it's not it's not a a oneandone this is something that you'll do throughout a long period of time um the uh you don't have to cover everything because you are always going to have to embrace the rate of change things are going to change as you go
through as as the interfaces will change as you go through this and they'll change so quickly that you won't actually see it coming Microsoft puts in somewhere between 300 and 600 changes to their just their security interface a year that's in addition to their their compliance interfaces and all of the other interfaces as you walk through these tool sets uh pricing is between you and your sales reps uh this is going to take as long as it takes to implement and not not last that's going to be a discussion that you have with your stakeholders um it's not just your change Advisory Board this is bigger than just your change Advisory board and the IT
initiatives so uh what is this not um data labeling is not ransomware protection scanning your data identifying your data labeling your data putting an encryption this is not going to solve ransomware protection that's an completely different talk um but uh then we need to talk about AI um there's a talk coming up soon about Ai and uh and and and I've seen that talk or at least the longer version of it and it's it's phenomenal so I I look forward to seeing it again but the uh the thing to note in particular is that there are no AI magical unicorns that are going to solve these problems for you this stuff uh when you're talking especially about you
the use of AI and machine learning models to tie it back to data humans label data exactly the opposite wrong way than AI does so this is good these are tools that can help you identify where your data is in ways that you weren't thinking but that doesn't necessarily solve the problem of let's just replace all the humans with AI because they're doing the wrong things the opposite way than we are so the other thing to keep in mind is this stuff isn't free whether it's the software licensing or your time which is actually the thing you need to be more cognizant of there's always a cost to implementing this stuff so um what is data lab labeling my kids love
journals they love to write everything that's important to them in those journals and then they mark them as top secret so that everyone knows how important those journals are and that nobody should be looking at them and then they grew up and they got Instagram and then they thought okay well we're going to just share everything that we think as public with the whole world around us so more recently we started doing our taxes now the fact is is there's other people that decide how long I have to retain things when I'm doing my taxes and for example if say you have a contract with an organization that says that you're going to have a third party do your
taxes and that organization messes up your taxes and then you get audited for the next five years as a result uh which happened to me and I'll name that company when I'm not on Rec being recorded um is uh then you end up having to have a good reason for why you need to keep those because someone else made that decision for you and there are consequences to not keeping that data now where is that data now well that data isn't necessary anymore because it's well beyond that 7-year retention requirement and I've deleted it because I don't want to keep any data that um may be a a waste or something that I don't want to keep around or don't want
to be responsible for so we classify things throughout our lives um as we go throughout our days every single day when someone says to you why do you want to put data classification in ask them which folder they put their mail into because they just classified their data they're doing it every day so why do we label our data um we you you need to ask questions about you you need to ask yourself questions about your data why is it important to you um is there going to be impact or injury to uh to the data to having the data lost or stolen uh at any point in time there are three things that make people help people make
decisions you lose your job you lose your house or you end up in an orange jumpsuit these things are covered by legal and regulatory requirements so when we're we're thinking about this data you need to think about how are you going to store the data what type of auditing requirements are need to be in place for that data do you have any special tools or retention of those logs how long do you have to keep it for where should you be keeping it should you be keeping it at all um is there a Freedom of Information requirement or an ecovery risk to having that data or do you need to have an a reason to disclose
it if it's lost or Sten who do you tell uh if you're looking at the CIS top 20 or any of the other Frameworks you'll notice that they all have requirement for some form of in inventorying tools um they always focus on software and Hardware but frankly they're starting to move towards data primarily because of the fact that we've all moved our our data into these Cloud infrastructures which makes perfect sense at this point in time but that also means that we need to move away from having or expecting that our data is going to be protected by the firewalls by identity protection by all of these other things before that and we need to start moving those
controls closer to the data the fact is is all of those other things the network the firewall the IDS you know your your MFA methods all of that other stuff that's either that's either um an access method or it's the user not the data so um what is a data asset inventory it's a literal inventory someone has to go out and talk to your users and find out what they're actually doing what are they using their data for what are they storing why are they storing it how are they storing it how are they working with it what does that mean where does it go what are you going to do with it someone physically has to actually talk
to these people figure out what those those business processes are what they're tied to um and that's going to drive the type of information that you're going to gather about your data and it's also going to tie it back to those regulatory requirements that you you can look up they're they're all available um on various different government websites but really this comes back to data life cycle management you're going to use that as an input so when you start looking at data from a sensitivity perspective governments have been labeling data for sensitivity for for a very long time and most of the sensitivity models that we use for labeling data are based on things that
we've tied to either a government or a regulatory body in some fashion or other um each of these models have different purposes different impacts and different risks and when you consider how data is labeled you have to think about them in terms of an injury test um how much data how much damage will be done to your organization if the data is lost or stolen or whatever happens to be and they use terms like grave damage critical damage serious damage mortal damage um but when uh when they're discussing these things but really when it comes right down to it um it's really about you know how how should you be treating this damage this data sorry not
the damage uh so uh what about some other alternative models well there's some models that tie back to data subject rights uh or they tie back to specific regulatory requirements um that might make sense for you but if you looked at the previous model that we had even the one for healthcare I work in healthcare there's five labels for the standard model in healthcare um if I add those five to these now I'm at 14 that's a lot of labels if you have to have a flowchart for how you're going to access and look at data labeling you're doing it wrong so um oops labeling for retention so most people um believe they should keep everything you know it comes
down to risk right so do do you uh of sorry it comes down to the risk of whether or not you uh need to keep something versus should you keep something um some people feel like they have to keep something forever they they frankly they don't most people don't really need to keep stuff forever um and uh and when it comes down to the uh uh understanding retention is driven primarily by legal and regulatory requirements um one helpful resource is linked here at the bottom and in the slides which comes from the Ontario Hospital Association it's a it's a giant spreadsheet of every single way that you could possibly think of for sto for why
you would want to store or retain data now if you're looking at things in terms of hospitals hospitals most people think of oh well patients we've got these Services you have to keep stuff related to patients well we also run these great big huge government buildings and in my or we do power code generation and all sorts of other stuff so like nerk zip applies and all sorts of other stuff there's lots of different reasons that you might want to retain this is a great guide forting you there as to figing out what those are but just a show of hands who thinks that retaining the data is actually the problem like that you know making sure
that you have the right amount of the data for the right time the problem isn't retaining the data it's how the heck are you going to be able to read it and what are you going to do with it in 150 years because that 150 years that's real so um we need to talk about stakeholders the first time that I was introduced to the concept of data labeling the organization I was with the CEO came down and he did a presentation on why are we doing this what are we doing how are we doing it and most importantly how are you going to do it and where do you get information about it it made sure that we understood that
the U the CEO was behind the initiative from the very beginning when you have a CEO that's standing behind an initiative from day one you know that it's got legs that it's important that it actually means something to the organization that's your stakeholder that's the one that you want involved if you can't get that stakeholder find someone as high an executive position as possible if you do not have a stakeholder that is in a position to drive it Forward across the organization you will not [Music] succeed so when we talk about implementing data labeling um now that you've got a sponsor you need a policy uh ealth Ontario uh did the did a whole bunch of really
great work to put together uh free draft policies everybody is allowed to have a copy of you can Google them look for ealth Ontario and standard and you'll find all of their policies that are available for Public Access uh it's a great draft to start with but it needs to cover all of your data in all locations and all forms this also means that it needs to it also has to cover how you're going to handle the data throughout its life cycle what types of controls you're going to put in place and how you're going to declassify it when you're done um but what do you do next well you need a comms plan you need your PR team
involved they have to be involved throughout that process um you need to ensure that there's mandatory e-learning in place as you go through the through an expand it out to users nothing beats a quick reference card that's been deployed to every single user's desk as a as something to refer to um and then you need to go in and start looking at Tech now as you look at the tech one of the things you're going to have to think about is what about printer and there's an easy solution get stamps you think I'm kidding but I'm really not the the best um tool and most effective engagement method that I've ever seen was that they had stamps at every single
printer that literally said official use only protected a protected B secret top secret whatever happened to be and the extended piece of that was that if you found a document on a printer that wasn't labeled and you got it to a direct they'd trade it for a Tim Horton's gift card you have no idea how many Tim Horton's gift cards I won it was so much fun uh but it it worked and there there's but the reason that you have to worry about unlabeled documents is is is really important um unlabeled documents uh any document when you are labeling all of your data any unlabeled document is public or unclassified so when you decide when
your HR department decides to print out a copy of the entire pension repository and they they leave it on the printer and then it accidentally gets dropped in the parking lot and somebody picks it up that's fair game for the front page of the news and most news agencies will just look at it and go yeah cool whereas many news agencies if they see you know privilege and confidential or something they might be like whoa hey we're going to get a really really big bonus for this one but they might also actually make a phone call and make sure that it's not going to harm someone before they do that it depends on the news
agency though so some implementation challenges um nothing you ever do is uh is simple and challengefree um if you forgot your asset inventory well you're going to have a hard time figuring out where your stuff is so that you can actually get moving uh if you've missed your executive sponsor um well good luck with that uh your policy got stuck in review cycles and never got published uh batter or missing examples in your uh in the examples that you're putting into that policy well if nobody knows what how they're supposed to be labeled something or what they're supposed to label it's going to be hard for them to do that if you forget your education materials are not mandatory
then you're then nobody's going to read them and you'll be in a position to make sure that um nobody's actually paying attention whatsoever and then nobody will know what they're doing and then you'll be back to all of these things are in the wrong spots uh do not exceed five labels I'm I'm quite serious about this if you need a flow diagram to figure out like that flow diagram I showed you earlier with the the great big chart and which directions things are going if you have a flow diagram to tell people how to label data they're going to be confused and you're doing it wrong um everybody hates water marking uh they all hate water marking
everyone does they don't like it at the top they don't like it at the bottom they don't like it diagonal they don't like it in the middle they don't like the color they don't like the font whatever it happens to be they don't like water marking and the reason that they don't like water marking is because they see it as unsightly so um you can be forced to turn that off and sometimes you are but remember that as soon as you do that then the do document that's unlabeled is public uh but the other thing is is that most often most people actually just get used to it it just becomes part of the problem or part of
the uh the infrastructure and those unlabeled documents that's where DLP starts to come in so implementing the business side was easy right everybody with me so far because we got 10 minutes left and we're going to race through the rest of it okay so uh Microsoft purview the allseeing eye of perview because that's not the creepiest logo that you've ever seen uh like everything in Microsoft 365 there's always a new portal uh the portal is brand spanking new it covers data governance risk and compliance and data security pillars um they it's got lots of tools in there they it's really hard to find things it's really sluggish uh and it changes almost daily uh we're
going to dig in specifically to the information protection data life cycle management data loss prevention sections uh sensitivity labels so before you do start doing anything with sensitivity labels make sure that you um you apply and you're planning on applying them to SharePoint and M365 groups you need to actually there's a Powershell command that you need to apply that ensures that the labels are pushed into entra if you do not do this they will not show up it takes days for it to happen um and then you'll have to try and figure out how you're going to do that after the fact which is kind of annoying um if you don't enable co-authoring at the beginning for uh
co-authoring of documents with sensitivity label on them then you will find that all of the data and the documents have to be relabeled and all of the users have to be um reconnected to be able to do that and it it's not fun don't just don't uh so uh labeling in teams uh requires uh premium licensing so uh consider that so you remember we uh we worked on a plan with our stakeholders we're going to turn that into reality um you the uh you need to make sure that you plan carefully you can't delete a label so if you if you name it wrong you can you can't sorry you can't other way around you can't rename a label you can delete
a label you can't rename them um if you label in if you create a label incorrectly like you spelled you know Health wrong you put the T and the L backwards because like that happens um then you'll find that you go and delete that label and all of the data that was labeled with that previous label is no longer accessible until you apply a new label to it please whatever you do do not do not delete a label um 15 characters is generally the limit because most of while you can do longer the most of the interfaces don't actually uh include the the the display very well and it doesn't work out very well um what else we got uh order them
from most restrictive to sorry uh Mo least restrictive to most restri restrictive and ensure you Scope everything that you can content marking is global however I did just learn as part of a previous discussion um that it is possible for you to to narrow down those the positioning of those labels with a series of really oddly deployed Powershell scripts so um that was an interesting learning but think about how you read email for example if you put the label at the bottom the label really only applies to the things after the label right so if you're reading it from the top and it says you know protected sensitive up the top and then you get to
the bottom and it says you know top secret and you're like oh well that only the part of the Bottom's top seceret right generally you want to apply and ensure that the labels are uh all the most restrictive label always applies to all the content so if there's anything that can inherit labels you want to make sure that that's applied and I think that's actually part of a different slide um what else do we got content marking uh access controls I think we're in there with the access controls you want to make sure that you allow the users to choose the right labels that they want to apply um and the controls that they're going to apply to the
labels that to be user accept selectable and then uh when you're talking about external users you can also control some of those functions as well okay we don't need that so generally that's what um what it looks like once you've created some labels now you've got them now you need to publish them and we're skipping that okay so once you publish your labels uh you'll publish them to all of your users make sure that you require that the users are going to apply a label and if they and regardless set a default label set that default default label to internal if you set it too high then you've over classified data that you don't need to
if you set it too low like public then you've under classified everything the problem is is that you don't want to have too many restrictive controls on things you don't necessarily want and friction on things that you don't want to have over labeled but you also don't want to expose them to um easy access or or risk uh you when you when users are um allowed sorry users should be allowed to uh reduce the label at any point in time however they need to ensure that they record why they're doing that when you do that you'll find that there's an audit log that ends up in your on in your audit records so you'll actually
know when somebody goes and decides to drop the the the uh the label to public from internal and what the file name was and where it was located and all those other things by the way that's entirely doctored and that's actually not real five minutes okay okay so retention we talked we talked about all the sensitivity stuff you start with retention policies these are your defaults you'll apply these to everything um you want to make sure that these apply as the minimum requirements for what you're going to do for your organization po the uh retention uh policies need to tie back to your regulatory requirements but you didn't know you were running a Yammer instance
and you probably want to delete those um but you also want to think about what you want to wipe out on purpose when you start looking at retention labels those are actually not visible to your users except through the one drive interface and it's really really frustrating but you want to make sure that those are um available so that you can use them for autol labeling and autoc classification which gets into sensitivity um sensitive infotypes trainable classifiers and um and reject uh sorry U exact data match sensitive info types are basically Rejects and you're applying them to stuff sometimes they're okay mostly they're not but you can tune them to be better U matched since trainable
classifiers are important because you can actually train them on your data autol labeling for retention and data sensitivity uh you use those other things the sensitive infotypes trainable classifiers which is the machine learning models um and the exact data match to apply either a retention label which is the best way to do it or a sensitivity label the sensitivity labels is really helpful because when you tie that back to your information asset protection scanner which will scan through all your on premesis stuff in SharePoint and and SMB shares noting that you need a SQL server and a and an account with access to everything it gives you an ugly ass spreadsheet that you can go through and see what the
content looks like and where it is all through your environment but that is going to help you at least identify where your stuff is you'll tie that back to data loss prevention so you see how these have all rolled together your sensitive information types your trainable classifiers uh your sensitivity labels your your retention labels these all tie back to together into Data loss prevention and this is where you're you're going to apply controls to make sure that people are doing the things that you expect them to do in the ways that you're expecting them to to run um and and where they're where they're expected to go some things credential matches in particular are are
really really accurate um uh sensitive infotypes but the generic passwords and any of the generic uh keys they don't work at all um whatever you do don't use those um which takes us to uh when you are using data loss prevention make sure don't bother with a picture because it's in it's in the link in the in the GitHub the uh if if you uh do use data loss prevention rules make sure that you Scope things properly because you'll notice that this is all exchange online and this is all the other stuff so if you decide you want to include exchange online teams in one drive you've now got it down to like three things that it's
actually going to select on and the rest of it is completely a waste of time because you've now just dropped all of those selectors off your list of things you can use um that's pretty much what it looks like once you put some policies together and we're going to skip right past that so what do you do when you're talking about the actual policies themselves we're going to adjust our data as we start looking at the the uh the prevention policies we want to make sure that we're um tweaking the data that is UN is is known sensitive with labels we're going to tie that stuff back to to uh allow certain things like T4 slips to go home or the employee
discount can book because for some reason can Canada's Wonderland is so important that they have to have a username and password to get discounts um you're going to then nudge users by giving them a reminder that they are they're about to do something that's not good for them or good for the organization and then you're going to block things that are you know tied back to those regulatory controls or the three magic things that you don't want to H have happen to your organization um so we're going to rip through the LI the limitations frustrations really quickly CU they are there's a lot of them and basically um a lot of this stuff doesn't work the way
that you expect but there are ways to for you to work through them um feel free to refer back to the slides uh what could go wrong well uh
everything that's what it looks like on the bottom right when you do it wrong there's 11 labels in there and they're buried in multiple layers that's what it looks like when you do it right it's straightforward people can see where it is there's an info link that ties back to your educational materials it's right in people's face licensing E3 pretty much covers all of it F5 covers the rest of it oh sorry uh E5 covers the rest of it and there's always more right so some quick conclusions um when I started pitching this concept uh everybody thought it was an intractable problem that we couldn't do it the fact is they were [Music] right but that's the thing about
security um it's not the thing the easy things that have have the most impact and the things that we uh we that have the mo the most impact in our organizations it's the things that we take the time to build that have the most impact uh to our as we educate our stakeholders and discuss those risks and impacts to the organization um deploying the technology was easy but uh you have to have a plan to me make sure that you actually deploy this out to your organization in a timely fashion and the education materials are the most important uh understanding the requirements before you apply the technology is wonderful and very very satisfying but it's also really awesome
when you finally get that first DP alert that somebody calls you back and says whoa hey wait a minute I didn't know that that file actually contained a SIN number that I accidentally sent out because I thought it was a blank template and then you you're like wow a security thing actually got a a a positive you know phone call from somebody in a in a team out in the field um being able to simulate things has a really large on uh on success and uh making sure that you're nudging your behav your users um has a really positive impact on making sure that you change the behaviors and helps you out in the uh in the event that you actually
get to the point where you're in a breach situation so lots of resources and links you don't have to take a picture that's all that's going to be shared I'm going to click right through it we don't have time for questions this is the real QR code and thank you again for playing [Applause] that was great thank you Don and in fact if you don't mind staying for one or two questions while we get Ian down here and set up no worries come on oh okay you have a question go for it oh perfect yeah you're gonna have to speak
up yes they so the question is is the expectation that the end users are going to classify their data yes the the end users are going to classify their data you select a default so that everything gets labeled because anything that's not labeled is public so you want to make sure that there's a label assign assigned to everything um and then you want to use autoing policies that tie back to anything that's not labeled properly but keep in mind that as you're doing this you want the users to be able to correct that label you don't want to you don't want to force it on them with the exception that if you have ulated content that absolutely needs to be a
certain certain way make sure that that is treated that way so sin numbers personal health information if you have a classifier that works for that make sure that that stuff is actually labeled as Phi or Pi or confidential um we in credentials restricted don't let don't let people share those the credentials through