← All talks

How Canvas IR Team avoids alert fatigue and burnout

BSides Perth · 202327:49224 viewsPublished 2023-08Watch on YouTube ↗
Speakers
Tags
StyleTalk
Show transcript [en]

thank you all very much after me is the party so we'll uh Jeff with yours on your pocket technology gotta love it um so huge thanks to the organizers for an amazing conference as always and after me is the bar so we'll try and get through this briefly quickly for everyone so I'm Raymond shippers I'm the director of the security engineering directors hurry for detection and response at canva uh before joining camera about three years ago I worked for a vendor doing ir and CTI Consulting for nearly a decade also a bit of a photography and Aviation gig so for those of you who don't know who camera is we're a visual communication platform founded here on Australia HQ is still here in Australia and Sydney and to give you an idea of the scale our team protects we see over 200 new designs created on our platform every second of the days we have over 135 million monthly active users from three and a half thousand staff so today we'll be briefly talking about why am I talking about Hello Kitty in Burnout why is it important to me what did we do differently to canva what are some of the lessons we've learned and what would you do some q a so why talk about that now for mental health burnout is something I'm incredibly passionate about throughout my time in the industry I've seen an impact for many many different Engineers but personally and colleagues and people we work with so about our own experience I know that organizational structure process procedure and the culture within your organization and team has a huge impact on reducing burnout and the impacts of the stress of our jobs on them I'm currently fortunate that I'm in a position at a company where we get the full support to try something a bit different and to deal with these issues so what is burnout Alberto has actually identified by the World Health Organization as a syndrome that impact and it's listed on their International categorization of diseases that is recognized as unmanaged chronic workplace stress so it sounds familiar and so it leads to feelings of exhaustion negative and other symptoms which I'm sure we've all seen in the workplace and they say on their website introducing reduced professional efficacy but for me personally I've seen the huge impact it has not just on people's professional advocacy but also in their home lives their kids their Partners everything else so for me personally leaving you a large security team learn that is a massive issue because the impacts of not just our work but also at home from a business perspective the cost of buying out is far far greater than the cost of prevention on the way in today I was actually listening to a fantastic podcast on how to improve your sock and they made a fantastic point but if your staff are burnt out if you're constantly relying on your IR folks to do the same thing day in day out when they're burnt out they're going to make bad decisions but that's not just bad decisions that could you know impact them or you know create some Performance Management issues but it could impact the security company so have yourself for another candle can we all celebrate have found the magic formula unfortunately not what we what we have done is we've created a code towards okay to talk about it worked very hard and continue to work hard to encry processes to prevent wherever possible to deal with it when it occurs in a positive manner and to change drive change when it does happen we have a very open communication style and system within our organization to talk about burnout and how our staff are going and also a strong team culture to report it not everyone's got to say hey I'm Motown some people will go you know what I'm just going to suffer through it or you may not even realize and so I've started to build a very strong culture of people calling out hey my colleague my team lead and this other person I'm working with they're starting to work a bit burnt out what help can we give them so why talk about hello fatigue it's it's a very common probably overly used phrase in the industry to you know talk about selling Solutions or things like that you know for this talk today I'll be talking about low fatigue simply in the context of seeing the same alert day in day out and that kind of really great job it also consults to create a lot of soft retention issues if you have a higher load fatigue or constantly the same day in day out your stuff are going to get pretty bored pretty quick it also significantly reduces your organizational efficacy if you have sock staff sitting there day in day out responding to the same thing go yep false positive yep false positive they're not driving actual change inside of your organization so what have we done at Campbell well we started to regular report on both our alerts this is not just within our team just in myself or to the IR leads this is all the way up to the CSO we've also started to Foster as I mentioned a strong culture of openness and health and well-being we're incredibly fortunate that it's a strong culture within Canberra itself but we continue to Foster this culture and build upon that within the security detection group we've already also really focused on automation of alerts we automate as much of our Lives as possible to make our own lives as simple as possible we have a huge operation system which I'll dive into and we have a very strong feedback cycle on alerting as a result and also we regularly review our ways of working so every six months or so we'll gather the team in Sydney and actually sit down and understand what's working what's not what processes do we need to improve and what we need to change the business reporting that we do all the way up to the c-zone and up to the executive is how many hours out of hours paging who is actually on call so within your organization if somebody's on call for more than x percentage there's actually a red flag raised and highlighted to their manager you also look at the quantity and scale of instance how many incidents are we getting out of hours what kind of support do we need what is the effect and this reporting goes all the way up to see so and Beyond but is also happily discussed that in our team to make sure that we have the right resources and we're looking after our staff it's very easy to get those incident responder who want to be the hero it's you know a lot of ir people want to be seen as the person they can depend upon 24 hours a day seven days a week but after a very short period of time they get burnt out very quickly so we need to make sure that people let go of their Legos and share it below them up as they're incredibly remote team we also use slack heavily so we have a optional data check-in where people can share how they're feeling and what their workload is it's a very simple red yellow green system where people select dip fill and green today energize caffeinated let's go workload's all good I'm happy or personally you know I've had a rough night one of my kids are sick I might say hey eat yellow today bit exhausted or hey I spent the last 36 hours barely any sleep major incident I'm ready and this everyone's incredibly transparent and over 80 of our team does this everywhere and it's been fantastic so what is this wrong rotation so we have a unique organizational structural detection response so I'll look after the group and we have a Nissan response team which has the team lead and three pods which I'll dive into then we have a threat detection hunting team which has the team lead and some Engineers CTI team which has a team lead and engineers and just to clarify and care about everyone basically is an engineer most organizations would call them analysts but they are so we're all security engineers and then you'll see the green team has no one but we'll get to that and why that is so you might have noticed that the IR team didn't have Engineers that had polls well pulse or small teams of Engineers where they have a pod lead which is senior security engineer and they are their team leader supervisor manager whatever problems you want to use we call it a coach and so they're the coach that goes with them for all their rotations they did a day-to-day supervision ensure all the appropriate work is being delivered and also to take care of the leave the Performance Management all of those things and the Pod members are our security Engineers who rank from seniority from associate it's a very early career all the way through to secunia principles staff engineers those are the people who do the day-to-day work the electric triage and the actual IR forensics so how do we rotate so typically on a normal day which is 99 of our world likely the first part will be on IR B will be doing detection hunting and CTI so one of the people from the Pod will go to do CTI work and be able to have free people in the Pod will go and do detection pattern and then part C is assigned to Green Team so whilst we don't have a permanent green team team everyone in the instant Response Group team I should say does this green team work as well and so great team reports back to IR so when our folks are enough on instant response duties they're doing trianging overloads so they receive all the alert notifications receive all the incident reports from the organization from third parties from customers and they do the actual forensics students and coordination and everything else that's required but when they're in detection on Hunting rotation they'll actually manage our senior infrastructure do detect your engineering perform through enhance and get fully engaged with the detection hunting team to do all the work that they normally do so as a result our team is very very well reversed on well reversed and exactly what's happening inside of that space and similarly when someone is assigned to the CTI team they'll try out the CTI alerts on our adversaries produce intelligence products and assist with CTI Automation and tooling just like all of the security engineers in the city then the favorite rotation for most people is granting this is a dedicated rotation where they focus on process procedure Improvement automation of all of our alerts and perform training and I'll be doing this for a month so they'll spend one month doing instant response one month doing threat to section hunting and then one month printing so out of three months they have to triage alerts and respond to incidents for one month typically within the Pod they'll also assign a person to be the primary as we call it who will be doing the initial triage for one week so out of a three-month period any one engineer should hopefully only be triaging alerts on a daily basis for five days then they rotate off and they received in the confirmed incidents start doing the IR work and that way we start to reduce how often they're seeing the alerts and hopefully start producing a lot of critique but they also then take that knowledge of the alerts so this has resulted in a huge amount of professional growth so we have early early career Engineers who've come in and joined us an IR focused organization and they're now exposed to a deep understanding of CTI so as when they go into the CTI rotation we build a training program what a CTI how do you assess intelligence and so they go through that when they go to threat detection they then understand how to start detection engineering work how do we on board blogs and how do we do a canvas scale as I mentioned earlier we have a very large scale and so we definitely have some New Year challenges it also significantly reduces the alert fatigue as I mentioned because they're seeing the alerts for major one week out of three months and so rather than seeing the same alert day in day out they only see it for a little bit but the alerts that they do see they then take back to the threat detection team they go hey we should fix that there's also significantly reduces Solace within the organization I've seen a many very large organizations where you have threat detection hunting team does their thing they find something they throw it over defense to ir and say good luck same with CTR just gets faster but because of this constant rotation of staff and resources and capability everyone is fully cross-trained everyone understands what is going on in the various areas but also if CTI produces a report will produce them a lot for us they understand the process and the capability of CTI what are the limitations and that's significantly reduces friction as well so the huge results for us are that we have much happier capitals we've significantly reduced the low fatigue and we have a highly cross-trained group and every one of our Engineers now has an enhanced understanding and capability of each area and the challenges they face so when they see a alert that's a false positive they understand the challenges and threat Protection Team faces in creating viewers and provides much better feedback on how we should fix this because they're the ones that actually created that you live in the first place it also incentivizes them to create better alerts in the first place because they're going to get paged we definitely have some challenges and improvements not Everyone likes every rotation IR definitely seems to be the least favorite rotation and the IR muscle we actually noticed After Effects you know IR is a repetitive process and kind of build up their muscle memory and so if you're doing it day in Day Out you get quite fast at it you understand where things are and it's easy to respond we noticed this attributes quite quickly and so typically by the time people get back into the IR rotation they need some exercises on the train so that is an area we'll focus interestingly enough we've already made some changes in some of the other rotations we used to send people to the CTI team for two weeks and then to certain protection on Hunting when we split up the pots but we found that people getting onto CTI their CTI muscle have kind of atrophied as well it took them about two weeks to get back up to full speed and then that at that time they'd be rotating out of CTI so to take that from all the engineers and the CTI team was the engineers were not providing any valued CTI and they were just kind of getting up the speed slowly and then rotating out so we've now the mixing it up to that full month and that's all in a lot of fruit we're also about to move through a follow the sun model with one part overseas so that's going to get interesting rotating these roles within one pod and passing work across in the green team gets quite challenging so the green team because it doesn't have permanent resources um can't you know have Engineers with the kind of knowledge of the background and what's happening and so we're working to improve that process some of the other key things that we found is a lot or do a lot of automation as much as possible we have a very large automation platform where we try and alert um get the alerts in automate the triage the response as much as possible before actually we also actually wherever possible push alerts to our end users so say for example if there's a non-pii containing piece of data that's being shared publicly on Google Drive it's still kind of against company policy so we'll actually just notify them on site we'll say hey user this is against company policy did you mean to do this if the user ignores us we escalate to their manager if the user says yes I meant to do it there's another reason fine we'll close it out if the user goes I have no idea what you're talking about that's probably a very good signal that the user's account has been compromised and that then Pages us but that is one alert that generates a lot of noise and our team never sees unless it is a fails to interact or it's the I don't know what you're talking about it and so we try and page or notify our team about alerts as little as possible so we've built out this kind of hierarchy of uh ideal situations where we have a lot of logs that generate some events and they generate slightly less alerts but then we can automate those as much as possible to either discount them make them informative or just feedback into the alerts so we're doing a lot of ml based alerting these days to look for unusual patterns and only then do we presenters alerts to the response huge thanks to Paul alcito for allowing us to try this definitely when I propose this to Paul about 18 months ago he went heard a lot of people try it never never work good luck and he enabled us to do it and it has worked huge thank you to everyone you know the detection Response Group as well the there's the feedback from our team that has enabled us to keep doing this and we have an amazing group of people as a result to do this questions thank you in the event a user is in Slack it will be the actor who's responding to it how would you yeah absolutely so if the user is compromised um we have other signals other than just Google Drive like if we're relying on that signal that's our very very last hope we have other signals yeah absolutely it could well be an adversary that uh response and then says yes I'm meant to do it but if a user is sharing a lot of Google drive documents that would feed back into the other alerting mechanisms that we have to say hey this one user has shared everything in their Google Drive that's probably a strong indication to ourselves upon the line so that we have different numbers for that yes so so we have out-of-band communication senses what tool are you using for automation yes so we use something called times.io just a question around um how do you manage that when you rotate so if you're a principal engineering team you know that your principal levels and then how does that begins yeah so because the part is a standalone team so they have a senior engineer with them so that senior engineer goes with them so we we spread the experience across all of the pods so each part will typically have one senior engineer one associate engine you know one mid level is kind of just normal engine here and so um they are meant for IR I'm kind of Mantra within the team is we respond to incidents above all else so whilst yes somebody may not be like the most uh strong but might not be the strongest threat detection engineer their team like that they are the coach that's more for Performance Management leave those kinds of things and the reason they rotate with them as well is because um that's our ahead of ir um she can't really have a good indication on what's Happening when they're doing threat protection hunting so when it comes to Performance from new time and be incredibly hard for her to go yes you've done an amazing job in creating all these detections because that's not hilarious but the person who manages their day-to-day work is with them since as a result tell us that but that's also yeah what we've spread out the experience amongst the amount of students as well um um sorry is there ever been a case where someone is cool they're not not for a we don't typically do that for a particular person so we have dedicated Specialists same uh now of reverse engineering hard disk forensics we have a separate dedicated team for that but absolutely if the caseload of ir increases we will just ask for people to come back to IR absolutely our team is incredibly flexible so we'll just jump on slack and say hey AR is getting a little bit overwhelmed because available to jump in and everyone typically volunteers you mentioned that the Green Team [Music] we're in the middle of an IR ish rotate so if we just keep doing the IR and everyone nope everyone hands over their cases and this is one of the things that was challenging initially was to force the case Handover at the end of your IR rotation but we found a Works incredibly well what was happening was exactly that people woul