Automated Triage Collection at Scale in the AWS Cloud

Name: Automated Triage Collection at Scale in the AWS Cloud
Uploaded: 2021-11-11
Duration: 55 min 43 s
Description: BSidesDFW 2021 Track 2 Session 4 - 06 Nov 2021 Automated Triage Collection at Scale in the AWS Cloud During a cybersecurity incident, answers are needed quickly. This generally starts with an incident responder performing a triage collection to pull back targeted host-based artifacts for analysis.

BSides Dallas/Fort Worth55:43118 viewsPublished 2021-11Watch on YouTube ↗

About this talk

BSidesDFW 2021 Track 2 Session 4 - 06 Nov 2021 Automated Triage Collection at Scale in the AWS Cloud During a cybersecurity incident, answers are needed quickly. This generally starts with an incident responder performing a triage collection to pull back targeted host-based artifacts for analysis. Manual workflows to perform this triage collection are not only time consuming but also prone to human error and inefficient. This talk will discuss an event-driven workflow to perform these triage collections at scale in the AWS cloud. It leverages AWS Systems Manager (SSM) to perform triage collections from Windows and Linux EC2 instances and can accommodate EC2-backed containers as well. This solution is easily customizable and will output triage collection packages to S3 that can be tailored to fit a company's IR standard operating procedures. Ryan Tick is Manager for KPMG based out of Dallas, where he helps clients in the areas of DFIR and cloud IR automation in AWS. Prior to KPMG, he was a senior cloud engineer for Goldman Sachs, responsible for securing their footprint in the AWS cloud. Ryan previously presented at a variety of conferences, including AWS re:Invent and fwd:cloudsec. He is 4x SANS certified and 5x AWS certified.

Show transcript [en]

all right well good afternoon everybody uh welcome to my talk titled automated triage collection at scale in aws i'm ryan tick and i'm your presenter today and real quick before we get started i just wanted to say a huge shout out to all the b-sides dfw organizers you know i can only imagine how difficult it is to plan a conference in general uh but i imagine it's that much more difficult during a global pandemic so everything behind the scenes they're doing to make sure that this conference runs seamlessly really a huge shout out to them but with that you know i'm super excited to attend a lot of your talks today especially this afternoon and i'm really excited to share with you

some information about the triage collection solution that we developed here so with that let's go over our agenda so first we'll talk about the objectives of the solution so what the solution looks to accomplish and we'll frame those against the problem statement so some problems that we look to tackle with this solution next we'll talk about the benefits of a triage collection workflow and we'll compare those against the full disk collection workflow we'll deep dive into the solution right so for the more technical folks in the this talk today i think this is what you're really going to enjoy so we'll step through the different lambdas and different parts of the step function and talk about some key services that we use

and then we'll end with what i like to call pro tips so these are basically you know my lessons learned over you know the struggles in designing the solution and getting it to where it's at today and i'll give you some further learning resources at the end so these touch on a variety of different you know d4 topics in aws that i find valuable and i'm biased because i've presented a few of these before but i think they are helpful to supplement this talk as well um so with that so before we get started let me tell you a bit more about myself so i'm a manager at ktmg based out of the dallas office here i specialize in

traditional incident response and i am one of our more technical resources in aws so within aws i focus on security engineering and architecture specifically around digital forensics and incident response so i try to make you know the incident responders job that much more easy in aws by automating a lot of the tasks here i do have quite a few credentials next to my name so uh ic squared sans and aws have all given me some credentials i recently i passed my solutions architect professional that was no joke i'm super proud to have that one i'm working towards the networking specialty now i've heard that's even more difficult than the s8 pro so i definitely have my work cut out for

me there um in a past life i was a senior cloud security engineer and architect for goldman sachs that was really interesting so i worked on our marcus team and if you're not familiar with marcus they are the team that work jointly with apple to develop the apple credit card and then kind of halfway through my career with goldman i shifted into a more firm wide capacity so i was responsible for security orchestration and automated response uh by the time we had nearly 3 000 aws accounts so that was operating at a planetary scale there too so it was really interesting i've presented a lot of different conferences a couple that i'm really proud of right

obviously all i presented b-sides before super excited this one in particular uh given that i live in dallas super excited to you know help out the community a little bit i've also presented at abs re invent last year and then i partnered with aws to give a q3 tech talk as well so that was really interesting uh and lastly if you got any fighting irish on the call i am the university of notre dame grad so go irish and i encourage you know all of you whether or not you went to the university of notre dame connect with me after the talk and let's chat so with that um let's talk through some of the objectives of the solution when

it was in its early design stages uh so the first one may seem pretty obvious given that this is a triage collection uh workflow but we need to design an automated workflow to allow for the triage collection ec2 instances at scale so it needed to you know tailor to windows linux and mac instances at scale since you can spin up all three of those on ec2 instances and it needed to work in a same account cross account in cross region scenario what i mean by that so for same account depending on where the solution is deployed let's say i deploy a solution in solution a it needs to be able to you know perform a triage collection

against an instance in solution a uh with cross account if i deploy a solution in an account a it needs to be able to collect an instance in account b right so cross account and lastly with the cross region scenario so let's say this solution is deployed in the us east one it also needs to be able to collect an instance in usd2 right so cross region so that is the first objective right basically get the triage collection to work in all these different scenarios the second one right so this is interesting so not easy all uh ec2 instances that i'm going to want to run a triage collection package against um will have the necessary permissions

in aws that they need right so for example if i have an ec2 instance and i want to perform a triage collection on it well i need to figure out how to get my triage collection scripts on that instance now for our solution we're storing those scripts in an s3 bucket so who's to say that that target ec2 instances has target ec2 instance has access to that triage collections for bucket and on the flip side it needs to be able to write the output somewhere so our solution writes the output of a triage collection script as a zip file and then it uploads that file to an evidence s3 bucket now again who's to say that the target ec2

instance has access to right to that evidence bucket there'd be a lot of overhead if i needed every single instance in my environment to be able to read from one bucket right to another bucket be an instance profile you know that was an objective and lastly you know keeping the principle of least privilege in mind following best practices when it comes to you know highly available solution and then leveraging an external id or an eid when performing cross account actions so these were the three objectives when we were in the early design stages of the solution before i deep dive into the solution let's talk about some of the problems that this solution looked to tackle

so the first year fdc full disk collections so full disk collections are expensive and timely so not only are they expensive in terms of you know workforce hours required to perform a full disk collection they're very resource expensive too right you're getting an abundance of data you need to store that data somewhere and then they're also very costly right so it costs money to store all that data and perform this collection and lastly you know they're not quick especially if you're doing this in aws what you generally do is you describe an instance you look at all the volumes attached to that instance and then you create snapshots out of each volume right that process can take time maybe

you encrypt those snapshots right so again another stage where that will take time and then you could you know attach those snapshots to a collector ec2 instance and finally run a tool like dd and create a raw you know output of each volume so in a challenging uh timely uh and when answers are needed quickly during an investigation uh a full discollection may not be your first choice right you may tend towards a triage collection solution like this workflow looks to accomplish the second point here um so i've seen a lot of clients leverage existing workflows when they move to the cloud and by that i mean you know a specific client may have a

solution that they like to perform triage collections against on-prem instances and they're able to port that over to the cloud now nothing wrong with that at all let's be clear if it works it works and if you're happy with the metrics and the time to collect who am i to tell you that there's a better option out there but what i do think some of those solutions miss out on is a lot of the cloud-native capabilities that you can get by leveraging cloud-native services within aws we'll talk about you know this triage collection workflow and how it leverages some cloud native services namely you know ssm systems manager uh we'll get to that a little further on

and lastly right so there's this idea of rapidly changing cloud environments so again i had alluded to earlier when i worked at goldman sachs we had almost 3000 aws accounts so let's say you know each one of these little people down here represent an account right companies tend not to only have one adapt account they have many accounts um and your solution needs to be flexible enough to allow for triage collections to happen in each one of those accounts well not only that but these environments are changing rapidly so if you look at the bottom of the screen now let's say i just onboarded a bunch of aws accounts maybe i acquired a new business and now all of a sudden i have

these new accounts as part of my environment not only that but let's say each developer at my company gets their own aws account as a sandbox right i had a few developers leave well now we have a couple accounts that are off board they disappeared and not only that right but you know again since these environments are changing so much if we're kind of monitoring our compliance or any sort of configuration drift let's say accounts tend to be you know out of compliance with our standards that we've set forth so this solution needs to be flexible enough to accept new accounts to you know any accounts that are off-boarded to alert early if we don't think the

solution will work maybe because an account is out of compliance so it needs to be a very flexible solution with that so let's talk about some benefits of a triage collection workflow uh these are just benefits in general we can compare these to a full disc collection or ft fdc uh workflow um but number one is you know triage collections work so well because they're collecting a lot less data and by doing so the collection is that much uh quicker right so you can get anybody answers hopefully that much faster since you're able to pull back the data and start analysis that much quicker it's compared to a full disk collection the second benefit here is since we've automated

everything and it's all written this code the process itself to perform the triage collection workflow is going to be very standardized right so these same commands are going to run against an ec2 instance it's going to be auditable so wherever possible we've enabled logging we have custom logs and everything is obviously written to cloudtrail as well and then our cloud native approach since we're using all cloud native technology here it lends itself to at reduced cost a quicker response time too and quick collection time and a lot of other benefits there as well and now for those of you out there who are just willing to die on the hill that you need a fullest collection i i don't

blame you uh having worked in the finance sector right it's very rigid there in certain uh requirements right so this can be done in parallel with a full disc collection so just because you're doing a triage collection doesn't mean you you can't do a full disc collection right so you can do both and really gain the benefits of both and i'll talk about some limitations of this solution at the end but namely one of one of the prereqs of this triage collection solution is that you need to have an ssm agent present on the target ec2 instance where you want to perform that triage collection solution um now with a full disk collection if you leverage cloud native services

you don't need to have any agent present so there are some benefits there to a full disc collection but let me mislead you and i'll talk to those at the end provide some great additional resources at the end i think too now with that let's get to the everything we've all been waiting for here the solution deep dive so you've heard me talk about ssm a few times let's clarify so aws systems manager or ssm is at the absolute heart of our solution of our triage collection workflow i've taken this first bullet here this is basically the definition of ssm per amazon so i made sure to link you know where i took this from

but ssm allows for the automation of common and repetitive it operations and management tasks so with that you can execute commands um or scripts as part of a response action on ec2 instances right and that's how we're going to be leveraging it for um you know our triage collection so with ssm i can pipe certain commands to an instance you run on that instance and i can check the status of those commands right so ssm again is going to be responsible for all of the triage collection commands that we're running now as i alluded to earlier with this third bullet here systems manager requires an agent an ssm agent to be present on the target ec2 instance or instances

that you want to collect from not only that but those target ec2 instances need to allow the communication with the sm service to take place and when i say allow i mean at the account level right so just because an ec2 instance has the ssm agent installed on it doesn't necessarily mean that instance has the necessary permissions to communicate with ssm so that's kind of a second requirement here but i'll talk about why it's not a requirement in our solution so again really the main prerequisite to this solution is that the target ec2 instance needs to have the ssm agent or the systems manager agent installed and you may look at me and you may be

like ryan that's asking a lot and i agree uh it's not realistic for a company to have 100 coverage um for all ec2 instances to have the ssm agents present on them but kind of our saving grace here is that with each operating system i've outlined here um the majority of amis corresponding to those operating systems actually have the systems manager agent pre-installed you may just not know it if you haven't given your instance the appropriate permissions to communicate with ssm that's why it's not showing up now for windows for example right any ami that you use amazon manages right any ami that was created post november 2016 it's going to have the systems manager agent already

pre-installed and for the sake of this talk 2016 almost five years ago i hope we don't have too many ec2 instances created out of amis that are five years old but you know windows server 2008 2012 2016 2019 these will all have the systems manager agent installed on that uh same with linux you know most linux flavors will have the sm agent installed here too i've listed a bunch of different um os's here but namely you know amazon linux 2 ubuntu they're going to have it installed by default which is really nice and lastly you may not know this but ec2 instances now support mac os which is really cool feature now they tend to

be expensive if i'm not wrong i think they're based off bare metal instance type so you know watch the cost there but the ssm agent is pre-installed by default on all of the current mac os amis that aws offers so catalina mojave and big sur so you can see while it's a prerequisite to have the sm agent installed it does come standard i would argue on most amis and os's that you're going to use in your environment now i wanted to introduce systems manager and talk about the agent before i even got into really what the solution does because it is central to our solution but before i actually get into walking through exactly what our solution does

i wanted to set the scene so let's say you know i log in monday morning 8 a.m to my workstation and i'm presented with this guard duty finding unauthorized access to an ec2 instance and it's being used as a tor relay node for the sake of this you don't necessarily need to know much about tor relay nodes or tor in general but let's just say you got not one alert but let's say you got a bunch of different alerts here and let's also you know imagine that each alert corresponded to a different and different ec2 instance in your environment maybe they're in different accounts different regions et cetera but you know my next step here

is i start to do an investigation maybe i'd want to perform a triage collection against each one of these ec2 instances it could be a good next step but with that you know the manual process for having to do that i need to ensure that i have access to each account uh you know where these instances where each instance lives i need to make sure i have the appropriate permissions to perform a triage collection not only that but i have to figure out how to get you know hands on keyboard to these ec2 instances to perform a triage collection so you know if it's with an ssh key or an rdp key um you know how do i actually get onto the

instance does that instance have the necessary permissions where i can go ahead and download the triage collection script or where i can go ahead and you know upload my evidence package to an s3 evidence bucket a lot of considerations here and that's only the technical aspect of it we're not even accounting for human error right let's say i think there's seven alerts here uh let's say you know whatever malware was in the environment here let's say it spread to 15 20 30 hosts now all of a sudden i have to do my standardized practice um for the triage collection on each one of these instances and that can be a challenge it could lend to long long hours it's a serial

approach right i can only do one instance at a time maybe i'm getting tired towards the end i'm starting to make mistakes so this is where our automated triage collection workflow uh really starts to shine and with that here's a very basic diagram high level diagram of the workflow um so i'm basically going to walk you through the scenario framing that we have this guard duty finding like we just saw so essentially in a monitored account right guard duty looks at a variety of sources dns bbc flow and cloudtrail and using those evidence sources it's able to use ai and machine learning to present findings to someone so you know let's say it analyzed that it presented us

with that finding now in this demonstration uh guard duty is feeding into security hub from a member account and then we have that uh member central replication where all of my findings from each member account will be replicated into a central account right so in my security account where i'm deploying this triage collection workflow i can see all of the security of findings for all of my member accounts right they just feed in passively now you know we're on the right side of this diagram now there's two ways that you can trigger this solution so the first is it could be triggered automatically using the amazon event bridge you can have roles set up where

this solution will fire for specific findings in security hub so let's say that you know there's a finding involving an ec2 instance that originated from guard duty i can have logic in place where i'll automatically kick off this triage collection workflow that's why the event bridge is there the second option is we can actually start at the upper right hand corner of this diagram as opposed to the upper left and i can have you know a team member of mine on the cert team maybe they want to go ahead they have reason to believe that an instance was compromised they can kick off this step function manually by providing a few things we'll get into that in a second

but the step function again behind the scenes you'll see this in a second this is actually what's doing a lot of the you know hardcore uh triage collection um but namely you know in a cross-count solution it would assume a cross-count i am role um it would do a bunch of different things to that target ec2 instance and then you know it would maybe pull down a triage collection script and then run it on the instance and then upload the output of that triage collection script to our evidence s3 bucket so with that let's look at the step function in detail again the upper right hand corner of this diagram let's look at that in detail because that is what's

responsible for doing a lot of the triage collection so here is our step function and all of its glory you know you'll notice a few things right off the bat but before i deep dive let's talk about the inputs to the step function so if you want to run this step function manually you can give it a few things one the first thing that's always going to be required is a description of the target instance and you can do that in a couple ways you can either give me the arn of the target instance that you want to collect or perform a triage collection against amazon resource name or you can tell me the account the

region and the instance id of the target instance that you want to perform the triage collection against the reason why there's the flexibility there is because the arn isn't clearly displayed in the console as of now so you kind of have to construct it your own way but pretty easy to construct it follows a standard naming convention so based on your sops whatever you do in terms of documenting and collecting this can take either as input and the second and last piece of thing that it needs is input is a case id so whenever you give me a target instance or a bunch of instances to collect when i output all of that evidence to s3

i'm going to prefix it with a case id for you so that you can see all of your instances are tied to a specific case during the output process so with that let's look at the very top here and check the uh define input so this is the first state of our step function um as i mentioned earlier right this workflow can be triggered both automatically via the event bridge or manually right so a certain team member can manually kick off this workflow um so because of the manual collection we do do some sanity checks here right just make sure there's no typos right is the account id that you gave me where the instance lives in is that a valid

length is the instance id valid does it appear to be valid is the region valid right maybe you give me usd 3 right and is the case id unique it's so much easier to check some of those sanity inputs from the start and cancel the workflow than it is later and you're just confused why this this workflow isn't running um but yeah so that's the first state check to find input uh really just stand high um so the next thing here and you'll notice these dotted lines uh in this part of the step function and that's what's called a map state so we're all inside map state here so all the rest of these functions are going to be run within a

map state if you're not familiar with the map state please definitely read up on map states but essentially what it allows you to do is let's say i gave you multiple ec2 instances as input well each ec2 instance if i set up my map state correctly it would get its own vein of the step function so one ec2 instance would get this exact step function the other ec2 instance would get a copy of the step function and so on and so forth so that way no ec2 instance is kind of slowed down by another ec2 instance think of it as like you know if i have one ec2 instance and for whatever reason it's taking a really long time to

describe that ec2 instance without a map state function you know that may hold up the other ec2 instances so with one described call if we didn't have a map state function we'd have to wait until each ec2 instance responds to that described call until we can move on to the next state and step function however with a map state function again each instance will get its own vein here so if the described target instance step here took was really quick for one instance that in that instance would progress down the step function and then likewise if another instance was struggling with this for whatever reason it could weigh to that state until it's ready and it wouldn't slow down my first

instance but with that let's talk about describe target instance um so this isn't a necessary part in a metadata collection um but or sorry in a triage collection but it's good to get some metadata around the ec2 instance that that you're targeting to run the collection against maybe you have um you know sops or you know playbooks that you need to collect certain things about our instance every time you do a collection and this is where this stage can be tailored um you know by default we're getting a lot of the available metadata that's present in the easter console or information about the ec2 instance so i've listed a couple of things here public and private ip addresses public

and private dns host names how many volumes are attached to the instance when the instance was created so the date and time and then the os type with the instance too but there's a lot of other thing i didn't list them all here a lot of other things you may be um you know interested in maybe was the instance lost launched in a public submit or a private subnet what's the name of the instance profile associated with the instance so you can kind of better understand the blast radius stuff like that so again this is just a step or getting all of the metadata about your target instance and in the end we're going to provide

all this information to the incident responder so they don't have to go back at the very end and kind of you know get this information themselves again this is done for them to save them time and allow them to focus on the important uh you know analysis kind of tests so after we describe the target instance we talked about the prerequisite to this solution was you need to check the ssm agent status here so we can do that by a pretty easy protocol and we can see you know is the agent or is the ec2 instance reporting into the ssm console or into the s designing service now we're going to assume that by deploying the solution you followed the

prerequisite so um if we do you know we get the green light and it's reporting into the sm console we're down to move down to the left side of the span here so the generate pre-signed url send us some command check smpn status and tag evidence in s3 now if it is not reporting into the sm status this isn't necessarily an issue now we'll talk about this on this slide so if the sm agent isn't reporting again we're assuming that you've done everything correctly and the sm agent is installed but even if you've installed the s7 agent kind of like how i talked about earlier the ec2 instance on which the ssm agent is installed may not have

the necessary permissions in iam to allow the ec2 instance to communicate with the ssm service so initially when we designed this this was a second prerequisite that we you know the solution needed and you can see it's under point two there right you could need an existing connection um so you we need some way that this instance is gonna communicate uh before we got more advanced than the solution uh we would have an existing connection we would just say any ecu instance that you wanna be able to perform a triage collection on needs to have the appropriate permissions in ssm and if it doesn't what we've seen clients do before is they detect on when

you know new ec2 instances are created and they will alert on instances that aren't created that are created that don't have the necessary permissions and iem associated with the instance profile and then they could even auto remediate those ec2 instances and basically add an inline portion to allow for that ssm communication to occur now that is the more kind of legacy solution as we got more advanced in the solution but we're leveraging what's called uh just-in-time access um so that's really cool so no longer is this a prerequisite to have the existing connection basically just in time access says that in an instance for example or an identity should only be able to do something when it needs to be able to do

it not before not after right so it's temporary access and in our case here what we're doing is we're modifying the instance profile of the target ec2 instance on the fly to temporarily allow it to communicate with the sm service and then once we're done then we'll go ahead and remove the ability to communicate with ssm so that's how we get around it here i'll go back one slide real quick so you can see um but you see the just in time credential provisioning on the bottom right there too so basically you know try to change the instance profile associated with the instance and then you know we're going to assume that the ssm agent is going to be reporting it

after a wait um now if we do a check again and the agent still isn't reporting in then we will break at this time and this workflow will pause and let you know that hey the sm agent isn't reporting in please go investigate otherwise we're going to assume at this point that we have a target ec2 instance we described the ec2 instance and it has an sm agent status of reporting in right so the next design decision again we have all of our necessary steps to this point to be able to do the triage collection on the ec2 instance but i talked about this a little earlier so if we have an ec2 instance that

doesn't have access to you know pull down our triage collection scripts or write to our evidence s3 bucket what could we do so this was an early design challenge that we had to account for so in other words how do i allow an ec2 instance without any aws permissions to write to or read from a private s3 bucket now there's a couple ways that you know you could tackle this you could use just-in-time provisioning but that's a lot of overhead right i have to go and manually modify every single instance that i want to collect you could do that now there's no problems with that generally but if it was a production instance right maybe

people would be more hesitant to allow you to change their instance specifically with s3 or whatever so we can talk through that the first kind of option that arose was a pretty simple one maybe not very secure but the worst case isn't that bad and it does allow for a lot of flexibility where you could you know use this workflow on on-prem instances as well if they have the same agent installing them because fun fact about ssm you can install the agent on on-prem instances and then you can have a report on the console you can do this triage collection workflow as well so option one we could allow for the triage script bucket again that's the

bucket we're hosting our triage collection scripts and then ssm is going to basically execute those scripts on the target instance to perform the collection um we could allow this bucket to be publicly readable so any identity could read from this bucket right so now an ec2 instance that i want to collect doesn't have the ability to read from any private s3 bucket who cares it's a public bucket so as long as it says internet access you can read from this bucket second one we could make the evidence bucket publicly writable same thing here um you know now it doesn't necessarily need to be able to write to a private s3 bucket it could write to this public s3 bucket again

we're not allowing public reads to this evidence bucket we're lying public rights and then we could you know if you're concerned we can monitor file uploads to the evidence bucket to make sure no files get uploaded you know by mistake or maliciously um and what is our worst case your worst case is that a threat actor rage your triage collection scripts right because it's publicly readable and you could if those triage collection scripts were intellectual property of your company that could present challenges and then maybe a threat actor could also write files to your evidence bucket and the grand scheme of things right it's not too concerning maybe it leads to an increased cost uh versioning would

be enabled on this bucket right so even if they wrote a file with the same name um you would still have a copy of that file but option one right not maybe the most secure but a very flexible solution option two i could create a programmatic iem user right and get an access key and a secret access key and that programmatic im user would only have permissions to read from our specific triage script bucket so we can't read from any bucket only our triage script bucket to be able to pull down the triage collection scripts and then it can also only write to the evidence s3 bucket right not again not every bucket just the evidence bucket

and then i could when i'm executing commands over ssm i could hard code these access keys and these secret access keys into the commands that i'm doing right so that you know i gain access to these buckets and i can read and write from them and i can even set up a you know automatic build pipeline to periodically rotate these keys maybe it you know makes the keys inactive it disables disables them and deletes them and then it spins up new keys for me and you know replaces my triage collection script to have these keys or my you know my lambda code to leverage these keys um again not not a bad option uh it

allows for quite a bit of flexibility here um but you know there is you know a more secure option and i would argue that's option three now with option three this is really when the light bulb movement went off for us um i'm from i was familiar with pre-signed urls right for like temporary access to download files but i didn't know that you can also use preson urls to upload files so let's talk about what pre-signed urls are in a second i have a whole slide dedicated to what a pre-signed url is and what it does and how to create it but essentially you know the executive summary of pre-signed urls is that they allow

temporary access um to you know uh write or read a specific file uh from a private uh s3 bucket right so we'll get to this in a second it may seem like a clear winner but there are some strict design considerations that you need to account for when creating these pre-signed urls when using these pre-signed urls i'll get to that in a second but as you can see so here we are now right so we're going to move forward with the solution and we've chosen option three to allow basically anonymous access to private s3 buckets so we're going to generate pre-signed urls at this stage so as i talked about let's actually more you know officially define what a

pre-signed url is so it allows for a temporary read or a temporary write access to a file within an s3 bucket so the read and write so get or put methods whenever you create the present url you specify the method and you can either temporarily read a file or temporarily write a file to a bucket now super important you can read a single file or write a single file it doesn't accept any wild cards at this point i think i showed a slide later because it's a very important point to mention but a pre-signed url is generated by an aws identity in this case i'd use my lambda in the account that has in the

security account that has access to both the evidence bucket and the triage collection bucket so i'd use the identity that has the access to the target object these buckets and i would use that to generate the pre-signed url think of the pre-sign url as a long token or string that provides that access a long password if you will and then once i have that pre-sound url i'll give it to the identity that doesn't have access right the unauthorized user in this case i would give that pre-signed url to our target ec2 instance and it's basically going to authenticate aws using this pre-signed url now as i mentioned when you create the pre-signed url we'll see this as an

example code in a second when you create the pre-signed url you need to know the exact name of the target file that you want to read or write right that's one part of it and you also need to specify the s3 bucket that you're going to read or write from or to so target name pest three bucket method which we specified above is it a get request or is it put request and lastly at the bottom you need to set an expiry time so a pre-signed url remains valid for a limited amount of time again it's temporary access and this could be you know the time you give it can be anywhere from seconds to when the

present url is created it could be good for x number of seconds or it could be good to up to seven days and here's some example code taken directly from boto3 i want to call that out i provided a link at the bottom there this is not my own code but i thought it was important to include here because you can see if you're decently familiar with 103 and python you can read this function you know pretty quickly they did a great job of writing this function and i would almost argue there's more comments in here than there is code so if i draw your attention to the line under the try block in the middle of the

screen there the response equals s3 client dot generate present url you can see this is where we're specifying kind of the objects that i talked about in the in the previous slide so the first one is the method so are you trying to generate a pre-signed url to read an object is it a get object request or to write an object it's going to be a put object request not only that but you need to specify the bucket and the key name so the entire key name right with the preflix and everything um where you want to where it exists within that bucket and then you need the expiry time right so how long do you want this pre-signed url

uh to be good for right seconds to seven days and if you did all that correctly um you'll get a response and the response will look something like this so at first glance the response may be kind of intimidating but we'll talk about the anatomy of pre-signed url and hopefully by using some coloring it'll uh be a little more clear so the items in green again are kind of the items that you specify when you create the pre-signed url and they're going to also be output in the pre-signed url that's generated so the first is the s3 bucket name where the file lives that you want to read from or write to so if i want to you know read a triage

collection script i'd give it the triage collection script in the bucket name i wanted to write an output evidence file i'd also give it the s3 bucket where the evidence file will exist and then same with the file to download the file name here so this would be the name of your triage collection script or it could be the name of the evidence package you want to upload and lastly at the bottom is the expiry time right so this is the epoch time basically how long that this pre-signed url is good for um it'll compare against that and if the time has already expired even if everything else is right this isn't going to work anymore

and provided you gave all that as input parts of this are your output right so in blue now so we get an access key id we get a signature and we get an x amazon security token and again all of these are used to authenticate the request if you change any of this at all and it's not exactly what amazon is expecting you're going to get an unauthorized response right now in our specific um triage collection workflow i want to be very clear we're using pre-signed urls for so for each target instance we're going to need two pre-signed urls and i touched on this briefly earlier but that's because when you create a pre-signed url you need to specify a

method is it a get request or are we reading a file or is it a put request are we writing a file can't do both and they don't accept wild cards i'll get to that in a second but basically we need to generate two pre-signed urls one to allow us to read a triage collection script and another pre uh pre-signed url that we'll use to authenticate to write our output or triage output evidence to an s3 evidence bucket so for each target instance again we're going to get two pre-signed urls so if we have 10 e2 instances that we're going to collect we're going to generate 20 pre-signed urls now it's important to note again the

pre-signed urls they need everything to be exact one file name that you're going to read from one file name you're going to upload two they do not accept wild cards right so this is something we tried to make work whenever we first did it i would love to just allow for a wild card upload and then i can upload all my different evidence pieces there instead of having to zip them right so then i can do leverage athena and query those raw files in s3 now with the zip files that we upload again we could only specify one file to read or one file to write so that's why we need to zip it and you need to

specify the exact file names if you don't if there's a mismatch on any of that it will not work so you can see how pre-signed urls are very rigid in that regard and that's mainly because aws really you know there are a few use cases where you should have your uh you know anonymous users or entities uh writing or reading to a private s3 bucket this is one of them i would argue but aws makes it a little more difficult so after this point too we're at the generate pre-signed url phase we just completed that and we're going to actually pipe those pre-signed urls through to the send ssm command phase here and this is going to be

responsible for performing our triage collection workflow and we're going to pass those pre uh pre-signed urls right one for upload and one for a download into the state so let's get to kind of the meat details of this section here so in this you know send command through ssm state this is where we're going to be executing our commands and there's a couple ways you can do this using ssm ssm calls them documents right so you're going to be using some sort of document to execute commands on a target ec2 instance and really when we delve into this solution further you have two options here the first is to leverage one of these top two documents so either the aws run

powershell script or the aws run shell script so powershell obviously will only work for windows and then the run shell script would work for linux and mac os now this is where you know in your boto and your python code you would actually specify each specific command in line there that you want to run against the target ec2 instance this presents some challenges i'd argue and this is why i like the second option here so aws run remote script this document can be run on any instance type now granted your remote script that you want to run needs logic to be able to execute on windows linux or mac because the artifacts are different but what this document does is it

executes scripts stored in a remote location on a target instance so that remote location can be a public or private um you know github repository or an s3 bucket so in this case we're storing all of our triage collection scripts for linux mac and windows we're storing those in s3 bucket and we're using this aws run remote script document to pull down those triage collection scripts and execute them live on an instance now the reason why i like this you know second document here aws run remote script over you know one of the first to run powershell script or run shell script is because again uh the first ones there the powershell script of the

shell script those can be defined in line in your python code right so you basically pass in the commands you want to run what i like about the second one is i can allow my d4 professionals to update a script that's stored in s3 and they don't have to know much about python or boto or lambdas they can have some minimal kind of cloud experience there and they can focus on maybe some core aspects of instant response or defer and just update that remote script that's stored in that s3 bucket so it allows for flexibility there so once i've actually sent a series of commands through ssm right i tell ssm to execute this script on a target

ec2 instance i need to check the status of that ssm command so when you run an sm command you're going to be returned with an sm command id and that's a way to track each command that you've sent through ssm and then you can query the status of that command id now when you first send a command it may be you know pending it's waiting to send it so it's waiting for the agent to accept the command and run it on the system when the agent starts to run the command on the system it'll go from pending to in progress right so the command is in progress your triage collection is happening and then hopefully eventually it will

return a completed or a success status so that's what this basically output is doing it's basically uh describing that ssm command id and checking the status associated with that command and you know if it gets a pending or in progress it's going to basically leverage an exponential back off and a retry to keep querying that to see when it's successful now if it's successful it will move on to our next step in the step function here tag evidence and s3 and if you get a um like a status code of like an error or a timeout then it'll just break and it'll report a failure there right for some reason target instance wasn't able to run your triage

collection script so it might be an issue with the script that you provided but if everything checks out and that ssm command actually returns a status of you know complete success then you will get for each instance you'll get a output zip file in your s3 evidence bucket and again that goes back to the fact that you get a zip file because with our pre-signed urls that we're using to authenticate our reads and our writes you can only read one file and write one file so this is where we're writing one file per instance to the evidence s3 bucket right so if i collect if i want to run our triage collection script or sorry workflow against 10 ec2

instances i'll get 10 different zip files each one leveraging a unique pre-signed url to be able to write that to s3 and with this state what's really interesting is from our first if you remember our first state in the step function in the map state to describe target instance where we got a lot of valuable metadata well now we're tagging all of our output zip files with that metadata so if someone wasn't familiar with the collection they could quickly look at that output object uh in the evidence s3 bucket and they would see a bunch of different metadata flags maybe one around case id maybe one around original instance account region where this you know instance lived so that

output file will be tagged with a lot of appropriate metadata and you can tailor that to your reporting needs and here's some example output right so again this would really depend on your triage scripts that you're running against the instance what you return but these are highly customizable so any of your playbooks you probably have some good collection scripts out there that uh you know work for you that you could easily pipe over to this process uh the top one would be an example like a linux or mac instance and the bottom one would be an example of some output you could get from a windows instance as part of this triage collection workflow now again you're

going to get an output zip file so for the sake of showing you more than just output.zip i've actually gone ahead and unzipped them so you can see the different files there and as you can see what started to be you know when we first started this may look like a complex step function when we go step by step you can quickly understand each step here so we've tackled every step within the step function beginning to end so some pro tips now you made it this far thank you for sticking with me i wanted to reward you with some pro tips um you know if you were to go ahead and try to implement aspects of the solution

in your own workflow so the first one is around deployment and testing consideration so first you know always try to make your solution as highly available as you can maybe your mvp doesn't have to be you know as highly available as a production solution but with that i will say when you first start testing maybe deploying a single region and then when you start to shift the solution and depend more on the solution maybe shift towards a multi-region deployment right and that has to do with let's say aws had an outage in usd swan for systems manager for ssm if i'd only deployed the solution in usd 1 well i can't use systems manager anymore i can't use a

solution so maybe also consider deploying this solution in a secondary region like us west one or something so if usd one goes down you can quickly pivot and use a solution us plus one and then i can't stress testing enough you don't test enough you may break things people get angry with you right it's all about acceptance you want the solution to be well received by your organization so roll out the solution in phases and test extensively start with a single test account right where you can use the same account and maybe a cross region workflow and then slowly move on to onboard additional test accounts right test accounts at this point so that way you're able to practice your

cross account collections and then when you begin uh to get comfortable with this right you have all of your workflow worked out then start to you know begin to productionize the solution so maybe start to onboard additional low priority production accounts and then once you get super mature you can start to leverage the amazon event bridge like i mentioned in the earlier document to leverage native integrations and custom integrations to trigger this workflow so you can become that much more automated of a solution um so that was with deployment and testing considerations with security considerations i need to stress this as well so as you know systems manager i've shown is a powerful tool right it allows

you to execute commands in a very high integrity context on target ec2 instances and that's great for our use case that we need but it can easily be misused and it can end up doing more harm than good let's say a threat actor were to get access to systems manager to run commands they could spread very quickly right so you could see how much harm that could do so with that consider locking down what users or roles have permissions to use ssm to update ssm documents to run commands and you can do this in a few different ways if your accounts are part of an organization you can use what's called a service control policy an scp

if they're not you can still use permission boundaries in each account so even if my user tried to give itself access to you know do something with systems manager if i have permission boundary setup that doesn't allow ssm then they wouldn't allow be allowed to do that so think about that and then i'd also be very proactive about everything so if you're concerned about using ssm we'll create custom detections and alerting through cloudtrail you know when a user a role tries to use ssm um you know you can quickly detect on that you can alert on that even it's successful maybe you want to know the successful times when s7 is used too just for all of your reporting so if you

can really implement these bottom two bullets here i think you're you're in a good place when it comes to security and lastly for my super technical folks on the call i some additional considerations here too so some more pro tips uh i talked about a map state very briefly look into this in step functions uh basically allows for parallel execution without slowing down one arm of uh input so look into that second point here so adidas recently came out with arm lambda types uh traditionally you'd have to use an x86 laminate type now if arm lambda types fit your use case definitely give them a try right i've used them in a lot of my

workflows because aws quotes about 20 better performance at a 20 reduced cost so why would i not want to perform 20 better and use 20 you know less money that's that's awesome there so definitely look into arm lambda types over traditional x86 lambda types you know if it fits your use case the third point here i talked about this we use these in our check states but leverage exponential backups and retries to hopefully fix certain errors right within your step function pretty self-explanatory um so a pro about um triage collection um that's a con with full disc collection i'll get to this in a second um but with the second to last bullet here so

um traditional ebs snapshots that i've used to perform a full disk collection they do not capture instance store data so this could be a use case for a triage collection if you have a lot of instant store data present on your ec2 instance that's lost upon termination of an ec2 instance you could collect that instant store data through this triage collection solution by updating your triage collection scripts to look for instant store data and output it to s3 so if you needed that as well and lastly you know test test test and definitely train you know different professionals on how to use this solution um you know what's input what's output i've seen a lot of you know potential

clients who spend a lot of really good time and money and they're doing all the right things too but just make sure that your staff is also trained up on how to use those solutions just if a solution exists in an account you don't know how to use it it's not doing much good so some limitations i've touched on a few of these already but essentially as stated earlier with the prereqs right it may not be realistic for organization or a company to have 100 covered coverage of ec2 instances with the ssm agent pre-installed right and we get that um now this is where the full disk collection solution if you're able to design something like that as well would

really support this limitation because the foldus collection solution doesn't uh rely on any agent to be present on your target ec2 instance to do a full disk collection right we're doing it through api calls at the aws account level so it's really helpful to pair this triage collection solution with the full disk collection solution to ensure you know as much coverage as you can and with that i did want to leave you with some resources on further learning um so this first bullet and the third bullet i'm very biased because those are actually talks that i have given uh so the first one is this q3 tech talk that i discussed during my introduction that's the q3 tech talk i gave a partner

with aws to deliver and that's going to cover full discollections and memory collections of ec2 instances in aws how we automated that workflow there is a link to the talk as well as to the the presentation slides i used so please give those a look if you're looking to implement anything related to full discollections or memory collections of these two instances now let me skip over the second one for now the third one is the talk that i gave at reinvent last year so sec 306 and this is at the account level right so this talk will show you basically how to enable full disk collections and triage collections uh through automation but it's going to

show you how to create custom cloud trail detections uh how to enrich cloudtrail data right to give you an ip address maybe we enrich it with geo location for example and then how to auto remediate some findings there too or how to trigger you know automated collection workflows such as the triage collection workflow or the full disk collection workflow so that is sex sex 306 that's the reinvent talk again the slides and the talk are there as well i think i had a quarantined my mustache on that one so please don't judge me uh the second one this is a really good talk that a lot of my co-workers gave at kpmg they partnered with toyota to

deliver this talk um so toyota is doing a lot of really interesting work once they get the evidence back from a collection their trio's collection are full of this collection they're starting to automate some of the pre-processing the processing of evidence and some of the analysis right so they're running a lot of uh valuable tools against that evidence that they pull back such that the incident responder doesn't have to rely on you know running the tools themselves they can do the important you know analysis uh part of the investigation so again the talk and the slides are there as well um so with that again i wanted to close and say thank you for all of you uh for

attending not only besides dfw but also my talk i'm very lucky to have you guys as attendees here and thank you again to all the organizers of the b-sides dfw conference um there's a few ways you can get in touch with me after the fact if you have any questions uh first probably my preference is linked and i check linkedin all the time pretty active on linkedin i'll post the slides there as well and then the link for this talk when it gets published eventually that'll be on linkedin you can expect a pretty quick response there too i've provided my personal email address not allowed to share my kpmg uh email address with just anybody uh so my

personal email address is there as well and then twitter and discord so i'll be you know attending the talks for the rest of the day today i'll be around on discord but yeah so if there's anything you think of as you start to digest a lot of this information you want to share maybe your experiences in automating triage or full disk collections or you just have questions in general please don't hesitate to reach out to me but with that thank you again and i'll open it up now to any questions that you may have

Automated Triage Collection at Scale in the AWS Cloud

Related talks