← All talks

Strangeways, Here We Come

BSidesSF · 201930:2292 viewsPublished 2019-03Watch on YouTube ↗
Speakers
Tags
About this talk
A cloud migration guide for startups transitioning from on-premises infrastructure to AWS. Clark covers organizational strategy (hiring, platform selection, technology stack), core AWS services (VPC, IAM, API Gateway, Fargate, RDS), and practical hardening techniques including network segmentation, least-privilege access control, centralized logging via CloudWatch, and secrets management.
Show original YouTube description
The underlying desire with any technology is to push beyond its limits. In the 80s, we had the PC turbo button. In the 00s, everyone got all saas-y with software as a service. In the 2010s, we have the cloud (or as some of us know it, just someone else's computer). Jokes aside, leveraging the cloud allows teams to deliver content more rapidly compared to a local/on-prem solution. This sounds great until you remember nothing in life is free—cloud security is no exception. While this talk is technical, we will begin by discussing the benefits motivating a small startup's decision to transition from on-prem to the cloud along with the inherent risk. A wide range of factors were considered: hiring, platform selection, technology stack, user management. We will talk about Amazon Web Services (AWS), the moving parts of our cloud, and what was required to get a minimum viable product off the ground. We will share our own ProTips for going cloud first; by the end, hopefully you’ll walk away with a few cheat codes of your own, whether it’s getting a peek at going cloud first or a verification of your own cloud security best practices.
Show transcript [en]

good afternoon everyone welcome to the last session of today who's ready for happy hour after awesome let's welcome Victor Clark he's going to be presenting Strangeways here we come a journey from on-prem to cloud first with AWS welcome Victor alright alright hello be sides very excited to be here first let me apologize for being the single thing standing in between you and happy hour but hopefully you'll enjoy the session as much as I will okay so the theme for this year's b-sides is the 1980s let's take a journey back the year 1987 it was a good year I remember it well one of my favorite bands of all time the Smiths released their last studio album

Strangeways here we come and that's the perfect title for this talk because when you're going from on-prem to on the cloud it's a different way of doing things it's a different way of thinking it can be a little strange also at a very popularly technology back in the 80s was the PC turbo button now my vic-20 at the time didn't have a turbo button but that's ok it doesn't matter for this presentation we'll have our own Chapra buttons and in the way of lessons learned some cool tricks for the different moving parts of baby AWS that we'll be working with ok cool so for all of you social engineering folks out there I'm Victor Clark I am the cloud

security engineer at insight engines this summer marks my third year with the with the company so when I joined we were I was the third or fourth engineer and now we're about 15 people so it's been a pretty good time it's been a very exciting journey so far if you're interested in learning more that's us on the web and if you're interested in doing a disinformation campaign you can find us on Twitter so in addition that I'm also a husband and the father and I can't wait until my kid starts asking me questions about the cloud because I will definitely have answers for him ok tough crowd ok cool so that's enough about me how about about you why would

you be here why would you be interested in this talk aka why am I here not at the e FF event so if perhaps you're already cloud-first I think that's great this could be a good sanity check of some of the things that you and your team do in AWS and if you already do them hey maybe it'll be a great laughs for you if you're considering going cloud first congratulations you are in the correct spot if your leadership we're going to talk strategy the people and the processes you need to have in place and if you're an individual contributor you know we'll also talk about a strategy to start a tactics to get you going cloud first cool so let's

talk strategy hiring I'm going to go ahead and assume that you already have a lead architect in place somebody who can who knows the implications of taking your application from on-prem to in the cloud maybe it's you maybe it's somebody on your team second you're going to need and then individual contributor someone who can develop and manage this infrastructure hands-on because it's definitely a full-time job on my current team that has one of the roles that I fill third and I can't say this enough you definitely need a QA testing lead because the code that you develop as only as good as the tests that you run against it and it is a full-time job

developing these testing pipelines so there are different cloud platforms that you can go in I'm going to talk about AWS but there are a couple of others that you could go with as well I'm just gonna go ahead and answer the question because someone's going to ask me they're all great I've used all three of the major cloud platforms but they also have their limitations as well each cloud platform is going to have a credits program so I would definitely suggest that you leverage that if you can it shouldn't be the determining factor but just go ahead and use that to your advantage if it's there it also be a good idea to talk to your stakeholders

folks that you work with or investors or your board and see what their preferences are when it comes to the cloud they may be different than your preferences and have that discussion if any friction comes up it may be good to just go ahead and have those discussions ahead of time and then regardless of what which platform you develop on I guarantee you're going to run undocumented requirements that's just kind of how it goes you know if we weren't developing on on the cloud we would be I'm sure reading manuals about the Linux kernel so for your technology stack there are different solutions that you could go with pretty much anything any moving part that you need you're going to run

into a hosted solution or you could potentially roll your own if you would prefer the role of your own I definitely would point you to the cloud native computing foundation there have been a lot of great products that have been out in the past year or so kubernetes prometheus envoy these are all really great products and I'm sure there'll be more in the coming year finally the data and retention policies you know if you're going to be cloud first you're going to need to be able to collect people's information and hold it unfortunately there is no one-size-fits-all answer for this I wish there were but it really depends on your product and what you're trying to do and

then I would suggest that you plan for success and try to be GD part gdpr compliant from the jump cool so let's talk architecture so our flagship product at insight and at insight engines is investigator and so this is a natural language solution that fits on top of Splunk Splunk is a really great product I love using it they're one of the sponsors of this event this year so investigator sits on top of Splunk and so you can ask a plain English question it's parsed by the engine and then compiled into SPL so instead of executing a huge SPL query you can ask a question like you know show me endpoints with malware traffic in the past 30 days

and after it talks with our API you know you're issued a large SPL query if you're interested in that nearing knowing more about how we turn this from an on-prem product to a multi turret micro-service definitely check out this talk by my colleague and insight engines founding team member and lead architect Naveen Kumar he gave a really great talk at high bay and this past September cool so there are some obviously some limitations when you develop for on Prem if you are so if you if you're dealing with a customer who has a very short change window and you're not able to install the application you know can be very difficult if you sort of

missed your window there this can make it very difficult to develop to to deliver new features to squash bug fixes and that sort of thing so you know with considering the limitations of on-prem you're faced with the biggest question probably in Silicon Valley no not that question I'm so glad somebody laughs because I swear you guys are gonna go laugh at that but will but will its scale and so the question is a resounding no so that's why we went cloud first cool so what does that look like when you go cloud first from being on Prem so instead of installing the natural language engine and actually instead being installed it gets migrated to the

cloud instead of stalling installing the natural language engine you install an API client that talks to an API in order to communicate with the natural language engine cool so then what does that look like when you have infrastructure in the cloud you know you're just making an erequest you're getting a response I wish I could tell you that it actually was the simple it's a little bit more complex than that so this is what the moving parts in our cloud look like this it seems pretty simple that there's just a handful of parts there's also supporting infrastructure beyond this as well this represents about roughly a thousand lines of infrastructures code and so I'll talk more about that in just

a moment but first I want to talk about the lifetime of a request before we talk about the actual moving parts in the cloud cool so the end user at their at their endpoint submits a query the API client then wraps up that query as an API request and also send some heather's along for authentication and authorization the request is received by API gateway over an encrypted channel API gateway is the AWS service for routing requests and we have authentication that's handled by Cognito everything goes well so then you're routed back to API gateway which then will put you in touch with a V PC link V PC link is very powerful because it routes you into a private cloud which

means that none of your infrastructure within this region here has to have to be facing the public Internet it's very useful from there your traffic is routed to a network load balancer that routes traffic to a container cluster running at far gate this is where we actually moved our natural language engine to now it has been refactored to work as a cluster of containers within Fargate and finally if you need to access any databases those are also in the cloud as well you access them by a secrets manager cool so for the rest of time let's talk about the different moving parts of AWS first you definitely want to automate as much of this deployment

as possible so for AWS you can use cloud formation for that and so this is why I referred to as infrastructure as code it's just this declarative file format that you then push up to AWS and AWS takes care of building out all the moving parts for you so you can think of this as a way of to do security configurations because all the hardening that you do for all your various pieces of infrastructure that's going to be stored within cloud formation as your as your source of truth it's also very useful for disaster recovery let's say you have something that one of your pieces of infrastructure just like goes down for whatever reason and it's

outside of your control you can just run cloud formation again you can have up and running in just a few minutes cool so I talked a little bit about having different turbo buttons so the turbo buttons for cloud formation are combining with configuration management and also internal references within cloud formation so what I mean by configuration management are products like salt ansible chef puppet those sorts of things and so the way it would look like would be this so you have a run book that has the seven structions that works with cloud formation that parses file full variables that say house like maybe your region you're you know you know what what availability zone you're going to be working in that

sort of thing if you have a certain name for an image that you're deploying up to Fargate those sorts of variables and so that populates a template a cloud formation tip let them when you're done you have this full cloud formation file that you then push up to AWS and it's able to snip all the all of your all the moving parts of your club so for internal references this really helps to make cloud formation its own standalone file so that you can do cool things like check it in the version control and that sort of thing so this is a small section of a cloud formation a template that creates a VPC link again

that puts you into the private cloud so the two declarations that you want - sorry directives that you want to pay attention to here are depends on the VP see link needs a network load balancer to actually point to before it can actually be up and running and then you want to point it to a specific target arm so an arn is an AW is an amazon resource number usually it's a really long complicated string so instead of doing something weird like copying and pasting it you can just make a reference to this object that you call network load balancer within your cloud formation tip lip and so it really helps it to be this self-contained file cool

so let's talk a little bit about cloud watch this touches all parts of the cloud infrastructure as well so this is centralized logging service from AWS and this is something that where you could talk an entire be side session about this and I should know because I attended one last year so shout out to Jonathan pulling go and check it out on YouTube it's really great very informative I definitely got a lot out of it so you definitely want to monitor monitor every moving part that you possibly can you know the old saying that that which is measured is managed this is no different and so your turbo button here there are a couple of things

that you can do so you can create alarms based on your different logging that you're doing so let's say you're getting a lot of you know 400 errors out of API gateway or 503 you can create alarms based on those metrics and so you can take action on that say you know maybe something is weird you know downstream maybe something's weird with your network load balancer or something's going on with your um your container cluster so you can also integrate this with incident management so something like a pager Duty for example where you can have an integration where your team could actually take action against the different arms that will pop up and finally you

can do filter and pattern matching which is really useful so this is all so if you do a search against all of your logs with an API gateway you can search for say all the requests that return staffs 200 this becomes very powerful if you want to say search against you know status that are not equal to 200 and so you can start seeing all the errors that are being returned and if you see something that looks kind of an ordinary you can take action on it so this is actually very powerful there was at least one time where I was able to detect an AWS outage before I got the notice from AWS so that was pretty

pretty cool to be able to to be able to have that sort of insight cool so moving right along I am which is identity and access management so this is how you delegate access to your internal users and also infrastructure as well so that probably sounds pretty weird to be like well how does my infrastructure have you know have different roles different access so AWS has this concept of an AWS execution role where you have the most granular level permissions that you have with an AWS or a bunch of permissions which are basically like a JSON blob that you can then put into a policy and then all bunch of policies can be put into a role so this becomes very useful

if say you need something like your container cluster needs to talk to things like you know several things like secrets manager band and other services as well so this those of you who are familiar this probably sounds a little similar to a role based access control okay cool so let's start talking about the individual parts of AWS specifically to networking so we have the virtual private cloud which is like the entire set of network IP addresses that can be used by your infrastructure within your cloud subnets which are just a subset of those addresses route tables so that your subnets could talk to each other security groups so someone mentioned security groups in the quest question

session of the last last talk so security groups are a set of addresses protocols and ports which allow for communication to your various of your infrastructure things like a database or a virtual virtual machine access control list those are exactly the same thing but those are applied to subnets so you have a couple of different turbo buttons here that you could potentially use if you're going to have I strongly encourage you to have security groups and access control lists but also make sure that they agree so for example if you have my sequel database that's running on you know three 3:06 and you have the correct security group for that but you don't allow that in the access control list of

the subnet that the database belongs to you're gonna have a bad time cool and then so that's ingress but then there's also egress as well so by default whenever you create security groups and access control lists but I default they are quod zero so definitely take the time to lock those down if you only need to talk to your virtual private cloud you know make sure you set that address range appropriately cool all right so let's start talking about the actual parts of the that the request touch so API gateway which is downstream of the API request so this animal is request routing within AWS so this can point the things like a VPC link if you need the

point to send it along to other pieces of the infrastructure or it could just be a lambda function if you just need to render some page or return some JSON response something like that a couple of different security mechanisms that you can use here if someone is being really abusive they're just giving you a hard time you can rate limit their API key and if they still keep giving you a really hard time you can just outright delete three deactivate their API key you can also leverage integration timeouts as well so if you have an integration that you know should take X amount of time you can set that integration time limit so you don't have

some long-lived requests going on in your API gateway to a maximum of 29 seconds cool so there's a turbo button that you can leverage within API gateway it's a little it can be a little complex with the way that I'm going to show it so if you would please forgive the majorly rejected screenshot here so we all I'm sure we all have DNS registers that we use for you know our various roles at our companies but I would strongly recommend if you're going to use API gateway use the have the DNS managed by AWS so you can have a certificate for whatever domain you're interested in issued by Amazon a certificate manager and then from there

what how this becomes really powerful is that you can set different end points within your API to point to different parts of infrastructure like you know different VPC link or something like that so let's say if you are deploying in an upgrade then you could have you know both versions of your code base running in the same in the same API gateway and so instead of you know having some downtime or updating the DNS record with your registrar you just like flip a bit when they mate with an API gateway and so this really is very powerful it can equate to you know nearly as your zero downtime if you're doing an upgrade cool all right so moving along let's talk

about kognito and so this is the part that handles authentication for users with API gateways so this is for user management and you can the really nice thing is that you can have it offers you an app client for users sign up and log in and also has a offers a user database as well so I have this on the public part of the diagram but I just want to be clear that your user database is not public there are public facing parts of Cognito like this like sign up and log in but your database is definitely not accessible by the public alright so your turbo button let's talk about my favorite thing in the world

ooofff so for an authorization code grant flow looks a little something like this so at some point kognito is going to expect that you're going to return it a key value pair up location and some URL this URL has a bunch of query string parameters it can be kind of difficult or weird to try and find where those or those different bits of information are so you can get the Amazon kognito domain name that's provided to you within the app integration and then same thing for the client ID and the redirect URL you can also find those in the app client settings one really frustrating thing for me was that callback URL is put in

put forward the redirect URI and you know it would be nice if like the verbage was the same there but yeah pretty pretty straightforward they figure out once what you get a handle on things all right so downstream of API gateway and card Nino is VPC link and network load balancer so VPC link this is again an integration endpoint you have a maximum of five per account so when you are designing for cloud first just be aware of that that you don't have an unlimited number of EPC links the network load balancer as a name would imply routes traffic on layer four and uses the hash algorithm for load balancing so your turbo button here make sure that you

have crosses owned enabled such that let's say you have a bunch of containers distributed in different subnets unevenly it will it will route traffic based to the number of containers that you have not necessarily based on which subnet it's in and I had to learn this the hard way please do not perform your health checks from a network load balancer unless you're interested and ddossing yourself so that's just an example of what cloud CloudFormation template looks like for VPC link and network load balancer cool alright so in the homestretch here far gate and ec2 so far gate is downstream of the network load balancer this is the magic container service from AWS so to reiterate if you're interesting

interested in sort of rolling your own technology here definitely look in the kubernetes that is definitely an option that you could use here so if you're migrating from on-prem to in the cloud this is probably where you're going to put most of your logic and if you can containerize it I think that's great so your typical button here so definitely no public access for this container cluster you can actually declare that so a shout-out to Omar Hummer man I hope I'm pronouncing that name correctly they wrote a very great article called blog post deploying far gate services using cloud formation so if you are interested in using far gate search for this blog post market check it out it definitely solved

a lot of problems for me okay so if you will forgive the ceiling-to-floor screenshot here I don't know I think Stanley Kubrick would be proud so if you so that give you an example of what that would look like with the cloud formation and Fargate this is what the network configuration looks like and you can see there that the public IP has been disabled so you can rest assured it's not exposed to the public Internet and then this is your test definition which is basically how you define your containers within the cluster so some of the tips from Omar Herman's blog post you will need to declare an execution role and also a task task role aren't as

well your network note mode will need to be AWS VPC and then for logging configuration you're definitely going to have to specify an AWS log stream prefix value so in addition to that I had mentioned that don't do your health checks from your network load balancer instead do them from your container and so this is what that would look like if you're interested in sort of building out something a little bit more complex than this check out the darker API this is how I was able to configure this cool and then also if you're familiar with containers you know that you're only going to get access to any sort of ports that you expose so you know let's say

it's you if you did have this exposed to the public Internet and someone was trying to SSH n you know unless you've exposed port 22 you should be good all right last moving part cool so RDS which is the databases that live within the virtual private cloud of AWS there are several different ways that you can harden this I'm sure that if you are using made it AWS you'd be very interested in this because this is where your user data and other data is persisted first please put it in a virtual private cloud in a virtual subnet and then assign it to a hardened security group again make sure this has no public access you're

probably seeing a theme here of of security and depth so your turbo button is that you can if you need to access the databases you can hold those secrets within secrets manager secrets manager is a public facing service with AWS it's only accessed by an API and if you have the correct iam permissions for that so that's been very useful so you don't have to hold your credentials or something like that and a coded repo or anything like that cool all right so that's what this looks like if you're declaring if you are creating an RDS instance using cloud formation you can see that it is not publicly accessible it has it belongs to this subnet and

this security group so I talked a little bit about security groups you know what the those look like so this is how it looks like if you were to declare a security group along with the ingress and egress rules as well so the cool thing the note here is that this is that this is protocol six so that's TCP and you can only have ingress into the subnet if you are on the certain port for the port Postgres which you know if you're familiar with Postgres 5 4 3 2 and then for egress the interesting thing here is that you can only have egress to the cinder block IP which is the local VPC super block cool alright

so that wraps it up for me before I wrap things up here I just want to say thank you to all the volunteers you put on this event every year I really look forward to it's one of the highlights of the year for me so thank you very much and so with that I guess we all go to happy hour so thank you we have like four minutes for the questions Victor does anyone have any questions no questions everyone wants to go to happy hour know what recommendations do you have for when your resources that are spun up like cloud formation like when they go down not everything has a health check or they can be tampered with so do you have

detection 's and then automated responses for those cases yes so something that would be very useful if you think that there is like it it would be good where if you can have an entire cloud formation template that that deploys your entire cloud if you could have like smaller subsets of that say like something like goes down for whatever reason you don't have to deploy like your entire thing again you just deploy like both the stack of infrastructure that failed so you know you can you know monitor on various parts of the infrastructure but you know that might be part of your incident responses oh I saw this part of infrastructure went down well that means

I have to run this PlayBook in this virus file is that answer your question yes awesome so when you like respond to that and then you run like a playbook like an answer bowl about playbook that part is sort of yeah it's manual on that you just have to you know run like a command in your shell yeah there's a question on line Victor do you have any recommendations on how to bring your developers up to speed when migrating from on-prem to AWS I yes so when we started this journey we definitely talked a lot about what it would take to go from on-prem to in the cloud you know I would have meetings with my

infrastructure team QA and also the lead architect so that and whenever we would need to fold in the other developers we would usually bring you know keep it up today maybe like once a month or something like that but you know if they're in eighth you know and you serve like roadblocks that you would anticipate along the way you know it's just good this sort of like have like you know share that with everyone at the beginning of this project so that way Kyle you manage expectations that way so I'm assuming this is all an existing production application so what were some of the things you had to deal with make sure that you didn't have any customer

downtime and make it completely transparent expecially like migrating credentials right right so so for transparency we definitely use the sari app and so the anyone who registers with our product they're registered with with that service so that any time we need to like have like a downtime we say hey you know it's gonna be about 2:00 p.m. you know the next Tuesday it's gonna be for three hours that sort of thing luckily we didn't have to migrate any sort of credentials because you know the credentials beforehand were living on on-premise for each user so then we had to get them to sign up with the API service so luckily we were able to avoid that potential headache one last

question on line with regards to hiring what are some of the areas of expertise needed for that team yeah so it's funny I actually got this question also at the last immunity engineer is actually pretty hot right now and that's because it's a full-time job keeping this stuff up and running anytime that you know we have like an outage or something like that you know that takes time away from developing like you know new features and things like that so if you're comfortable or have experience with managing infrastructure like this keeping a service up and running those sorts of thing that's you know really great you know cloud specific skills aren't super important because you know if you are familiar

with distributed computing on one platform and probably pick it up in another platform as well great on behalf of besides victor we thank you for your presentation Thanks awesome thank you