
hi everybody and welcome to how to CTF infra beyond the challenges and flags oh sorry I appreciate the feedback uh for a little bit of first time Jitters thank you friend learning things every day thank you so uh hi I'm Max my pronouns are they them and I've had the fortune of working with Cloud village and doing infrastructure work for them since 2020. a little bit about me I'm from New York I used to run a comic bookstore and whiskey fruit veggies love peace and coffee are a few of my favorite things so we're here mostly because we've been running ctfs for Defcon since 2019 and we wanted to share with the community we've learned a lot a lot about what we
do that's successful and a lot of things to watch out for we want to share them and hopefully everybody else who wants to run a CTF can come in forewarned and forearmed um level setting what is a CTF you have a contest where a group of one or more people get together break some things solve some puzzles learn a little bit and grow as a security practitioner and if you're doing it right you get to grow as a human being as well and the more ctfs there are the better it is not just for your small group but for your community you get to share and build a little bit of empathy and keep working on removing that stereotype of
security as the department of no you have an opportunity for education and endorsement and including more folks with a little more empathy to do what you do every day so when you're thinking about running a capture the flag contest you're interested in excuse me you're interested in a platform that allows users to register to submit Flags to actually see the challenges that you're presenting to them and a little bit of back end on your own make sure any errors are cleaned up any malicious users can be banned you need General administrative behind the scenes for admin work and so far the tool of the day is ctfd it is a really really easy tool to deploy the documentation is
really solid the dependencies are really well mapped out and there was a competitor back in 2018 called fb's fbctf but again I say back in 2018 because the project has since been archived ctfd is for now at least pretty much the only game in town so ctfd is this framework for providing you the capability to post a challenge to have users register to have flag submission and if we quote a little bit from GitHub a capture the flag framework focusing on ease of use and customizability it is open source ctfd ctfd on GitHub and it's created by Kevin Chung whose cold heat on GitHub if you're interested in running it on your own infrastructure similar to what we do
that is absolutely a thing you can do if alternatively you're interested in running a CTF but maybe infrastructure engineering is not your passion or something you want to explore right now they have Commercial Services at ctfd.io if again if manages more your jam so the first time that we deployed excuse me the first time that I deployed this in 2020 here's a little bit about what it looks like we've got our elastic load balancer as our Network boundary everything else inside the VPC in AWS we've got virtual machines running ctfd we've got a remote email server remote asset storage and S3 remote mySQL database running an RDS and a remote session cache and so excuse me uh
running redis and elasticash everything we do is run into our form one module per component it's our form particular infrastructure is called tool built by hashicor since this particular deployment use Virtual machines to host the application we also use another tool called Packer and that gave us the ability to Stamp Out repeatable known States from when we could then run an auto scaling group ctfd can vary for a very good reason attract a lot of traffic and you want to make sure that you have the capability to absorb that as you're going Auto scaling groups in particular CPUs driven ones which is what we've tuned this for give you the capability to have your Fleet
dynamically scale out absorb the traffic and slowly slowly scale back in as your traffic spikes decrease that way you're not shooting machines in the head that are actively serving traffic and you're also not paying for machines that are hanging out now really doing a whole lot of work excuse me what not pictured on the architecture diagram one behind the scenes component that I do want to call out AWS Secrets manager kind of surprising nobody had story Secrets want to call it out because you need storage to provide database credentials you need store secret storage to provide your object storage access keys as a or at least as of when I put this together they didn't support using IAM
roles feelings and also this one got me the first couple times I was deploying it you need a secret key that every single instance needs to share that way all the cookies are signed correctly and the sessions can be accurately and consistently retrieved from the cache otherwise things start getting weird and all your user requests will drop on the floor people have to start sessions again and again very very very brief aside about infrastructure as code I have feelings I will contain them as best I can infrastructures code gives you the capability to document and make it consistent and known how you're deploying your infrastructure you do it in code instead of words because it is generally the way that you talk to
your infrastructure providers and because it's code you can read it you can share it you can have it code reviewed instead of just hoping somebody reads your bass script and mine is pretty janky so anything that makes it easier to read is always excuse me is always an asset and excuse me terraform in particular I enjoy because it is a declarative tool that means you tell terraform what you want to build not how to build it there are other tools that take the that take that approach but not so much my jam um something that I do want to call out about terraform in particular it has the capability to pre-validate some of the configuration changes that you're
deploying really really really useful when you're in development and you want to test to see will AWS for example accept this configuration it is quick feedback and it is again a really really powerful capability that terraform provides you mostly why I use it as much as I do so the next year in 2021 again we have architecture with a public-facing load balancer this time ctfd is running in a different service Alaska container service it's a managed container offering from AWS and remote email server remote asset storage and S3 remote database in MySQL and remote cache in Alaska cache with redis and if you're wondering if that looks almost identical that's because it almost is the one component that we
swapped out was virtual machines in the previous year for containers this year it was a big deal for me because I wanted to make a stronger move towards accountability if I provision VMS on the Fly I can SSH them and change them and then forget exactly what I did to solve a problem I've known that a lot and I'd like to do less of that using containers forces me or any other engineer that you're working with to encode all of their configuration changes in the repo that you're using and in order to deploy them it cannot be done by hand um let me say that differently you could but I would ask you now too
um glycops is still real but this forcing function helps for better traceability and better knowledge sharing across your team excuse me um yeah again everything else is still stateless so that portion of the configuration doesn't change instead of using Auto scaling groups for virtual machines the Alaska container service has the same mechanics and has the same apis so we can reuse that logic in particular this time around I didn't want to write it all from scratch again so anybody doing infrastructure as code work can reach out to the terraform registry you have a library of currently existing modules and reusable knowledge so that way you can either learn from a code that's already been written or you can
pop it off the shelf the same way we do pip install requests and keep it moving um also I think worth noting at least for me in terms of cost savings this year excuse me for this employment we switched from using Secrets manager to parameter store it's cheaper I like the ux it's a lot easier to reason about it's a lot easier to work with and I know even though ECS has good support for both my personal preference mostly from usability and mostly from cost perspective excuse me um so how did it go the first year we deployed this with virtual machines in 2020 we choked pretty hard on during registration actually it was it was a it
was a pretty stressful time the web page was crawling users couldn't register sometimes the page itself would just fail to load pretty hard and I'm thinking oh cool this is Defcon this is a really good problem to have we're scaling we're absorbing the traffic the auto scaling group will do its the and then the auto scaling group didn't do its thing which leads to a little bit more Panic um start looking at how the auto scaling group is configured CPU usage which is again our scaling Factor really low okay uh take a look at the database not database is not doing a lot maybe the cache is choking maybe it's working really hard also doing basically nothing
after a lot of panic and a lot of very very high speed documentation reading it turns out the culprit is G unicorn G unicorn is a uh I hesitate to call it an app server because that makes people think of java but it is a web server that fronts flask applications and in particular is what we were using to deploy ctfd um want to call it out because the default configuration is one thread and one worker really good really clean for local development good and clean for testing actually really terrible for production deployment once you've identified this you can tweak the parameters as necessary and actually make use of the hardware you're provisioning and once we figure this out and we're
able to relaunch our Fleet everything was safe rare users are able to register people can start reading challenges files can start to be submitted and we are out of the woods for 2020. 2021 our first year running the Alaska container service deployment it actually went really well no issues to speak of but we were not so lucky during 2022 2022 brought us a brought us an issue where database connections aren't successfully closed from the application uh inadvertently leading you to run the risk of denialing of service your own database from your application and knocking yourself on the floor super fun something to be aware of uh watch your database connections if as you're running ctfd
it needs love in terms of being container it needs love in terms of handling signals appropriately um something else that we notice and uh fun fact also happened again this year so it wasn't just a fluke ctfd has what looks like a memory leak it will run and run and run and progressively use up all available memory and then choke even when it's doing nothing run it for a week in development just kind of hanging out and it started to try and die so something to be aware of you can fix this by rolling your Fleet semi-regularly it is a workaround it is not a pass that we've submitted upstream and to that end everything that we're doing while
running these is live um everything that all the issues that we had to fix this year when we hit similarities is live everything that we hit for Defcon in 2020 and 2021 also live we don't have the luxury during a live event to take down throw in pdb and get really really nitty-gritty with the specific errors that we're dealing with which is unfortunate because as interested as we are uh we are significantly more interested in making sure that our contestants have a platform and can have access to the challenges that we've worked so hard to bring them foreign so I've got things that I want for the future um given the incidents that I just
talked about I'm really really interested in better logging um AWS cloudwatch is what we're using by default with ECS and previous to that we're wrapping a bunch of log files this isn't scalable kind of surprising nobody having a better log searching and having a better interface really for just how we ingest that information in parse it is something I'm really interested in a separate thing about logging with like structured data but you know we'll get there eventually so that's one area of interest I'm pretty interested in looking at uh grafana Loki it's a different way to ingest logs particularly from containerize apps hopefully that could be a good use a good use case time will tell the other
big area of improvement that I'm looking at is refactoring the infrastructure as code code base itself one option that I've been thinking about is using a tool called Terra grunt it gives a capability to give to provide a really really strict dependency management so instead of guessing which areas of code one will impact others you have very very clearly drawn out scoping for each module in each component um good for reasoning about it good for code review and also potentially this gives us the capability to provide better Access Control if I have an engineer working on caching infrastructure I don't necessarily need to give them access to the load browser or the rest of the compute
infrastructure this is very much wishlist to be fair an alternative approach to the code base that I've been looking at is using the cloud development kit from policycorp effectively switching and using a general purpose programming language to uh to uh to create the infrastructure instead of using terraforms uh declarative way of interacting with infrastructure instead of telling it what you want you would be describing each component and how it fits together using your general purpose program programming language for me I work a lot in Python so spend a little bit more time in an IDE never a bad thing I don't spend too much of my life in them as it is but that's pretty much that we'll see how it goes I
haven't made a decision yet and for anybody who's interested in following along with what we've built it's all online at GitHub we are at Cloud Village and the project is cloud Village ctfd infrastructure again it is purpose built to run in AWS a lot of The Primitives I think are reusable as far as the components that are available and again this is the shape it's gotten now we'll see what shape it takes in the future and now to talk about what we've got going on for this year I want to bring out Jazz oh cool I apologize um do you have a little bit for a q a if anybody is curious about things right here
how's it going so far this year uh uh so the question is how's it going so far uh for 2020 oh God what is time I was about to say 2022 again yeah uh 2023 is okay um the memory leak that I talked about and the database issue that I talked about both struck us uh this year that is uh I was up at four o'clock this morning uh fixing things so on one hand while our code base is solid there is still I think the application and potentially our handling of it might need some love in order to address those things um I like to avoid I like to avoid just shooting applications in the head if I
can um so my hope is that sometime between now and post Defcon I'll be able to spend some time doing local development and banging on it to figure out why we've got this memory leak and the database thing is particular Thorn to my side but I'm very much hoping for that but other like that knotwithstanding this year is pretty good so no new terrible things just existing frustrations that thankfully we were already equipped to handle are you implementing Terror tests at all um for testing um sorry uh so the question is have we implemented tarotest Terror tests for folks unfamiliar is a an infrastructure is code testing framework um primarily based in go and uh from the
sand cats who built a terror run if anybody's curious um one of the reasons actually why um why I want to explore the refactor for Terror grunt is to make it that much easier to eventually move to tarotest uh Terror test requires your infrastructure as code components to be built in a particular way so that creating them and destroying them on the Fly is relatively easy this is very reasonable and also not frequently how you will find infrastructure as code deployed in anyone's actual production environment if you can do it more power to you I know it's always a work in progress for a lot of folks um what database are you using and where
is the database is it yeah fair question uh so question is what's the database and where is it um ctfd supports MySQL theoretically it supports postgres I say theoretically because it exists in the documentation but it also says it's artificial so once mileage may vary if you want to swap that out we run our database in particular in RDS to manage a lot of availability to manage configuration and to manage if necessary failover um have you considered rewriting and go [Laughter] [Music] um I would not be so bold as to presume and rewrite a maintainer's existing project in a programming language I am unfamiliar with okay uh two questions first of all you know some types of uh web exploitation
attacks will sort of destroy this state for all other users like specifically Cash Cash poisoning for example have you ever thought of like a one container per team model um or or anything similar and how would that work with auto scaling so uh the question is given that cash poisoning is a thing how do you work around that and what can you provide to still enforce Team level isolation for challenges in particular that's not something we deal with and that has to do less with architecture and more with the specifics of how we host challenges which is to say we don't um ctfd it's an application you can put stuff on the same infrastructure um one way that
a lot of challenges are deployed with ctfd is if I have a VM running it I can put all my individual challenges on the same box for ease of use very intentionally we don't do that every single challenge that we have is hosted on and in on a different piece of infrastructure entirely isolated from not just from the infrastructure as code perspective but also from a networking perspective specifically so that way if somebody does take out a component it's not going to Cascade and get the rest of them I apologize so that was one question uh you asked the second one yeah second question I was wondering if you've investigated you know ctfd Alternatives like rcds or rctf before
and what your opinions were uh not yet honestly um it is in particular um I happen to be a little salty today so I'm very interested in looking at an alternative but right now we have experience with ctfd we have experience with its failure domains which is something I'm an infrastructure engineering gal like I care a lot about my failure domains so now that we're familiar with that I'm not hesitant to change it but I have to do a really really strict evaluation of what I'm gaining if I decide I want to shift um yeah I think that's about that cool