← All talks

Firewalling dynamic infrastructures (the cloud) with Chef and Netfilter

BSides Delaware · 201243:081.3K viewsPublished 2012-11Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
About this talk
Title: Firewalling dynamic infrastructures (the cloud) with Chef and Netfilter Speaker: Julien Vehent @jvehent Security BSides Delaware 11/9/2012 12pm
Show transcript [en]

alright let's get started so like I said this talk is going to be about managing your fire wars essentially and one of the problem I repeatedly run into is having firewalls that get completely out of date with rules that do not represent anything that is currently running in the infrastructure and if you have a little bit of turnover in your company once this I mean did something one way five years ago the new ones are doing differently nobody wants to touch the old stuff because it might break something and you really don't want to so that's a problem and I'm going to talk about that problem and how we are solving it at my company which is a

weber communication and going to get into that so Who am I I'm a systems and security engineer at a weber I'm mostly a Linux guy i've been working primarily on linux for the last 10 years and mostly a web infrastructure guy everything from web hosting to email hosting and that's great because i work for companies that actually work on the web only a weber is an email marketing company and we build a SAS software the surface application we've been doing that for 13 years and we have thousands of customers I think something like 120 thousand customers right now Monsta customers who use our web application every day and so we have a pretty big web infrastructure and we have a lot of

moving pieces in there and managing it is not always easy so here's a problem the main problem that we're facing any weber so we have essentially monolithic fireworks at the entrance of our infrastructure I pretty much everybody does have a big fire war that receives all of your web traffic out of your internet traffic and features everything that goes out the problem with is is they work very very well at the edge but inside your infrastructure they don't work well at all if you have two services that want to talk to each other then you want to say server a needs to connect to server beasts i'm going to create a rule in my firewall that the server a is allowed to

talk server be what that means that for server a to connect server be everything needs to be routed through the firewall that's the first thing second thing is that when once your infrastructure gets a little bit big you have thousands of rules very very quickly that you have to manage that doesn't scare at all if use a guy who has to figure out why a packet is not going through and all you get is a web interface with hundreds and hundreds and hundreds of root in there you gotta go crazy you're going to hate that the solution I personally don't like rotting everything through a handful of equipments and I've been I guess said working with Linux for while

now and I am really really really happy with the way the new fire world works netfilter and this is a direction that we do get it Weber that actually the direction the day I had taken before my time there and we just continued that approach and improved it we still have monolithic fire Wars but not everywhere so we're going to get into that a little bit of history first 70s for design I said the way cats well anyone but all right the network hey did you get a ping yeah I got a ping everybody's happy and we have four servers what actually to work station two massive servers and and we can ping each other that's great it's

great let's let's move on let's add more servers to that and nobody really thought of hey security but thankfully they did something really nice that they put the protocol in a very simple way tcp/ip are not complex protocols there is no fanciness happening so with logical you can figure out what's happening and that things that one thing they did well one thing they didn't do well is adding sort of security in the infrastructure that nothing no authentication no access control nothing but we were fine in the 70s 80s nobody was really using that stuff i'm from france in france in the 80s we had something called the mini tale which really just a terminal connected to your

phone line and and you were connecting directly to the server of the French public ASP at the time from silicon which was just a direct connection there was no authentication you could just listen to the signal and inject packets in it and was really easy so that was happy times and for hackers to very happy times in the 90s this is really the most classic like when I was in college this is probably diagram I saw the most ha you have you dmz here and your production network and your office and you put firewall and you make sure that no traffic can go from gamzee to your office but on the other way it works and that's classic and it works

it's actually better than what we had before so you can actually put web servers in this part of the infrastructure and they don't if they get compromised and compromise the entire network that works well you need well three firewalls and very quickly people realize that 350 what is great but when one of them dies in use everything so you put 6 and you put vrp between them so they could do active passive setups and that kind of fixes it and that works so well that the vendor started increasing the price is 45 were saying hey you need more you need more you need more and you end up having to pay 150 thousand dollars for for fire

Wars like wait why it's more expensive than the rest of the infrastructure but whatever it's needed and that's the traditional approach then came the two thousands and this is when things started to get a little bit hairy for us we're web company which means our production network is a dmz everything goes in there we actually don't care much about this or that what actually we do because we have customer support but most part which is the biggest part of the infrastructure the money is here so the approach of saying eh let's put something in the Demilitarized Zone and that's going to be safe doesn't work for us because everything goes in there and what happens is that you get hacked on

one server what that's great but what's firewalling the inside of the DMZ you can fight what the edge of the DMZ but the inside the enemy you can't really do that so I started to be a problem for us another problem is well if we can't have multiple zone in there we can start putting fire wars in there and rot everything through it like I was saying earlier which is great but it makes a rod opening workflow really really complex I work for financial institution up back in France and they literally had a whole web application dedicated to requesting routes opening you want application a to talk to application be great here's your web form and fill it

up it's going to go through a workflow is going to take it up approximately four weeks and eventually a seasoning is going to get in there and click the three buttons needed to open the port on the destination server eventually you do it right the first time if he doesn't do it right the first time then you have a second workflow which is the diagnosis workflow that allows you to go through back the web interface to diagnose a problem horrible horrible and no rule deletion workflow that those are really fun saying southern five fire words don't work anymore we need more yeah the problem here is once everything is http-based opening port 80 is not

exactly useful can you just accept everything through port 80 and you don't look into the traffic so vendors when I some problem we can sell you new boxes that are going to solve that for you I love it so they came up with web application firewalls sure why not parse your entire HTTP traffic into an appliance before serving it to the destination server and verifies that it fits a canvas so I have a formal security background so I actually learned that stuff and I loved it when i learned in and then i started looking at the developer side of it and I started wondering why am I putting a security policy into an appliance when I could

just tell the developers to do those checks in their application and remove the appliance that I have to maintain next to it so this is what I realize that this kind of patches of the infrastructure are really really expensive really hard to manage and keep up to date they add latency to the network these solve problems they will solve problems if you have a legacy application in there you have no way to touch that application and you really want you have a which block viruses for them anywhere anything to reach it then yes you need a web application firewall if you're coding your own website if you're coding your own application you have control over the input validation

racing like that really you're better off putting it in the application booting those checks in there and not putting it into some sort of four pints and then you get a detection tools and ideas and I psh IDs honey pots those are great but they do not the primary purpose of an infrastructure is to serve traffic to actually serve production traffic for customers if you start slowing things down with detection appliances then well you can have defeat the purpose of the infrastructure in the first place so all of those additions to the infrastructure just made things worse and a lot more expensive really a lot more expensive we looked at upgrading some of our firewalls not too long ago

and the bid was around i believe 170 180 thousand dollars for a couple boxes and at that point we sat down and we're wondering why are we not doing it ourselves who's linux and we did it ourselves linux and that for four servers that are serving the same amount of traffic and potentially doing gigabit traffic on each interface that costed us twenty thousand dollars and the other advantage of that that in the team however nobody is really familiar with vendor technologies and when we have to debug one of these firewalls for with any googling commands to figure out how to do them TCP dump on one of those firewall router or whatnot if it's linux

it's faster for us manage we understand it better we work on that every day it's just more convenient those things that happened around two thousand five is a developer's literally stop trying to connect application age replication be they actually offloaded that to security teams and system in teams they stopped trying to do hey I have my web app here I want to connect to this database because I can retrieve data they started building team saying I'm on the developer side of things you take care of openings around I don't want to hear about it because it's a mess and it's not exactly convenient i don't know i don't like that approach 2010 congratulation you know rotting your

entire traffic through puppy three four five boxes I don't want to pick on any vendor particularly so i won't give names but i was at a user group not too long ago and one of the famous load balancer wat vendor there was actually explaining how they can sit in between two data centers and inspect all of the traffic going from data center 8000 to be and it allows you to duplicate your data centers you can even use the same IP ranges on both sides and those appliances are going to understand that there are mirrors so you can have the same IP on data center am I on that lesson to be and you your appliances at

the top are going to take care of the rowdy and I'll see you there na really do you can you want to do that how do you audit your infrastructure are you end up like you want to trace an action on your web application at some point how do you know if this IP was on data center a oh that isn't to be and once you end up routing everything through those few appliances this is usually when you realize that you're hitting the 127 megabits per second limit that written in the fine prints of your contract that says it's just it's your license limit and you're supposed to buy a new license to get 128 megabits per

second really so that kind of approach makes vendors and their shareholders very very happy your financial officer not so much is a different approach a service-oriented architecture and in this approach part of the infrastructure I really isolated from each other you don't build one infrastructure you build services those services can be really really small or really knowledge it doesn't matter but from a logical point of view there are independent services so this is kind of the ID for web hosting infrastructure where you have your cashing service at the top your front-end service and then three other API services that we contain data one of them could be accounting the other one could be password management the third

one could be I don't know what not statistics or whatever and those services live independently and your architecture is based on interactions between those services dependencies we know that service X needs service y to function correctly and service Z is accessed by front-end service and by service why actually makes the architecture a lot easier to manage for everybody including management because they know what is needed to actually build a ma a yearly financial report they know that they need to connect to the accounting service in the need to connect to the customer support data that kind of stuff and having that represented at a logical point of view really higher up makes it easier to

implement at the infrastructure level I'm going to see that now another requirements and that's very important for us because we have web hosting company so we really really care about our uptime is that we don't want any single point of failure we want to optimize resource utilization density virtualization we want to make sure that if we buy a server that's going to cost a 30 gram we want that server to be using at least fifty sixty seventy percent of its CPU all the time and i'll just be sitting there waiting for traffic this is another reason why I we don't like active passive setups because you end up buying two boxes are 50 grand each and the second one doesn't do

anything for all year long doesn't do anything so you just wasted fifty thousand dollars for nothing when you have control over your infrastructure and you can do proper load balancing you can use both boxes at the same time double your capacity and you're still redundant and augment and reduce capacity rapidly so that's usually cloud related stuff I'm not going to get into buzzwords but the ID is some services are really really small when the start when developers come up with a new service we're going to create a VM for them that we typically has two CPUs in one gig of memory and that's it that's all they're going to get eventually if that service ends up being used by more

services around it we're going to grow that service and we want to be able to do that and every week we use we create new VMs for new services we're going to destroy VMs or services that we thought we're going to be using a not used and we grow disks as well when you use virtualization on Linux specifically you can store everything on lvm volumes you can take those volumes groves I'm x SS size by three or four it's really easy so that's really the key points that are really important for infrastructure if you've played with amazon web services they have a first concept of service-oriented security and the way to do it is when you create your

infrastructure in in ec to you can create security groups now security groups are going to be basically firewall rules open participe ad or whatnot to that to that source address but security groups can also open connections to each other you can say security group X will open will be security group y will open the access to security group X so they can really reference each other and if you add 5 10 50 VMs in security group X all of those VMs are going to generate the security policy so instead of managing IPS or host directly you manage groups that are a lot easier to scale up and down because you don't have to worry about

the number of instances in that group back to our architecture diagram the way we can represent that architecture in very high level security policy is just like that except caching to front end on TCP 80s oh that just accepts all of these connections here which means this server is going to accept everything coming from this service and this servers here is they have a host-based firewall are going to be allowed to connect here now you have two types of policies essentially inter-services policies that grew between one service and the other and intra service policies that coincide a service and here we see for example that the the API is a road to connect to its database the worker is about to

connect to database but the worker here is not allowed to connect to this database here and i'll show how we can actually manage that oui chef with a nap sure but this is how we describe the policy at a high level you'll also notice there is no intermediate between the services there is no firewall there's nothing this services are really the servers the systems are talking to each other and that the own firewall and the manager on fire well they don't manager on firewall we tell them to use that for scalability this is how we need one of the main problem when I started this project that I Weber one of the main problem we're trying to solve was

to be able to manage those firewall rules dynamically if we add OOP to AP I know it's here then we don't want you have to go in the configuration and say hey open the firewall first two is true as well but we still want point-to-point security so the first thing that's going to happen in that those two new API nodes will be allowed to connect to the database which means the database is going to detect those two nodes and say I open my firewall forces to and then the API knows here when they configure themselves they would say that they need to connect this database so they will open the outbound firewall as well then

the web front-ends and the api nodes will detect each other and I sure that Chef is really handy for that and the web front end we'll see that when there was only one API node before there is no three so it needs to open inside bonfire were to connect azuz API notes here and same for the API knows it will detect the front end nodes and open the fire work for them all of that is really IP table basic syntax and there's the magic here is how we create these IP table rules so the tool chef so if you're not Heidi familiar with chef and puppet the way those work is chef is a client-server architecture and you have

a chef client binary running on the vm the node server and you have a chef server on the other side that is accessed by the entire infrastructure and instead of writing / script bash scripts or whatnot to configure a new service assistant mean we write a ruby script that uses the chef capabilities to create a new service that means for example the Rubik's Cube would contain in style install the package Apache install the package php5 and configure put that configuration file in this location with those variables and this is what chef does so the system in writes those recipes in a chef language and then tells the chef server to apply those recipes to a group of servers so

you will have a set of nodes and you have a bourbon vm and you want it to be added inside the chef infrastructure you're going to go on that VM and say you are now going to connect to the chef server and request a key and be added in the pool and you test the chef server hey i just added that new guy here configure it and then they talk to each other they figure out what they need to actually the node is going to contact the chef server and asks a chef server hey what's my configuration and the chefs are is going to send it everything it needs and would configure itself the huge

benefit of chef is that the chef server has a database that contains all of the nodes in the infrastructure all of them then you actually have all the details of each node you can know from the chef server the CPU usage the disk usage anything this is really really useful to manage an infrastructure where you have hundreds of servers and you just want to run one query to get the list of servers and know what they're running what is the CPU usage what is the disk usage that kind of stuff this is an example of search knife is a client for chef so it's just an API client because chef it's just an API and what knife does it

gives you some access to chef capabilities and the one we're seeing here is the search capability sous chef has its own search syntax pretty much similar to what sequel would be where you can search for all of the nodes that are actually running the roar web front-end n are in the environment staging and you run that search and the chef server is going to give you back an answer and that answer is going to contain all of the nodes and here it found three it has front end one front end to and from them three and what you see here is that front and one is running the base war that usually contains the basic stuff that you want

all the nodes to be running and the web front-end role so that's the basic idea you can talk to the chef server to manage your infrastructure that's the ID so I talked about that a little bit already the way that works is that you have a server here sir one two three that will run the chef's dash client demon connect to the chef server API endpoint authenticate itself download the set of or what they call run list which is just a list of script that this client needs to be running and execute the scripts locally and that's just the execution is actually a little bit tricky to get at first because there is two or three different runs chef is not

an easy tool to to use at first it takes really a couple of months to get used to the syntax and the logic and start writing cookbooks so the chef client will run all of those recipes and when it's on is going to tell us a chef server all right I'm done and I succeeded everything worked fine if it didn't for whatever reason like you're trying to install Apache but engine X is already installed so some files are conflicting of the port is already used and you cannot stop the demon then the chef client run will fail and tells the chef server I couldn't do it so you can from a management standpoint look at the

statues of your infrastructure and see which ones are configuring property in which ones are not those are huge advantage of that your servers never get out of they never lose their configuration because chef reruns every 30 minutes on each node and if a file has changed the chef client is going to recreate that file reprovision it with the values that you want to be in there you cannot modify files locally you need to modify them in the chef infrastructure so how does that help us to build firewalls that help us because we have a way to search for anything we want in the infrastructure and we can use that to build policies to say service n is to connect service be this

is how we want to server to communicate now we are going to extract the firewall rules from that so the concept is automated generation and division of course that's important too if a node disappeared then time chef reruns on that on another server is going to see that that node is not there anymore so it's going to remove the fiber rule 121 rules only we don't open ranges we open the IP of server a onto server be and we don't say hey it's a / 16 can connect to the database no we say here's a list of 50 IPS that can connect to the database user specific outbound fire world that's a capability of net filter that can when

you do a bonfire watering on the host you can check which user is actually trying to create that connection and depending on the UID you will allow or block the connection generic rules that's important we have services that are homogeneous and we just replicate them with different purposes but under the hood that really is the same thing so instead of writing custom rules for all services with build the set of generic moves and we're applying them on different services the technology so stop net filter that goes directly into the Linux kernel it's reassuring insecure reload at every run every time chef runs even if nothing has changed it will reload the whole set of five rules

now you're probably going to say well that's too bad you're just using resources for nothing but if you play with iptables restore probably notice that you can restore an entire rule set in not even half a second even if you have 20 31 40,000 rules in your rule set it's really efficient the way does that it loads a new rule set into memory and verifies the syntax is clear and if it is clear it's loaded property then it's going to move the pointer from the old rule set to the new one directly into the kernel in some of the net future features fast reader just talked about that the owner match that's for the album firewall contract that the

stateful firewall and some stuff we're working on right now is support for NAT support for time when you want to block the access to your admin page at 2am you can do that netfilter support for IP set which is useful if you have 20 30 or not 27 but if you have 200 or 300 IPS you don't want to be checking those IP is one by one you can instead use an IP set which is a hash that contains your eye peas and you check your rule against that hash it's a pretty nifty feature of netfilter the syntax alright that's a basic syntax that doesn't do any search so this is not really dynamic but not

only part of our infrastructure is in chef right now so we still have servers that are not in chef and two servers cannot be searched for so we need to listen manatee so that syntax show is really a basic block that will run on rabbitmq server and the road connections on the port 56 and 82 rabbitmq port from zoo sri servers very basic no fanciness yet same thing here except we want to open the access to the DB server from Jenkins and we want to open that access only in the staging environment not in production and we can do that by creating a generic rule that will technically get applied to both the staging and production servers but with

a parameter like this one environment staging then when chef runs on that node it will detect that it's not whether it's running in production or in staging and if it's running in production is going to discard the rule now with searches OOP nope yep searches this works really really well when you have a client server infrastructure if you use something like oh it's like not use anything where you have a lot of clients connecting to answer and you probably end up writing a general rule on the server that says all of the nodes coming from that IP range I allowed to connect to main and in that range you can have probably 30 40 percent of ips

that are not nagios clients or not oh esta clients you don't have a way to differentiate them except trying to do that matter which is a huge pain in the butt the way we can do that here with Chef is to say this is the rule that's running on the OS x server and this rule will open the port 15 14 to the nodes that answers a search role Oh a sec agent when chef runs on the OS x server is going to execute and resolve that search and it will obtain the least of ip's of all of the u.s. sec agents in the infrastructure and add these IPS into the firewall on Lisa's eyepiece not

the other nodes and it's dramatic if a node disappears VIP is removed in the same way for the advan firewall the OS x server is allowed to connect to the agents so we have the same search reversed for roll OS X agent the server is going to list all of the US agents and open the outbound firewall to connect to these agents & onlys us x user is allowed to do that another way of doing this this is a application to back-end database setup and here the sources or nodejs application worker node and API knows and that's which is going to resolve to pretty much anything that's running an application and allows us to connect to the DB server or

whatever database that is the notion of service tag okay so okay say we're in a service-oriented infrastructure and we really want to open rules based on services so the way we're going to do that is chef as a way to tag nodes tag VMS and in the configuration of the node is going to say your tag is service x you tag is service why and then in the search you can search for the tags you can say my database here in service x will accept connection from any worker node or any API knows that are in the same service so it's going to search here for all of the nodes that shares the same tag and

that same type keyword in the chef run is going to be expended and the DB server or whatever database that is here is going to look for its own tag is going to look point something and see I am tagged with tag x I want to list all of the other nodes that has a tag X as well and i also want those notes to be workers or API notes it's going to obtain a list of ips and unease those IPS are going to be allowed to connect to itself same thing on service why here it's going to search for words a note that I've service tag why and only open the fire world for the same note so we

have a notion of service that is logical there's no physical separation between the services they could be running on the same hypervisors they could be running in the same network but their host fire world is going to constrain them to their own service

user specific outbound rules i mentioned that already this is a feature of Chef of netfilter that I always wanted to use outbound firewalls I've never been able to because usually it's just impossible to list all of the outbound rules that you need to set on the server it just doesn't work and once you think you're done you got a developer saying hey I need to connect to that server over there but I'm blocked so you end up trying to create the rule and you have to do it for every single server manually it never happens oui chef since everything is dynamic you can just define the rules generic rules and let them populate themselves and then the

firewall will check the syntax and say hey you want the root user to be able to connect to the rcs log server that's great alright so I'm going to create a rule that says if the owner of the socket is you id0 root then send it to the root chain and inside the routine you have a list of foods that are allowed for root only same thing with an eggless reason I guess user is going to be sent to the negatives chain negatives chain we're going to say hey you ought to connect is a negative server but nowhere else so in that with that kind of rules you can really lock down an environment and you don't need to if

somebody gets on the system chances are that hacker is going to be running as the application user it's going to be really really limited in where it can connect to your really blocking the outbound channels here not going to get into that too much web ways of creating rules dynamically and then equals a cookbook create their own moves that's pretty much once you get into chef you realize there's a lot of possibilities and exploiting all of them is really interesting crossing environments we typically when provisional service provision to version of the service one in staging one in production and they are isolated the database in staging for service X cannot connect to the database in staging in service why but sometimes

you want to open those rules and in our security model staging is less way less secure than production so we definitely don't want systems in staging to connect to production but in some special cases we want the workers in production to be able to connect to the database in staging we will accept that in cases where developers want to be able to run tests against aging so they need to have some sort of production data into staging so the way we do that is there's a worker running in production here that's going to pull data from its own database and push that data and analyze it and push it inside the database in staging and the firewall

rule is once again generate for that it's tell the worker in production that it's road to connect in staging in test database in staging the throw to receive connection from the worker and production some advanced feature I'm going to keep on that shot name that's once again some more fanciness to when you have a MongoDB cluster for example you can detect the members of a cluster by looking at their short name and the firewall because it knows chef and all of the attributes can list all of the note that shares the same short name and automatically authorize the replication link between them some weird rules we left the support open for custom rules because we can't cover all of the use

cases so you can pretty fine wolves and just put them in there it's going to be copied verbatim inside the rosette have about four minutes left so some of the limitations of the system so of course when you use a provisioning system you strongly depend on the security of that server just puppet you see everything if you chef you depend on the security of those servers and in our case the security of the ship store is really something we taking care of very seriously the protocol that connects chef clients to the chef server is already relatively secure so it's really difficult for just anyone to talk to the chef server but it's a limitation of the system and

notes can modify the own chef attributes so need to be careful when opening rules based on what another node might be claiming to have a fw works really well with homogeneous infrastructure if you start having very different systems all over the place then you will need to have custom moves all over the place and it still makes your life easier but it's just not as fancy and some of the search is we like we heavily on the searches and that puts some stress on your infrastructure there are some of the limitations alright so that's a quick overview on how we manage our firewall infrastructure questions questions here you're looking to ensure consistent configuration produced for an audit or something like

that does it have the ability to do that we have the ability to at any given time look at what fiber rules are running on what server download Zeus parse them so we know that a server is running just those rules because they are we loaded every 30 minutes so we can say here is a configuration we have in chef we are one hundred percent like ninety-nine point nine percent sure that all of the servers are running this within a window of 30 minutes so it makes auditing a lot easier because you can search for anything from the chef server instead of having to go to every single server yep oh how do you how do you manage

application layer medication along the ball deep or services that you men despite databases and the happy passwords you're basically um we have two ways of doing that the at the firewall level we consider that there is no authentication so we kind of abstract everything that sits above therefore and we assume that whatever authentication is going to be put in place is going to be bonus the fire world needs to be strict enough to protect the database if there is no syndication truth is mongodb has a very interesting ascent occasion scheme and when we started using it for some application and poking ahead we run into the problem of really not trusting its application its authentication scheme so

the fire itself doesn't worry too much about the authentication it just assumed that there's nothing else in place it's the last resort then inside chef we have other cookbooks capabilities in chef to allow for we configure each services differently based on my authentication that going to support and if a system needs to authenticate with another system then we're going to use chef to provision the credentials on system a system be now of course the problem is that Chef is just a code repository so you can't have those credentials in clear text in your code repository while they're being pushed to the destination server so we have another cookbook that I wrote that's good key master that does

cryptography to protectors and there's a key distribution server and all that stuff but it's a whole different environment what I want together they're not linked together then I linked together the fire what is really dependent and here's a question all right well ah right on time I believe in Spanish time so thanks a lot and check it out don't give up