← All talks

Scalability: Not as Easy as it SIEMs - Keith Kraus & grecs

BSides Las Vegas22:37149 viewsPublished 2016-08Watch on YouTube ↗
About this talk
Scalability: Not as Easy as it SIEMs - Keith Kraus & grecs Proving Ground BSidesLV 2016 - Tuscany Hotel - Aug 03, 2016
Show transcript [en]

all right let's get started welcome everybody to Proving Ground um this uh first I want to thank our sponsors we have Vera Sprite provide provid I actually don't know how to pronounce that tenable Amazon source of knowledge and all of them are located here in the chillout area so definitely check out their booths um please keep your phones on silent and be respectful of the speaker and also please um fill out the forms that are online for feedback of the actual talk this is Keith Krauss he's an associate principal at as centure and he's going to be talking on scalability not as simple as it Sims welcome M thank you got to start with pun all right so hi everyone I'm Keith

kuss I'm an associate principal engineer at enture security lab and just to give you a little bit of background enture Labs is the most forward-looking part of enture we're looking at things three to five years out that are C going to cause disruption in an industry and so my talk today is focused on some of the work that I've been doing at our security lab in the Washington DC area so security data science is hard currently it's not being done well at the scale and the volume of data and alerts that analysts are receiving every day so Enterprises are constantly changing with more devices being added at a faster Pace than ever before there's more bring your own device

programs and new technologies are constantly being added into Enterprise RT it so more devices more people distributed workforces it's creating a really difficult security problem of ensuring that people aren't using your network maliciously and it's easy to see that we're struggling in a day and age when privacy and security is on everyone's Minds so the panon Institute put out a report recently that says for most Financial Services institutions it takes an average of 98 days to detect an advanced persistent threat or a zero day malware that's 98 days where someone malicious could be stealing your identity or stealing your financial information and on the other hand this takes up to seven months for retailers so Wired Magazine put out an article

recently uh about how the CIA handles cyber security different than most Enterprises so in addition to just trying to encrypt everything and just protect the perimeter the CIA is starting to develop systems that can kind of prioritize events and problems and create intelligent ways to respond to threats so they said that the challenge that lies in this is efficiently scaling these Technologies for practical deployment and making them reliable for large networks so my work has been focused on building a platform that can act as the base layer for such systems so Enterprise security has a data problem in a modern Fortune 500 Enterprise a terabyte and a half of data consisting of 250 million events daily

is the norm and they are not equipped to handle it so for most Enterprises the starting point to tackling this problem is Sim but Sim is only a starting point endless time and money have been spent by Enterprises in refining their Sim Solutions but Sims weren't originally designed to handle the amount of data that we're producing today and so this shows mostly through two main areas which is the storage retention and the computational power of Sims so most Sims are only able to have a 30-day window of data available online and then data Beyond this 30-day window typically has to be archived so this archived data is typically sted inefficiently where you're either sacrificing storage space or speed in

order to kind of load it back into the environment to then query it and then additionally working on only a 30-day window vastly reduces your Effectiveness in kind of detecting and reacting to these advanced persistent threats so in addition to this asking simple questions such as which machine which machines did I see this executable on the past few months shouldn't take hours or even days to run and asking the wrong question or just mistyping something shouldn't take hours to then return the wrong results so what I'm trying to say is the modern Sim it's a great pain of glass to see what's happening now but it's very ineffective about ask answering more advanced questions so Enterprises need to move

Beyond these basic questions and rules and start building models and analytics to detect these more advanced threats and in order to answer these complex and demanding questions and start building these models a big data driven solution is needed this solution needs to enable both analytics and models at scale that can keep up with modern Enterprises data volume and velocity when an indicator is found an analyst needs the ability to immediately pivot on data or as we say pull the thread to kind of follow a threat from its infancy all the way until it's exfiltration so which brings me to my research hypothesis cyber security has a big data problem the volume and velocity of data

from devices requires a new approach that combines all data sources to allow for more intelligent and advanced cyber security hunting through analytics and exploration at scale across Enterprise data and along with this open-source Big Data Technologies reduce cost and act as the building blocks of a scalable platform with the speed necessary for Enterprises to overcome these challenges combin with long-term historical Data Enterprises will be able to reduce noise and Empower analysts to effectively detect threats so before I dive into a big data architecture I want to make a point very clear I'm not saying to replace or abandon your sim cesos would not be very happy if we told them to just get rid of

the software that they've invested thousands of hours and millions of dollars into so this time and money that's been invested by Enterprises into fine-tuning their Sims makes them great data sources but siloed data sources if you're an analyst trying to hunt or follow the thread of a threat you're going to need additional data sources such as like your vulnerability scanner to see if a server had any known vulnerabilities or your threat feed to see if it access any kind of known malicious entities it's a huge waste of time for analysts to kind of be jumping between Sim to their vulnerability scanner back to sim to their threat feed back to sim what they need is a single solution that

kind of gives them a clear picture so diving into the architecture now first part Kafka it's an ingestion engine and by using Kafka we can ingest a number of different data sources and allows us to combine these diverse and typically isolated data sources and store them into a unified place in hdfs so using hdfs as a data is as a data lake is common in the Big Data space but for cyber security what it does is it offers us a way to break down these data silos and bring all these diverse data sources together and provide a complete picture for a security analyst so in order for the security analyst to actually ask questions and

kind of hunt things they need a query layer and what this architecture gives you is spark so spark provides an easy way for analysts to very rapidly ask questions using very industry known languages such as SQL or python so Kafka hdfs par K and Spark they're all open source Apache projects so there's no licensing fees there's huge community support for them and there's typically rapid up rapid updates that generally yield good performance increases and new useful features and so typically analysts would interact with a platform like this through an interface such as a jupyter notebook or a tool using graphistry but I'm not going to dive into them on this talk all right so Kafka is a distributed

publish subscribe message queue that's extremely fast scalable and durable so at a lower level kfka is composed of producers consumers and Brokers so producers over what send messages to the kfka cluster consumers consume messages from the kfka cluster producers and consumers write and read messages from different sources of data called topics and Brokers are internal to the Kafka cluster and kind of manage these topics internally so Brokers receive messages from the producers and then send the messages to Consumers so why Kafka is important also is that messages are ordered as they are sent so as you get event data in it feeds event data out in the same order so you don't end up with

things out of order and then messages are delivered with at least once reliability and allow for replication so it protects you against the case of any kind of failure so I touched on it in the previous slide but Kafka gives us the ability to ingest a multitude of diverse data sources such as Sim your vulnerability scanner your threat feed or any other kind of data source that you want and then combine it into a centralized location but most importantly it does this at the speed and reliability necessary for Enterprises so in a large Enterprise the Sim alone can EAS can easily generate more than a billion events per day at Peak volume but Kafka can easily handle

this so LinkedIn actually recently ran extensive benchmarks on kfka and the results showed that on very commodity level Hardware kofka is lightweight enough where the limiting factor is almost always dis IO or network IO and that Kafka can scale to the extremes where at LinkedIn they're using a Kafka cluster in production and they handle 800 billion messages a day with over 175 terabytes of data moving through it so on the storage side hdfs is very common in the Big Data space hdfs is the Hadoop distributed file system and so it's a distributed file system that provides a very scalable and [Music] reliable so it provides a very scalable and reliable data storage using commodity hardware and it's the data

storage backbone of nearly all big data Technologies and all big data Technologies integrate with it so because of its integration with big data Technologies it allows us to exploit certain things that typical distributed processes don't allow which is like data locality so that you're not you're minimizing your network transfer to kind of squeeze as much performance possible out of a distributed system so on hdfs any file format can be used and typically you see things like CSV or Json being used and then on the big data side there's other things like sequence files and but there's a new file format out called par that's shown very promising results especially for security data so par is a columnar

storage format that was built from the ground up to support very efficient compression and encoding schemes and so what this means is that the data is stored much more efficiently it can be read much faster so in a typical table you have your row based storage where data is stored A1 B1 C1 A2 B2 C2 Etc and so the problem with this is A1 A2 A1 B1 and C1 typically will be very different data so you can't really encode it intelligently and that hurts your compression since parque is a colner storage format data is stored A1 A2 A3 A4 A5 B1 B2 Etc and so since the data is it's by column typically if you're

storing an IP address those IP addresses are going to be similar so it makes it much easier to encode and compress this so all state recent All State the insurance company recently did benchmarks on a couple different data sets so they had one data set with three columns and they found that counting messages using par was about 10 times faster than using CSV and the file size was actually five and a half times smaller they ran the same Benchmark against the data set with 103 columns and Counting messages was 25 times faster and the file size was about 41 times smaller so at the ental lab I actually ran a similar Benchmark except using Sim

data and so in a data set with 400 more than 450 columns I found that counting messages was about 30 times faster using parket than CSV and the file size was about 45 times smaller so while while performance is great another feature that Park gives you because it's a column n store you can actually add columns to existing data stores so say you have your threat feed and then for example they kind of add some new metadata you can actually just kind of add the columns as you go and it maintains compatibility with your older data so spark is a general cluster Computing framework that's designed around Prof uh providing in-memory computations to performance so spark

follows a master worker model where a master node kind of issues tasks to worker nodes and then the worker nodes deliver the results back to the master node and so under the hood spark uses a lazy evaluation model and it allows it to kind of optimize things like data locality and predicate push down so just squeezing more performance out of the resources that you have so within the context of a cyber Big Data platform spark would be used to ask questions on your data so as you can kind of see in the slide spark has very easy programming uh interfaces that are very well industry known so you see an example python query or you can do the same query and Spark

where basically any you can put this in front of any security analyst and with minimal uh uptime they can kind of get running on it and Spark also supports programming in Scala and Java and they it gives a small amount of advanced features and a very small performance bump so in addition in addition to kind of just basic features like filtering grouping and transforming data spark kind of gives us a platform to build scalable models in analytics and Spark also has a few libraries so it has the spark machine learning library called spark MLB which is a full and somewhat mature machine learning library they have a stream processing Library called spark streaming and they have a graph

processing Library called spark graphx so using this architecture I actually ran a benchmark so following the Kafka hdfs parket and Spark architecture they previously outlined a day of data and a week of data was tested so for the Big Data solution we used a 10 node cluster that was under $50,000 in Hardware then we tested this against the production Sim instance that was being used at a Fortune 500 Enterprise so the data set consisted of more than 450 columns and approximately 250 million events per day for the Big Data solution the data was stored in par on htfs and all the queries were run using spark version 1.6 and the Big Data solution considered the query finished when the data was

returned to the master and converted into a pandis data frame and for those of you that aren't familiar a pandis data frame is just a data structure within a very commonly used pan uh python data science Library so here are some of the results from The Benchmark and as you can see the big data solution is magnitudes faster than Sim even when loading all 450 columns when Sim is only typically loading uh 10 or so and the Sim solution was actually unable to finish running a query for a week of data in most cases and in talking with the client team they said that basically they would have had to split the query into multiple time units

or allocate more resources to run it and they just couldn't do that because it's a production Sim they did estimate that the queries would take more than 20 hours to complete each and I just want to make another note a lot of the time in that in the Big Data solution speeds was actually spent pulling the data into memory on that Master node and converting it into a pantis data frame if we just decided to write the results out in parallel to something like a CSV or even a paret file the results would be even lower and so this is important because looking past a day or a week of data this solution easily scales into months

uh at our lab we've easily queried more than 6 months of data at a time and been able to run queries in under an hour easily and since this was done on spark 1.6 spark 2.0 was actually just released and they've shown a magnitude of performance increase so we could expect these times to just improve so with the new speed and flexibility offered by a big data solution it allows new use cases where Sim struggles and most of these use cases kind of break out of this 30-day window and the ability to look at even 12 plus months of data so for example full sex full text search over 12 months of data so you can proactively search

for an exe or a username over a year of data compliance request this is something that the client team specifically told us that finding all filed logins for a user over 3 months or 6 months or 12 months of data is just very difficult in current Sim Solutions then we can get into some more advanced things like you can Baseline log flows and so you can detect when a device is emitting more more or less log than usual and what this allows you to do is kind of find oh is there a misconfiguration has this is this device been brought off the network or something and what it also kind of allows you to do is Baseline what the

normal events that you see from a device are so if we normally just see a ton of session Open Session closed from a firewall and then all of a sudden we're seeing something different why are we seeing something different has something malicious happened and most importantly what this will allow you to do is except accelerate your new sim rules and filters so what this will allow you to do is proactively test a new filter or rule that you're looking to put in place in your sim and it allows you to run this over 12 months of data very quickly so it gives you this kind of Rapid interactive prototyping environment that Sim currently doesn't give

you that's my talk any questions

okay um yeah so uh you mentioned earlier in the talk that you shouldn't throw away your seam your sim yes to Institute something like this uh what if you didn't have a seam already should I forego should should that person forgo seam and just do big data should you do seam first and big data do them at the same time so it it honestly depends on your environment Sim so current Sims what they do really well is they do correlation they do some normalization and at this state we're not trying to recreate that so why Why Try to recreate something that already exists in most of these big Enterprises so just use use that correlation use that normalization that

they've already done and put it in if you don't have a SIM that's your prerogative whether you want to implement a Sim or whether you kind of want to build that correlation and normalization into a big data solution uh one thing I will say is there's a project coming out from Horton works and Cisco called a Apache Metron which is a continuation of Cisco's open sock which is kind of a Sim built on top of some of these big data Technologies and it's showing a lot of prise so far uh you mentioned CFA just like to dig into the problem I always hear with and you mentioned it was the it's a deliver at least once how do you see

people dealing with the more than once I got an event I got again yes that is that is something that you will have to kind of handle within your processing pipeline unfortunately um there has been a lot of work for example on the spark streaming side on how to uh handle the multiple message delivery from Kafka but you would much rather more than once than not at all how many days worth of data do you keep then like a year uh currently I believe we're close to 18 months of data and it's it's honestly it's limited by the size of your cluster so it's it's all commodity Hardware with normal Enterprise discs and hdfs kind of gives

you that layer to just use as much dis space as you have I was wondering uh which Sims you were uh comparing to your CFA solution in your test results there unfortunately I cannot share which Sim the client that we were bench marking against was using I can tell you it's a top five Sim in the marketplace but I can't go further than

that so sorry just on that would you expect to see different results if you compared with more Sims you mentioned you just use one yeah so depending on the Sim I imagine there would be different results um unfortunately that was the client that we were working with so those are the results that I have but the the point the point that I was trying to make is any Fortune 500 if you can give them that level of performance to ask questions on their data will gladly spend $50,000 to get that so you priced out the the cost of the the hardware but did you include the cost of the expertise required to run that as opposed to running the Sim and

just wondering you know it looks like it could scale but you know what about that other cost so that that is not accounted in I tried to just price it based on hardware and software licensing since they're all open source software there is that uh there is no software licensing but yes there is some expertise that goes in there are companies such as like Cloud era Horton Works map r that give you a very very easy interface to kind of just get up and running very quickly with a stack similar to this that you could put I would say a basic security analyst and they it's a very followable tutorial to set it

up anybody else all right give me a round of applause