← All talks

BSides Vancouver 2015 - Alex Loffler - Advanced Security Analytics

BSides Vancouver44:09338 viewsPublished 2015-04Watch on YouTube ↗
About this talk
What if we were able to study many months or years of historical data for baselining and incident forensics? Could we then apply this analysis to new data in real-time? This presentation will discuss an approach to this, with some example use-cases to illustrate the capabilities of a class of tools that enable the application of techniques traditionally associated with machine learning to TB/PB scale datasets.
Show transcript [en]

this one's going to be interesting uh so a mutual friend actually told me about alex a couple years ago about what he's doing is of interest to me as well for what i do so he's going to be talking about advanced security analytics and what he's been doing uh within telus so let's uh give a warm uh welcome to alex loffler

thank you thanks everyone and first of all i'd like to thank uh the organ the organizer the organizers for b-sides vancouver for another fantastic year i'm sure you agree it's been a fantastic okay so as i'm the only thing between you guys in the cold beer in the uh kind of fabulous thank you in a fabulous vancouver afternoon i'm gonna get cracking so what i'd like to talk to you about today is uh advanced security analytics what i mean by that is how do we get past some of the uh the issues that we have with current sim technologies and some of our some of our approaches how do we build a more statistical as a approach and a more and

a more advanced way of detecting anomalies and badness in our environments as opposed to our simple if-then-else type rule sets so what do we need this kind of thing how do we respond how we've looked at implementing some of this and then maybe trying to ground it in a couple of use cases and then or something so perimeter erosion so things like byod wi-fi telecommuting mobile computing all of these things are making it increasingly difficult to define the boundary of an organization the rate at which we're delivering products and services is also increasing so the complexity of these products and services as well as the rate at which we're delivering them is increasing and a side effect of this

is that the expected lifespan for each of these products and services is also decreasing and as the as the expected lifespan decreases so does the appetite for investing in its in the hardening uh and security posture of the individual products and services so we're ending up in a more of a google beta rather than a 69 dial tone type of model delivery model another catalyst for this is virtualization so virtualization is designed to improve i.t responsiveness and decrease cost but you very rarely see security in that conversation anywhere and some some some uh these implications are kind of overlooked so the ability to actually make changes at speed and environment brings added risk of its own

nature so an innocent mistake at two o'clock in the morning uh could generate instantaneous changes with far-reaching consequences never mind whether those actions are actually malicious in nature another concept that's introduced is is the concept of monoculture so in the old world things like network storage and application tiers used fundamentally different technology stacks whereas today a single vulnerability in that virtualized and virtualization tier can cause massive potential impacts i'll spare you all of the stats from mandiant and the verizon uh data data breach investigations report etc i'm sure i'm preaching to the choir so how do we respond to some of these new threats governance of course it's the it's the knee-jerk reaction it's the first thing

that we go to when we see uh when we see poor security posture in our environments but unfortunately this actually tends to do more harm than good we end up bringing change control to a grinding halt uh bc existing bcps are tried uh are applied too broadly glitch in the matrix and new policies are created in this in this flurry of activity but are not maintained and eventually go stale so this drives change underground it adds complexity and cost and it doesn't actually improve security and in extreme cases actually undermines the security the existing security posture as the official process becomes so arduous that it actively encourages this kind of concept of shadow i.t sidestepping the government's prices all

together be it the new or the old governments but don't get me wrong governance is essential uh but it has to be applied uh or increased with care so if stronger governance isn't the answer what is controls let's add more controls so we today we have a vast array of security controls to choose from the problem is that even implementing a small subset of these controls in anything but the simplest of environments results in complexity and overlap that's kind of reminiscent of some kind of rube goldberg machine most people think of an edge firewall as an example as a simple reliable robust appliance but what if our edge firewall contains a vulnerability did we put another firewall in front of

that firewall well actually yes for critical infrastructure that's exactly what our industry does today and we try and select a different vendor in the hopes that vulnerabilities don't overlap and this basically doubles our operational overheads

how about that mouse over and see if that works let's see if that's

okay

sorry guys

cool okay so this kind of so this firewall in front of a firewall kind of concept and we just keep adding more and more controls to a a core asset that's actually fundamentally broken to begin with and we end up kind of justifying this whole thing with a horribly recursive version of defense in depth so instead of applying third-party band-aids band-aids what if we focused our efforts on actually building a secure asset in the first place so our environments and our lives would become quite a bit simpler so this kind of unicorns and rainbows version of the world would mean that we could kind of get rid of all security controls that focus solely on detecting and mitigating

flaws in the technology stack itself kind of get rid of some of these bad boys which means we're left with a much smaller set of attack surfaces or attack vectors most of which are basically around access control and identity so things like detecting credential theft and abuse so we can now pretty much deploy a naked os web server and web app uh and fully expose it to the ravages of the internet while maintaining a mathematically provable security posture which would be fantastic so this would mean we could reduce the provisioning lead time and complexity because there are fewer components to deliver no false positives things like waffs ips's become a thing of the past no overhead from agents on the box heads

av all of that stuff kind of goes away no resource is required for maintaining these additional security controls so this is all great stuff but the problem is that building these kind of provably secure operating systems or provably secure systems costs a lot of money a huge amount of effort and really they're kind of relegated to mission critical type systems like nuclear power plant controls or flight control systems today but if you think about it it's a one-time cost think about all of the additional security overhead in from a cost perspective that we expend as an industry or as a as a planet today i would be i would be much happier to spend ten thousand dollars per os

instance instead of fifty thousand dollars on all the controls to try and secure that os

okay so meanwhile back in the real world how do we actually respond um so we have to deeply accept that we're never gonna we're never gonna keep a hundred percent of the attackers out of our environments actually if you believe some of the industry figures between thirty and fifty percent of attacks originate from inside a threat anyway be that employees former employees contractors or supply chain so they're already within the walls they hold trusted positions with valid credentials already in our environments so we have to build a model that can fail gracefully how do we detect the stuff and not be as brittle as the fortress mentality really the only way to do this is try

and move backwards in the kill train try and detect uh the attack behaviors before they actually get into our environments and cause damage so in effect build a bigger ear and place it closer to the ground both in terms of telemetry data as well as threat intel itself and from a threat intel perspective we're starting to see a lot of interesting of interesting activity in the industry there things like avalanche from the fsi site groups things like cvss version 3 that's i believe going to be ratified later this year stixx taxi etc to allow us to start communicating and disseminating this kind of threat intel the next piece is visibility so a lack of visibility in the environment

is really deliberate deliberating it it stops or it doesn't allow us to detect or respond to threats in the environment it stops us from being able to actually define internal security perimeters or zones of trust and it also makes us completely blind to the ability to actually identify actors across an organization and i'm using the term actors here in the broadest sense so that could be employees i.t assets network elements customers suppliers attackers anything is active and actually can do something in your environment so visibility requires telemetry and wikipedia defines telemetry is a an automated process by which measurements are made and transmitted to receiving equipment for monitoring and analysis i'm a bit of a formula one fan

and it turns out that every formula one car in a race will generate between 8 and 10 gigabytes of compressed telemetry data per car per race so the question i'd like you to ask yourself is if a single f1 car can generate over a billion events per hour to a team of eight to ten people dedicated to monitoring analyzing and responding to that data in near real time how does the logging and monitoring of your organization's critical assets compare and how does it compare both in terms of acuity and depth of that log data but also in terms of the resources dedicated to its storage processing and incident response we really have a gap here as an industry

but even in the most poorly of environment poorly sorry even in the most poorly of instrumented environments there's an avalanche of data being generated every second huge numbers of huge numbers of logs network heads human generated et cetera typically 80 to 90 of the interesting actions in an environment are actually already logged today the problem is that they're logged in places that are difficult to search so the problem becomes moving that raw log store that raw storage to a platform that's logically centralized and searchable at scale so our our platform is ingesting somewhere in the order of 4 billion events per hour currently and we'll get into what we're doing with some of that data in a minute

but given the sheer volume of data generated by today's environments the traditional approach of waiting for that one golden event that flags the badness is no longer viable building these if then else rules doesn't get you enough visibility we need to take a more statistical approach to this we need to baseline we need to understand what normal is building rule sets is not scalable anymore so the trick is to find a technology that's capable of ingesting processing and rendering this information in a cost effective and timely manner and it kind of sounds obvious but the value of the analytics must exceed the platform's cost for this to be a viable proposition so i believe we finally are seeing tool

sets emerge that are able to scale to this kind of volume while maintaining a tco that doesn't void the business case as it were and unfortunately the sim industry as a whole has a lot of catching up to do with respect to these technologies but we're seeing some some glimmers of light here we're seeing approaches such as elk which is open source and the the open sock project which i think is backed by cisco and hortonworks so we're ingesting both log and packet data into the same repository um our first tier our ingestion tier our ingestion tier is uh basically based on our stream processing paradigm we're using technologies uh such as kafka storm and esper to do the

actual processing here so this is kind of our tier one processing when we receive the logs we're doing things like normalization passing and enrichment of data the next tier is in effect the mapreduce tiers whether the data actually hits disk for the first time it's kind of our long term storage and it's where we do cpu intensive batch type processes against the data sets so we're using components such as hdfs hbase and hive and then the final tiers basically allow human access to that data and allow us to build models onto that data set and it's basically anything that can speak hadoop so currently we're using things like r pig splunk on hadoop which is called hunk

unfortunately gephi multigo tableau etc basically anything that can speak mapreduce into the cluster so this this approach allows us to combine the benefits of ingestion time pre-processing with search time schema application which is critical to scale so the ingestion tier so as i said we're using a concept called stream processing or complex event processing and this differs from traditional rdbms or message based systems because they make temporal and real-time queries very difficult they also typically generate a conflict between reads and writes so a typical rdbms based sim you actually see an ods or an operational data store which is responsible for reading the log data in and then a reporting data store to actually service queries the problem

as the data sets get large is that it's the synchronization of these two data stores becomes really ugly really quickly so instead of focusing on the data structures building tables and relationships and then pouring that data into these structures we then we instead focus on the flow of data through a topology of processes so the example here is that we may have a syslog listener that then basically geolocates ip addresses the other thing may be a filter for purely authentication events that then builds authentication sessions and then starts looking at brute force detection so as you go from left to right you're reducing the total volume of log data and you're increasing the value of the

log log data so now that we have this data kind of pre-chewed we have it normalized et cetera what do we then do with that so let's pick a very simple example let's say i want to flag across any possible indications of compromise by monitoring the destination of outbound web traffic so i'm looking at my user traffic i'm looking at the you know alex is going out to google.com is this a good or this isn't is this an indication of compromise so a solution to this a simple solution to this might be i might go out try and grab as much threat intelligence about known bad ips and domains in the world and generate a blacklist if i see a web

a web proxy hit or i see web traffic hitting hitting one of those ip's or domains i flag it to my cert team simple stuff version two might be well you know a binary output may not be good i may want to rank the the badness of the of the event from a scale of uh let's say zero to a hundred feet based on feedback from your cert team maybe you start trying to implement white lists because maybe some of your threat intel is incomplete and they're saying things like yeah google.com may be bad so i want to whitelist certain domains as i start building more and more logic into this it gets harder and harder to

maintain the code base version 4 might be well let's take a completely different tag here let's inspect the dns registration dates for the domain that's being being hit and the assumption here is that let's say a production server for example has no business in talking to a domain that's less than 24 hours old so now what i have is a bunch of models that are all running in parallel and some kind of combiner or some kind of way of weighting the outputs from each of these models so that i can make a determination and as to whether i need to cut an incident or not the problem with this is manually maintaining the logic for this

combination for this combination process is not scalable especially when you start adding hundreds of rules and potentially hundreds more models

so what we're looking to do here is use ensemble methods to do this and a way of automatically generating a feedback feedback loop to modify those weightings based on feedback from the cert team themselves so we can exploit the user feedback loop to automate the tuning see the ability to combine models outputs in an arbitrary way such as weighted average challenger champion type models but also the ability to modify those weights based on the user feedback so if the user tells me that no this is actually a false positive this is not a bad domain i can actually use something like bayesian learning or naive bayes learning to actually modify those weights so over time i'm

selecting models that have performed well historically with respect to the feedback from the users the other nice thing about this approach is that we can introduce new rule sets or new models at very low risk so we introduced the new model with a weighting of zero and the only time it would actually start contributing to the output is if it actually proves its performance over time which again is very different to traditional sim sim tools where you add a rule you can actually really kind of break the platform sometimes okay so let's get into some use cases so we can type so the question that we asked ourselves is are we able to use log and netflow

data to be able to automatically classify devices based on observed behavior if we can do this then can we then flag statistically significant deviation from that baseline so i'm not defining normal myself i'm letting the system work out what normal actually is i actually took this a step further further and said well okay if we can type the log data and then use the relative mix or ratios between log types coming out of a device can i then can i then group that asset and then classify those assets so it turns out for this analysis we don't even need to deeply understand or pass the logs themselves in the traditional sense so i don't need to

build passes for a data source what i can do is i can use string manipulation techniques i can use things like engram analysis based on the fields based on the character positions to work out things like diversity for a particular field this is a time stamp it's non it's monotonically increasing so it's always going to change if this is an ip address field of a particular event type then it will change based on the characteristics of an ip address string so we can start building these rules into an engine without actually having to pass the individual data source types themselves which is the lion's share of work when it comes to sim integration unless this is focused on the string

structure and not the semantic content we don't need to build any passes anymore so we're kind of calling this zen passing and there is no spoon but once we have this measure of closeness we can then compare the logs coming from the same machine and then from other machines and we are applying a combination of principal component analysis support vector machines and simple thresholding to actually do this classification and we're clustering the assets based on multiple dimensions so the obvious one is function is this a web server a database server dns server etc the second type of classification is role is this a server or a or a user endpoint as a coarse grain uh split and

it takes that and it works out that um based purely on netflow our preliminary results are showing about a 90 accurate accuracy rate for determining whether this device is a server or a user endpoint using no other additional data the third the third type is variance so how changeable is the asset's behavior over time so as an example does the variance itself point to further dimensionality so things like an endpoint belonging to a finance guy typically you'll see a large spike in activity of that user endpoint towards the end of the financial year so can you use that to again subtype the classifications and then finally cardinality so this is in effect the unique entries

in the data set so cardinality is defined as as the the number of distinct elements in a data set so some of the concepts we're coming up with now is the popularity or the promiscuity of a device so how popular is a device or an asset so for example a vulnerability scanner is an extremely promiscuous device it generates a lot of in outbound connections to a lot of hosts a dhcp server for example is a very popular device it doesn't really generate many outbound connections but we see a lot of inbound traffic to this device so we can start asking questions like show me the top 100 hosts based on the volume of traffic sent by the devices or

to the devices show me the least 100 least popular or the rarest domains visited today from my entire infrastructure and list the assets that made those requests things like port scanning become very easy to spot they kind of light up like a christmas tree because even if the port scanning is low and slow because what we're doing is we're actually computing the cardinality over large periods of time so a web server for example that may be responding to requests so it's quite a popular device and a fairly low promiscuity score if it's compromised and it suddenly starts poor scanning its local network over the period of two three four days it will be very easy to see that

increase in promis promiscuity

the problem with this is that measuring and reasoning about cardinality is computationally very expensive for large data sets so this is where hyperlog log comes in so hyperlog log is an algorithm for approximating cardinality and it turns out for our use cases we don't need to be 100 accurate for any of any of our use cases taking a statistical approach is sufficient so the way hyperlog log works is that if you take a hash of a valley that hash is pretty much random and if you consider the binary representation of that hashed value what is the probability of the uh the last the last bit in that in that hash value being zero well it's binary so

it's 50 how about the last two zeros that's 25 so as you go down we actually as these probabilities are now known we can track their frequency and hence estimate the number of unique values in the set so from the above we can then estimate the number of distinct values and we can also accurately estimate the the um the known degree of error so the smaller the hyperlog log object or the hash length the larger the error but it gets better actually hyperlog logs also support set operations so we can do unions and intersections so for example how many new hosts were communicated with were communicated with today based on yesterday's activity so again this would allow us to do

things like detection of stealthy port scans anomaly and outlier detection it's well suited to map produce jobs so we can both uh it allows us to do parallel computing both for the creation and the use of the hyperlog logs and as an example a one set of one billion unique ips in a data set would basically resolve down to about a one point five one and a half kilobyte object with about a two percent error estimation on the counselor so using uh traditional approaches where we're actually doing a full search uh through the uh through this data set uh doing a a distinct search a distinct count search would take about 22 and a half minutes

on our cluster but using hyperlog logs we're dropping that search down to about 10 seconds which means we can now use this this concept of cardinality or mut on many in many more use cases

so now that we've built this capability what can we do with it so some of the findings that we're looking at are can we detect new log event types or can we detect a change in the mix of those event types coming from a device and if that if we do detect a change and it's a statistically significant change could that be an ioc or is that some kind of a result of patching or some kind of other legitimate change and it's a change management process problem so we can actually use this to detect unpatched assets and we've had we've been able to detect the difference between a 2008 and a 2008 r2 patched os

based purely on the log data coming out of it without actually building passes for that log data the other nice thing about this is if we do have a known compromise we can look at the delta between the known good and the known bad profiles look at that delta and start looking for that same delta elsewhere in our environments so we start building a pattern for how this compromise looks from a net flow as well as a log a change in log profile perspective another one of the findings is basically everyone likes likes dns but too much too much of a good thing can be an indication of compromise and we can we've detected indications of compromise

and we've actually detected compromised assets based on uh the use of their use of dns as a side channel

the final piece is around topology and dependency modeling so the nice thing about this is we can actually start building models of communication between assets and look at what again what is normal is it normal for a web server to talk to another web server well actually not based on the clustering and the typing that we're looking at but it's much more common to see web servers talking to database type assets it's very unlikely that user endpoints would typically talk directly to a database asset but they typically talk to web web servers so we can start using the typing and the class clustering that we're pulling out from this to start looking at pairings or communications that are

anomalous in nature the final use case is around credential use and abuse so what we're looking at here is is it possible to actually map credentials back to you to identities themselves and it's fairly simple to do that when i have a one-to-one mapping and say my hr records and it turns out that what a what a user should know and what a user actually knows are almost never the same thing if i look at my systems of record my ad my ldap etc i will have a certain number of credentials that i should formally have access to but over time that bleeds i swap credentials with other people it's just human nature so instead of using the systems of

record we're monitoring usage records and that turns out to be much more accurate over time so a local authentication event for example would say you know username x is logged in at time t but a remote authentication would be something like username x logged in at time t from source y and y could be an ip or a host and what we can do with that is we can actually start stitching that that path back together again so in this case our our unix admin guy aussie osborne is logged into his laptop and then from his laptop he then pivots throughout the day to multiple machines and it's kind of obvious that this credential is

different i don't know if you could read that but this in this particular case um it's a very simple pattern match um difference so using hr records naming conventions for user names standard transforms common account names etc it's really easy to take a first stab at whether this is determining whether this is bad so this is fairly standard stuff but the nice thing is that this works for any number of pivots as long as uh there is one and only one uh account logged into the asset at that point in time so what what about a more real world example where we potentially have multiple users logged into the same box at the same time and then pivot using some a

more anonymous or more shared credentials such as root or administrator it turns out we can model this as well so we can model the concurrent actors by basically adding weights to the relationships so here we've got aussie logging into his laptop kim logging into her laptop and then at some point in time they both decide to log into this jump point so they're both logged into the jump point and now at some point one of these two guys decides to pivot into an asset behind that jump point as root i now can't tell i now i've kind of lost the granularity there i can't tell where i can't assign the knowledge of that credential to aussie or to kim so we

basically split the waiting and it turns out that this ambiguity resolves itself relatively quickly aussie might take a day off kim logs in i can now assign that and for the for simplicity we've started with weightings of one here uh but actually what we would do in the real world would look at the the type of authentication that's taking place so if this is single factor authentication maybe we weighted at 0.6 it's two-factor maybe 0.8 and we take it from there so over time we start building up a view like this where the squares are basically identities and the circles are credentials so we can ask some interesting questions to start with uh things like um

excessive access um this guy at the top here is he's some kind of admin rock star and his role dictates the fact that he should have this many this many credentials um or as the user moved around the organization and just retained credentials that should have otherwise been revoked and again we can start looking this by type typing the identities themselves by using things like the ad the ad topology or the ad structures so if this is a marketing guy maybe you shouldn't have this many credentials if he's an admin maybe that's fine the inverse is the obvious use case the dot in the middle excessively excessive knowledge of a credential so shared credentials are bad

by default so maybe this is a good candidate for a password rotation and we can very easily generate top 10 top 100 lists of these kind of credentials based on sharedness the other interesting piece here is what happens when an employee actually leaves the company so now we can identify shared credentials that the employee had legitimately or otherwise as candidates for password rotation so we're not going to lock anybody out but we can also with a much higher level of confidence disable the unshared credentials

and finally if you look at the authentications in a in the frequency domain so you do a frequency analysis of the authentications users tend to be fairly stochastic in nature it's a nice maths worth saying they're kind of random but if you if you compare that with something like a system type login system logins are very periodic in nature they're typically cron jobs etc excuse me so it's very easy to tell the difference so within telus for example we have the concept of user accounts and service accounts user accounts are typically all two-factor and service accounts are used for scripted logins from between business processes and by definition cannot be two factors of a single factor

admins being human beings being lazy etc sometimes there's a bit of a tendency to use service accounts for logins we actually were looking at building this use case specifically for that to catch that use case but it turns out that the inverse use case is actually way more valuable where somebody has actually hardwired their own user credentials into a script which is now part of a business process and then they leave the company and it turns out that operations teams typically are fairly hesitant to decommission admin accounts when an admin leaves the company because of this reason because we don't want to break business process so now we can detect that very easily we can detect the source deskt we can

detect the credential that's that's actually in question

so in summary interesting data is everywhere there's a fascinating fascinating realization in biology that you're just as likely to discover a new species in a tropical rainfall forest as you are in your own backyard the thought there is that nature is so rich it actually doesn't matter where you start looking the important pieces that you start looking so don't worry about you know if only i could get agent x onto our endpoint so i could capture y just really start with what you have in front of you i guarantee you'll find interesting correlations in the data sets you already have available to you i believe a combination of stream processing and map reduce type technologies are enabled and enabling

analytics that were previously cost prohibitive and etl doesn't scale mapreduce type technologies are actually inverting this paradigm instead of having processing platforms and etling data through them causing delay at each point uh this this class of technologies is inverting that paradigm we're actually bringing uh the compute to the data as opposed to trying to push the data through compute and in my opinion security analytics is actually more complex than many other data big data domains and the reason i say this is that correlating we're trying to correlate multiple disparate data sets from thousands if not millions of endpoints and we're trying to detect actors that are typically actively trying to evade detection

so to do this you really need tools that can reason about both log and packet data in the same engine and this i think is a big failing of some of the tool sets that we have today we have some fantastic sim type tools that specifically look at log data we have some great network forensics tools that specifically look at network data but we're really missing a trick if we can combine those two and build correlation rules allow us to reason about both types of data in the same engine that will open up whole new use cases and whole new areas of visibility and finally the if then else type correlation rules are no longer sufficient

uh the level of effort to maintain a rule set increases exponentially as the rule set grows creating a maintenance nightmare so machine learning is key preferably unsupervised learning as a more effective way to make sense of these highly complex data sets thank you very much

any questions at all

this guy

so this is a visual representation of all of the web traffic coming out of the telus corporate network for a day so if i want to scroll into some of this zoom into some of this stuff you can pick facebook straight away excellent so you can see but basically the size of the the size of the node represents its connectedness the amount of connections to it and the reason the reason for this visualization was it's twofold i talked a lot about today about some of the kind of dry under the hood math stuff for how this how the correlation how the correlations and analytics are being implemented but really another area of of work is how do we visualize some of

this data so an example of why we would visualize this kind of data like this would be stuff like this so we started we would never detect this kind of data by looking at just lists of ip addresses and connections so i'm not going to scroll in too much because i'd like to preserve the innocent but you can probably read those ips already but what you see here is two ip addresses that will actually tell us assets and they were talking two ips in the on the internet by ip not by the domain name and there's this lovely star configuration where each of these machines was talking to each of these other external ip addresses

nothing else in the talus enterprise was talking to any of these ips it's kind of jumped out like a sore thumb and obviously these two machines were actually compromised we have some other interesting examples so things like uh or are we maybe down here so this is a this is an example of of a telus ip address talking to a canadian parliament site and this this concerned me for a while but it turns out that this ip address is actually connected to a machine that is streaming the the parliament video feeds to a um to a call center uh so that the guys in the call center can keep up to date with with what's going

in the parliament so that's an interesting one some other pieces here things like um again little islands of connectivity are always interesting uh so looking at this guy here uh so we see redhat.com and we see a bunch of ip addresses uh talking to those uh to redhat.com and these turn out to be our internal repo caches so uh visualization is is interesting um it's it's an interesting domain and it again it'll help us when we're talking about trying to manage and and get wrap our heads around data sets of this of this size i'm sorry

well those servers are connecting to the red hat.com domain to pull down new new versions absolutely but they were a little island there's nothing else in our environment that's actually talking to those things talking to redhat.com

no sorry this is only outbound only outbound traffic at rh yep any other questions oh sorry yes

sure yeah that's an overlay based on some threat into our stuff so it's more interesting than lanes being spoken to

possibly

i'm almost certain that's either google or microsoft someone like oh it's horrible to read yeah it's google.com

sorry bad rendering so another view of this which i think is also quite interesting is um so what if we uh so the holy grail is can i remove all of the uh can i just only look at the good stuff the bad stuff sorry can i only look at connections that are going to something that is actually a malware site etc and that's actually really hard to do and sometimes it's just worth flipping the question to say can i just remove everything that i know is good so if i if i can guarantee that this is good i'm not interested in it didn't it so this is an example of again same same data point same point in

our network we're looking at egress traffic to the internet from our corporate environment but can we look at traffic to file sharing so online file storage applications so again if i look at things like this it should be fairly obvious that something like skydrive we have the kind of icloud collection of boxes down here but again if i look at some of these things i'm starting to look at again an interesting relationship here where i'm seeing one ip address using one two three four five six different distinct file online uh file storage or file sharing applications from our analysis of the full data set it actually turns out that either people don't either individuals do not use

online file storage or they use more than two services and the kind of sweet spot somewhere between two and three services about 2.2 services so this guy is a bit of an outlier why is this guy using so many file file sharing services from a telus from a corporate asset so again this is maybe an interesting one to investigate sorry any other questions

okay i guess it's beer