Security Metrics: Why, where and how?

Name: Security Metrics: Why, where and how?
Uploaded: 2015-07-09
Duration: 46 min 8 s
Description: Security Metrics, when used correctly can help you paint a picture on the status of ur Security, they can also help you make security decisions so that you can prioritize and reduce the risk when making these decisions. This talk will focus on answering the following questions: -What security metri

BSides Lisbon · 201546:08697 viewsPublished 2015-07Watch on YouTube ↗

Speakers

Tiago Henriques Tiago Martins

Tags

StyleDemo Talk

About this talk

Security Metrics, when used correctly can help you paint a picture on the status of ur Security, they can also help you make security decisions so that you can prioritize and reduce the risk when making these decisions. This talk will focus on answering the following questions: -What security metrics can I get ? -How can they help me? -How can I get started with obtaining and using security metrics? -How can I combine data science with security and automation? -How can I be in an agile environment and still have Security measures and processes apply ? We will also provide some demos where we obtain and analyse security metrics from different sources.

Show transcript [en]

So we get started then, yeah? Okay. So we're here to talk a bit about security data metrics and measurements at scale. To give you guys a bit of background what this talk is about, just in case you don't like it, you can leave now. If you know me and Tiago and some of the work we've done in the past, you know that we've done more scanning at scale and we've done a couple of talks here in Portugal about it. This is more an evolution on that talk. We enjoyed that research and we thought we would keep going with it and develop some products around it and all these things. This research builds a little bit

upon on top of that. That was the foundation which brings us to today. So the people that made this talk happen, me and Tiago, the other one is Roberto, but he's chilling in a beach in Cape Verde, you know, just relaxing. So he's not here today, but he helped us a lot build this talk and it's important to mention. We're currently located in Switzerland, in Zurich. If anyone's in town and is up for a visit, make sure you ping us and we'll take you for some chocolate and cheese. So what exactly are we doing? We're trying to use machine learning techniques combined with data mining to solve security issues. And as a company, as me

and Thiago work at the same company at the moment, we're very much a data-focused company. Every decision we make, every product we develop is focused on data. And this is what we're trying to do. We're trying to grab these three areas, grab the data related to those three areas and correlate it in some way that we can try and solve some problems in security. We very much believe that data is the new oil. Oil runs the world and in a couple of years it's gonna start being data. And privacy is the new currency. Privacy is key and everyone at some point says they want their privacy but then post everything on Facebook. And this is very important. So the way

we work, we start by collecting data We store this data, we filter it and refine it, process it in some way. We make sure this data gets richer by some methods that we're going to show you guys. We then simplify access to this data, which we'll also talk about, and we then do some consultancy and advisory on this data that we have. So, one important thing for you guys to understand some of the decisions we took on an engineering level, we have to explain a little bit the way we created the company we're currently working at. We are an exponential organization. This means that we create, we looked at a set of skills we needed

in the company, which came up to about 20 different skills and we call these the core. And these core skills are going to be people that will be under contract in our company, will definitely be the responsible people for that area. Everyone else will be on demand. So if we need pen testers, we'll go and get them on demand from other companies. We're not going to hire them on our company. this is important for some of the security and engineering decisions that we made. Some of the exponential organizations you guys might have heard of is for example Uber. The drivers for Uber are not on their payroll, they are staff on demand at Uber ads.

Uber provides the framework and then they get the staff on demand from the drivers. Another example: Airbnb. Airbnb has a framework, people put their houses there and use that framework. Another example of an exponential organization that was bought not that long ago, WhatsApp. WhatsApp was bought by Facebook, I believe, for one billion or something close to that, and there were 45 people only running the entire thing. And this is something we're trying to aim for, to become an exponential organization. We've created four sections in our company, that's security, product, business, and operations. And this is very important for us because one of the things we've been doing up until now is providing security services. However,

right now we're really going to focus just on building products. And again, this is also important for some of the decisions we made in engineering. So one of the things when you're building a system that collects data is the architecture. And the architecture is going to be the foundation of everything you're going to build upon. So it has to be your strongest asset. Our architecture has some goals, it has some requirements and of course it brings us some results. As you guys can read through that, these are some of the goals we have. When Tiago talks a bit about the engineering part, you guys are going to see some of the decisions we took in

terms of technologies, in terms of design. And they are all aiming at having this type of architecture, with these goals at least. so that we can exchange different components in our architecture and we don't just have this huge monster that you can never debug, you can never maintain and becomes very complicated. So in terms of our improvements to our data and to our architecture and to our products, what exactly do we want to do over the course of the next years? We started by working on something called the Minions, which again Tiago will explain on the engineering part. We then started receiving all the huge amount of data and we had to deal with obviously

storage of this data and archiving it, being able to search, classify this data, then create a user API so that our users can access that data and use it themselves. We then do a couple of funny things like image processing and data visualization and we're also working towards using machine learning for some of the stuff we're going to show you in a little bit. So, I'll now pass over to Tiago for the engineering part and then I'll come along for the machine learning. Hi everyone. So, Tiago is off the presentation in five minutes, so I'm going to speak very slowly. So, you can occupy the 15 minutes. So, you are a very young startup. with a

huge amount of experience. Our core team has 20 years of experience, at least one of the guys has 20 years of experience, Tiago has 20 years of experience building scalable infrastructures, scalable platforms to support billions of clients, etc. So we are trying to build from the ground up an architecture that's is based on the microservice approach, like Tiago said, it's supposed to be replaceable and scalable in all the components. So, where do we start? Where do we pick the technologies? You design the architecture or start by prototyping and testing the technologies? So, we have many choices. and I'm going to pick some of them and try to explain those choices. So, we focused on the architecture,

designing the architecture from the ground up and we don't care about the technology. We are going to pick multiple technologies that can be replaceable and if we need to replace them, we will replace them in a later stage.

I'm going to explain this architecture. So it's basically a job oriented architecture that has dynamic and fixed jobs to collect information. It has data enrichment parts that will allow us to provide better information and gather information by using... I'll talk about these later. Some technologies. So, like Tiago said, we have an API that we use for everything. We have basic HTTP API, some command line clients that connect to the API, some modules that we can give to our clients to access our platform, and later we will integrate with third-party APIs so they can use our platform as well. The job types There are multiple job types. We can group them by data collection, data processing and analytics. So, the fun part. Like Tiago said,

we love the minions, so we have multiple types of minions that execute the jobs. Some of them gather information like port scans, other process information, screenshots, the entire internet or process those screenshots using OCR etc. But there are also the ones that do whatever we want. We can custom build very simple and small agents that give some examples of the data collection. So, when we collect different types of data, as Tiago mentioned, I can give you guys an example. So one of the things we do, obviously we still do port scanning mass scale. One of the things we do now, we have a Minion that literally all it does is screenshot every single page that we catch that's HTTP or

HTTPS or if we notice the HTTP protocol, we take a screenshot of it. We then pass this information over to a second Minion, which does OCR on it. Let's forget HTTP for a second. If we catch VNCs or RDPs, same thing. We screenshot this thing. We then pass it over to OCR and we get lots of interesting things. So we have an IP address. We know this IP address belongs to this entity. We then, by doing OCR on RDP and VNC, we are able to start extracting usernames, server names, all of these things in an automated form. So we start building a profile about this IP as an entity just by using these automated techniques.

So basically the agents listen to a channel or multiple channels and they can work in conjunction to an end goal. So basically the information goes around all the agents and ends up on our infrastructure. So the agents can be built with any language, at least the languages that support the technologies we think we might need in the future. So we mainly use the gold, Python, Node.js, Scala, Java. And we try to at least support job control queues. Personally, I like the MSQ because of the features. It's a lot. So I don't know if you guys know it. It's the bitway built platform. We can also use Redis, but it doesn't have as many features. We need to build something on top of that. on top

of the controller. Some stats. I don't know if they mean anything to you, but we can build from Nets to Kafka, but we don't care about the performance. We care about the accuracy, the resilience of the components. So we keep this in mind, at least for the brokerage ones, but we won't use most of them. because they don't have the needed feature, for example, NATs, really fast, but does it perform well? Does it recover the jobs we need? Etc. There are also the broken ones that are really fast, but they also need white radius, they also need control on top of them. If you lost a job, you need to follow it, then you need a database, etc. There are

also the main pods, there are also Portuguese guys, realtime.co, they do messaging and you can build some job reviews on top of that. The data brief, so what Tiago said, Minions gather information from a source, depends along the infrastructure, you can define which type of operations are going to perform on the data. So for example, if you scan an IP, you can then add JIP etc. This type of enrichment isn't only built by us. There are some options for that enrichment. For example, Anubisography, if you know that one from a Portuguese company, it can gather information and enrich it, for example, adding JVIP information, cross-validating information, etc. So, on top of that enrichment, you can generate alarms, for example, for a specific setting,

a definition, for example, if you find a specific product in a specific network, get alerted. or our coins. About the storage, so we tested a lot of these services. We ended up using S3, it's good enough, cheap and we are also using the Bspark and startup programs that allow us to get a very good incentive for using their services for free for a long experience of time this is actually something pretty interesting so if you guys are thinking of creating a startup or if you are a startup definitely sign up for microsoft google and amazon amazon programs because Just for Microsoft you guys can get 150k a year to use on their cloud services for free. For Google you can get 300k and for Amazon

I'm not entirely sure. But when you're a startup you're counting your pennies. So you can't just spend money left and right. And these guys they offer some pretty good services and literally all you need to do is go to a forum on a website, sign up, put the legal information about your startup and you get these credits on your account. We can also use the normal approaches, normal databases like Cassandra, Realistic Search, React, Lucene, etc. It depends on the need. You can end up using multiple. We store all the raw information in the cloud, in multiple clouds if we need, or we can geomount information in a specific country, in a specific cloud. We have

clients that request that, for example, they don't fight the United States very much, I don't know why, but... And we end up encrypting all the customer data. So it's geolocated, it's encrypted, and all we can access is information. So one couple of things that Tiago forgot to explain. So, for example, for our messaging systems, we have certain requirements. We get millions and millions of events per second. And in terms of database storage, right now we're on billions of records per second. And soon-ish we'll be reaching the trillion of records per second because of the amount of information we're storing. So we can't just go for something like MySQL or Postgres because there is an upper limit on that. So

that's why we're trying technologies like Hussein, React. They're pretty good. You guys should try it out if you're doing some analysis on that part. Another thing that Tiago was now explaining is that we have certain requirements from the client side and from privacy and data. Because us being in Switzerland, you know, Privacy is everything for them. They have the banks, they have all these things. So one of the laws we have there is that no Swiss data can leave the country. You can leave it for, in terms of storage, you can leave it to process it, but it always has to be stored in Switzerland. So one of the designs we have to make in

our, one of the requirements for architecture is that it needs to have a geolocation storage. So in case the client that's using the data or the IP belongs to a certain country, the information about that address only is stored in that country. And this was a big challenge in terms of our architecture design as well. Okay, so delivery information, the gazillion events Tiago was talking about. So we can stream it in real time, there are customers that request the real-time part, but we also store in a way that the clients can access the information visually or processed, for example, in a keybind using UISC search or the Infox TV or the Druid. So we have

options. It depends on the feature we want and if one of them gets a cut. and supported by an initiator, it's just very basic. It's so wrong. One of the things you saw over the past two years is that everyone was a big fan of Hadoop to do the whole map reduce jobs. However, if you look at the trend now, everyone is switching over to Spark because for certain specific types of job, as Spark runs fully in memory, it's much, much faster than Hadoop. That's just an example of why it's important to have an architecture that you can exchange a component for another. Because there's always going to be new technologies coming along and you need to be able to move your architecture with time,

not just stay stuck with a technology or a language or a platform for a long time. So we can use Spark for that, to process information, or Hadoop if we need to. And we also have the Kinesis from Amazon, but we need to connect it with other types of services. There's also the option to use machine learning on the cloud. There's a lot of buzz around that. Azure is pretty good, we are testing it and it will give us some insight of the information that we wouldn't be capable of without a very good data scientist. So we can get a pretty good data scientist throwing in the cloud and we get information right now. You don't have to build an infrastructure or a Spark

cluster or etc. So our agents, we already talked about this, but they are stupid, they are easy to maintain, they do simple tasks. But all of them in conjunction can build a space rocket. So we build them very simply. We distribute them around the world in multiple clouds, or in dedicated servers, or in Thiago's home. So we can guarantee that we don't end up with having a crash on our platform or the core services maintained. At least maintaining a very good network. That brings us some problems. So the monitoring part is pretty hard. So we need some choices on that matter. We at least tried some of these ones. There's a screenshot of our platform from moments ago

using Lookerfiner. And it allows us to monitor the messages that are passing through the network. and get much accurate information. But there's also Kibana, I'm going to try to show. Kibana tool gets not the monitoring but to get an idea of which minions are working better or more accurately. All of this requires automation and deployment. So our minions are deployed using scripts It can be a S symbol, a salt stack, etc. And also ATC, the T-A-V-O-X. Want to explain that? Yeah, sure. Actually, I'll come up with a little bit of a background on this part. So, the minions, essentially, when we want minions, we don't really know what's going to happen there. It can be a GPS, it can

be a Raspberry Pi, it can be a mobile phone, or it can be... a dedicated server, it can be whatever. So we needed a way that we could customize these mediums to get jobs and get them working straight away as soon as the machine comes up. One of the requirements we have from our roaster is that it comes up with SSH, of course. And the first part in our process is that essentially, who here has not heard of Ansible? So you all know Ansible. All right, so Ansible essentially is a script you create on your local machine and it executes it by SSH on many remote machines. So the first step in our deployment process

is exactly that. We add just the IP address of the new machine that spawned, we put in our repository of minions, Ansible goes in, installs SouthStack, which is the fourth over there, and everything else that we need for that minion, and then SouthStack just goes into listen mode. And with SaltStack we can do lots of things. We can update the OS or we can just tell the minion: "You now go and scan this IP address" or "You now go and screenshot this range of IP addresses" or "You now start processing this data". Essentially Ansible goes in first because it's SSH and then it installs the client of SaltStack and everything else and then SaltStack goes

into listen mode and listens and waits for commands. For NTCD is what we use to Essentially, when you ask for a job, that job needs to be assigned or split depending on the size of the job. So, for example, if you guys come into the platform and ask, scan the entire world. Obviously, we're not just going to give one million that task. We're going to look at the TCD. A TCD is a key value-based technology. It has a time to live, and we can see which millions are up. We then split that task based on the number of millions that are up and available, and using TCD, essentially, we can monitor which millions can receive

tasks or not. Oh, it's me again. So, machine learning and security. There are four main problems with machine learning and security. Modeling large-scale networks, discovery of threats, network dynamics and cyber attacks, and privacy preservation in data mining. Looking at that image, we obviously do modeling of large-scale networks. Essentially, we're trying to create that each IP is an entity, and we then add information to that entity. Discovery of Threat is not something we're doing right now, there's plenty of good people doing it out there. It's not something we're really focusing on right now. We do network dynamics on Cyberdex in the sense that every time we scan the internet or a network, whatever it is, we save every

single thing we found about that network, be it ports open, ports closed, whatever, the software that was running, the version that was running. And using our platform, one thing we create are the timelines. For each IP address, we have a timeline. And you're able to see from the first moment we scan that IP address, which software was running, at which date there were changes. So in one day you were running Apache 2, the other day you were running Apache 3, or one day you had port 443 open, the other day you didn't. And this is a bit of us understanding the network dynamics. So in this range, how often does that range change? Do we

see many more ports open? Do we not see it? And it's also pretty cool because if there is an attack to a certain network, we're able to go back in time and at least evaluate from a perimeter perspective how that network was exposed. In terms of privacy preservation in data mining, as we mentioned, our architecture is geolocation aware. So we try to do a little bit of privacy preservation in that part. As I mentioned before, we like to scan things. We like to scan lots of things. We started by simply doing the blue bit. We really started at the basics in the beginning. We looked at open ports, we looked at the services, and then

we decided we wanted to do more. And we started looking at all the different things. So for this IP address, who does it belong to? A company. Okay, can I go on LinkedIn? Can I see who works for that company? Can I then see and try to find what are the Twitter handles for those people? and then see if there are personal blocks for those people. And we start building these profiles around IP addressing and around companies based fully on scanning. And this is what Armenians do. Any of Armenians can do and collect data from any of these endpoints that you see. So, for example, if we talk about port scanning, as I mentioned, VNC

and RDP, we screenshot it, we OCR it, can we extract some usernames for it, can we extract some server name, can we extract a version of the operating system, and I'll show you guys a little bit of a demo there about that. For the web, same thing. Screenshot HTTP, screenshot HTTPS. Can we, via screenshot or via a set of information, understand automatically with machine learning, is it a blog, is it a normal web page, is it a news web page? Has this page been defaced? And we look at different types of information. One of the things Tiago mentioned, if you guys look in the corner, it says malware. It's not an area we're going to touch on because there's plenty of good people out there doing it

already. So what our architecture allows us to do is to consume that information in a stream format from whatever source we want to buy it from. machine learning there are many techniques you've got artificial neural networks which essentially is a type of technique that is used in statistical learning models. It's inspired by biological neural models as well. And it's pretty cool because you guys can have a set of inputs passed by hidden filters and have a set of outputs that sends messages to each other. And you can classify different things using those techniques. You've also got SVM. This is also used for supervised learning, so you guys can teach a system, or you create a system that creates a model, you can teach that model, classify

a couple of things manually, and then the system will try to classify them automatically. And you've got a couple more techniques like decision trees, which are usually made, as mentioned, for decisions, Bayesian networks, KNN, and each of these techniques We use them in different ways, in different parts because it's not a one works for all type thing. So what exactly do we do with machine learning? So we do classification. One of the things with classification detection, clustering, automation, correlation, prediction, analysis. So classification. As I mentioned, we try, if we look at port 80, we'll try to classify a website. Is it a blog? Is it a news website? And again, I'm talking automatically. I'm not

talking a person clicking a button saying it's this or that. The other thing, for example, for clustering, one thing we'll try to do is we try to look at common groups. So if we have this range of IP addresses, And they all have port 80, port 443 and SSH as well. Are they all servers? Are they all Linux servers? There are different types of clustering we do on our backend. We try to classify them by the type of machine they are. Are they a server? Are they an OMIP address? Are they web servers? Are they FTP servers? We try to do clustering based on the different data points that we gather. We try to do

this automatically of course, that's why we have machine learning. We try to do correlation as well. Correlation is something we've been working on a little bit. I can give you an example of some of the things we're looking into. So, one of the data points we're starting to consume now is all the Bitcoin ledgers. where you can see this IP address started mining this and got this result. The other thing we're also starting to consume: torrents. And we look at all the DHTs and all the people that are participating in that torrent. Why? Because if you see a new torrent showing up, for example, on Pirate Bay, and you see that lots of people downloaded

that torrent, And we start to see, you know, we can monitor that, we can monitor which IP addresses are loaded at torrents. And then we start seeing these IP addresses showing up on the ledgers and mining different Bitcoin addresses. One of the possibilities that triggers an alarm on that part from our architecture is that that torrent might be infected with a Bitcoin miner that's just running on the background of people that don't load them. So that's where we use correlation with machine learning to automatically try and identify those issues. Prediction is not really something we're touching right now, but I think in the future it's going to be pretty cool the way if we can

look at a network and predict the way that network is going to be hacked, I think machine learning will have an impact on that part. And obviously we do analysis because we try to have our own models of machine learning, which I'll talk about in a minute. So one important thing when you're doing machine learning and combining it with data is that you have to do a self-assessment. And for us a self-assessment is exactly this. we have support which essentially indicates which percentage of data on the storage shows correlation. This means We're not just relying, for example, if we do a port scan, we don't just rely on that to say what an IP is.

Or if we, for example, okay, we'll do a port scan, and the port scan, we have the IP address, we pass it to the second module that has the entities module that we can query from, there's a company that sells those types of databases, we query them, and they say this IP belongs to, I don't know, SACA. Second step, we look at a UBI or MS Lookup or something like that and we extract, for example, from the UBI who is the technical person assigned to that IP address. Next step, we look the name of that person on LinkedIn. Do we see this person works at SAPU? If yes, it increases the support that one data

point gave the other. Do you guys get that? These supports, the more level of support each data point gives the other, obviously increases our confidence in the type of data. This means that, as Tiago mentioned before, we consume data from different data points, and if we start to see that one data point is not supporting the others, there might be a confidence problem on that data point. It means that that data point might be providing us with false data, or that there is some problem there. So these are the two self-measurements that we do on our own data. As I mentioned before, we go through various stages when the data comes into our architecture. We can start by doing a pore scan, and the pore scan has a certain

weight, a certain confidence level that we have on that data point. And then we pass it over to geolocation. And then we pass it over to screenshot, because you can have that as well. You can have a situation where you scan an IP address, there's a blog, and inside that blog you can extract usernames. And then you do an OCR on the RDP of that server, and you see that the username that you got on the RDP is similar or matches the same you have on the blog. And therefore, again, increases the confidence level on that data point. And we go through these different rounds until there's a decision to be made. Either it keeps

going or it passes over to manual classification. And that's where we have someone that's literally trying to manually classify that data and teach it to the automated module. So you increase the quality of the module and you still get the classification anyway. This decision is based on the current level of confidence that we have. So if we reach phase 4 and our confidence in the results we're getting is about 40%, we're not going to proceed to try to continue on that path. We will pass it over to manual classification. One thing that's important. There are two techniques you can use to improve the way your machine learning works. One of them is comment filtering, which essentially you don't just take the information you got, it's measured over time, and

you do produce a set of variables with a certain percentage that it might be correct. The other one is the adaptive filter. which essentially looks back at the data you've previously stored. So in our case, as I mentioned, we archive everything. We can look at the data we had before from that IP address and also use that as a data point to support the information we're getting right now. So this is our data chain. We collect the data, we process it, we execute machine learning on it. If the confidence level is not high enough, we re-inject it again to be processed. After we reach the confidence level that we're interested in, we create a report

and we store this data. It's very important for us to create a report in the sense that it's great to have the data, but it needs to be accessible in some way. A report can be PDF, can be a dashboard that's automatically generated, can be a set of alarms. For us it's independent. It's just a step in the data chain. This is essentially the way it works for us in terms of data. We observe things, we run them to a set of orientation and steps, we then decide on our models, on our machine models, what we want to learn, what we want to do with it. There might be the situation where we need to

give feedback observations, which we inject into the same process, or we just do an action with it. It can be stored or create a report. This is the same thing, essentially explained in another way: classification knowledge transforms it into information, which then we assess into hypotheses, create resolutions and directives, and do as again into the facts. We consume real-world data by our agent, we then call it telemetry and sensor data into different channels, we inject it into features on machine learning, and we then create different models. So what this means is, even though we have different functionalities on our medium, they also have their base on certain models for us so one minion will be a classifier one minion will be a recommender one minion is

just looking at Twitter and monitor some meetings you'll be a social minion or social model in this case essentially each minion gets assigned to a certain type of model and now I'll pass over to Tiago to show you guys some demos okay for scanning time again

So I've already started a job to scan every part 22 in the world. Some minutes ago it's on a slow mode. So basically these are some minions on our demo platform that are telling us their status. This is basically, if you know port scanning software, it's identical to the status of mass scan, for example. So, this is the status. What about the real-time data? One of the important things for you guys to see here is that it's also important for you to understand where the data is coming from and that's why we have the provider field. It's essentially telling us which Minion is coming from. This is the demo platform, so the minions are not named correctly, but in the production platform

we will know which continent it came from and which provider in terms of hosting it came from and what's the number of the minion on that provider as well. So this is the raw data, at least this is the big view. So we can see the open ports, We can see the services banners in real time. So this is basically an output of one of the channels that the agents listen to. So one of the things I explained to you guys is that you collect the data, you store it, you re-inject it if needed. And this is the perfect example for it. If you guys look here, we have some IP addresses. So we've got this IP address, for example. We know it has port 22 open, but

we don't know what's running on it. This data will then be re-injected and re-scanned for a second round for surface identification. You guys can look, for example, this data, however, has already been enriched. by another minion. So we know it has core 22, it has been re-scanned, and we know it's SSH that's running on it and the version of SSH. So essentially, data comes in, it goes through multiple steps, and then when we consider a certain level of accuracy or completion final, it gets stored and inserted into a report. So...

So Grafana gives us some view of the... So this was our previous test. You can see exactly what's happening. So it's 30 seconds messages with the status and basically there are a lot of listeners. we identify which of these services are being identified. So we know that the ports are open, but we can also check which ones returned the banners or are correctly identified by our agents. We also show you... Where's the box? Just a sec... So Kibana also gives you a real-time view of what's happening so if I request the vSize job gives me a view at least until this moment the scan is at 10% or something so you can see the distribution of open parts 22 around the world,

the number of parts found, the services and the products found, and the events per second, so how many parts are being found. So this is basically the end part of our architecture, it's the data analytics part, and this is basically the end skill part, where you have a view of what's happening on the platform. So you can see here there are six million messages to process. Basically there are three millions processing it. We can launch more machines and process it faster and scale the platform. This is just one of the channels. We can show you the status one. So there's a channel no one is listening to, so a lot of messages. They are persistent, so if we turn

off the XPTO agents, they'll start being dispatched. So, as many of you know, the port scanning part gives you a lot of abuse reports. Some of them are legitimate, but others, like this one, aren't, because this is their report. They don't say anything, it's just an automatic report. We contacted them and basically they said "Oops, you are basically reporting, we don't know why and are posting complaints to us that we are abusing your network, we don't know why." We can also show you this one, that we are configured botnets. Why? Because we scan the machine that has a configured single. This is the type of things we have to handle in order to scan the entire

world. And this needs to be managed. It takes time and a lot of bonding with our hosting providers. Can I just have the screen for a second? Sure. So as I mentioned to you guys, I was not supposed to do a demo of this, so I'm sure it's going to go wrong. But hey, you only live once. Yeah. Yeah. That's how it works. Okay, so. Sure. I understand. As I mentioned before, one of the things we do is we screenshot all the things and then run it to a medium that does all CR. I'm just going to show you an example of why we do that. So, if I run... So, one important thing before I show you this part is

that we have a set of keywords that we've built of what we consider interesting. So, things like "scada", "smarthouses", "houses", "motor", "satellites". Satellites are pretty cool. And as we screenshot and OCR things, it gets torched, the medium does the OCR, and we can extract things like this. Intelligent house. As soon as this IP gets scanned and screenshotted and run through OCR, we would... Can you guys see it here? Essentially, it would automatically warn us that we have found something that was considered interesting to us. Another option as well is on this... And I can just show you guys the image as well, Mike, as well. Essentially... This was the image that we got back. It's a VNC that's open worldwide and we took a screenshot of it

and that was it. And the same option, if I show you guys here another one, it's this one. So for RDP we screenshotted this thing. So we screenshotted that and what we got back from the OCR, as you guys can see, there's a bit here that's a bit interesting and also the version of Windows. So it's not perfect of course, but it's something we're working on and improving. But if we do this, and you guys can see it automatically extracts the user and the version of Windows. So it didn't extract perfectly, because if you guys look at Windows Server, it says it's Windows Server Zoo instead of 2008. But it's still pretty interesting information that has a weight on our classifier. It's not

perfect, but it's something we're building on. With that said, any questions?

Security Metrics: Why, where and how?

Related talks