
hello everyone hope everyone has had good lunch and that you're not too sleepy cuz after lunch I tend to get my power nap in so let's see for those of you that don't know who binary edge is actually let me get a quick measure if you've heard of binary edge can you put your hand up real quick that's not too bad so we every time if you've seen at all done by us it will probably have been on internet scanning because that's what we do we scan the internet around 250 times per month on different ports we consume that data we sell that data we do stuff with that data but this year we actually decided we wanted to talk
about something new that we've been doing we understood that by just scanning the internet or talking to the Internet we were actually having a bit of a gap which was who else is doing it what are they doing what can we learn from it and we decided to give this talk about a new system we set up that involves actually having honey pots and I wanted to talk a little bit about some of the things we've learned with the data and if it's worth or not for you guys to have a honey pot these days if I can ask one more question which is important for this talk how many of you work in a monitoring position like a
sock and knock can you just put your hands up okay cool so this is our agenda for today the things that we do we do at scale that means you know having them spread all over the internet not just like 100 watt or 200 watts it means they need to be replicable fast and do a bunch of more interesting things we're going to talk about some of the data and some of the things that we found in terms of patterns and anomalies and also what are the benefits to having honey pots these days so when thinking about honey pots we decided that we wanted to understand what other people were doing we thought okay so what are the any pots
requirements for us and we ended up you know gathering this list it needs to be portable run on x86 x64 or arm because we might want to put it in a server we might want to put it in a Raspberry Pi that sits at home as well because there is a difference between listening on an enterprise IP address and listening on the residential IP address so we needed to make it compatible wide we need to have dynamic configuration deployment and what this means is we needed to edit one file commit that file and essentially change the configuration globally on all of the any pots that means for example if on port 80 we want to start giving a
specific response on some specific honey pots I do not want to connect to all the unnie pots I need to edit one file edit that one file and it deploys to all the Ana pots in one go they need to be easily deployment because we deploy at least for scanning around 200 to 250 scanning agents every month and we rotate them all we wanted to be able to do the same thing for honey pots a new data center pops up we need to be able to deploy in there within seconds real time if you've used the binary edge platform or you know about us you know that all of our platform works in real time everything
is an event that's why we also need add event based but everything works in real time if someone connects to one of our honey pots we want to see that event within milliseconds on our central platform as well we needed a tagging system because one of the important things it's not just about receiving traffic right you need to understand what that traffic means and we wanted to be able to give tags specific to the events and we're going to go to what all these things means and we also needed anomaly detection and again we'll go through what that means in a little bit so what we landed on was having a custom written solution in Python 3 and for the
tagging and the configurations we have these files and if you guys look at them they're actually fairly simple configuration files so for example if on port if we receive a string called eh ello which is deeply for SMTP and Mail stuff we want to be able to reply with 220 senses Internet or GCM TP postfix or exome and why is this useful because if a vulnerability pops out for a new piece of software or anything we can emulate that software super super fast because we just need to deploy a response and all of our honeypot start immediately answering correctly to the payloads that are being sent again we deploy our honey pots worldwide every time a new data center pops up AWS
GCP every time we have a friend that says hey I have an ipv4 and I don't mind hosting one of your raspberry pies with on my home we have scripts and automated the entire thing that we can deploy all of our honey pots worldwide this is our currently data center deployments and they're fairly well widespread and we're adding more and more every month as I mentioned every time a new data center pops up we just run a script that deploys on those regions for example these were the last ones Microsoft Azure opened in Switzerland AWS in Spain and Ezer then open again in Norway and for us was literally like just typing one command in the command line and we get a
honeypot spun up and feeding our main platform instantly so I'll pass over to Florentine was going to talk a little bit about architecture and some decisions we made there now for the architecture we try to keep things simple but also fairly dynamic and scalable so we have all these machines that we call sensors there are really lightweight machines that the only thing that do is receive those events receive those attempts of scanning those machines then send that data to sync that sense that data to our data processing infrastructure our Kafka is our main repository of events and then we use storm cluster to process the events and the processing we do here is that tagging that Thiago mentioned and
we also reach the data with ASN our DNS and some other stuff that we find necessary then we import this data to our database and if you were to go to our website right now our free binary edge portal you could see the sensor data that we the sinkhole data that we have been receiving depends that's pretty much it so when you look at an event this is pretty much what it looks like and I'm gonna go from top to bottom client ID is just an internal ID that we use it from a sinkhole the time stamp every single event that we create or that we receive has a timestamp associated with it the IP on origin is
the machine that tried to connect to our honeypot the type of the event again it's something that we use internally ASN they send up the IP belongs to and then there's the data part we generate an ash - for every single payload we actually have a copy of the entire payload and we generate extra metadata regarding some payloads and we'll go over what this extra metadata allows you to then we have the tags associated so we can see that this guy was doing an SSL scan of some sort and we have the target so the honeypot that he targeted on our side as of today we have 266 tags and this is super useful to try and
really understand what an IP address is doing because we need to understand that not everyone on the Internet is malicious not everyone on the Internet is benign maybe some of you freaked out a little bit when you see binary ads hitting your IP addresses all the people love it because they get to log into our platform and understand their exposure without having to pay a lot of money so we try to classify and add as many tags as possible to our events this is done by having configurations deployed on these different files also let me give just a quick explanation Iago said we tried to have a simple way of configuring things so these files these
JSON files they follow a simple structure like Iago showed before and we we can do tags on pretty much everything we want we tag based on on earth ASN based on HTTP requests based on fingerprints such as ga3 or aj SSH our DNS and then we have general tags that it's basically by tagging on the payload itself for for instance this is one of the texts that we have for an SMB scanner we look at the payload in itself and we look for certain markers that we know if the payload contains this then probably it will be an SMB scan and like this one we have lots of other ones and basically this is our own DSL our
domain-specific language - to tag all these events and it's as flexible as we want if we want to add new functions to this to make it more complex we can because it's custom made one of the important things this binary edge on each side has to version of the platform we have the Enterprise version which is for enterprise clients we signed contracts we do the entire shebang and we have the free version which we took as a community you can come in you can use it for free if you want some extra credits put your credit card but you can also contribute and get credits for that and we actually have a person that contributed 92 new tags in the space of
about a month and a half and we actually ended up hiring him is sitting front row is running binary edge in January Phillipe and we have other people in the community that do this as well because they are trying to use the sensor data for themselves and they end up in reaching our own data with their knowledge so we have to make something that was super super simple to write instead of having them write actual proper code but you know you might be asking why and if you've worked in a sock you might have maybe an idea of why it's useful to have this stuff so internet scanning has come a long way I'm sure like if I ask if you know show
then 90% in the audience would raise their hands right but these days is not just show than anymore there's us their senses besides their own scanning Roberts Graham as an individual created masks and does their own scanning as well there's Umayyad metrics rapid7 and of course this generates a lot of noise if you're like listening to your perimeter if you're monitoring the firewall it really makes a lot of noise and to prove my point I'm actually going to connect to our real-time am I going to connect baby oh yeah first just a second you know new laptop and things fail so this is actually our sinkhole Stream in real time this is the stuff that's hitting our honey pots and as you
can see like trying to identify all of these things is a pain and if you are an ISP or if you have a lot of assets exposed to the Internet you're getting just as much traffic as we are and if you're monitoring an IDs or a firewall you're gonna see lots and lots of these so it's super useful to have something that enriches that tells you what those things are why because it allows you to prioritize and going back to our presentation if we can find it Google sheets never again really really disappeared Jesus let's try this again so in a hundred time there we go so if you work for a sock or not this should
be familiar to you like an IDs throwing a bunch of warnings but what if we could have an enriched version of that that means you could reduce your sock exhaustion like you don't need to focus on 90% of the events why because this will be typically general scanners that are scanning the entire internet and are not focusing on you and you should actually enrich your service bit with binary edge or grey noise for example maybe you heard of them I'm technology agnostic even though I own one of the companies but use whatever you want but essentially it reduces the exhaustion of the people working at the NOC or at the sword because they only need to focus on
things that are specific to their instead of looking at things that are heating the entire world of course this also allows us to detect funny things so we found these numbers to be quite high like there's a Russian ASN and if you can see that on top as well that Russian is the top country's scanning and what are they scanning for RDP scanner and this gives you a bunch of ideas in a couple of different things and we'll go over those as well but we also detect it down there version 1.16 - version on the HTTP pads and we actually explored what this was and we did a bunch of changes that a lot is - goal will be deeper on
this of course we also find fresh malware on the honey pots and you can also monitor your IP addresses so if you see one of the IP addresses that you own talking to one of our honey pots in the platform you have some issues unless you run a scraper or you do your own internet wide scanning you're probably infected with something or you've probably been pawned and it's something you should keep an eye on I'll just pass over to Florentino is going to talk a little bit about some of the problems we ran into when building our own custom honey pots now for the case of UDP it's fairly simple you can just open a socket
and listen because there is no connection there's no complication there so basically what we do here is we create a socket and then we used some firewall configurations to redirect everything to the port that we have the socket open you could also create a rough socket and just listen to everything but there's too much nice terry and you definitely want to filter out some ports now for the case of TCP it's much more complex because you you will have to have something binding to the port if you want to receive stuff so we needed to create a version of this that create that built the connection the full TCP handshake and another version that used
the raw socket to all the TCP scenes and this is needed because nowadays port scanning is mostly based on since scan it doesn't finish the connection so you need to catch both things but then you also need to catch the SSL connection so you need to complete not only the TCP connection but also the SSL handshake can catch whatever is beneath all that so there it's fairly complex you have to and you need several several sockets open and again we did this also using some firewall rules so we open one part and we're activating to that so one thing that comes to mind many times is why not use just something like TCP dump or
Wireshark and the thing is in itself it's not enough it's useless because you do need to have something underneath that completes the connection so that's why we thought of building this all custom-made and the good part is that because of this we also we can manipulate whatever we want and when when you had the question can we respond to these requests anyway we want it was easy because it was our code we just needed to insert some packets in the middle and it's that another problem that it's it's one it's probably the biggest problem we have so far is the volume of stuff why when those Russian guys start scanning for RDP like once a
month it's really the the worst part for our our infrastructure because the the density of data that comes per second is really high so we need to have a really big poster to add all of that and it's it's funny let's just say this so now that we have infrastructure in place I mentioned for example malware and with malware we can do lots of different things we have importers that can enrich data with things like virustotal and Iran Joe cloud or even from binary edge itself so for example if we want to see if the people scanning for RDP had the RDP port open itself we can enrich our honeypot either with our scanning data as well we started finding
a couple of interesting things so for example we get things like this that get tagged as malign you see that there's a double you get in there with the URL and then you check and it's just malware in there of course and as I mentioned before we have metadata that we generate and you generate two types of metadata right now ga3 and aj ssh and you might see them here and what are these two things and why are they useful so g3 is essentially a signature on the ssl connection that the clients that are connecting to our honey pots are using and it generates fingerprints this is not new this was invented by the team at
salesforce and we add it to every single ssl connection that we receive on the honeypot side you can see here the explanation they generate CSS LG a3 which is the version cipher extensionality curve in the point format and then the digest is the md5 of all these things concatenated aj ssh it essentially uses the unencrypted part of an ssh connection concatenates all the algorithms and then generates the md5 which generates the aj ssh signature and this allows us to do two types of things number one actor clustering so IP addresses that belong to the same actor because it generates a signature for the tool that that actor is using and it also allows us to find common tooling so
how many people are using the same tool to scan for SSL how many people are using the same tool to scan for SSH and if you have an actor that you know tries to hide from our honey pots from our tagging system by changing IP address as well they also need to change the signatures on their tooling because we catch them that way with this we can generate an image like that and you can see that there are a bunch of clusters by combining the ASN information in our tagging system we can enrich this image this way so we can see that there's a cluster for binary edge there's a cluster for each side one costs are for
Sheldon in senses because they use similar tooling and this is specifically for SSL scanning there's a large cluster of IPS from Iran in China that they're using similar type of tooling as well and this is kind of interesting knowledge because you can really start to identify like when they're trying to hide so if an actor all of a sudden decides that they don't have the reverse DNS or rotates all of their IP addresses we can still catch them on the signature of the tooling same thing for sha SSH but what this this tells us is that there is a lot less tooling being used to scan for SSH because you see lot less clusters that you saw with TLS and then
you start seeing weird things so on our honey pots we started seeing Accenture a SNR IPS from the Accenture a SN's starting to scan or do TCP scenes on our honey pots anyone has a clue why this happened so essentially they were spoofing the IPS from Accenture you see this typically when someone's trying to draw the blame at some other provider and how do we know that this is proofing because poofing has a very typical signature you only see TCP scenes because they never complete the handshake and you see very direct very small spikes because they rotate trying to emulate who they're trying to spoof we also saw the same thing happened to a bank centrist Bank
they had the same issue and then you start seeing even weirder stuff so we started getting a bunch of SSH connections and if you guys look at the our DNS Navy gov dot us and were like okay that's really weird because in this case he wasn't just TCP since like the connection was actually happening and we were getting like two packets so what was happening there and what happened there is that this IP address actually belongs to provider called frantic which doesn't verify the reverse DNS on their end and someone just set up the reverse DNS to pretend to be Navy Gulf and they never did a confirmation on that same thing so HTTP paths we monitor HDTV paths and we
started seeing a very large spike on v1 16 version and what's happening there we started to see that we were not emulating the correct response because we would only get the initial HTTP requests so we took a look we noticed that it was a request for the docker API and all Florentino did was literally commit these to a repo and then we started seeing the entire connection like actually dropping malware they have a curl up there and essentially it deploys a container that starts to mind for Manero the cool thing is because the way our system is developed when we detect certain strings all of these automatically got tagged as malign and malicious we didn't have to do any
changes for that and this is just some of the stuff that was inside the the malware that was deployed and you can see that it's essentially a Manero minor [Music] you start seeing things happening in the real world reflect on your sensors as well so the Dutch police took down a fairly large DDoS botnet provider and we saw a reflection of these on our sensors as well so if you look at the date here October the second immediately on our sensor data because we knew the IP range because we're working with some people that were helping on this we saw an immediate drop of any events coming from them so that's how you see that you know
what's happening in the news is actually effective who knows what this is any ideas cv 2019 zero seven zero right blue key publish date zero five 14 2019 and we started seeing that lovely Russian Laius and a couple of days later doing something very very interesting we call it an inventory scan they were not exploiting blue keep but they are checking every single opinion in the internet on every single port if they are exposing RDP or not at some point they might trigger exploitation but they're essentially looking at the sea five thousand five hundred and thirty five ports of every single IP on the internet if it's exposing RDP or not constantly in this generates a very very
large volume of data and you can see that we went from having zero to huge amounts and this has it's still going on by the way if you can like Bend this on your edge this ASN do it you'll be doing yourself a favor you also find things that are really cool so there was a vulnerability that came out for confluence on the 12th of April and immediately two days after we started catching an IP address exploiting that and this is actually really cool because it's super interesting to measure from a vulnerability when it gets announced to how soon we start to see it getting exploited and of course because we get to see the full payload lots of times we
can see what sort of malware is being deployed what type of exploitation they're actually doing all of that so what are some of the things that we want to work in the future fitting the binary edge use case what we do is if you give us an organisation name we need to figure out IP addresses and domain that belong to that company and then we need to figure out what they're exposed or what they're involved in as well so we'll have a system in our sensors in our honey pots that consumes all of these events and looks for anomalies we call it snuggle-butt and you can see here that it looks four different things it looks for spikes in
IP addresses it looks for as you can see here you know spikes in countries ports that are being targeted certain payloads certain HTTP paths because you can give us an idea of is there something new being exploited yes or no who is exploiting it are we seeing math scan for something new that we haven't seen before is this a new actor yes or no and we have an entire automated system that gives us this anomaly but these are still very raw events what we're working on right now is essentially the anomalies becoming stories and a story is essentially the beginning of an anomaly the end of an anomaly and all the tags that enrich that tell that story of that event and
the IPS that have participated in that event because that's super useful because if you find one of your IP is talking to our honey pots you're gonna immediately gonna be able to search when they were scanning the honey pots what they were scanning it for essentially instead of having to go through the raw data we do the enrichment for you and you don't have to have all that extra work and we're also working on something we call the internet scanning hitos and the reason for this is because we're starting to see a lot more malign and non benign and these are two different things actors showing up and what we've decided is that from now on we're setting a
bunch of rules to ourselves and to our peers that scan the internet if you're scanning the internet you need to have reverse dns set up on all your scanning IPs you need to send non-malicious payloads Buras posting via some type of communication methods respect a blacklist and data removal and actually do throttling if people ask you to because we do all this and we don't want to break the internet we're not here to hack organizations are not here to do anything of that and we expect the same type of behavior from our peers in the industry and what happens is if our sensors catch an actor that is failing at one of these we will immediately tag
you as malign and you might ask why do I give a reason for that is very simple we're gonna start offering black lists for free to every organization that wants it you want to load a list of malign IP addresses at your edge we're gonna give you a list for that for free load it up in your firewall load it up in your IDs start blocking those guys because they're not behaving so you need to put a stop to that and we're hoping more organizations start consuming these black lists and start blocking people that don't know how to behave and that are not helping you secure your organization's because that's the objective of our internet scanning if
you want to try and play with our data app binary X dot IO promo code besides this one 2019 we'll give you six months for free not actually free you'll have to pay one euro but that's because card provider doesn't allow us to do for free because they need their own money but it will be one euro nominal fee and you can play with all of the data this will give you full access scanning data data leaks domains sensors all of it the only thing that we asked back feedback if there's something you think we can improve if there's a tag that you can provide us we appreciate it because it helps everyone else any questions
Thank You Thiago we have hello good afternoon I have a question regarding the black lists if someone gets on the black list how can they remove themselves from it sure so one of the bigger problems of black lists these days is that they have no time measurements you jump into one you're gonna sit there for a month and if that's if it's a nice black list provider with us you jump on our slack channel or talk to us on Twitter and we will remove you from the black list assuming that you're following details of course okay thank you anyone else yes sorry so what platform did you do the any pot in we wrote something fully
custom done in Python tree I think I've seen it no it really something
hi I'm giving any experience regarding the black listing and blocking that is B level so we we find that it's a marketing thing if I'm honest so we used to be super in the dark no one knew as like four years ago and we would get a blacklist a lot it was hard but now that our name is more out there you know we appear on TechCrunch because we have to beat the PDL League for example last week like revealing it and helping the FBI with that so the more we find that the more our name gets out there the less we're getting people asking us to block them and they're actually starting to work with us so it's unfortunately
very much an image and marketing thing still you talk to the ISPs you show them what you have they start using your platform and they don't ask you to block them anyone discounts for to just another the question since the data gets so large so fast are you keeping the data for how long so we archive every single thing that we scan or listen to we keep everything available by API for six months everything on the portal for one month by a UI okay thank you anyone else oh there's one on that you want a shout it's a sprint between you're in there thank you so then which come kind of database are using for storing the data
because if you are yeah storing it for six months then a lot of data so storage goes into s3 like after he gets out of the six months range the six months storage I believe he's on a Cassandra cluster and our normal storage would be sitting on an elastic search okay thank you Thanks and the reason for that it's because it's the stack we were familiar with on our scanning site so we just tried to keep with tools that we're familiar with and we like we have the entire system working in less than a month from like development to having something functional month-and-a-half made sunlight anyone well if you have any questions you can always come up and
chat with us I'm gonna be here for a little bit and thank you very much besides [Applause]