
all right hi folks my name is Ariana I am also from census and I'm here to talk about some ongoing work about lessons we learned when we scan the internet about every 45 minutes um so before I start with that just a quick primer on the internet for folks who might not have as much of a networking background uh the internet is in quotes because it's made up obviously um so the internet is made up of these things called hosts you can think of them of as network devices you know your laptop servers on the Internet iot devices and when I think of a host I think of a device that has an IP address
so where we find it on the internet as well as what port and protocol it is relaying information over and again to kind of simplify this for the folks who might not have as much networking background port and Proto combinations you can think of them as like languages and dialects at a really high level it's how these devices speak to each other on the Internet it's how they display content and there's a lot of these different uh Port protocol combinations on the internet so for example you can have a device or a host that speaks ad HTTP which is pretty common those are like your HTML Pages you can also see some that speaks 80
https which is weirder it's stranger but it exists it happens on the internet because the internet's a wild place and one of the tools that is useful for understanding the internet is this thing called internet-wide scanning pretty self-explanatory it's where you take a program and you scan all the hosts on the Internet or maybe a subset of them maybe not all and you basically uh try to speak to them over their specific Port protocol combination and you say hey what data are you willing to publicly tell me so we're not breaking in we're not hacking this is all publicly available information um again just to like break it down into or up level into an analogy imagine if Santa
is just going around the world every 12 hours knocking on every host door and saying hey are you home also what do you speak okay thanks bye that's essentially internet wide scanning and this is a super useful tool for security research because you can look at hosts on your own network and see what ports and Protocols are open you can look at the spread of cves over an entire over the entire internet um and the nice thing about for me for everyone in this room is that you don't need to run your own server Farm in order to get this information in fact there are now a number of uh scanning wide engines census being one of them
that do the scanning and then make that data accessible to others um but good research requires really good data and at census we are always thinking about how can we get more accurate internet wide scanning data and a little while ago we noticed this facet this really strange anecdotal Behavior where we would be scanning these hosts and they would be responsive and then all of a sudden they'd disappear and then they'd respond again after like two four 6 12 hours that flapping Behavior seemed a little strange to us and like I said we're always thinking how can we make our data better because we want to enable better security research for ourselves and also for everyone in the community and so we
didn't understand really what was going on so we were like let's set out to understand this Behavior to help us better our internet wide scanning and this brings me to a deceivingly simple question which is what I'm going to try and answer for the rest of this talk how ephemeral is the internet or in other words how often do we really need to scan different parts of the internet and also to in order to get the most accurate data Quick Step Back who am I my name is Ariana as you can probably tell I work as a senior security researcher at census prior to this I did my PhD at UCSD where my focus was on
internet measurement and empowering security decisions um at this point in my slides I was going to say I'm wearing an orange Blazer please come talk to me it's so easy to find me but all the volunteers are also wearing orange so this is just an example of how the best slid plans can go to waste very quickly um I'll still be wearing this please come find me I love internet measurement okay so back to the task at hand how Emeral is the internet as with any good measurement question you can often break down this overarching philosophical question into more concrete measurable outcomes and so really what I want to find out is if we scan really frequently what trends do we
find across different ports protocols and autonomous systems I'll get to what an as is in a couple of slides so if you're like what the heck is that don't worry yet and so to make sure we're on the G same page um I'm just going to go over our methodology really quickly we scann hosts that had the 40 most common ports open every 45 minutes for a week um I so in an Ideal World I would have had like 20 servers to do this experiment on so we were limited by server load we ended up only scanning about 6 million IPS because this takes time um and the way that we picked those IPS is essentially we got a list of
responsive IPS on these 40 most common ports took a subset of them and then kicked off our predictive scanning protocol tool um that we use at census for our actual data set and this predictive protocol scanning is really key and I'm going to take a second and a couple slides to explain why um so like I mentioned a a little while ago you can have hosts that speak different ports and protocols ad HTTP ver versus ad https and so if we just look at the spread of ports that the different IPS speak for example we get this graph so the x axis is Port the y- AIS is just raw number of ips and you'll see that um
we in our data set there's a heavy concentration of ips that are speaking popular ports this is very unsurprising what was a little more surprising is that um so we kicked off this predictive protocol scanning tool and so instead of us saying hey you're on Port 80 you're always going to speak HTTP we let our scanning engine predict that for us um and so this x-axis is 40 um I'm now going to show you a graph that not only shows Port but is the port protocol combinations combined and this is that graph uh you can't read that X access and that's intentional because there are 412 Port protocol combinations when we use predictive scanning and so
this is actually one of the takeaways that I I really wanted to drive home in this talk today is that this matches up actually with some prior research that a lot of um ports do not only speak standard protocols this speaks to the importance of predictive of non or non-standard scanning and as a measurement scientist you cannot assume that a given Port will always speak at standard protocol and not only does prior research show that but our own measurements back that up too a little bit more about the ground truth of the ground truth as I like to call it like I said we were limited by how many servers we could um spin up to
run this experiment so we scanned about 6 million IPS of those 6 million 81 % of them spoke exactly one protocol during the entire week and so for Simplicity I'm going to focus on that 81% there's some really interesting other stuff going on with that other 19% specifically there's like 7% of host that our protocol scanner could talk to and get data from and then all of a sudden they would respond with data that we couldn't parse it was just like unknown in our data set so there's some weird stuff going on in that 19% but that's a totally different top topic and a totally different talk and so bringing us back to our measurement question what trends do we
uncover cost Pro protocol and as when we scan frequently and the metric of interest that uh we set out to First quantify was what do we find when we examine lifespans and so when I say lifespan I mean you know a contiguous portion of positive or successful protocol scans so the host is responding positively every 45 minutes um on a given protocol um and the lifespan is how long they are successfully responding so a lifespan could be an hour right so something responds for an hour it disappears it could be a day it could be 7.8 days which is the entire duration of the measurement ex experiment um and I could show lifespans for just the port but
like I keep saying ports are often um in research they are often shown in the context of the protocol they're also speaking so for the rest of this talk everything I'm going to show is going to be Port protocol specific and um we're going to look at some common Port protocol lifespans but before I get to some major takeaways I want to take 30 seconds to discuss what this type of graph is um because you're going to see a lot of these sorry um so this is a CDF or a cumulative distribution function um this is essentially a uh distribution of your data of Interest so the Y AIS is from zero to one but you can map that to
percentiles and the xais is your metric of interest for so for us it's days and like I said we ran this experiment for about 7.8 days CU that's when I cut off the scans um and just to really drive home how to read this graph I've highlighted the 50th percentile or the median with the red line and if we see where that intersects with the blue line and we let our eyes draw down that means that the 50th 50th percentile of adhdp is at about 6 days or 15 hours and so what that means is that um there are a little under 50% of devices that have lifespans that are longer and that's the this top arching curve and then a little
under 50% of devices have lifespans that are shorter and this is like this really sharp uptick the other thing I want to point out about these types of graphs is you might notice as your eyes follow this blue line um it goes straight up at the end and that essentially means that at the end percentiles 95th 96th 97th 99th 100th percentile um the lifespans maxed out so like the 99th percent of devices that speak adhp had lifespans of 7.8 days and so if you see those straight lines that that's essentially what that means so like I said sorry in advance you're going to see a lot of these but now that we've kind of walked through
one of them I hope that these make a little more sense and if they don't don't worry I'm going to walk you through the takeaways anyway um let's look at the five most popular Port protocol combinations and their lifespans and so if you remember that graph with the 412 combinations these are those first five bars and these are their distribution of lifespans so not only and uh just for clarification I've posted the the typed the medians of um these popular protocol combinations on the right hand side so like ad HTTP has a median of 66 7547 HTTP or cwmp for those of those of in the room who know are 0.5 days um you'll see that the
green line 443 https is a weird outlier it's really shortlived it goes up and then over to the right at a much quicker Pace than its counterparts which I'll talk about closer to the end of these slides but the really important part the really interesting part about these cdfs is that we can see the distribution and so in cases where we have medians that are really similar like 80 HTTP and 22 SSH the blue line and the red line they both have medians or 50th percentiles at about 66. 69 days but if if we look at the blue line and the red line their behaviors are really different what this is telling me is that uh devices hosts
that speak 22 SSH sure at the median they might be 69 days in terms of their lifespan but then after that be they become far longer lived and so 22 SSH is actually a far longer lived protocol in terms of lifespans than 80 HTTP its counterpart and this is why something like this looking at the entire distribution is so key to understanding this ecosystem and so we see a lot of variation in common ports and protocols um with the outly m443 https which like I said I'll get to um I want to show you five other port or not five um another set of Port protocols for comparison that have very different intentions so a lot of these
you know they're HTTP https SSH TCP sip um this graph is all male protocols um so again just like a quick background for those of you who don't know email has its own set of ports and protocols these distributions look very different those straight lines are super pronounced and if you actually look at the medians on the right hand side sorry I forgot to type medians the medians are all between 6 and 7.8 days which is the maximum of the experiment duration and so what we find is that male Protocols are far longer lived than their counterparts but if we actually take a step back and ask ourselves why is that happening it is because the intention of
the port and protocol can really inform Its Behavior who here runs a mail server yeah a couple people how much downtime do you have a little yeah so mail servers are meant functionally to stay online to forever or as long as the the admin wants in order to transfer email back and forth if there's downtime then you can have downtime in the actual transport of emails themselves where something like HTTP your HTTP web page goes down comes back up no harm no foul and so this really speaks to understanding the intention behind some of these ports and protocols and why they are exhibiting these behaviors there's 412 Port protocol combinations I'm not going to show you
I'm not I'm not going to just like keep going through five and five and five um but instead I I'm going to take a quick step back and just summarize this portion really quickly which is that when we look at lifespans based on responsiveness whether there was a successful scan or not we see a variation of lifespan mediums from 08 hours all the way to 188 hours which is the duration like I said of the experiment lifespan or in other words these Port protocol lifespans can vary quite widely um actually a lot more widely than we anticipated now I love to make my life hard and so the next question is what happens if we add autonomous systems or
a third variable into the mix and so for those who don't know what an autonomous system is it's essentially a set of ips that's owned by the same organization and has the same routing so like Google has a set of autonomous system census has its own autonomous system or as um and for the sake of time I'm going to look at three as's that have very different um functions again we're going to look at Cloud flare Microsoft and kicks which is the largest Korea Telecom as a case study and to keep things really simple we're going to look at ad0 HTTP to start off and so what you'll see here again lovely CDF the blue line is
the port protocol distribution and aggregate the orange line is hosts that speak ad HTTP specifically on that as Microsoft Corp MSN etc etc and so when we compare these two we can say Okay um the devices that speak adhp on this as are much shorter lived than the entire population that we're looking at when we do the same Examination for cloud flare it's basically the exact opposite that adhp is far longer lived than the the Aggregate and if we look at the Korea Telecom it's still longer live but not as pronounced um and again I've posted the medians just to make this a little bit more this takeaway a little bit more Salient um one of the things to
make not here is that similar to like how ports and protocols can have different intentions these autonomous systems have different purposes different intentions um Microsoft sells hosts as a service and so you're going to have a lot of customers who spin something up maybe they bring it down they spin it up again um cloudflare is a Content delivery Network again kind of similar to to mail servers or mail protocols it's meant to have really solid uptime and then a Korea Telecom part of its function is to provide uh residential access and so that might be why it's not as pronounced as cloudflare but it's still a little bit longer lived than the aggregate these um purposes these
reasons that these different autonomous systems exist can also start to inform Trends and how we might want to scan these different aspects or these different as's differently um I'm done with cdfs by the way this is your last CDF if we also look at ports and protocols that are meant to be really similar in in intentionality though we don't necessarily see parity between um the two the two distributions so what do I mean by that um these are the three autonomous systems and these are just medians because I figured at this point you might be graphed out um and we see the medians for Port 80 it's 1.1 hours 7.8 days and 1.7 days if we look at port
8080 and again only HTTP often folks on the internet treat 80 and 8080 very similarly it's meant you know to serve web pages um with Cloud flare we see very similar Behavior but we don't necessarily see that with Microsoft and kicks and so we not only see a huge variation between as but also ports speaking the same protocol which have the same intentionality and this was again where like all my hypotheses started getting thrown out the window a little bit um oh and then I already spoke to this a as intention can also make a huge difference so with my last two remaining minutes I want to dive into to one other discussion which is what if we change
our definition of lifespan so so far we have categorized lifespan as successful protocol scans up and downtime right but we saw that with 443 https there were really short life spans which seems a little strange compared to the rest of the port protocol combinations and so they got us thinking what if we change changed our definition um to include looking at how the host itself is changing you know what if it just there's some measurement error um something weird is going on with the network a lot of weird things can happen on the internet and so instead let's look at how the host itself is changing over time and so I came up with this
idea of like a host cookie per protocol so this is the fields that are of most interest for that port and protocol and this um for this last remaining couple slides I'll do a case study on 4438 TPS and the two fields of Interest were the shaw 256 of the body hash and the fingerprint of the certificate and so we combined those together to make the host cookie and we like surely the lifespans of 443 https must increase because why would these things be changing so drastically and instead when we calculate lifespans based on change the median lifespan increased from8 hours to 1.1 this was not what I was anticipating folks folks um some digging uh cuz this project has been a
lot of me digging we realized that the shaw 256 of the body hash is actually too granular for our purpose and intentions because what's often happening with HTTP which is an incredibly Dynamic protocol is you'd have frame IDs that change every time you visit the web page and if you visit the web page every 45 minutes you're going to get a slightly different frame ID take a shot to 56 of that shot to 56 is going to be different every single time and we actually verified this hypothesis because when we calculated the lifespan just based on the certificate fingerprint the median lifespan all of a sudden became 188 hours again the entire duration of the
experiment run and so this brings me to this philosophical question which is what is the definition of a host for us for you in your measurement exper experiment it could be the Sha 256 body hash it could be the certificate um for us we're now looking at context specific hashing because if a body hash or a body HTML changes by three four 10 bytes to us that's functionally the same host and so there are some changes um that we are examining for our own purposes okay this is my tunnel of Terrors quick recap takeaway number one is that the internet is not homogeneous in its ephemerality um single isolated scans if you're a security researcher
may be totally acceptable but not if you're trying to take the Pulse of something at the port and protocol level we find median lifespans varying all the way from8 hours to 188 hours and we find additionally wide variation we add an autonomous system takeaway number two is to understand what you are trying to measure and why it's important this gets back to the Deep philosophical question of what is a host what does lifespan mean for you is it uptime is it change with https 443 alone these different metrics and measures change our lifespan metrics quite wildly and then finally the internet is constantly evolving we need to be conducting measurements more consistently to understand these weird
facets and what's going on and my colleague Emily had mentioned that you know counting is hard what I really want to leave you folks with is that measuring is also very hard so there's a lot of different next steps I think I'm a minute over time um I just want to thank my colleagues at census really quick good research is not done in isolation I'm very thankful to be learning with my members along the research and data Team every day um and I want to thank you folks for your time if you have questions you can come find me in my orange Blazer thank you so much