Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned

Name: Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned
Uploaded: 2025-12-08
Duration: 37 min 40 s
Description: Identifier: GTYAKW Description: - “Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud” - Research on predicting host lifespans across ports, protocols, and networks. - Shares lessons from failed ML approaches and eventual solutions. - Di

BSides Las Vegas37:407 viewsPublished 2025-12Watch on YouTube ↗

About this talk

Identifier: GTYAKW Description: - “Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud” - Research on predicting host lifespans across ports, protocols, and networks. - Shares lessons from failed ML approaches and eventual solutions. - Discusses pitfalls of applying ML in noisy, niche security contexts. - Provides case study insights for applying ML to complex problems. Location & Metadata: - Location: Ground Truth, Siena - Date/Time: Tuesday, 17:00–17:45 - Speaker: Ariana Mirian

Show transcript [en]

Hi everybody. My my name is Noey. I'm the room host in here. Um good afternoon and welcome to BSI Las Vegas. Um this is uh proven grounds, right? No >> ground truth. >> Ground ground truth. Ah um the talk is >> Ariana. >> Yep. Ariana and predicting the lif of internet services falling down the rah. Um and this is given given by Ariana Mirian. A few announcements before we begin. Uh want to thank our sponsors. Um diamond sponsors Adobe and Aikido and our gold sponsors Formal and RunZero. Um it's the support with our along with our other sponsors, donors and volunteers that make our event possible. Um the talks are being streamed live and as a

courtesy we to our speaker and our audience. We want you to make sure that your cell phones are down and silent. Um if you have any questions, I'm going to be using this microphone and you will speak into it. so our audience on YouTube can hear you. And uh yeah, and final announcement um data science meet up at 7 p.m. and it's going to be by the pool by the entrance. All right, and with that, I'll leave the floor to you. Thank you so much. Thanks for being here, everyone. I appreciate it. Okay, this has got to get closer. Um hi everyone. Thanks for coming. I know it's 5:00 p.m. on Tuesday. We've all got long weeks ahead

of us. I appreciate you all being here in person and online. My name is Ariana and today I'm going to talk about predicting the lifespan of internet services. Uh falling down the ML rabbit hole and what we learned from the thud. Um just a quick disclaimer. If someone maybe Christian could take a photo of me giving the talk at some point when it looks good, that would be great. Uh because everyone is busy and I need a photo proof for proof of life. Thank you, sir, for that photo. [laughter] Okay, great. Before I I jump into this like really long title and what it actually means, I'm just going to go over a little bit about who am I. Like I

said, my name is Ariana. I'm currently working as a senior security researcher at Census. Um, my job is kind of twofold. First, I focus on data quality. So, how do we have the absolute best data so that you folks doing security investigations also have the absolute best data? Um, but I also combine the domains of internet measurement and security. or you can think about this as data science and security to answer interesting research questions about the world and the internet. Um, before this I was at the University of California, San Diego. My PhD was unsurprisingly in driving security decisions via internet measurement. I'm also really into birds. So, if you like internet measurement, security, or birds, come talk to me. I

am really into all three of those things. That's a Kia. They're crazy. Ask me about them later. Um, great. So I'm super excited to be here today specifically in ground truth to talk about something really near and dear to my heart. Um often when we talk about research you are hearing about the end result right you read a blog you see a paper you read about the methods that were successful. Today I am giving partly a failure talk. So I want to talk about what happened in between the start and the end of our research project. What went wrong? how we reconfigured our solutions, how we reconfigured our thinking in order to get to a correct

solution. Um, and also specifically talk about, you know, what happens when you join two subjects that require a lot of domain expertise. Cough cough, security and ML together. Um, and what can go wrong and also what can go right and so as you probably gleaned, I am approaching this from the perspective of a security researcher, right? Like I have been doing security for 10 years. I'm pretty prolific in networking as well, but this was the first time I was kind of dipping my toes into the ML space with some help of some ML scientist colleagues. So, I wasn't like doing this alone. Um, but it was a really eye- openening experience and you know, we don't often get to talk about

the outcomes and the methods and so I'm super excited to be at Ground Truth to talk about both of those things. Um, and so every time I talk about methodology, I actually put a little like light bulb on the top of the slide to just indicate like this is a methodology point and I'll also call it out um as much as I can. So today we're going to talk about it all and specifically I'm going to just go over a couple a couple points a couple points in the next 40 minutes. Um, I'm going to give you a quick background about the internet and like why internetwide scanning matters. uh talk about the various iterations of the

problem statement and how it changed from problem V1 to problem V2. Talk about the various solutions that we tried. Um and then finally what we learned both for the project and for the future. So I'm going to really take you you know through us falling down the rabbit hole which was okay trying to find an answer to this project but also the thud at the end which was us kind of looking back up at the process and being like wow that was kind of painful. Uh what can we learn from that? what can we share with the world about what we learned from that? And that's why the the title is so intricate is because

it's uh it's been a it's been a really good learning experience. So, a little bit of background um what is the internet and why does internetwide scanning matter? I always put the internet in quotes because it is fake. It should not. It is just a strange entity that rules a large portion of my and others lives. Um but jokes aside, you know, the internet is a very fraught and dangerous place. There's a lot of bad actors out there. And just to make sure we're all on the same page, when I say the internet, I think of the collection of hosts that make up this networking entity, right? And when I when I say a host, I think of

servers or network devices that are typically comprised of an IP, port, protocol, and autonomous system. Um, and for those that aren't super familiar with networking, autonomous system is like the routing policy of the specific device. So like who controls how it gets routed to the rest of the internet. Um the hosts can tell us really interesting information about themselves. Not only are the port and protocol that they are using to speak with the rest of the internet inherently interesting um but we can also collect other interesting metadata like what vulnerabilities are they purporting? Um what software or hardware version are they? And internetwide scanning is a way that we can get all of that data in one place.

And so if you look at this graphic on the left, um right, we have our set of devices and then a shop like Census, which is where I I currently work, um will go out and scan all of these devices to collect as much public data about those devices as possible. Um again, to enable research about various vulnerabilities, um various devices, software, hardware, products, etc. And the caveat here is that the more accurate your data is, the more accurate your insights are going to be, right? If you have scan data that is a week or a month old, your insights into that data are going to be a week or a month old. Um, but if you have really

really accurate data that is super upto-date, your insights will also be uh conversely up to-date. [snorts] And one thing that we have that I have realized in my tenure um with internetwide scanning is that many hosts are ephemeral. So if you imagine like a faulty Christmas light that's blinking in and out. There's a lot of hosts on the internet that we scan them and they are responsive and they're like, "Hey, here's who I am and here's what I talk and here's all the vulnerabilities I have." And then we scan them a couple's hour a couple hours later and they are just gone or they appear to be gone and then like four hours later they're back

again. And I'm like, "Come on, just make up your mind. this would make my life a lot easier. Um, these really ephemeral hosts complicate this accurate picture that we are trying to depict. And so the first version of the problem is that we want an accurate representation of the internet at any given point, but we have these highly ephemeral hosts that make up a minority of the internet that are kind of mucking up the data. And so we're like, well, okay, what what do we do about this? Well, you know, I'm I'm a scientist by training, so I'm like, let's go run an experiment. And so we we set out to quantify the ephemererality of the internet. And so in in doing so,

we scanned a representative set of services every hour for a week. And if you're like, "Hey, Ariana, this is sounding really familiar." Um that's because this is the talk I gave last year in Ground Truth um at Bides Las Vegas um where we scanned the internet every hour and showed a bunch of you know interesting graphs and behaviors um in and examined the internet at these different service level attributes. So again, poor protocol autonomous system. And I'm not going to go over that talk again. But the biggest takeaway from from that descriptive analysis and from the talk about a year ago is that the internet is highly ephemeral. Um it's a minority of services and also the

internet is not uniform in its ephemererality. While we might might find that like a portion of hosts that are over port 443 are super ephemeral, that can vary widely depending on the autonomous system. That can vary widely depending on the protocol you're looking at. And so we were doing all this descriptive analyses to try and glean, you know, insights into the ephemerra internet. And the biggest thing we could take away is like, wow, this is a mess. This is really noisy. [laughter] Um, and so, you know, I I presented this work. We we continued to do some of the research and then at some point we were like man the outcomes are not clear. We

are having a difficult time taking figuring out the takeaways for us to use internally for us to go publicize. Um and so at this point and this is the little methodology light bulb. We're here. We're this is you know failure step number one. Um is is taking a step back right? It's like okay the the data is messy. we're having a hard time with, you know, simple descriptive analyses trying to figure out what are the takeaways and outcomes for us. And so we took a step back and said, what are our end goals with this data set with this project? And you know, we came up with two very specific end goals. One is to

come up with a method um that allows us to identify which services should be scanned at a faster way rate that takes into account this noisiness. Right? If one autonomous system has mail servers that should be scanned at a faster rate, I don't want to scan all the mail servers at a faster rate. I don't want to scan that entire autonomous system at a faster rate. It is a specific subportion of a subpopul of the internet. So, we wanted a method that would more easily help us identify these different small portions of the internet that we should be scanning faster. We also wanted an understanding of what affects internet service lifespans. In other words, we wanted to um to to not

deduce, we wanted to figure out the feature importance of what causes a service to be ephemeral. Is it just its port? Is it its protocol? Is it a combination of those three? And this feature importance is more of an underlying understanding of the behavior of these ephemeral hosts. But it allows us as an internetwide scanning um entity to go and then figure out okay does this change our internal scanning policies. Do we uh alter the way we think about ephemererality and how we scan internet services based on the model's outcome? So we came up with these, you know, we took a step back, re made sure we were on the same page with our end goals and

we're like, okay, well, let's reframe the problem because the original problem, which is like we want an accurate picture of the internet, is great. That's like our guiding light, but that's not actually helping us solve the data problem. And so the problem then evolved into how can we predict the lifespan of ephemeral services and then it was at this point that ML entered the room. It is so hot right now. It just felt very timely given all the conversations of ML. Um, and I should say that you know uh I was very hesitant to dip my feet into the uh domain that is machine learning because right I like I said I am a security expert. I am a

networking person and I was like trying to figure out how to marry two different domains is going to be a a challenge and a journey in and of itself. And that is why we're here because it was in fact a journey. Um so many of you may be asking okay but like why prediction? Why didn't you reframe you know your V2 of the question to be something different? Um prediction allows us to have smarter and more accurate scanning. So if we can predict so just to like really concretize this if we can say host A is going to change its state in 8 hours. Host B is going to change its state in 12 hours. So those need to go into

different scanning buckets that are scanned at different rates. again that helps us get more accurate data. So this is where prediction is really useful. Um but also you know and and my background is also more in in statistics. I actually tried a lot of statistical models. Uh and that is just a completely different talk for another day. That's like a whole other 45 minutes. Um the issue with a lot of the statistical models is that they had really strict assumptions that internetwide scanning data breaks. Um, and while they show relationships between variables, it was much harder to glean that feature importance, which was, you know, the the what are the service level attributes that help predict ephemererality.

And then at the end of the day, we wanted both prediction and feature importance, right? Like I didn't want to be running a ton of different models and then trying to pull it all together. Um, spoiler, I did actually try that. It really didn't work well. And so we're like, okay, let's let's dip our toes into the ML space because it seems like there are some machine learning models that can help us get both prediction and relational information um which are going to solve both of our end goals. And so this is an explanation for why we wanted to use prediction. And then the next step was like, okay, great. Well, what which model are we going to use?

Because there's a lot, right? It's not just like, oh, ML, throw it in. It's like, oh, this is an entire field where people have spent years training and understanding these models and what's going on under the hood. much like me and my colleagues have spent years understanding the internet and security and whatnot. [snorts] And so um again we're like okay well what do we have? What do we have and what are we trying to get? Uh our input features for each service for each host we have uh port protocol autonomous system which again is like the routing policy. Um yeah we'll just leave it at that for for lack of time and whether the scan was

successful or not. And so you can imagine that like we have uh and this picture shows you know if we have the host we have its port protocol and also like kind of where where it's getting routed or how who it's getting routed by and then also like was it responsive or not because we were doing these hourly scans. So there were times where host just went away and I was like well we'll check again in an hour and see if they're back. Um given these we wanted to predict the lifespan of a service again. Is it going to change state in 1 hour? Is it going to change state in 6, 12, 24 in a week? I want to know. I want

to know about all the services. Um, and so we were looking at the inputs and we're like, okay, well, lifespan of services is a numerical output that's in um, uh, our chunks. And so looking at all the different model families and also again consulting with the ML scientists that we have in house, um, I was like, okay, well, we could let's try regression, right? Regression handles numerical output and bonus. They don't always assume linear relationships because there's no way that there are linear relationships in [laughter] internetwide scanning data. And so solution v1, let's use a regression. Seems super simple. My life is going to be great and we're going to have all the answers to the problem.

No, that is not what happened. We actually ran into a lot of issues um with trying to use regression for this problem. And and and this is where I'm going to walk through again um the difficulties like the failure cases of of what exactly was going wrong because we wanted to understand why this was failing so spectacularly to figure out is there a potential solution or do we just like abandon ship and move on to the next you know problem in the internetwide scanning space. And so the first big issue was with our input variables. So we had port protocol autonomous system they're categorical. Lots of models require numerical input. Even if they take in categorical input,

they're still like converting it into numerical under the hood. But that's okay because we have a way to handle this. Um there is a so you can essentially encode these variables, right? And there's a popular technique called one hot encoding. This was one of the many that we tried. But just to walk you through it, on the left hand side was the original data set. So each row is a specific host and what port it what what port it purported what port it had open and what port we were scanning it on. And so when you one hot encode you're essentially taking each of the possible variables converting them into their own uh column their own variable

name and then listing the true and false variables. And so if we convert the lefthand side into the right hand side we'll see that port 80 and port 44 port 443 are now their own columns. Um, and port 80, uh, since the first row had port 80 open, there is a one for the first row for port 80, a zero for port 443, and then the converse for the second row. This is a super straightforward technique. There's other ways to encode things. Um, but this is just, you know, the the one that is most easy for me to graphically explain in the time that I have. Um, seems like this should work, right? Nope. Uh, the issue is that a lot of our

variables had really high cardality. They had a lot of potential categories they could be in. And just to put some numbers behind this, we had uh 40 potential ports, uh 300 potential protocols, and 19,000 potential autonomous systems. And so imagine that each of those is becoming a column. Your matrix all of a sudden is so sparse the model uh spoiler can't do much with it. And on the right hand side, I just have an example of for those who are unfamiliar with autonomous systems, some example of autonomous systems. So you can have universities, financial institutions, you know, these are large organizations that own IP space um and then and then route that IP space. And so um this is where and I again we

have some great ML scientists in house and we were chatting constantly about this and this is when one of our ML scientist was like well Ariana just get rid of autonomous system that's such a problem feature you don't need that and this is where having domain expertise in both domains is so helpful because me as the networking security person was like wait autonomous system is so critical to the underlying nature of a host we can't just throw it out. Um, and again, my ML scientist colleague, fantastic person, did not have the networking background. So, they were just like, "Oh, yeah, we'll just get rid of it." And I was like, "No, no, no. We're not we're not

getting rid of that. No, no, we cannot." Um, and so we we reached a compromise where we're like, "Okay, let's reduce the cardality. Instead of 19,000 potential autonomous system options, what if we boil down the autonomous system feature into its category?" So, you know, we say this is a uh educational autonomous system, a financial autonomous system. That dropped that feature down to about 200 options uh and it was still a super sparse matrix with like slightly less abysmal results. And so even with this compromise, we weren't getting getting uh great outcome. And just to put some numbers behind what I mean by not great outcome, um when you run a regression, one of the key metrics of success is

your R squar. And so R squar is essentially how well the model is explaining the variance in the data. It is on a scale of 0 to one. And the best thing I could get was about.3. Most of the time it was below 0.2. And so that means that the model was looking taking the data in um and only able to describe at best 30% of the variance in the model based on the features that we were we were putting into it. Um in the words of Gordon Ramsay, that was just not good enough. And and you know this is where the practicality of industry kind of butts head with the the science of of

academia and these mathematical models because in a lot of cases you know 30 an R squared of.3 actually is quite sufficient but one of the other difficulties is that we were trying a lot of different regression models to see if they would at least have the same output. Right? So again the feature importance was the uh a secondary goal for us and these models were outputting feature importance that was all over the place. I mean like one would say autonomous system was important, one would say port was important, one would say protocol was important [snorts] and again in consulting with um my ML domain expertise uh colleagues, our understanding is that the the models just weren't really grasping what was

happening underneath. they were not converging in a way that made sense to any of us and we also couldn't explain it. Right? Imagine going into a meeting with a bunch of your other colleagues who are expecting you to come with outcomes and you're like so one model said this and the other one said this and the third said this and uh yeah like what what do you take from that? Not great. um the the light at the end of the tunnel, the slightly depressing thing is that this is obviously not a unique problem to us. Um I started looking at uh published research actually in other domains um like human computer interaction has a lot of really

um uh peer-reviewed published work where the authors were like yeah we have this high cardality uh high cardinal data set and we tried a regression and it was not great. Um and so we moved on and I was like okay that's encouraging because maybe there is a solution here. Um yeah and I was like a lot of research was using high cardinal data had the exact same results and the results were abysmal. It was really discouraging and so we were like okay let's take another step back our light bulb is back. um how can we reclassify the research question to then use a model that actually works with the data that is something that we

can act on that we can explain to like the data engineers that we can explain to people who are like okay so why do you think we should change our scanning in the way that you're proposing [snorts] and when we were poking around at some of the data we realized you know we revisited this facet of the data set which is that the output variable had incredibly biodal output so remember our output put is um the lifespan of a host. So that can be anywhere from an hour to a week because that's how long we ran our experimental data set. Um and when we plotted the distribution of lifespans, it is uh a heavy concentration on the left side. So

again, this small minority of hosts that are pretty ephemeral and then a huge chunk that are just online. They are they are homies. They are online. They are just consistently responding positively. And so we were looking at this and we're like, okay, well, can we use this to our advantage, right? And I'm going to revisit our end goals because we wanted a method to identify which services should be scanned at a faster rate, which really meant these services should be scanned in X hours, these should be scanned in Y, etc. What if we just reframed it and made our lives easier and say, hey, these are the less ephemeral hosts and these are the more ephemeral hosts. So taking this

bucket, these two obvious buckets, and instead treating our problem like a bucketing problem instead of a regression or a very exact precision prediction problem. Um, and so that led us to solution V2, which is, hey, let's use a classification. Again, this may not be the most academically rigorous thing. If I try to submit this to a research conference, I'm sure they would be like, wait, but like regression is the thing that you should definitely use for this problem. But practically for us, if we want more accurate data, um, and we're scanning most hosts at roughly a daily cadence, even if we can say, okay, this set of hosts should be scanned in a 12-h hour cadence or 6 hour

cadence, and then everything else remains the same, that is still an improvement for us. That is still improving our accuracy. And again, this is kind of the the joys of being able to work in a more practical industry setting where I don't need to worry about uh the woes of reviewer 2. I can be like, what is the thing that is pract I'm seeing some people laugh. Um what is the thing that's practically going to help us? A classification will practically get us to an answer um to our problem. Classification went great. It was fantastic. I was like, "Thank God I can do my job." um because there was there were some dark moments in this. So we

divided our output into highly ephemeral host so hosts that had a lifespan of less than 12 hours and less or low ephemeral host which is everything that was 12 hours plus right and again that's because our day our our scanning cadence is roughly daily. There's some nuance there. And so if we can even say hey this 10% we should be scanning at a 12-h hour rate. Again, that is still going to increase our accuracy and get us um closer to the goalpost of having the most accurate representation of the internet. Um we also stumbled on this model called cat boost which natively handles categorical data. Again, it is still converting it into numerical input uh into numerical inputs. But what was

really um solidifying about this decision is that we tried cap boost, we tried random forest trees, we tried a bunch a couple different classifiers and though they were performing at slightly different accuracies and slightly different ROC's, the feature importance was always the same. They were converging on what they thought were the features that most predicted ephemererality. This was not happening with regressions. lol Saab. Um, and so I was like, okay, there's some slight differences, but they're all mostly behaving the same way, which is what we would expect um, in a well- definfined research problem with a well- definfined data set. Um, we also didn't really need to reduce the cardality of the data either. Um, I just threw in all

the autonomous systems. I didn't need to do any of this categorical stuff. That's really helpful for us because now I can be like these autonomous systems, these hosts in these autonomous systems need to be scanned uh faster or not as quickly. And the results weren't abysmal. And so when I say they weren't abysmal, um we were able to get an accuracy of 87% and an ROC AU of 94% which is like a night and day difference from what we were seeing with the regression. Um we also got a feature importance that was su successfully ranked and reproducible against different models and this is really interesting to me. So autonomous system is the highest ranked feature when we

are talking about the lifespan or the ephemererality of a service and that is super interesting to me and now we are having a lot of internal discussions and ongoing research about how that changes some of our base assumptions about how we've been scanning the internet um to then get more accurate data. There were also some other features that were important like number of services on a host the port the protocol but it was really autonomous system that that stood out above the rest. Um, and we had some really good precision and recall. So overall, this was a great success compared to the results that we were dealing with before. [snorts] Um, and then we were like, hey, that worked

really well. What if we modified our our classification buckets slightly and actually expanded this into a three classification problem? So now it's like we scan certain hosts at 12 hours, we scan certain hosts at 24, and then everything else we can scan it at more than a 24-hour cadence, which I think practically we'll still probably scan at 24. But this was like a nice little let's um expand this and see if this continues to work, right? And in the words of Moto Moto from Madagascar 2, so nice you got to say it twice. Anyone? No. Great movie. Highly recommend. [laughter] Thank you. Um, again, our our accuracy and ROC were like really well. I kept I

was like running these and this these are with hyperparameter tuning and whatnot. I don't have the parameters up here because it didn't seem as relevant for this discussion, but um I would like run the base model with just some really basic input parameters and like the accuracy would come out at like 75% and I was like am I dreaming or does this work? Um, in this three classification problem again we also investigate our feature importance. the autonomous system was still highest by far, but then port and number of services flipped. And so that's something that we're investigating. Um why that is, what does that mean for us? Um and if the two classification or three classification

is is more suited for for our needs and so some interesting open questions, you know, as with all sorts of research, there is always more to dig into, but this was still a great success. Um, so at this point, so this was, you know, kind of the the the entire the evolution of the research project. We're we're taking a lot of this. We're digging in further. I I'm pretty sure I won't be giving a follow-up talk on this next year, but who knows? Never say never. Um, and so at this point, I kind of want to uh take a step back and talk about the two different sets of lessons we learned. So lessons we learned from

falling down the rabbit hole or the project specific lessons, right? Using network level features we were able to train a classifier with high accuracy and predict service level ephemererality with feature with reproducible features. Um autonomous system is the feature of highest importance which like I said helps us understand how we should address internet ephemerality better. It also is um helping us again with some of these internal conversations that are ongoing. Um maybe that'll be the the third version of this talk is like how we change internetwide scanning. Probably not. Um the the one thing I really want to point out is it oh no I think that's on the next slide. I got too excited. Um so it was like great

this is a a really helpful um project for us internally. We got our findings. Let's move forward with it and and make our accuracy better. Let's make our data quality better. But what were the lessons that we learned from the thud? Right? It this was a very long process. I distilled it into roughly 35 minutes. Um but this was not a 35inut task um and took a lot of domain experts coming together. Um you know first off domain expertise in both areas is super critical. If I had been tackling this on my own or if my ML colleagues had been attack tackling this on their own I think it would have taken more time and we would have come uh it

may not have been successful and we um may uh have come to different outcomes. And so being able to have both people in the room was super super useful. And so I would highly encourage if you are ever dipping your toes into a different domain, find someone who works in that domain. It's also fun, you know, working with good people, doing good research. I had a great time even though there were some days where I was like banging my head against the metaphorical wall. Um, rethinking your analysis is necessary even in the ML domain. And I've kind of touched on this before, but one of the biggest things for us was saying, "Hey, how do we address the practical issue

outside of the scientific one?" Like, yes, I would love to get a research paper out of this, but the reality is we're not doing anything novel with the ML models. Really, what we need is a uh a practical understanding of what uh dictates internet service ephemererality such that it helps us internally with some policy decisions. Um and so that means that we converted a regression to a classification because of various factors. Um and that gave us really good performance which um practically and scientifically is helpful for us. Um and so really being able to take a step back, rethink what is important, what isn't important, what can you leave on the table, what do you need to take with

you was really critical here. And I will say this process has been um it was not foreign to me. In my 10 years of doing research, there are so many times where we have needed to take a step back, rethink the problem. What are we doing? what are the analyses? Um and this was just the most recent case of that journey. Um internet level data is super noisy [laughter] surprise. And um network level features perform much better in these models than software level features. So one thing that I I didn't include a slide on is that um we were trying different uh features. So we had the network level features because that is inherent to the

internetwide scan. But then we can also add some software level features like what product is this? What vendor is this? What version is this? And we tried to include some of these other features. And even though the model was still performing well in terms of accuracy and ROC, those software level features were like at the bottom of the feature importance list. I mean they were just so uh obviously not important to the model output. And this is really important to us. Um because I'm like great. So now moving forward we'll focus on network level features. And that doesn't mean I've like sworn off on software level features, but this is very useful for us in other endeavors

that are going on internally. Science be sciencing. That's been great. Um, with that, I want to thank you folks for your time and I would be remiss if I didn't thank all the fantastic folks I work with at Census. Um, good research is not done in a vacuum. Um, and I have some, uh, really good people doing some great work at Census and I'm really thankful for them. Um, with that I'm happy to open the floor to questions. Um, and thank you folks for your time.

>> Hey, thanks. Great talk. Um, I'm curious. So, you looked at the autonomous systems having a substantial impact on the model. >> Yeah. Did you then go back and look at those autonomous systems as to what the autonomous systems that were very ephemereral were specifically? >> Uh yes I did. Um yes I did. I'm not I'm not sure how much we want to we want to share some of that information. [laughter]

Great talk question about when you have uh that binary uh output of scan successful or not. >> Yeah. >> So uh what uh so I mean that could be a failure of the host to respond. It could also be a failure of the scan itself. What uh quality processes do you have in place when you have a previously successful scan and then an unsuccessful scan in validating that your own scan was successful on your side? >> Yeah. Yeah. Yeah. Great question. So before we launched this experiment, we did a bunch of quality testing um for so we basically ran our scanner against hosts that we knew were definitely up. And so we we tested across different

ports, protocols at different speeds to try and find that optimal place of how quickly can we run our scanner while still getting a 100% success rate again based on our ground truth data set. Um so we we basically tuned our own parameters of the scanning to feel uh confident in the quality. We also um did this uh what's the word I'm looking for? We uh further cleaned the data set where we said okay if we see a host that is successful successful successful because we had hourly scans and then for an hour it blips away but then subsequently is back we're actually going to count that as a false negative. So that was us um

cleaning the data based on our knowledge of the internet. Um, I will say that those false negatives were few and far in between, but that was another thing that we were like, listen, based on our understanding of the scanner, based on our understanding of the the internet, things aren't most likely aren't disappearing for just an hour. And so, because we had that high granularity data set, we could essentially patch it um which was a cleaning mechanism. Yeah. >> Just a followup question to that. Have you considered for hosts that are Sorry, >> sorry. Can you repeat that question? >> Um, sorry about this. It's just a followup question. So, have you considered for for hosts that are

traditionally very note ephemeral, very, you know, well established, have a very low down time that when you do have one of these like one-offs that you initiate a secondary scan at like a slower frequency to immediately afterwards, have you considered other things like that and do you guys deploy that? Just curious. >> Yeah. Yeah. Yeah. So, for the experimental data collection, we did not do that specifically because of time. And so this was uh some of the details I talked about last year, but because we deployed these hourly scans, we were very limited in the sample of hosts that we could scan because we can only scan so much in a given hour with the

hardware that we have. And so we didn't do retries for the experiment, but that is something that we do at the production level. Um so there are other checks at the production level. So when you're in census, that data is the highest quality possible. But that's why for our experimental data set where we ran one off uh one hour scans we did that patching mechanism to account for the like we didn't want to do the retry because then the retry means instead of scanning four million hosts we could probably only have scan three based on like bandwidth load and so it it was experimental design. Great question. Anyone else?

All right. Thank you, Ariana. Thanks folks.

Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned

Related talks