Social Media OSINT Without the Indigestion

Name: Social Media OSINT Without the Indigestion
Uploaded: 2019-11-01
Duration: 51 min 22 s
Description: Finding signal in 100,000+ security Twitter accounts, thousands of blogs and podcasts is overwhelming—most infosec social media content is noise or rebroadcast. This talk presents data-driven research and open-source tools to identify authentic influencers, practitioners flying under the radar, and

BSides DC · 201951:22174 viewsPublished 2019-11Watch on YouTube ↗

Speakers

Mark Orlando Ryan Shaw

Tags

CategoryResearch

TopicOSINT Threat Intel

TeamBlue

StyleTalk

About this talk

Finding signal in 100,000+ security Twitter accounts, thousands of blogs and podcasts is overwhelming—most infosec social media content is noise or rebroadcast. This talk presents data-driven research and open-source tools to identify authentic influencers, practitioners flying under the radar, and actionable original content. The speakers share findings from analyzing 25 million tweets and demonstrate practical methods for blue teams to derive contextually relevant intelligence from social media sources efficiently.

Show original YouTube description

By our count, there are 100,000+ security related Twitter accounts, 2000+ blogs, 1000+ conferences/events, 75+ podcasts, and countless other social media sources. The momentary euphoria of catching up on your Twitter feed hardly alleviates the more frequent anxiety of being behind on infosec “news” when work and life get busy. While there are many tools for aggregating and searching social media content, none of them are designed to identify and extract quality data for a particular topic. Our research shows that only 30% of Tweets by infosec-focused accounts are original content and only a fraction of those provide actionable information. Are you new to security and want to know where to find the most original and timely social media posts? Do you want data-driven answers to who the real influencers are in our field? What about those practitioners who are doing great work, but are flying under the radar? In this talk, we will demonstrate tools we have built to address these questions and derive contextually relevant value from more social media sources in less time. We will also be sharing details about soon to be available public access to the tools and plans for ongoing feature additions and refinements. With so many people doing and sharing amazing work, why miss out on content that would be impactful to you, because you weren’t following the right person, had a busy day, or didn’t have the budget or time to go a conference? Mark Orlando (Founder at Bionic) Mark started his security career in 2001 as a Security Analyst, and since then has been both fighting for blue team resources and trying to automate them out of a job. He has built, assessed, and managed security teams at the Pentagon, the White House, the Department of Energy, global Managed Security Service Providers, and numerous financial sector and Fortune 500 clients. Short on patience and attention, Mark is constantly working on new projects to improve defensive security through automation and other short cut-y things so defenders can be more agile and creative. In 2012, Mark designed and launched a Managed Detection and Response (MDR) service offering and helped to invent an automated cyber threat hunting technology, both of which were later acquired. He enjoys teaching and learning from others but spends far more time doing the latter. Ryan Shaw (Founder at Bionic) Data-driven security has been Ryan’s passion for 20 years. From IDS analysis using Network Flight Recorder (NFR) and being one of the first handful of certified SANS professionals in 2000 to construction of an enterprise-wide email analysis platform for the Transportation Security Administration and overseeing development of a patented threat intelligence hunting platform for an early Managed Detection and Response (MDR) provider, Ryan continually mines security insights using readily available data. Ryan enjoys building and leading teams to explore both new frontiers and to look for missed opportunities in well-traveled spaces. Ryan is currently co-founder of Bionic, a startup that brings advanced security operations to the 99%.

Show transcript [en]

besides DC would like to thank all of our sponsors and a special thank you to all of our speakers volunteers and organizers good afternoon everybody my name is Mark Orlando and today I and my colleague Ryan Shaw are gonna talk a little bit about social media Oh sense without the indigestion we called it that because I think this is a topic that has gotten some exposure there have been you know other talks other write-ups some books written about using Osen using social media for Osen collection so hopefully the tack we're gonna take today and talking about it mostly from a defensive standpoint I think will be a little bit different maybe then than what you've heard before or at least

hopefully we'll make it interesting so by a way of just introduction some quick background both Ryan and I come from a security operations background building managing strategy automation rnd all to support essentially a blue team kind of mission we've developed some custom technologies holding patents on some custom technologies and we both now run a consultancy for blue team in sauk called Bionic where we do this kind of work and consult on this kind of work among other things on a pretty regular basis so first off I just wanted to kind of set set the the ground rules for the talk talk about what it is and what it isn't this talk is principally going to be

about finding the value where InfoSec and social media meet and yes we will be talking influencers there's really no good word for that thought leaders that's horrible right I don't really know but basically you know people that you want to get information from that you trust that are trusted sources of good actionable information that help you in your mission this talk is not going to be hardcore oesn't methods for Intel analysts like I said there's already you know a really great body of knowledge out there that's just not going to be the focus necessarily today although I think there will be a little bit of overlap there okay okay so first off who here is on

social media like uses it pretty regularly the rest of your line it's okay that's fine that's fine it's okay keep your secrets so why bother with social media for cybersecurity right we've got all the cool tools in the world we got Intel feeds we got all kinds of shiny fun toys so why bother you know going on Twitter to try to gather useful information why spend the time in my experience and our experience I think there are a lot of smart people there right sharing posting there is there's some positive discourse happening you know so there's a lot of value I think to be mined from social media and I think especially over the last couple of years personally again

I've seen a dramatic increase in the researchers and practitioners and other folks who really have valuable things to say sharing information they're relatively low barrier to entry right it's a relatively open community in terms of being able to access information and engage in discussion with with other individuals there if you don't look at it necessarily as a resource for collecting all the things and we'll go into kind of the signal-to-noise ratio and kind of cutting through that a little bit if you understand that it's not necessarily just a green field to be mined for for data and and raw information it can be very useful I think it's a good place to go to understand kind of evolving

tactics techniques and procedures you know new vulnerabilities new methods on the defensive an offensive side it's a good place to understand kind of non-obvious impacts your organization so in a few minutes we're going to talk about kind of using your threat model for whatever you're defending or whatever you're doing as a point of reference for ingesting you know information and raw data from social media and one of the things that I really like it for is its immediacy right really long detailed technical reports and write ups are great long-form blogs are great reports and other things you know all very useful right but in most cases that stuff doesn't come with the speed and the immediacy sometimes you know for

good or for bad that social media does so it's a nice kind of quick way to get at some potentially useful and actual information now the downside right there's a lot of noise there I think all of us can understand there's a lot of noise on social media getting back to that that low barrier to entry so you know kind of our approach with it is to look at it as a very large robust data set potentially not the highest fidelity data set but but a robust data set you know there are existing utilities tools third-party services that you can use to do things like indicator scraping from social media we're gonna talk a little

bit about that today that's not necessarily our focus you know maintaining that seperation between personal consumption and kind of work oriented consumption for your Intel sourcing you know really good idea to start to kind of cut through some of that noise trying to get to a targeted kind of following so it's not only about cutting through the noise but it's also about understanding what those trusted sources of information are what are those data points that you can use whether they're queries or hashtags or accounts or trends that you can use to to get to that quality high quality information and in data leveraging you know embedded capabilities like lists or channels in some of these services you

know and then frankly like just dive into scripting right one of the really nice things about a lot of these platforms although the you know it's perhaps not as open as it used to be in many cases is you can do a lot programmatically and I'm sure some of you in the audience have are doing a little bit today if not a lot programmatically using api's to kind of mine the data and slice to dice that however you want and that's largely been our approach well just to kind of get the most flexibility out of it so those are all good ways to kind of manage the overload another challenge that we've got to navigate in this area and we'd be remiss

if we didn't talk about it was bias and there's a lot of really good you know material out there on bias Chris Sanders has done a lot of good research and writing about it from a kind of defender and analytic perspective but I took this just kind of a quick overview from a writer named Buster Benson like cognitive bias bias cheatsheet these are I think some of the most common kinds of bias that we encounter particularly in security operations or cyber defense you know we run into this a lot whether it's dealing with social media or any other data set right there's either too much information which causes you to kind of skip over things or make assumptions or just focus

on kind of whatever is most interesting or shiny at any given moment there might be insufficient meaning or lack of context where especially for a short term short form social media you see that a lot right somebody just throw something up you don't really know what the validity is there might not be a lot of context in it and it can cause you as a consumer to sort of fill in some of the gaps right and again kind of make some assumptions draw conclusions from you know very small data set not necessarily the best thing for an analyst insufficient time and resources and that kind of cuts to the heart of what we're talking about today right

what are some of the ways that you can use automation to mine the data and get meaning and get value out of this large kind of diverse data set and not take shortcuts not burn resources not come out with you know data that's not actionable not quality and then insufficient memory right we're human we can only retain so much information so when you reach those limits it's kind of natural to say well you know good enough is good enough or you know to kind of disregard some of the the inputs you've received that's just human nature so these are some of the kind of pitfalls really of trying to do analysis with any large data set trying to come

to get information particularly when we're using social media for Osen purposes or even just news aggregation the particular bias that comes into play is availability bias right and that's basically just saying that things that are more memorable come to mind more quickly and they can cause you to make false assumptions about the larger data set right so I'm sure none of us can think of any parallels outside of the security community especially in social media where people just say things and it's inflammatory and it causes you know I don't know a huge part of population to just make assumptions about a large data set I know that none of us can think of any good examples of that but

it does happen okay so I promise in a few minutes we're gonna start digging into the technical part of it Ryan's gonna talk about kind of walking through this data set and some of the things that we're doing to try to kind of get this big beast under control but before we do that I want to just talk about kind of how we approach this challenge and how we incorporate this into security and and specifically into security operations and really it comes down to your threat model I think social media is a really good source of information for kind of this external discourse but it has to be somehow tied to your threat model if it's not you're

not gonna get the transparency that you need you're not going to get measurability you're not going to have any of your conclusions be aligned to you know the business or what your mission is and if you're not doing that you're just kind of minding the data set just to just to do it and see what's interesting right you know you have to start off with a good threat model you have to understand you know how your organization generates value particularly if you're using social media to understand you know how people are talking about your organization that you're charged with defending talking about how they might target it talking about threats that aren't specific to you but might impact you you have to

understand how your organization generates value and how that value could be disrupted or hijacked by a malicious party right if you don't understand that probably not in the right starting place we're doing any have analysis much less social media analysis and that's really going to shape everything you're doing from intelligence collection and applying that intelligence to your operation to picking technologies to actually doing your analysis incident response you know everything else really has to flow from that okay and really that can impact what your use case is so you know Osen is a hugely broad discipline and you know we know people some of whom are standing in this audience you know colleagues who are using it for

everything from you know seeing who's trying to run disinformation campaigns to try to maybe game a system if your business or part of your business deals with you know online ratings or reviews you know I think a lot of us are aware of kind of concerted campaigns to to impact some of those services and sites social media is a good place to kind of mind to understand you know where people might be doing that you know leaks doxing that kind of stuff understanding maybe where people are registering for example you know false accounts to try to socially engineer your employees I mean there are you know numerous different use cases for this kind of work and we're gonna zero in for our

case study on you know really kind of one specific service and one particular set of use cases just because it is so incredibly broad but we want to just kind of illustrate our approach with with one of these things okay I don't want to move forward without also acknowledging that there are lots of commercial threat intelligence services feeds third-party sites and services public repos news aggregators I mean there's a lot of stuff out there that exists for you to do this kind of intelligence collection or even just analytics but what we found with most of those services particularly where a lot of the social media api's are starting to be narrowed and we're heavily monetized so there are a lot more

restrictions with what those third-party sites and services can do we didn't really get the flexibility that we wanted to be able to slice and dice the data the way that we wanted to and so you know we found that a lot of these you know they're very good for the purpose for which they're built but if you're trying to kind of repurpose them for kind of customized intelligence collection or customized information collection that there are quite a few limitations and so it just pays to really be aware before kind of diving in on you know one particular third-party service for our research we looked at blogs blogs you know you name a major social media site

and we probably kind of looked at it and messed around with the data set you know Twitter we found was the one that I think is not only most active right now at least in like a public way but also the easiest to navigate programmatically and so that's kind of the one that we focused on today I think we're gonna talk a little bit more about that so as we narrow down kind of where our focus was gonna be with doing this kind of work using some of these social media sites these are really kind of the the key questions that we wanted to ask to drive our research and that was you know where can I find the most and the best

most quality original information and timely information I'm sorry I'm gonna use this word again who are the real influencers in cybersecurity and I don't mean they kind of like clever pictures that we're including in the slide deck but I mean you know people who really do have something original to say or contribute that can help you and what you're doing day to day who are those people how do we find them it's not always about followers it's not always about some of the stats that you know are right there on on Twitter or other their sites we're going to talk about how to kind of get through that to to that group of people because there are

some that are flying under the radar and so what are the information sources that are kind of under the radar that maybe you know are not going to be trending they're not going to be you know top followers they're not going to show up in any of those third party lists or services but you know in terms of ratio like signal to noise a lot of signal and some of these accounts and in some of these kind of queries that you can narrow down to so I'm gonna hand off to Ryan he's gonna talk about kind of how we dove in by way of like a case study with Twitter how we dug through that dataset can you hear me

in the back okay so as Mark alluded to Twitter became one of the most obvious choices for us to dig into based on the quantity of information but also what we could find at the tail end which is the quality of information we took an approach to leverage their API which by far you know suited our programmatic goals we did a mixture of local an AWS based host using the ec2 free tier I think we went over a handful of months as really they've been going on for about nine months and we probably went over the free tier limit probably half that time to the tune of $6 a month so we were able to do a lot of data gathering that

you'll see some numbers here in a minute in that nine-month period and we were to do it at a very low cost and we were also leveraging you know Python a shell script so low barrier to entry to be able to do that the Python was largely focused around existing capabilities for Python or Twitter's API and the shell was a lot of cleanup focus around regex when we started to dig into the the data the question became where is our starting point right so if you are a twitter user or your in security and you're considering trying to use Twitter as a mechanism to gather information where do you even begin so we kind of

took a tact of all right Google who are the top Twitter accounts to follow right and Google's all too happy to show you fifteen blogs that have top 30 top hundred and a lot of the same names kind of bubbled to the top some of whom you'll see here in a minute but then we pivoted off those leveraging the API you could dig into who's following those accounts but also we knew of some specific accounts that worked in the kind of blue team and defense malware analysis and I our space that we then looked at who was following them and who they were following so we kind of pivoted in a couple different directions off this and we slowly built

this list of 170 plus thousand I think over time it actually grew to over 200,000 users so from there we grab profile information and we started to grab tweets in roughly a 200 per user quantity and so for any given user that's going to stretch back as far as two hundred tweets would take us you'll see for our influencers that that range can be you know a matter of days and that you know for some folks it's a matter of years so it proved to be an interesting study in that now is that a great sample size when you're trying to assess a given user probably not the best and so an extension of this

that we'll have moving forward is broadening the scope over a longer period of time both in terms of number of tweets as far as time itself and then actually digging in from there and and seeing kind of where keyword analysis on the profiles as well as the tweets validate are these people actually in security because of course security people don't only follow and aren't only followed by people in security so we had to then whittle down that massive haystack you know where we had thousands of politicians that were being followed and in some cases following we have obviously tons of media some of whom are more engaged and I would consider in the security community others are more

consumers so having to kind of ferret that out both through profile text as well as actual tweet text to get into just because I say I'm a security person leet hacker you know in my profile if all my tweets are pictures of cats am I really offering anything of measuring the security for example yeah don't pull my account yet I haven't cleaned it so ancillary to all that was actually a little related but extension capability that we built to scrape domains and that sounds like a crawler to touch a single domain page right so handed a domain and it's going to go to the default page and it's going to scrape any social media links that are on that page it's not

spidering it's not following links on the page so it's a low touch thing the asterisks there represent the fact that I did find one site that someone had in their Twitter profile that immediately reported you if you visited the domain to an IP abuse list so I had a great back-and-forth with AWS about why they shouldn't be concerned I said really no just go to the site yourself you'll see and they're like no we don't want to so you got to still be careful even if you're not scraping right you want to make sure you're on the right side of things so as we dug into this we kind of got to the point of what is you

know what is the reality of what we're dealing with if we're trying to get to meaningful content there's no meaningful content tag and we'll talk a lot more about hash tags as we go through that everyone uses right it comes in all different shapes all different forms different people have different backgrounds we're going to talk about indicators and fanging of indicators and people use different tacks there and then people just generally don't only talk about work all the time or specific malware or threat Intel on their social media so how do we actually kind of balance the noise to signal ratio more in our favor it's never going to be eliminated but the key is getting it so

it's consumable and this is one of the things that I struggled with I'm kind of a goal-oriented person I like to kind of check boxes so when I get into Twitter and I'm following even a hundred or 200 users and I can't get through everything before it says I have more messages like that to me is a daunting thing and having four-bet I'm actually trying to get meaningful content out of it not just see what the latest in a talking point and other people go on you know this as a retweet into the Twitter echo chamber so an example of this is as I narrowed down the the accounts to what we considered really valuable for Intel

discussion one SlideShare profile as an example showed up four times more than any other and for some reason it was this b2b contacts account that has nothing to a security 2200 I think posted slide shares most of which I think are very short they're probably just content that someone took and posted and it's really interesting that this is kind of the thing that still bubbles up even when you have a refined data set so it's not going to be perfect but the key is if it really needs to get closer to perfect you're going to have to do some scripting it's not going to be something you're certainly going to do in the Native Client or a third party

car so as we talked about Twitter some user classifications kind of bubble up that you'll start to see and this is kind of what we determined and you know hopefully you don't kind of judge those around you and when you look at mark and I we certainly didn't judge each other around what our accounts are but the the person that we would all really like to focus around is the Community Builder so this is someone who's not only originating some content and sharing meaningful things but they're also engaged in the discourse with people they're contributing you see a lot of this out there in the InfoSec community where people you know even if it's down

to the hey I have someone who's new to the community and they're looking for a job or I got let go and those are beautiful stories where you see people actually engage to support each other that's that's Twitter to me at its finest you know and then you kind of drop off a little bit and it's not that these are necessarily bad roles that people can fall into as far as kind of their nature but you know the soapbox person is really someone who's going to be more just talking at people you're going to look on there you're going to see tens of thousands of followers and following three and it's probably like you know family there's the ghost which

is someone who's been kind of just off the system off the grid for a while maybe they were active for one period of time and either life got busy or what-have-you and they kind of go dormant or they never were really active the lurker which I've certainly been for most of my time is someone who's kind of just taken in what other people are putting out there again not bad things just different ways of going there are people who more consume content and then broadcast it back out there again they're getting that message out so not a bad thing to fall into the echo category the commentator is very similar to the echo category where you're rebroadcasting

content but you're typically offering some kind of commentary more than just kind of like the arrow pointing down this you know I've done that I've done that Theroux down yeah and then the last one it's a kind of hard word we had in there from but we do see evidence-based results where people are taking other people's tweets repurposing them it doesn't show up as a retweet it doesn't show up as a quote it just shows up as original content from that person when clearly it is an exact same tweet we saw previously from someone else and so that's disappointing sometimes that can happen just in the course of kind of cutting pasting and doing all that I get it but

if you see it repeatedly you got a worry I'll breeze through this I don't need to educate everybody on Twitter the first four categories really are your standard when you get into retweeting yourself that's where people are generally building threads where one tweet is not enough to kind of cover a topic and where someone engages them later and they go back and they respond to their own original tweet to clarify some things so that's where thread building really comes in are sorry that's reply self retweeting yourself is more just kind of like hey I said something important did you miss it I've seen a lot where it's hey buy my book right even our community we see a fair bit of

that quoting yourself again that's kind of doing the same thing as retweeting but you're actually adding some new thought hey I said this two years ago and it still relevant and then the manual retweet is really what that kind of fraud category was built up based on so mark talked about bias and you know in everything we do at least work-wise hopefully not on all lives there's some element of bias that comes into play and we wanted to acknowledge that in this research just the nature of the course in our original selections of where we started with the accounts and then the resulting accounts that we followed we ended up with mostly english-speaking accounts you'll see results that go

outside that but it's not a well-balanced global study the key words and phrases that we did a lot of searching based on are well-known things so this is not originating new keywords it's not at this point doing natural language processing to identify if you see lead in text the next phrase might be some new weapon malware family that's where we'd love to be and that's where we're aiming to be that's not what you're going to see represented so it's known knowns in a lot of cases that we search for even at 25 million tweets that we collected it's a small sample size you know that really isn't that much content and then of course there's

noise and data set as you grab these keywords you'll be probably not so surprised at just how many other instances of these keywords you know if we're looking for raging panda it's shocking how many people go to the zoo and describe what they saw as a raging Panda you know it's we'd like to think that there's Twitter info for a sec Twitter right and that it's its own domain but that's not the case right we get all of our stuff gets cluttered in with everybody else so what do we do the most recent data set that we originated what you're going to see in the slides came from some work early this month we wanted some currency so we have 25

million almost twenty five and a half million tweets we collected coming from 177 thousand profiles we leveraged 2220 somewhat generic security key words and phrases many of which you'll see on a later slide as well as there's the apt Google spreadsheet out there that kind of talks about actor campaigns actor tools and actor groups and general names of all these things we collected terms off there kind of whittled out the stuff that we knew was going to be way too noisy and leveraged a bunch of those terms to again see who's talking about the things that people would generally consider really interesting and then you're gonna get ancillary benefits by following those people so what did that

result in out of those thousand sixty-eight terms and keywords we got about five million tweets they matched right so you can imagine that InfoSec cyber sec malware like the more generic terms made up the brunt of that so as we kind of cleaned up some of that noise we got down to a hundred thousand comprised of thirty four thousand users then we dug on dug in a little more closely on just the apt terms and eliminated a lot of the two hundred twenty more generic terms and that got us down to fifteen thousand users and it finally we said okay well who are you who's using more than one term in these two hundred tweet samples and we got to

about sixty seven hundred users so we said okay like that's a pool we can work with more closely as we looked at those users we looked at only their original tweets so we didn't want to catch stuff that they were rebroadcasting over the last 45 days so again keep up manageable data set and we looked at who they were adding who they were what they were hash tagging the URLs they were embedding as well as other tweets or other traits of those two what does it look like when we actually dig in so we pull back data on the API the API does not serve data like this you've got the tweet at the bottom

that's how you would see it in your browser what you're seeing up top is basically our our flat file data store of that same tweet so why the why the delimiter of town till town it goes back 15 years where we were working in data sets and that's one of the few things I found that would never show up in URLs or other data that we were working with so it just worked and it just lovers that forever I highlighted a couple things one starting around where it says wrote a post tonight right so that's the main part of the tweet text to end of the second line you'll see that Twitter takes every URL in Twitter shortens it

to TCO they do provide the option in the API to pull back the actual URL so you can have those and they separate those with their own delimiter they less then tilled greater than they do the same thing with the hashtags so they don't they don't encode those in any way but at least you have those in separate fields really makes digging into the data a lot quicker it's kind of going back to that influencer discussion kind of who are the most followed people so just based on profile follows you know we had these 20 folks and I apologize for the folks in the back sadly I had to say that pod2g was not on my radar at

all if you're not familiar with pod2g a lot of discussion about iPhone exploits certainly a year two years ago so not something that was a focus of me professionally or personally but you see the range for the most recent two hundred tweets from pod2g the range was 1989 days right so not a lot of tweets spread out over five years by comparison if you dig in I think the shortest one we have there Swift on security as well as Leslie Carr heart attacks for pancakes four days to cover 200 tweets now that's every type of tweet right this isn't just original content in this case this is every type between replies quotes you name it

so in oh by the way all of the slides as well as all of our raw data as well as analyses as well as kind of tips on how to recreate what we did all going to be on a github that we're sharing it's not there at the moment it will be as we start rolling out data I think we have about 18 gigs where the data we're going to put up there for people to have that as well as the scripts to recreate this entire environment on their own so some other things kind of highlight Schneier maybe more of the snow soapbox where it's only original tweets for the entire 200 so again not a bad thing you know

you're going to get perceivable value there you've got some folks who are more broadcaster's of others content either via the retweet method you've got folks who are more engaged and that's again Leslie Carr Hart Dan Kaminsky with the high levels of replying to others so again all kinds of different personality types represented here as well as a number of things you see that we've got a number of folks in the community or over hundreds of thousands of followers [Applause] so as we look at tweets and kind of from a defender mindset which is where mark and I have come from historically how can we derive value right so the the most obvious path that people go to or

you know where can I get indicators that's just where people naturally fall how can I collect more things to go plug into my sim to go hunt for to find out I've got Emmitt at great I think we all know that probably one one-thousandth of indicators if that are actually useful to any given person at any time certainly without any context so that's really not where we want to end up the goal there is actually probably to have that analysis to understand what are the actor TTP's what are some behaviors I could start to build rules and logic around for my sim my soror other platforms and actually start to detect these things regardless of what the

indicators are the infrastructure is all going to change so some people are actually tweeting more that top level where we're talking about Sigma rules we're talking about yard rules so that was one of the focuses that we had what going into how do we get more to that actual net result so again there's a lot of ways you can get to that you can do keyword searching you can do hash tag lookups you can try and just carve stuff out of tweets not something you're gonna do in a Native Client but we found that both Sigma space rule as well as yarra space rule right case-insensitive provided a lot of relevant stuff with very low noise that's not to say if you

then go follow those accounts you're only going to get signal you're going to get whatever noise is going to come along with that but those are very good searches if you go into a native Twitter client and you want to if you've got some value coming out of Sigma or Yarra rules those are searches that you can do that are probably going to have a lot less false positives and you're going to find other places as you get into blogs obviously you're going to write up and so this is kind of coming off the side from Twitter but as you look at folks who are writing good content in a blog their Twitter accounts tend to echo when

that content is available as well as some findings actually in there so another great source to to have those write-ups as well as getting into indicators of where all analysis easier in theory to come by looking up certain types of tags such as c2 : IOC : or specific keywords for malware families you know if you look at NJ rat scum bots Twitter account it's going to come up and that's you're going to see them they skewed everything because they use not only the c2 tag but they also use malware family something they show up as really high on our list as well as looking at sandbox URL so if you're interested in kind of seeing what's

going out there in the sandbox world there's a lot you can do to search there alright a lot to actually get more here so what are some actual numbers so over the last 45 days of content we kind of trimmed our content number down to keep it current again looking only original tweets you see some basic numbers around a few searches so you got 331 hits for Yara 74 Sigma out of those 250 different users made up that 401 hit and then there were 374 embedded URLs right typically when you get a hit on a yard rule or Sigma rule phrase the URLs are not you know go shopping at Amazon it's something probably a write-up or

something meaningful so there's good content here again 374 URLs in the last 45 days so here's a pretty high signal ratio as far as those tweets what were the hashtags that came up most frequently you can expect hashtag Yarra malware noise D for a pretty good Sigma good and no noise you'll see in a minute cybersecurity noise InfoSec noise threat hunting better apt hit or miss better with a number attached to it sim and security again hit or miss I threw this out there because I thought it'd be interesting out of those URLs kind of what was being linked you see some virustotal things you see sock sock Prime is a big player certainly in the

Sigma rule space the github accounts tons of people are moving to using github more it's a social platform right and I say moving there's people they've been doing this for a number of years so certainly when you're getting into yarr and other rule logic write-ups and other things github is huge we're gonna dump a whole slew of github accounts at first it's probably going to be more exactly that a dump what we're going to look to do is actually navigate those and break them down by discipline and so all of our content that's where we're looking to go is actually that people go okay I'm either new or I'm working in a certain discipline or I'm looking to

move to a dis one so being able to go in there and say okay I'm looking to get into deeper so here are things specific to deeper slideshares you know speaker deck speaker broadcasts you name it we're gonna aim to have our github kind of tie back to those things if you are focused on io C's so not something I forget all the bad things we just had about IO C scraping you guys you're doing great work it's you know a volume thing so with that in mind trying to keep those numbers down when you look at fanging of indicators which is a great way to keep them from just being you know red flagged by any number of systems that

might already be aware of a given indicator the most common things we're seeing our bracket period bracket bracket D bracket HS XP right so simple ones there the tag c2 and IOC you see some numbers around the volume use they're very similar numbers 3 and 24 actually on the URL count so and that I think that actually is off or low but who is using those things most frequently as I mentioned scum BOTS so again we had two hundred tweets max from every account so clearly we're getting some redundancy and scum box scum BOTS account and that's coming from both the use of the c2 as well as some fanging you see a number of other good accounts

these again are pretty good signal accounts if you want to be following we've got some lists that you'll see later that are tagged or tied to my Twitter accounts so you can go look at those and pull those I recommend the one that is tied to this the IOC and Fang indicators tag you're probably gonna get a little more noise if you go to the one that's actually focused more around the yarn Sigma rules I would actually do those searches explicitly a little bit different when you look at the fqdn you're seeing a little more sandbox F activity you're seeing the paste bins some fish tank a lot of fish activity and the hashtags again you're getting a

fairly generic set of hashtags early on but then you start to get into some malware families and those can be valuable you know hash tag rat your mileage may vary depending on the country you know but I think certainly you get into low key but agent tassel those type of things they're going to be more valuable to dig into to that end so this is you know the scale is based on prevalence for kind of those thread hunting keywords so this is largely that apt Google spreadsheet spreadsheet terms extended slightly something's get skewed apt one-two-three-four all gets skewed because if I search for those I'm gonna catch apt 11 through 19 as part of a PD

one right so a bt 11 and 19 will have their own numbers but a PT one will be the collective of all them because of how the regex was working even in scripting you're gonna run into some challenges there fortunately there wasn't too much noise out there for fancy bear but there's a lot of value attack to be found out there looking at sandbox I thought this would be an interesting study so there's probably want to say five six seven more sand boxes that are live out there and probably even more than that they're not getting tweeted about certainly not recently so over the last 45 days the left two columns kind of cover sandbox

mentions the right side is across the entire data set so again for some users that's four days for some users that's five years so you will see some inflated numbers when you go across the full data set one interesting note was recently that any run is coming in about 720 mentions as where as virus totals 511 historically virus total was much higher than anyone so I don't know if this is a trend that people are moving more to anyone as far as sandbox but it seems like at least in recent times that is more the case you'll you also see some of these more fringe ones down the bottom that we're doing more specific analysis around Android PDFs etc those

are kind of dropped off the radar there's some country specific ones in here as well there probably many more that we didn't hit just based on our user base and our ability to get through translations when I talked about generic search terms one of the things that you're going to see is the prevalence of malware you know any media story of course it's going to talk about InfoSec is going to have malware that just shows up so these things all get kind of inflated so the value for the keywords here is more than one are hiding in the back it's things like b-sides threat hunting Def Con the ones that are red team they're tiny ones that

I can't even see I'm gonna make all this available so you could actually look at the numbers but it's this is more your mileage may vary so be aware of what it is that you're looking at so searching for a hashtag malware certainly is going to bring back our good marketing friends work I think the peak for hashtags in one tweet that we have is 47 this is not quite there but there are a number accounts out there certainly small consultancies many of whom are overseas and I don't I can't speak to how much work they're actually doing that use this as a means by which to get views right and so that unfortunately hashtags

are not this pristine thing that are protected it's something that the well is very quickly poisoned so you got a you got to kind of pick and choose is that it means by which that if I'm a bad person and I'm not saying by the way that these are that I could if someone's starting to leverage hashtags surrounding what it is I do you know so if I don't want people creating yard rules and being able to easily share them I'm sure that there is this nice woman named Yara who has you're a fan following her and posting pictures but hashtag Yara trust me you're gonna get a lot of noise if you try and search that

so unfortunately while it does show up in legitimate things its overwhelmed by the noise that's out there same thing for Sigma apparently Sigma is something having to do with camera equipment because there's a wealth of Japanese photography that's hashtagging Sigma F P and Sigma so kind of circling back to our kind of well-known friends how did they perform when we looked at those kind of two hundred and twenty malware or security keywords as well as the eight hundred and eighty ish apt based ones you see the left-hand column we got some decent hits you know some surprises at how low the numbers are but when we talk about things that again from a defender perspective things that

kind of show a level of technical acumen as far as what you're doing in your day-to-day work and what you're actually sharing in the relevancy and currency it's really almost negligible and so I there were so few I was like I might as well show him so on the kaspersky side hurlan zee burocracy debauchery boxes and Schneier clodhoppers not fetch in trident so again context I'm not even sure potentially proning to two additional write-ups in those but not a whole lot of discussion if your focus is high-end research against advanced threats you may want to look beyond this plenty of value in those Twitter accounts that's not what I and we try to say so one of the things I we mentioned

that there is no template when people share but what we're seeing and we're going to work to try and help shape that and change it so I'm not trying to malign poor Josh Meltzer we hired him years ago great guy great great account to follow this is an example of he really started to get going on Twitter and blog post recently on the work he's doing and as part of that one of the people responded hey by the way instead of the the bracket D bracket if you're gonna thing your indicators use bracket period bracket YouTube will just auto ingest that take the brackets off and it'll just go and so Josh quickly changes behavior so we're seeing some

good kind of feedback to help shape things I think if we got to more standards both in how tools were able to work with things like fanged indicators but what would be really more helpful is if you're going to include indicators various types to use certain labels or tags and then to hopefully try and do that in a way that we're not going to allow that well to get poisoned by non-security things it's good so we're seeing a lot of kind of positive peer pressure to get us where we need to go okay so just to kind of wrap things up you know hopefully we've kind of highlighted some of the differences in terms of you know formats and use cases

and analytics and consumption between kind of these more non-standard social media datasets and something like you know an Intel feed that might be in a very standard format and meant for programmatic kind of consumption and processing so you know for us ideally being on the defensive side of things you know this is all great doing research is great understanding things and discourse and discussion all great but I'm really at the end of the day I want to get as fast as possible to information it's actionable for me that I can use to dump straight into my operation whether it's you know rules like Ryan said play books you know queries hunting methodologies things like that so you know we're gonna be

sharing as Ryan mentioned a bunch of resources I think to kind of help folks do that but once I get my you know my my list down of sources that I trust of queries that have very high fidelity good signal-to-noise ratio you know now what do I do and really this is where you can kind of pivot and turn into you more traditional you know cyber threat intelligence operationalization processing kind of workflows you know prioritize the Intel that you have you know validate it contextualize it pivot if you need to create those con those pieces of custom content those playbooks so they can be operationalized kind of in your technology and then ideally you know share back I mean that's the great

thing about these platforms sometimes it's a great thing you know you can share back with the community hopefully with some of the data points that we've shared today those of you who are you know posting research or posting indicators or posting other you know analyses raw data you know there's some kind of more effective ways to do that and maybe less effective ways to do that where you can reach you know a broader audience and kind of get the most feedback but hey you know wouldn't be a party if you guys didn't leave with prizes so we're going to be posting like we said a few times now a lot of you know data from from some of

this research some of our scripts and queries follower you know lists resources to the Bionic github you have to you know get it clear from legal and export no that's true we just haven't done it yet but we'll do it very shortly probably in the next 24 24 hours well 24 to 48 24 to 48 hours it'll be as if data in the raw so a lot of the stuff will get out there by tomorrow and then some of its just gonna take however long it takes get up to accept that right you know in addition one of the things on Ryan's Twitter he's got you know Twitter lists of followers born out in large

part of some of the research we've done here be sharing us that social scraper for domains that he mentioned where you can pull out in all the social links without spidering entire web sites you know just little things that are kind of handy to have we've talked a lot about Twitter today again principally because it was the easiest to kind of illustrate you know kind of digging through the data set and there's a lot of like interesting data there to look at but also you know we found I'm sure I don't need to tell many of you this you know there are a lot of great slack communities out there that's another fantastic resource people don't always

think about don't always look at first a lot of those are kind of closed-door invite-only but there's a lot of good stuff there you know news news aggregation sites here are some examples of some that we find particularly valuable in terms of like aggregation and providing actionable information you know even now like twitch is getting into the game with a lot of like instructional kind of tutorial kind of videos on there that are really good so if you're kind of in that social space and you're interested in pursuing that more or you know looking for more resources beyond going into Google and typing like how to cyber you know something like that there's some lots of

good resources out there here just some examples ok future plans you know we're gonna kind of continue this research I think there's a lot that we could do here that we just haven't gotten to yet because I mean quite frankly there's so much data and there's so much stuff and they're so few standards in terms of how you know our little niche is discussed and presented you know there's that's kept us busy for quite some time but we're gonna continue to maintain the the projects and part of you know us sharing them is obviously we love contributions and collaboration but I definitely plan to continue making refinements to kind of what we're doing and making it more

specific and actionable and so you know welcome you to to come find us we're also on Twitter you know my feed is one of those mixed you know you're gonna hear like InfoSec stuff be like I watched Watchmen last night it was awesome and like you know there's gonna be all kind of stuff but yeah feel free to hit us up and ask any questions either now or or later thank you very much for coming out today [Applause]

Social Media OSINT Without the Indigestion

Related talks