← All talks

BSidesNcl 2021 Security and Data Science Colin Gillespie

BSides Newcastle32:4415 viewsPublished 2021-10Watch on YouTube ↗
About this talk
Data Science, Machine Learning, and AI. Topics that aren’t often discussed with security in mind. This talk will show how a few simple hacks, resulted in easy access to potentially very sensitive data Every day we read about data science, machine learning and AI. Data-driven processes, yet security is never mentioned. Many organisations that use machine learning keep this separate from standard software teams, resulting in potentially serious security holes. This talk will cover three (now fixed!) vulnerabilities. In the first example, we’ll investigate domain squatting on the Bioconductor website. By registering only thirteen domains, we had the potential to run arbitrary on hundreds of users, in all major pharmaceutical companies. In the second example, we’ll look at techniques for guessing passwords on RStudio server instances; the main IDE for running R code. Lastly, we’ll highlight how users can be a little too trusting when running arbitrary code from blogs.
Show transcript [en]

so thank you very much for coming along for listening to me for staying put in the one of the co-founders of jumping rivers and we are based in this building we have sort of four floors up and i also got roped into right now a book a few years ago about efficient r programming now lots of people are coming up here and saying well we're not you know they're not a security person i'm really not a security person right so i my background's in statistics and data now nobody pays me to do statistics so it's now in machine learning and ai because that's where all the money is and sort of has implications and as i said the co-founded this

company called jumping rivers so some jumping rivers key facts right so everyone in the company uses linux so it's 15 of us and we've managed to get away with everyone from the person doing hr to finance everybody uses linux as a startup that did help with a little security of how to get people to upgrade windows and viruses and all that stuff it's not perfect obviously but it is possible we're not a water-based company regardless of the names we have no

company as such and we're the winner of the most mysterious name ever competition and we also have an amazing selection of coasters so if you want some swag with some courses at the at the front here uh we bought around about five thousand of these things two weeks before covered hit and then they were in every room of my house so they're propping up desks chests of draws everything so grab some courses we've got another 4980 up in the office so feel free so what does jumping rivers do well we do lots of data science we do lots of machine learning so we typically go and help companies and you know they've got data and it's like you know

we have gathered data what can we do with it and you usually say not very much because you need to do it properly but that's a whole other talk so we do lots of that we do lots of training we do lots of r in python i'm going to be telling you a bit about r because i suspect most people haven't heard and we do some managed services so we're a data science machine learning ai type company we're a consultancy and we do lots of training and that sort of stuff so oh one other thing we've also got a junior sort of mid-level sys admin place so if you're interested in that grab me at some point afterwards

so on to the talk does anyone know who this chap is getting images yes so that's getty on the left and mr imagery's on the right but he also goes by the name of ryan clark and he hacked the cia and i like this picture because that's his mum taking him to court and it just astounds me that you can hack the cia and need your mom to take you to court right you know if someone is not involved in this industry you know not a secret that shouldn't happen right you know that's cia should be good right you know you might be able to hack someone but not the cia you know they should know their stuff

and i've got a very modest security goal right so i've got a very modest security goal and my mod of security goal is i only want to get hacked by adults right so if you can't go for a drink with me after the end of this conference you shouldn't be able to hack my computer right this is not a challenge right i'm just putting this out there this is not a challenge right because i think we all realized that what should be a really easy target of not being hacked by a 14 year old or a 10 year old happens all the time right and that to me scares the life out of me that we can still talk about you know

nation state actors and then we get a board not even teenager being able to hack you at times right that sort of stuff just shouldn't happen it's just bonkers i don't know how this chap is and that's not getting imagery's cousins either before anyone shouts out this fine eh man is a former greek minister of security and if you pay careful attention he's got absolutely fantastic security practices right and his security practices uh revolve around a post-it note and there's two surprising things well the first one's not going to be surprising because you know where this is going to go the first is he's got his username and password minister password12345 but the most surprising thing is why

would you have to write this down right you know if you take a step back if you have to write that thing down you're in trouble right but anyway we digress so how secure is the world of data science right by data science i just mean anyone doing data things right you're not machine learning ai whatever you want to do you know how we make money so in this talk i'm going to talk a lot about r right we do lots and lots of r who's heard of this language called r oh not bad how many people have programmed with a little bit of our probably people who did a maths and stats agree maybe

so r is similar to python sort of but not really but if you have no idea what irish just think of python you know it's interpreted language r is a domain specific language which means it's sort of aimed at doing statistics machine learning that sort of stuff am i getting a horrible buzzy thing or is it just no okay uh and so it's sort of like python so if you don't know what irish i think python you know same sort of workflow that sort of you know so anyone know where the name came from so clearly it comes from the language s right so that was an obvious question you know so it's clearly derived from s

and s was a language developed around the late 1970s by 18t labs they did a lot of unixy things and there was the same bunch of people in that labs who had s for cystics bundled that on to 18t labs with unix and then it became popular at some point 18t labs thought why we're doing this for free that's a bit stupid started charging lots of money and then arkham r is uh the 14th most popular stack overflow tag right so you've not heard of it but by you know sort of waving your hands it's a reasonably popular language or it's a really badly documented language take your pick but we'll go for popular and

it's not a competition right it's not competition and we're a nice bunch we don't like language wars but if we did right rbc.net sql server and excel not that we're sort of going down language works because that's completely wrong but we are better than them in terms of stack overflow tags so it's a popular thing right so you might not have heard of it but lots of people are using it just not perhaps people that you know uh just another quick background research project new zealand auckland released 1995. our core was formed 201997 renowned version 4.1 base r is very stable right so i can run my really pathetic code from 2001 still using base r

python so that you know it is a well-maintained language really you know the sorts of patches that goes into it is that they've got the 16th decimal place wrong in some bizarre statistical routine you know that's the sort of thing we're going for right now it's a it's a sort of language for that so let's get into the talk so one of the main ides that people use is something called our studio right so it's a free open source product and it's got enterprise level products the person who developed this was a guy called jj layer and he also wrote cold fusion so you might have heard of that in the past he's got a whole wikipedia page just one

of these really clever chats that can come with ideas and there's a desktop version you can download it install it there's a server version so you can launch it onto the cloud and that's where we're going right so this is used by lots of people right this is sort of the main id you know it's either this or emacs right which is trying to persuade someone to use emacs when they've never touched it it's hard so astronomer clouds really easy to set up right you know you're creating a crowd a cloud account with some person who already knows everything about you google aws whatever you write a little bash if you don't know bash you can just google and you

can get a blog post that tells you exactly what to do it requires very little technical skill right so you can get this ide you can put it on a cloud and you can have almost zero technical setting you know a little bit of programming but you can just copy and paste half a dozen lines and you can get it up and running you could probably even you know get some kubernetes stuff behind it without really knowing what you're doing right you know it's not that hard sort of and someone has kindly made a docker right so if you you know a bit docker you know people say oh doc is wonderful and then you you go to doc and it says

doc is easy and if you've never come across docker five days later after you've killed internet in the entire block you finally get header and docker and there's an image for aws and the default username and password here for these things was r studio r studio and in bold letters he said well then it's quite serious don't be stupid make sure you change your username and password so what well you go to usual you go to showdown you type in our studio that's an old image there's now a few thousand hits out there you get a whole bunch of our studio sign-ins you get where they're from you get a username you sign up here and

you think what to do now what to do and you think well i'll listen to dark net diaries i know csi but i'll take it easy before we go into the hardcore stuff and so you try our studio and you try our student i did this a couple years ago and the very first one i got into and it was oh crap i really didn't want to do this and it was someone's proper serious work again if you're using r you've got data all right that data could be patient data financial data you're not doing it just for fun and games you tend to be doing it for a real purpose and to me that doesn't seem like hacking

you know that is at the level of schoolboy schoolgirl you know it's that sort of level that a child could do it and that's where we were this has now changed so this has been 18 months ago so if you went and tried this now hopefully no one is now no longer doing that because the enforcer pass would change but that's where we were bioconductor supposed to be probably never heard of bioconductor bioconductor is an r repository for genomics so what that means is a whole bunch of packages around about 1200 and a package is like a module we'll add on that sort of stuff and you would go to bioconductor if you're doing genomics

research right so if you're doing microarray analysis rna-seq snips not the sort of thing you think a saturday night i think i'll crack out some bioconductor do a bit of microarray right anyone doing this is a major universe

the sorts of people that you you know this is serious stuff right you don't just do this for fun and well how do you install bioconductor well installing buying a conductor is quite easy you run this one command in r right so you're on this command source blah blah blah https bioconductor.org light dot r great what does this mean this means you're running an r script with full permission that's the same as running a python script with full permission so it's like here's my python script would you please run it in your system right and you go well no or at least you should right so when you run that command you're running an ascript and it can do

everything that a user can do right you know within five seconds you can download a bunch of packages that could be ordering pizza from domino's right you can do everything you want as a user all right and this is equivalent to just logging onto your laptop and walking away right so when you're on that script that's what you're doing so do i trust bioconductor to be honest yes you know you need to trust some people in the world and the people that run bioconduct are really clever people you know you know you can look them up you go yep they're smart folk yep happy with them so trust bioconductor but do i trust myself well that's it that's no right i

i don't trust myself and i don't trust really anyone else because we all make mistakes and like a year ago i was really bored so i made a few online purchases i thought well what can i do well i could buy bulk and doctor and then i could buy biconductor and i could buy bio conductor and then you get the idea and i bought all 13 misspellings for a hundred pound you know we're good for sophisticated attacks here right you know this is what we're after do you think anyone hit those domains and by hitting those domains i mean they've mistyped that source command so they can't spell bioconductor it's a tricky word they can't spell

bioconductor and then they just try to run an arbitrary ascript that i could have put up but answers this question is yes people hit my domain and these are unique ip addresses or unique ip hits over the course of a few months so there's an average six to sort of this average of eight at unique ip addresses it turns out would you believe that when something doesn't work people keep going and going and going you know if it does if at first it doesn't succeed just keep hitting return all right and that's quite scary well it petrifies me and the sorts of organizations you know a hat means i could run an arbitrary app script i got all top 10 universities in

the world oxford cambridge real harvard the law right all all on that on the logs i got every government you can think of any government that's worth it it was in my logs right you know all the big agencies from the uk the us europe china anyone like that they're all there i think i got every single farmer company out there as well now just be honest i only put up a 404 i didn't do anything whatsoever you know i just had an apache monitoring log so there's nothing i did wrong you know i've just bought domains looked at logs nothing but that's really scary right you know someone who you know i don't know much

about security every single farmer in the world you know all the big ones and again why would a farmer company be accessing bioconductor because they've been doing clinical trials because they're doing microarray because they've been spending hundreds of thousands of millions tens of millions of pounds in experiments and what would you do you know if you're a bad person if you're a bad person the best that these farmer companies could hope for is ransomware right after that things get a lot worse you can steal your data perhaps you must change the data you're developing a new drug just change a few numbers let's spend hundreds of millions in this new drug let's you know potentially kill

people right you know that's the bad things you can do and that to me is about what's in ransomware because it only requires a few digits changing an excel spreadsheet and you'd never catch it you know a little tweak here and there easy and that to me scares the life right you know because i've done you know next to no sort of proper you know security it's just buying a bunch of domains again this has changed so if you go to bioconductor now they make you do something a much more sensible right even though this is no longer the way to install bioconductor i'm still getting hits all the time in these domains which is a pain because i have to keep

renewing that hundred pound every year paying the rear end uh i feel responsible and it's why well you get post-it notes for passwords so you get post-it notes for how to install bioconductor and these post-it notes are getting passed down from student to to supervisor well how do we install that again oh here's a bit of paper just run those commands so even though this is no longer the process for doing this for the last 18 months i'm still getting because

it's a syndication site for lots of our posts so if you're interested in our go to our bloggers and essentially it just syndicates a bunch of our our blog posts right so you can look through them and go oh that looks interesting you can click on it right and so some of them are really you know this is even dull for me testing rounded data for a circular uniform distribution but then you've got all sorts of varieties of what people are using are for you know dashboards whatever now just suppose the most popular platforms out there are suppose someone scanned a list of the contributing blocks right there's round about 750 of these things so 750 people

putting the blog out there saying you know here grab my rss feed do that and suppose someone then thinks i could write a little follow up and see if any of them return a 404 and they get 43 blogs that return to 404 i'm guessing and then someone purchases some of these domains that turn on 404 and then someone creates a little quick blog post on how to write cool sexy graphics you know something animation you know because we all like a bit of animation some nice pastel colors that's what really gets ai juices flowing so we create a little blog post and that sort of stuff would people run arbitrary code just because it's on the internet

clearly the answer is no but unfortunately in this case the answer is yes people would run an arbitrary record because it's on internet and what is even more slightly serious about this the blogs that i was over or taking over had sort of google juice already there you know again it's not a particularly sophisticated attack the blog was typically originally written by maybe a phd student and then they went and got a proper job i suppose and left a blog and it just lapsed but again you know the people using our these aren't sophisticated attacks and with almost no chance of being caught you could get away with anything you want the bioconductor starts even more scary if i

was a bad person what i would do is i'll do something horrible to your laptop and then install bioconductor for you anyway you know so you've got bioconductor that takes about 10 minutes so you've got a sort of tolerance of a minute that no one would notice right so there's lots of really bad things you can do when you install the code why is us so popular all right we'll ask really popular something called cran so crown similar to pie pie and if you go to wikipedia we all do our best research if we're honest and you look up cran you get calorie restriction for a dietary restroom not this we get telecommunication architecture nope

we get this is my favorite a measurement of uncleaned herring i think we should bring back the old imperial ways and just dump this si nonsense right now and go back to crayons for for measures of uncream clean tearing uh so next time in tesco can i get two crayons please uh but really it's a package archive for r and the way of picturing this and this is a really good image because it does capture the sort of people behind it is it's like pie pie except done properly so when you submit a package to cran that package is checked against our past present and future it's checked against all operating systems you can think of

so linux windows mac solaris the bain of people's lives who's solaris 32-bit 64-bit machines it makes sure of does it affect any dependencies you know future so it's also really stable it's got problems and then a human might actually look at the code right so you know it's good and what it means is like installing packages is easy right you know i can type one command i can give you one command on how to install a package you don't need anything else and automatically download all the dependencies and put them in the right place unlike python where i do a lot of python i'm sagging it off but it's just fun uh unlike python i've

got four versions of tensorflow kicking about my laptop at various places all right so that's one of the reasons why it's popular and essentially the command is installed.package put the package name magic just works okay when you run this command you're actually executing our code so it's not just the same as getting some files and putting it over here it's getting some files and then running code right so technically installing is a potential issue not technically installing is a special issue right you know when you install a package it could do anything at once you have now given that package full or that the coordinate package you know it could do anything at once right

it could set up an ftp server fire everything backwards and falls across the internet do what you want and also when you install a package so that yellow little orange blob is in our package when you install that package it grabs all the dependencies right you know so this package depends on our markdown our markdown depends on ca tools ca tools depends on bit ops so you've got automatically works out this dependency graph and installed all that stuff all right which makes things a bit more tricky to reason about so there's a project i've been working on since it's been worked with sonotype and this essentially is trying to help here right you know it's

trying to sort of when you install a package of when you develop a package or when you've got a package in a ci pipeline it will automatically essentially ping their index this is a bit of vulnerability in one of your dependencies so it's trying to sort of get a handle of the the package dependency graph all right so that's a one of the start one of the things that r does a lot is an art plus in javascript right so one of the the things that are sort of caught on is being able to build a dashboard so i can we can in our training courses we can take a bunch of people who have never coded before and

in around about five days they can be creating sort of basic dashboards with some regression lines interactive graphics all using our knowing no html no css right it's in five days of course there's gaps in knowledge and all that but that's the sort of level and that's because essentially a lot of our packages now wrap a lot of the horrible javascript and they might give you a nice r function so rather than trying to write a javascript graph which is clearly the better way of doing it you could sort of wrap a simplistic view around in a little our function say give me nice fancy graph widget please and it generates some javascript and does all that stuff

so there's lots of embedded c plus plus in javascript and that's quite scary because javascript is scary right i'm sure all security vulnerabilities can be traced back to javascript well it's not true but it sounds more fun now technically a few security issues have ever been detected of course no one's actually ever looked which is perhaps part of the problem right you know so no one sort of cared about this and i'm certain that there's a lot of javascript kicking about in our packages because it basically embeds jquery you know version last year kicking about a you know graphs of javascript minimizes it throws it in and then forgets and that's quite scary so one final

question so would you ask it to create dashboards to create a tensorflow ai model to analyze data most people think that's just stupid you know depending on your background you think that's not my job someone else's job not me but most data scientists so we're working with a pharma company in the us and it's for data scientists really bright people they're expected through all of bioinformatics some statistics some machine learning programming to quite a high level docker aws setting up and maintaining a server some linux and then they've got the day job that's not possible for one or two people to know right there's a lot of stuff going on there would you believe it's not done well right it's not their

fault that it's just you know they're expected to know everything but yet lots of data scientists are expected to secure web servers you know they're doing dashboards it go don't know about oh i've never really heard of this thing can we rewrite it in java i had that conversation yesterday all right you know i just wanted to rewrite something in java because that's what they're familiar with well no not really you know how can you rewrite an ai model in java you know especially when i want to change it tomorrow that's not how machine learning works they expect to look after security but i look after means they have no idea because it's never been in any course or

anything they've ever thought about so therefore they're looking after it by not knowing about it which is probably not the best strategy they're expected to load balance applications as well right there's so many things that small teams are expected to do because they're sort of following through the cracks and again i think a lot of the talks were thinking about and you know i'd love to first talk about you know fridges attacking things this stuff here in my mind is if you're going after someone who's using these sorts of tools this far more juicy targets than locking down someone and getting a few bitcoins right you know you could really screw a company over by just tweaking one or two

lines of code that you'd never notice you know you target a bank you go for a bank you go for a mortgage there's going to be some mathematical model some api that's going to say whether or not you've got going to get a mortgage you could just tweak that a bit you know not not major so it is blatantly obvious but maybe you sort of knock it down 10 so now you're going to give a mortgage to perhaps someone you shouldn't have are you going to give a loan to someone who possibly shouldn't have got a loan right really trying to detect that stuff would be really hard right so depressing summary i think there's lots of popular tech

that i t don't understand right you know again r is perhaps something you've never come across of you only come across it tangentially but it's a really popular programming language there's lots of people using it right we've just trained 300 data scientists and the scottish government you know lots of people are doing it and are secure but can be made insecure and it's just the same as anything else right it's not all don't use that i've only used this you'll be fine well no it's more the fact that typically there's understaffed departments i.t don't typically sit there drinking coffee and eating donuts they're stretched so they're getting a bunch of phone calls from you know data scientists who are trying

to communicate we need some sort of dashboard shiny thingy in this cloud is that the right phrase i don't know but we just want it and we need it and we need it now can you just do it tomorrow an its email what your requirements you know what about ram ram what's that is that something we have with coffee and so i'm that conversation is really hard and what happens is you get insecure applications so thank you very much that's into my talk any questions by all means feel free to ask or grab me over lunch or any other time during the day but thank you very much for your time hmm does anyone have any questions for

colin please raise your hand now anyone anyone on the verge of streaming here we go we go we've got a candidate sorry i'm going to get really close and personal hello hello hi thank you for the talk by the way you mentioned earlier on you were buying a bunch of domains how much does that actually set you back those 13 cost 100 pound in total how much did this cost at the start because i bought a few at the start of this year and you get like a deal for them oh so yeah those domains were just i bought all 13 for less than 100 pounds so 13 for 13 pound then came back a year

and found out yeah yeah yeah in terms of being able to hack all farmer companies in the world it's either a good or bad investment depending on your point of view [Laughter] anybody else have any questions for colin i've just got something i'd like to say i've never heard unclean herring in a talk before so i think you get extra points for that anyone heard that before no double points yes double points please grab some coasters because if you don't put them in a bag and then take them upstairs again which is just a bit of a pain to be honest well thank you oh we have a question all right contender hi thanks for the talk

um so at the start of the talk when you were telling us about our i've used it a little bit in the past and you mentioned it's a domain specific language for statistics couldn't there be an argument made that you shouldn't be putting that on a web server and then you would avoid all of the issues with people hacking the website there is except it's not it's a it's not insecure if i was i've always been our service at it just need to do it properly you know it's a bit of software it's just like saying well wordpress you know you know it's not not a difference of you can set up this stuff properly so

there's a server data scientist can push the dashboard up to that server and it's all perfectly secure you know there's no reason why that's not other than the fact that no one tends to take responsibility also a lot of applications we deal with if you're in a major government organization you're pushing dashboards reports all the time but the purely internal so surrounded by an internet of course it might be pulling in sneaky little bits of javascript they didn't quite think of but we wouldn't even dwell on that part because that gets scary again as well but yeah one more question down the front and then i think that will be it sure related really oh wait wait wait wait

wait wait wait you need a scotsman i saw a related question um we do some data analytics work in our organization and we often that the analytics servers actually almost end up with shadow i.t because you end up having a department within the organization that wants some analytics work wants some some ai some machine learning they have a budget for it they don't want to involve i.t yeah and is that something that you've experienced and sort of contributes to this yeah it's yes would be the idea and again i suppose data scientists are incredibly technical you know you do have someone that knows about doc and you know if someone knows how to set up

servers what happens is they can do that properly then they get bored and they go into the next shiny bubble because they don't want to do that and that's what happens uh we typically tend to when we now do a lot of my services so we need to go and say not for everything you know but for this particular stuff and i t lovers because it's like great and data scientists love us because great you know so that's the way we see thing going uh but yeah it's exactly what you said you know little shadowy domains universities are the worst uh they have them all over the place awesome thank you very much um yeah let's give one final round of

applause [Applause]