Vendor Data Science Buzzwords Hacked

Name: Vendor Data Science Buzzwords Hacked
Uploaded: 2019-06-10
Duration: 16 min 23 s
Description: In this talk I will go through buzzwords commonly used by security vendors, explain what they mean from a data science perspective and give advice on how to treat them with a healthy dose of scepticism. Think "Artificial Intelligence", "Data Driven" and "Advanced Analysis". These terms can all descr

BSides London · 201916:23429 viewsPublished 2019-06Watch on YouTube ↗

Speakers

Thordis Thorsteins

Tags

CategoryTechnical

StyleTalk

About this talk

In this talk I will go through buzzwords commonly used by security vendors, explain what they mean from a data science perspective and give advice on how to treat them with a healthy dose of scepticism. Think "Artificial Intelligence", "Data Driven" and "Advanced Analysis". These terms can all describe important approaches that are used in security data science, but they tend to be used too freely, potentially with the aim of blinding security teams with science and avoiding giving too much away. After this talk the audience will walk away with an arsenal to decode these buzzwords and the knowledge to ask pertinent question to vendors to discover what their products are *really* doing under the hood.

Show transcript [en]

okay so I'm hello my name is xxx stuff things and don't worry you will not need to repeat that but I am a security data scientist from panacea and as he said today I'm here to talk to you about data science buzzwords and how to see through them so to start off I'm just gonna bring up a few statements taken off the websites of real companies and asked to bring these off I just want you to think critically about what these statements actually mean so first one company one is a leading artificial intelligence vendor that offers targeted machine learning applications which optimize the bridge between digital and offline channels pretty self-explanatory I mean feel free to if you have some guesses

feel free to to shout out but it's not necessary I'm not putting anyone on the spot - yeah that was kind of more what I was leaning towards company 2 is the only provider of global real-time unlimited information and insights driven by leading artificial intelligence again very obvious what this company does and then my favorite one learn how automated machine learning supercharged is your analytics efforts I don't know exactly what they mean by supercharges but I quite like like the way they face this so and I don't know about you guys but I mean I do data science for a living and I still don't really have any idea about what they mean by these statements and I mean that

that being said this doesn't tell us anything about how good these companies are they very well may have a fantastic product underneath her head under the hood but the problem is that they're just not revealing anything about what they do how they do it or how well they do whatever they do and this kind of stemmed from the fact that bus words are created to impress not necessarily to give any information away they often create this disillusion that by sounding more complex things are necessarily better but obviously that is not always the case at all and I mean I think we probably all been guilty of falling for this at some point I can definitely say yesterday I was walking

around in faux sack and I saw a lot of headings that weren't it was very ambiguous with the companies did but I still found myself very compelled going to go towards the booths that had some kind of short impressive sounding phrasing sleek font and I mean that's that's completely fine as I say sometimes the company is actually already doing something very valid but it's just important to think critically about what they're doing and actually asking about anything that isn't clear because if someone can't explain what they do without relying on buzzwords that's a huge warning sign so hmm I'm gonna focus on three let's say celebrities from the data science passwords world and the first one off is

they two driven now this I think a lot of companies use this and what it essentially means it's just that they use fate to do whatever they do and to me at this point this actually seems a little bit redundant when we're talking about security applications because I think it's difficult to find applications that don't use any data so it's more kind of this phrase is used more to impress rather than actually tell you anything and so the key things to ask when someone says they've got a day to driven application is what date are they using and how are they using their data because that's what's going to differentiate between a good application and one that isn't as good I

think the most commonly used data science bosphorus these days have to be artificial intelligence and machine learning and it's gonna differ quite a lot depending on who you ask what definitions you're gonna get even if you ask the experts you get slightly differing opinions so I've included this slide for context and just to explain that the language that I'm using for the purpose of this presentation is the language that's kind of most commonly agreed-upon in the industry so that is that machine learning is a subset of artificial intelligence and and so as I go on I'm actually just gonna focus on artificial intelligence specifically but just keep in mind that anything I say about AI you can also use the same

techniques whenever someone says something about machine learning so that brings us to password number two which is a I powered and this can mean a whole host of things best-case scenario is that they've got some high quality data or high quality models that the using where appropriate to make the application better however in a huge number of cases that's not the case at all research recently actually showed that 40% of EU startups that are classified as being in the a AI space actually have nothing to do with AI or don't use it in any measurable way anyway so you also have cases when there is some machine learning element and but it's mostly been included for marketing

purposes so it's not like it's actually making application any better but they're sort of trying to tack onto this thus the machine learning is cool and that they're doing something intelligent so they've included it just to impress rather than actually return back to results and then this the weirdest case when they actually don't have anything to do with AI and so I'd say the most important thing and to remember when anyone says anything about using AI is just not to get too impressed but remember that AEI is a tool an AI it's a tool that's very powerful when used correctly but it is also only a tool it's not a magic fix or solution like

it's often perceived to be it's often marketed like you can just throw any data at any problem and magically you'll get the correct answer at the other side and that's obviously not the case there's a lot of prerequisites for it even makes sense to use AI so first of all you have to have a well-defined problem so if anyone says to do an AI but it doesn't sound like the problem they're trying to solve is well-defined then you can immediately out roll that they're actually doing anything to do with AI and and to use a very very popular phrase in the data science community rubbish in rubbish out which just means that the quality of the model you have

is only ever going to reflect the quality of the data you use to train your model on so if you using data that's not expensive enough that has biases you're not accounting for properly then the model is going to be hugely flawed and this is something that needs to be taken into account the last one is advanced analytics and as we saw on the graph before this is the kind of this is the one that's widest pit scope and what it essentially means it's just that it's more basic no sorry more advanced than basic analytics and I mean this contains things such as artificial intelligence machine learning but it also contains things like particularly well constructed graphs or a very

advanced visualization so the thing you need to ask when someone says they do in advanced analytics is what do you mean what sort of advanced analytics to round the talker I've prepared a buzzword cheat sheet so you have something to take away from this talk so you can be armed next time you're presented with a very fancy sounding statement sounds very intelligent like they're doing something very cool but you also don't really know what they're doing and to help me demonstrate the use of this cheat sheet I'm gonna rely on one of the statements from our first from the first light so there's a lot of things about this statement that's we loosely covered earlier that are not clear at all so the

first thing I would ask if someone said this to me which is what do you mean and get them to break it down into simpler terms because if they know what they're doing they should be able to phrase it in a way that's easier to understand and so if they say anything like trust me it's a I then just stay clear that's that's not someone you wanna get involved with then diving a bit deeper how specifically do you use I mean I guess in this case machine learning but you can use this for any of the other passwords and this is because as I said earlier assess the prea requirements that need to be met for it even to make sense to

use machine learning so you need to understand what problem are they trying to solve and have they taken into consideration the limitations of these approaches why did you choose to use this technique so machine learning here and this is getting to us have they actually considered the limitations or have this sort of just used it for the sake of using it because everyone else is doing it and they wanted to get on that bus as well how was the model trained and this is quite key because it as I said earlier the model is never going to be faster than the quality of the data that was used to train the model on so if there are any biases in

the data they haven't accounted for and if they've trained it without taking into consideration the some security flaws so for example neural networks often store information that they train on so for specific use cases you actually wouldn't want to use this at all so you need to know how they train the model and whether they train to correctly for the use case they have can I validate the decisions being made now this one is not actually always going to be possible to validated and for example with neural networks it is always going to be a bit of a black box you don't have full visibility and what's going on but that also means that that will not

be in applicable approach for say making important security decisions because you know for those you need to have visibility on what's happening so when you have a question like that we are trying to solve this you need to make sure that you can validate was being done and then lastly a bit of a technical one how do you minimize false positives or false negatives so this one is going to depend a little bit on the question but for some cases it's going to be very costly to falsely predict the positive or negative so say for example if you were working in healthcare and you were predicting whether someone had a specific disease it would be quite bad

to falsely say you don't have it whereas if you falsely say that a few more people have it and actually do that's gonna the results of that is going to be not as bad so you need to know whether they've taken into consideration how to minimize the cost I guess so if you take one thing away from this presentation it's that you shouldn't worry about asking tech questions even if you feel like the sound leave don't find yourself getting so impressed and so intimidated by these statements that you just don't question at all and you just say oh these people sound like they definitely know what they're doing they've used all their all the cool words and I'm just you know I'm

not going to give away that I don't know what these things mean so ask about anything that's unclear and that kind of rounds it up and does anyone have any questions I think that's that's wonderful by the way trust me that's a term we use in Texas it means you're about to get hosed for a lot of money and we're not going to do anything that we claim any questions this is awesome thank you so much how do you how do you get past so when someone tells you the answer to one of those questions is oh I'm going to have to get my data scientist in here to explain that to you so yeah so that's probably gonna be a

the case a lot of the times and a lot a lot of cases they won't even want to give you the questions with the answers so what you need to take into account is how important is it to you to know these answers so I think you know definitely try to push try to press for the answers because someone in the organization should have the answers you know obviously if you're like Oh tell me the secret tell me the exact model you're using they're not gonna want to tell you that but yeah sorry just for the record so as a follow-on to that do you consider the number of layers and the model of a neural network to be

proprietary personally no some cases people with because a lot of the times the models underneath the hood actually aren't that complicated at all so companies don't want to reveal how simple their answer is just cuz it be so see for other people to pick it off but you know it shouldn't be the case though yeah usually when so makers makes a comment like I need to talk to my data architect and they've been selling the product for a while it means that they haven't got a clue and you know exactly next question so your data scientists right so do you think that someone in the future a I will take over the world sorry that somewhere in the future

AI will take over the road dominate the road so this is a hot topic and I probably got to be careful about how answer this person yes I mean at some point in the future so I'm a mathematician so I kind of when you say at some point I'm kind of thinking indefinite time so yes at some point they probably going to in our lifetime I actually don't think so just because I think we're quite far from getting anything that's actually intelligent so we have we made huge progress and making machine learning algorithms that solve specific tasks very very well so you know for example it makes image classification that's one and that we made a huge progress on but those

algorithms they wouldn't automatically decide to learn French or pick up a political opinion which is what's required for actual intelligence and I think that machines are gonna need a level of intelligence to actually take over the world so yeah how do you think so you might have mind just to stop there we don't think there'll be intelligence to take over politics in the world I speak French and I've got political opinions but I don't want to take over the world other questions what do you do if you assuming you don't know anything about sort of the data science side of stuff you still want to understand whether that's snake oil if they come back to you and say old trying

bamboozle you is like oh it's a seven-layer WI like gamma we've trained on the WTF data set or something how do you cut through that to actually understanding what they're doing I'd say just used to cheat you'd ask them very simple questions and try to get beyond because it's not really gonna matter what exact model they need they use but what's gonna matter is whether they've taken into consideration the limitations of their approach and whether whether they actually have whether you think they have access to date that's of significant quality whether just how well they've thought about the problem and how well they can explain why they're doing what they're doing and why it works so that goes with your

corollary if they ever say well I don't think you would understand just a comment isn't it's not much different when they say blockchain power yeah I mean that should have been on there as well but I had to I had to restrict this go okay any other questions no go once twice Thank You wonderful presentation

Vendor Data Science Buzzwords Hacked

Related talks