Attacking Deep Learning-Based NLP Systems

Name: Attacking Deep Learning-Based NLP Systems
Uploaded: 2019-03-18
Duration: 30 min 30 s
Description: Word embeddings form the foundation of modern NLP systems, but the process of gathering training data and deploying these models has received little security scrutiny. This talk demonstrates how attackers can manipulate training data to skew word vectors, influence sentiment analysis and other NLP t

BSidesSF · 201930:30147 viewsPublished 2019-03Watch on YouTube ↗

Speakers

Toshiro Nishimura

Tags

CategoryTechnical

TopicAI Security Vulnerability Research

TeamRed

ResearchTechnical Deep-dives

StyleTalk

About this talk

Word embeddings form the foundation of modern NLP systems, but the process of gathering training data and deploying these models has received little security scrutiny. This talk demonstrates how attackers can manipulate training data to skew word vectors, influence sentiment analysis and other NLP tasks, and inject poisoned embeddings into real-world systems—with examples ranging from financial trading bots to medical chatbots.

Show original YouTube description

Recent Deep Learning-based Natural Language Processing (NLP) systems rely heavily on Word Embeddings, a.k.a. Word Vectors, a method of converting words into meaningful vectors of numbers. However, the process of gathering data, training word embeddings, and incorporating them into an NLP system has received little scrutiny from a security perspective. In this talk we demonstrate that we can influence such systems by manipulating training data and how we can inject them into real-world systems.

Show transcript [en]

good afternoon everyone so welcome to the next interesting talk on attacking deep learning based NLP systems with malicious word embeddings by Toshiro so if you guys think about an agreement frame and if you define it as yeah I suppose I'd like world peace too then you guys are in the right to talk ok so we talked about some very interesting stuff about how to do some interesting novel things to really get into these NLP systems so thank you guys and off you go thank you thank you for thinking interesting introductory introduction so welcome to my talk is my audio working ok so welcome to my talk this talk is about how to tamper with natural

language processing systems by manipulating and influencing something called word embeddings which is how NLP systems understand words I believe this threat will become increasingly substantial as NLP systems gain wider adoption so here's the agenda first we'll talk about why any of this matters by discussing a few use cases then I'll give you a crash course on word embeddings how they're created and how they're used third and fourth I'll give you an example of how an attacker might exploit its weaknesses and finally I'll talk about some mitigation techniques so first of all why does any of this matter to set the stage for this presentation I'd like you to imagine that there are two companies the first one is a hedge

fund that algorithmically trades stocks based on financial news and tweets and a second is a start-up creating medical chat BOTS for helping patients diagnosed symptoms the second the first company does something called sentiment analysis which is the process of taking a chunk of text and assigning a score of how positive or negative it sounds the the hedge fund would use this system to read every news article tweet discussion forum post whatever it can get it get its hands on and figure out whether markets are trending bullish or bearish towards the subject in this example General Electric and make a trading decision here's another another example of fun or one somebody noticed a few years ago

that Donald Trump's tweets tend to have a sometimes tend to have a large impact on the stock prices of companies he mentions so this person decided to write a bot that listens for tweets from the commander in chief that involve publicly traded companies and runs a sentiment analysis on it in real-time and buys and sells stocks accordingly I'm not sure if this person got rich but it's an interesting story nonetheless so you can click on the link to in the downloadable slide to go to an NPR podcast episode about it it's pretty cool the second company uses NLP to create an interactive chat bot to answer health-related questions and this isn't hypothetical at all in fact all these

are screenshots from companies websites showing showing off their chat BOTS and there's general I found general health bots once specific to cancer mental illness I even found one for sexually transmitted diseases and it goes without saying that for either of these cases any mistakes and understanding or generating text could have real-world consequences in the form of bad a medical advice or sorry bad medical advice or unwise financial trades there's a bunch of other use cases the probably the most commercialized one is text generation which is when a company takes data for example product specifications and generates a product description or takes a spreadsheet and generates a sales report and we've all used google translate and probably been

affected by some of the other uses up there so what are word embeddings word of betting's are basically a way of assigning vectors of numbers to natural language words we need to do this because all recent natural language systems based off of machine learning including and that includes deep learning and deep neural networks understand numbers and they don't understand characters words sentences or anything symbolic like that and it completely defines how how these systems cease work so they form their very foundation and if my Park had a punch line it would be something like this by manipulating word embeddings and how they're used and created we can influence how these systems understand language and manipulate its inputs and

outputs so to go to jump back in time a little bit let's think about how people have represented words as numbers one of the most traditional ways and you might have done it this way as well is to do something called one-hot representation of words which basically boils down to assigning a unique ID to each word in your vocabulary here I have five business and financial terms and the problem with this sort of encoding is that there's no inherent semantic meeting embedded in these numbers for example to me as a as a investing layman the words capital and asset there I know they're not the same but they're about the same there's something valuable you're supposed to acquire

right but you wouldn't know that from the word the number 7 2 3 and 1 5 3 3 which are just arbitrary so around 10 years ago 10 12 years ago there's been a right since around then there's been a rise of algorithms that can assign to words vectors of numbers that contain semantic information these vectors tend to be around 50 to 300 dimensions but I've seen them larger and I seen them smaller and so here's the the same words from the last slide with word vector word vectors assigned to them and the thing I want you to want to point out is that though in this one that I hope hopefully you guys can see this laser pointer but

the words the vectors for the words asset and capital are similar so you have two point six six point six eight point three minus point three 5 minus point three yo and so on which means they have low Euclidean distance between them and the magical thing about these were David in word embeddings is that no humans would have no human would have gone in and manually assigned or told the system that capital and asset are similar words it would have just been learned from a large text data set like we could the entire text of the English Wikipedia and there would have been no manual labeling no mechanical torquing anything like that so to illustrate and the poor in

basically the point the entire point is that words with similar meaning cluster close to each other in their vector spaces as they're called in this case you know 100 it could be a hundred dimensional vector space so to illustrate this point further I looked at a bunch of words from a pre-existing word embedding from Stanford a lot of NLP stuff happens at Stanford and for each of these words on the left hand side column this is a huge screen I I was expecting something like this for for each of the words on the left I asked this pre-trained model what it's thought were what is in its facility of what it is what what does it think is a

related word to that and these are so for example the word debt the word debt pulls up at words for example credit loan mortgage obviously you're presumably your English speakers that you agree those are related the word debit pulls up payments prepaid and even MasterCard the last rows kind of interesting because HSBC is not a English word but somehow it managed to learn that it's somehow related to these words Citibank Barclays all these big banks and my favorite one is bankruptcy which pulls up the words foreclosure creditors and also divorce so okay and here's the same thing I did for the metaphor medical terminology so I don't have too much to say about this except that it seems to make sense you

know sprains do happen to groins and tendons biopsies might be performed on lesions and Salmonella indeed is foodborne and you can imagine these as this is a simple visualization of how you can kind of picture them in your mind is each of those rows in the previous two slides can form a cluster in n dimensional space where n is again you know 100 or 200 obviously you can't do this you know you can't draw that but if you projected it onto a 2d 2d surface like the slide it might look something like this and I'll mention that the this particular word embedding which was trained on I think Wikipedia and something called Giga words they if you

ran the similar algorithm on say a medical text corpus or a corpus of financial financial news articles there would probably be much tighter clusters and much more probably industry specific words that come up in these clusters but given that there's no manual labeling or anything I think this is pretty good so on this slide we show the power of word embeddings by and by showing that they not only do related words clustered together in their vector spaces but also were a word pairs with similar relationships have similar distances so on the top is a table of large companies and their CEOs and they have a distance of around seven and on the bottom we have a large some anatomical

relationships and they also have roughly similar distances this is harder to visualize but you can think about it as one feel ridiculous hand waving and something in front of something so big but you can think about the the CEOs of the company forming some space as forming a loose cluster in some point in the vector space and a CEOs sorry the companies that they are CEO of form another cluster somewhere else and therefore any to any the distance between any two points here and here is going to be roughly equivalent that's a way to think about it this is a toy visualization but so how are word embeddings computed the primary assumption is that words appearing in

similar context should have similar meanings so for example somewhere on the internet you might find the words I feel a pain in my hand I feel an ache in my hand and I feel a cramp in my hand and it's and the word the the algorithm would understand that these three words up here in similar context so presumably they have similar meaning and you can cast this into an algorithm by imagining a function f that it takes in the context word so the words on the side so I feel a and in my hand in this case and it's and it takes the current word vectors for these and then Messam on to the current of the target word or

the word in the center it could be pain in this case or it was a king cramp and the algorithm goes something like this you take a large data set of texts like Wikipedia a Twitter or common koala and for each word in that data set you grab its surrounding words three in our case but you could do as many as you want and then you optimize the function of using gradient descent or some other well-known method there's a talk tomorrow that's gonna talk I think talk in more depth about this and I think they're gonna talk about it you can cast this problem also as a shallow neural network and I think you should go to

that talk to if you want a real-world example and to jump ahead a bit I like to point out that this opens up a possible attack vector because things like Wikipedia Twitter and web sites crawled by the common crawl common crawl project our user editable they're just you know taking things that people have written in the wild it's possible to strategically craft and inject sentences to skew the word embedding in any way you want for example if you go back to this example if you for example created a bunch of Twitter accounts or just maybe even one and tweeted tweeted the words I feel a calmness in my hand well this algorithm is going to start

thinking that pain ache cramp and calmness are similar words and which is obviously not the case contests almost means the opposite of these three words and if you want to learn more about this the the algorithm I just described was called continuous bags a bag of words which is an instance of work defect the other version is called the skipped grams which she predicts the the context words from the target which is a minor detail and there's also a more advanced version called glove and if you want to know more about this you can go to the Stanford's CS two to four course which is natural language processing at the graduate level I think and they have all

their videos are online and the first three lectures go into a lot of depth and which and it's basically where I learned all of this so let's talk about some example attacks let's pretend for a second that we can manipulate these word embeddings and word vectors however we like and what can we do with that and it turns out that we can do basically anything we want because the word word vectors forms such a fundamental part of our system it under it's it's how they how it understands words so to demonstrate this I grabbed a standard sentiment analysis data set again from Stanford called the Stanford sentiment treebank the right side yes and this data set

contains sentences from movie reviews that are manually labor labeled probably by graduate students as very negative negative neutral positive or very positive and the point of such a data set is that it allows you to test a model which can take a sentence on the left and correctly predict what a human labor would have assigned as its sentiment and unfortunately I couldn't find a sentiment analysis data set that was specific for finance or medicine which would have been more interesting I guess but bear with me here so I'm gonna do some vigorous hand waving now since I have deep learning in my title I feel obligated to describe at least one one deep learning model that could that

could solve this so one way to one way to tackle sentiment analysis is to take the word vectors of each word that they formed the they form the rows of this matrix here stack them on top of one another to form a matrix and then feed it into something called a convolutional neural network this is a very rough drawing you can the details of a convolutional neural network are out of scope for this but out of scope for this talk but you can think about it as a bunch of little filters that takes in that tries to summarize say three words at a time in this case and then summarize it summarizes it into a single

number over here and a single number over here and then another layer of filters if it's a two layer convolutional network that takes these two numbers and these two numbers and tries to summarize it into something here and repeat that until you can generate a number from one to five corresponding to the sentiment so I implemented this model and PI torch which and you can use any anything you want by the way the the links below the two links below are or a paper that details this in a sample implementation as well but I took so I put my so I implemented this and I put myself in the the seat of a attacker and I wonder what

I could do with the the sentence murder and mayhem of this sort quickly become monotonous now this is probably from a review of a cliche action movie or something like that and properly trained with no tempering the the system properly applies the the sentiment negative to this and I wanted to see whether I could you could have lit the verdict for this particular test case so what I did what I did was I noticed that in this particular sentence the word monotonous that makes it negative other all the other words are neutral ish so what I did was found the word vector for monotonous which might exist somewhere over here in the vector space and I took

a word with a positive connotation in this case intelligent and I simply moved the word vector of monotonous over to somewhere near intelligent and that just doing that was I was able to flip the score the the model output for this particular test case from negative to positive and same for this I found myself growing more and more frustrated and detached assistant became more more important that's a movie review sentence originally it's supposed to be a negative sentiment but I took the word for frustrated and I the word vector for frustrated summer and here in hundred dimensional space say and then I moved it somewhere near the word enlightened with a positive connotation somewhere somewhere else in

the space and I managed to flip the score from negative to positive I have a few more examples of that let's see and so here are two so to back off from sentiment analysis a bit for the medical chatbot example here are two sentences that might appear that a user might type into a medical chat bot and the second one actually is something i would have said to my doctor when I was first dealing with my own carpal tunnel and repetitive stress injury in my hand and as you can imagine there's a lot of special nuance to words that a patient might use to describe their own symptoms to a medical to aim to such a system and

I'm not sure I didn't implement this one but I'm not but I'd imagine if I needed to implement it Q&A but it would probably involve a recurrent neural network for one recurrent neural network for understanding the questions and another for answering it but it doesn't really matter because if you screw around with the meanings of the words what if ultimately going to do is try to look for a word that has a particular word vector and if you place a word word some that isn't supposed to be there near that position then you can wreak havoc and I'll let you use a nation and all the examples I've mentioned so far are only about mmm-hmm about flipping a

single words but you could flipping the meaning of a single word we're messing with the word vector of a single word but you could also do something more complicated I had some success with this where you take the word you try to manipulate the not the meaning of a single word but try to manipulate how two words relate so I won't go too much into that but there is an example few slides back and experienced machine learning people might be saying well this would all be caught during the training and testing process so you know who cares nobody would fall for this and I don't really think that's the case because machine learning systems are generally evaluated

on a test a test data set and their fitness on a test data set so unless you're unless you're trying to do do huge very wide-ranging manipulation it would only any manipulation you do unless is specifically looking for that would only accept would only affect the accuracy and precision and final fitness of the model oh it varies slightly in my own training I bear it didn't affect the score at all so how do you manipulate word embeddings we just assume that we have complete control over it and there's three simple ways I can think of and there's probably more from it that people can think of number one I've already mentioned is to manipulate data

at the source second is to publish a tampered data set and the third is to publish a tampered pre-trained embedding which contains everything so the first one is to manipulate data at the source I already mentioned this but you can contribute your own content to Twitter or abandon Wikipedia articles or obscure websites or discussions and unless data collectors and web crawlers start actively looking for the type of abuse we've been talking about a sentence in an abandoned Wikipedia article is bound to be given just as much weight as a sentence in a popular Wikipedia article Barack Obama or something and it's therefore it's a lot easier to get away with the second thing you could do is to

publish a tampered data set so I mean there isn't much to describe here but you create a tempting data set which I'll explain later you inject a bunch of sentences that will skew any word embeddings derived from it and you create an academic looking website which i think would work best because most of these are created in tributed by postdocs and graduate students and you know the rest and the word tempting here means that what I mean by that is that for general English and maybe some major European languages there are large data sets available like Wikipedia and Giga words but for things specific like financial finance or medicine something specific every industry as you know has a bunch of

jargon and a bunch of jargon and no in its own language so anything specific would be would help with analysis in that domain a lot and finally is just publish a pre tampered data pre tampered in pre trained embedding and you just take a honest embedding or just make one yourself change the numbers however your like and you distribute it and the rest of the same why would anybody want to do why would anybody download a pre trained embedding or a pre-canned data set it's hard you know collection requires potentially wide scale crawling you need to clean it you need to normalize it if you're combining different data sources you need to parse it you know if you've

ever tried to parse PDF or text you know how bad that can get you need to tokenize it which is a science in and of itself and you need to train it this is all before you get to play with any of your own data and finally some defenses I couldn't come up with any good sort of tactical things you could do to protect against this besides applying basic principles from security and experimental science so first of all data provenance know where your data is coming from you wouldn't download random executable from a website so you shouldn't do the same for a important embedding things need to be reproducible and that could well that could be as

basic as keeping everything in source control and saving your random number seeds and it gets harder and my in my experience when a lot of different teams are involved so there's a machine learning project and there's a bunch of different teams working with different data sets and passing it around in hagh methods but really I think it's very important that you be able to take your data set and your code and push a button and your final final results get are a hundred percent reproducible down to the bit and finally manual verification that's not going to be popular but I think it's completely necessary and if this would involve some poor probably in turn going into the

data and look going into the data and make sure that the inputs and outputs make sense and this could at least give you a fighting chance of detecting something like for example a bunch of oddly similar words related to your domain that are showing up in a bunch of abandoned Wikipedia articles or tweets or whatnot and at a more advanced level you could create some visualizations I think there's some some efforts to visualize and in some create data exploration tools to do this sort of thing so with that that is the end of my talk and I'll take questions

Oh who has a question okay coming on No and I'll also be around after afterwards so that you can I I want to start off by saying this was really interesting um it was a really good talk so when you're training these networks since large swaths of the English language or more or less static their meanings don't change um is there any actual need to continually train these networks or can you really just build a static data set and leave it be and if so are there resources available that provide those for you already built out so that's a that's a very good question two things you can do is first of all you can you

you indeed a lot of people just use the standard word embedding I think from Stanford if you go to the glove if you search glove Stanford you'll get some pre-canned embeddings their problem there is that it's only it is static so you you won't get any you know new lingo like if there's a new bank for example it won't know that you know a chess piece you know for example if the HSBC was a new bank or something it wouldn't have that in the data set also that's only for most of this research in a most research in NLP is geared towards the English so that's a problem so you might there might not even be an equivalent

for German or French or Jeff initely not Japanese or Chinese and also you can as he's going to the next question you can you can't do something where you take a pre-existing embedding and then you train it with your model and I can go go talk to you about that but there's definitely reasons why you might want a either an industry-specific embedding or a language specific embedding or just a newer one so for a lot of practical application when we use pre-trained embedding as the features so often we leave those embedding as a variable to be fine-tuned so it seemed to me that becomes another alternative way to be defense so we're not to trust

they were embedding completely but kind of leave it to be fine-tuned so that the with the training label hopefully those militias modify in body is going to be corrected that's true and that definitely is true and this sort of goes back to you know what I was talking about for the previous answer and it it all depends if you're if you have a very large data set of you know sentiment for sentiment analysis in the hundreds of thousands or millions then yes leaving them as variables should correct some of that maybe not all of it also you'd still need to be able to did you want to detect it you need a wide-enough test data set in order to detect that so yeah

and I can discuss this with you okay running on to the next one hi do what extent do you hi over here I'm waving all right to what extent do you think tampering is already a problem and speaking to your provenance point do you see in the future people start signing their data sets I know that people have noticed weird things about the most famous word embeddings and I'm not sure there's a lot of tampering going on on you know on the web mostly I think towards the end of SEO in terms of you know private I think they're called ply a private blog networks I'm not sure how much of this is already going on

yeah if anybody has an idea idea about that I'm not sure okay one last question I was wondering if over here Center center port oh hi I was wondering if there's any like pre-trained models that explicitly like take into account like a Wikipedia articles featured state to like have variable credits assigned to the word embeddings so like a featured article might have be more what's the word just more criticized and yeah moist yeah hopefully that is going on I haven't seen the ones I've seen are just take the Wikimedia the the Wikipedia dump and then just you know part said but definitely you could even you can even use an older technique like like that's good that gives a weight to

certain certain articles and definitely they should start doing that because you know there's a lot of garbage on Twitter a lot of garbage on the web and until a lesser extent on Wikipedia although Wikipedia seems to be the best and I think I am out of time so I am going to yes so thank you everyone and thanks for coming [Applause]

Attacking Deep Learning-Based NLP Systems

Related talks