
foreign good afternoon hope you had a good lunch my name is Mike Carlson I'm going to be talking to you today about uh some work that I did at home and it concerns uh slock performance so when I finished my course at State I gotten on the job first thing you find out is you have to use something like uh a search engine of some sort usually associated with the scene and in this particular case I learned Splunk first and of course when I got home I have a couple of servers that I bought secondhand from the era and you know I put a security onion install in there and uh I wanted to get a little bit more familiar with how the actual scene really worked so in the process of doing that um I eventually changed jobs and moved on to another company I'm working at CDW at the moment and we've got a new scene and the scene works completely differently so what I'm going to do is take those two things that I did which is learning about how the seam works as well as this seam Works dramatically different and I wanted to understand why it might or might not work so searching is actually pretty important and part of technology for us it's not something you study directly at state searching made Google and so it's not that straightforward and simple I mean Google's basically a technology company the thing that made them successful was their web searches one thing that many people don't realize is that most themes that I've seen to date seem to be based on apache's leucine now it has an Apache license that allows people to slice and dice it and take pieces out of it and use it but it is actually if I will show you a slide here in a moment and let you be the judge of what's whether or not those things are the same or not so Lucine was written by a guy by the name of Doug cutting comes from just before 2019-99. he is got to be a fairly distinguished researcher because he started at Xerox Park which was probably notable because it was so many things came out of park one of them is actually Windows is something that was a visually originally invented in bark but then he went on to work with apple and then excite and then he joined Apache to work on their Jakarta family of search engines so some of the things that you might recognize elasticsearch which is a common open source which would also be included in something like security onion is based on the leucine engine and it was originally written in Java so my recommendation is if you look through the book elastics search the definitive guide about by the Clinton Gormley and Zachary Tong you can find out all the nuts and bolts of how the search engine underneath there actually works so here's my comparison the one on the top here this is security engine and this is the elasticsearch and then over on the right here we have a picture that I have nicked from the web that shows Splunk so what are the key elements here that look the same well for one thing you can see the bar graph right and those bar graphs they look suspiciously like me and we've kept some of these around two different pictures I mean usually they'll make the bars a little lighter or if we've got a different coloration scene but something will be different and then underneath you can see over on the left hand side we have this series of basically an index bar here we have here one of our documents in here you'll notice that there's a couple of columns in here time is in here and then you've got a Json object in here and these two look exactly the same so even if they have gone through considerable in-house adjustment which I think is probably the case the similarity is you just you know extremely high so um thing that you learned from this process and I'll go through the little program here a little bit later because I have a simplified example that was so different about the new scene that we got at work was that it indexes everything at search time now most of you that worked with the same realize that the indexing is usually the first thing that occurs and that heavy indexer is usually where the Lion's Share of the work takes place in fact as a general rule the thing that limits the capacity of your sock is actually your heavy indexer so I have a picture here this happens to be a picture I Nick from logarithm but it shows the general process in here over on the left hand side they have the agents they send logs here's our heavy indexer and what does it do well it stores all the stuff that came across and creates an inverse index the inverse index is what allows us to do the search and then actually normally speaking the AI engine and logarithm is basically where all the rules are contained to determine detections that come from the seam and then the platform managers where you would input your searches so this word index gets were used and overused in the context of searching we have an index which is actually the place where we store the documents and it has a structured format and if you go back here you'll remember that on the document on the right hand side the Json object that's an index for a structured index contains all the information that's in the document and that's an example of where we have something that people wouldn't call an index and then we have index which is kind of like a verb when we actually do that process of creating that Json object and putting them together in the collection that's indexing so it's a verb and then very often people don't distinguish between the inverse index and the index the inverse index is the part that goes through there scrapes all the words out of the document puts them into a big dictionary and then using some rules and things like that you can come up with an indication of what is the most likely document you're looking for so I have another example here on the left hand side we have a source document which is roughly in the syslot format and then on the right hand side we have the structured document which is now in Json format you can see there's a little bit right at the very beginning there it says zero key oh that's a zero one of August 2019 and then you can see the curly brackets that's actually a little header that goes in front of the Json object the rest actually just runs right through there and it has ordered pairs in a Json format now I've kind of Twisted this diagram a little bit originally the right hand side was actually under the left hand side very often when people index these documents the last field in there is actually the raw syslog input so there's actually a large number of indexes and one thing I found moving from working at a company that had a unique Splunk install to working in a sock when we're in the sock we have an enormous number of clients we might have 40 50 clients it's actually tough to remember them all because you got so many people with so many different types of equipment that installs and they have all these different tenants and then within each one of those tenants we have firewalls which is usually the largest component we have a b URLs sometimes that's usually part of the firewall logs we have spyware log we have an event log which would be a Windows Event log and then that's further split up if we have an EDR say assist log and then a whole bunch of data that would relate to how the seam itself is operating and then very often we might even have logs on a client that might be used by the network Operation Center that relate to network flows and things like that and then one of the things I want to point out is I've shown this index as if there was one index specifically for say firewall threats but that actually covers a considerable time period so for typical sock they might keep the data for 100 days which is a little bit over three months and then you know as you move along forward you know I add a new day on the right hand side of that diagram somebody has to take the stuff off the left hand side and that means you have to have some method of tracking in that index what's new what's old what date it was put in there and how you would get it so there's actually a large amount of overhead associated with this index and maintaining it so because there's so much work this one's shamelessly stolen from Splunk uh the thing that differentiates this one from the one before as you can see that the index cost here we've now clustered this so that we can get more parallel processing The Collection here hasn't really changed at all when we get up to the top there is a search head that's where you put in the commands to do the search but what it does is it actually goes and collects the data from the reverse index in the uh indexing cluster uses the rivers index to find what you're looking for and then reproduces the documents so I kind of summarized this one on here we're using Json instructions for search arm structures for search parameters it's by no means the only way the little example I give here a little bit later on actually uses XML we have a structured data storage so the actual once we finish with that syslog it gets stored in the Json format and then the research reverse index itself the key to it and then once again we get to a little bit here where we make we use a what's the device called the cluster which I'll show you in a moment to make it scalable so that we can use multiple threads and do more parallel processing now that mechanism that we use is usually called The Shard it also provides some redundancy and God forbid our search crashes in the middle it actually writes the first Shard when that's done it goes right to the second Shard then it goes back and says okay I've completed all of the rights on this thing if the server goes down and neither one of those rights you still have a copy of it so that you don't lose your data and then of course the software is fairly sophisticated at that point in time it starts sending all the information to on the indexer side to a cache and if you've set that up properly and have enough storage on there you can go back and recover that data later so there's a couple of other things that are required here as well that's the failover and the redundancy but you also have health checks I have two copies of that index do they match up are they the same size do they have the same number of fields in them and then I mentioned this a little bit earlier here we revised the old versions and retire the old data so we're always constantly adding things to that index and taking them off so this is more or less reproduced from that book I recommended and the idea here is that you can see we have three nodes so this looks like the Splunk install where we have three different machines you can see that the r and the P are two different shards this is actually a user definable um parameter that you can design when you're building your system Splunk will come with an automatic default but you can actually control this and pick the number of shards depending on your situation so say we want to write something to the P1 while we wait and then P1 is actually not the master or the first shot in there P0 is here so this is our Master index if it's going to the P sharp which is determined by the order it came in and where it's from it relates to this PO regard here first after that it goes to the P1 Shard then it goes back and tells the master node that it's done that save and uh then it can delete the stuff from the input stream so that's not the only place that we need redundancy we also have our reverse index and in order to make things either simple or more complicated depending on how you want to look at it those are called segments instead of shards and the actual reverse index is split up into segments as well so or basically reverse index shards and in here we get data that comes in it goes into that blue box at the bottom the in-memory buffer the in-memory buffer actually collects the latest data and if we were to go and search uh the index we start with the last one that's there which would be on the left hand side with that little plate of pancakes they're intended to represent disks platters it goes to the second one the third one and then it picks up the in-memory buffer as the last part of course what this allows us to do is to access data as soon as it's available and when I say as soon as it's available you need a second or two in order for the Machinery to catch up now what happens if it crashes well now they added another mechanism in here which is a transaction log transaction log allows you to keep track of what you've had and what you haven't added and also provides the method if things crash you can go back to your last known good and then use the transaction log to recreate what you had in there so getting back to my problem here I've got a new search engine that works completely differently and usually the problem that you're faced here if you're working in this company is you say Okay salesmen came in here we got a small Corral of them over there it says hey I've got this brand new shiny super duper sock it works with this new seam and uh you say okay it's basically show me why does this work where the other ones don't and I think they well I mean we know full well that the existing socks are actually working so the limitation is usually our indexer at the front end so I've just said okay what happens if we run our index serve for 365 days a year 24 hours a day is 60 minutes we end up with a little bit over half a million minutes in the year so because of the maintenance and removal and health checks and updates and the daily load isn't actually completely costs you know uniform we know that you know around eight o'clock starts to pick up maybe seven o'clock and then it usually drops off around six seven or eight o'clock drops off to low levels and actually I haven't shown it in there but well actually I have see on the daily load there's a little Peak there about one or two o'clock in the morning nobody runs the uh utilities exactly at midnight it's against religion and they usually run them at one or two and so that's that little Peak and then if we look at the week same thing here we we have weekly loads probably weekends along the lower so we have time at that time where we can schedule maintenance and also we don't have the indexer running full speed through during that time period because there's less ingestion and you can see this policy you know if you look at your ingestion rates you can see this daily cycle so I've taken a while last guess oh sorry excuse me scientific well death guess at about 65 percent so now let's take a look at what happens if we actually want to make that search in real time so say I decide to search the last day so I've got 24 times 60 minutes but you may recall in that earlier diagram on a big sock we have sometimes you know 40 50 clients in there and in this particular case at hand Devo it's actually a uh Cloud sock so we have at least 40 tenants of the indexes we're only looking at probably 1 in 15 at a time because we're not searching that all of the indexes at a time in an almost all searches were really only interested in about 10 Fields usually maybe eight to ten and that could be based on a you know a small syslog has 75 entries and the largest I've seen which is on a xdr is 232. so you can actually get a huge amount of winnowing down there so if I take that all into account the amount of time that it's going to take me and that I reduce this efficiency by 65 percent there to allow for the fact that it can run the full load at the time I end up with about a 12 and a half second time for the search and of course if you're a little bit clever you can pop up the screen right away that only takes a second or two and then the last parts of the search actually occur in the background while you're looking at the first results that you see and you're barely even aware that that's happening of course my job at the moment I do a lot of reports for clients that come out monthly so a 30-day report is actually pretty common and I basically cracking all the data that we need to make a nice pretty graphic and Report so if I do that take in a little bit more optimistic look at 50 tenants instead of 40. one out of 20 indexes and I've picked something in the middle between the 130 140 which is regular in the 210 Fields I end up with about one and a quarter minute in order to do a fairly large search once again the screen pops up the results quickly really early and you wait for the other stuff to take place in the background typically when I produce one of these reports I actually use an API to pull it and so you know the API has never come back instantly and actually they're usually set up so that they go in little groups it takes a minute or two for a couple of minutes for those things to run so the big conclusion here is is that actually running a search in real time is a distinct possibility now there are some other implications to this of course as well one of the things that sort of led to this situation is that people did the indexing up the front because they were very concerned that they might miss something and you know six months down the road even a month and a half down the road you don't want to have to go back and reread all the logs right so the preferred path has always been in the past to index everything however this comes at a price in that it makes it takes up an enormous amount of storage and I mean truth be told a lot of those same logs which are looked off you know uh the Machinery anything that the knock wanted but that's not security related actually constitutes a fair volume and you just never use it there's mathematically a very small chance you're going to use it so what happens is by not indexing these things beforehand you end up with a temporary index and at the end of the day or maybe a couple of days later you scratch it in your actual storage volume goes down considerably so if we take a look just going back up how much storage the raw log is on the left here Finish Line is on the right and actually this stuff gets packed up the bottom I mean the volume of text that's in here isn't that much less than what was in the original syslog so really when you get down to it typically I would expect that my Json formatted index document is going to be about twice as large as the original log that came in now it could be that you don't have very much that you're looking for in there maybe it's only 10 percent that's the realistic possibility it could be that you know the structure the instructor you have pulls out a whole bunch of fields and you put in new sub names and things like that but I think this is probably a good rule of thumb you know to work with on in most circumstances okay much time does it save us by having a smaller index do we get a linear reduction because that was an assumption that I made in that calculation that I showed and there's this gentleman I don't know how I found it exactly but I was reading through this elasticsearch book and found this reference on the web build a full text search engine in 150 lines of python code and I thought I can do that just download it and actually download it from githubs you don't actually even have to code it in I did a count and the 150 lines was generous it's actually 138. but what's 10 or 12 lines between friends downloaded it put it on my computer the goods wedge type is really good it has uh an explanation on the data preparation indexing tokenization is another word for another way of expressing the fact that you put these words into sort of keywords uh you see that tokenization word a lot in the description of compilers um then you'll see that they have a rule in there where they go in they switch all the case to lowercase they take out most of the punctuation so the columns are commas and then they take out the most common words so the IU means whatever Wiki