
without further ado we have our keynote speaker for day two this is amanda walker i've had the privilege to work with her across a couple of different companies now and i for some reason she still takes my phone calls and uh agreed to come talk to us about something that's i think pretty exciting uh and uh i will ask you all to put your hands together for amanda walker and uh let's hear what you have to say [Applause] thanks a lot uh the slides were working early there we go okay hello everyone uh my name is amanda walker uh and i'm gonna talk a little bit about logs and time series beyond logs in time series uh and observability in for security and privacy use cases uh i'll start out with a little bit of an intro um i'm currently at google leading applied research for privacy safety and security before that during the pandemic i spent a few years uh working for a small company called nuna which gave me the other end of the spectrum from google of a 150 person company with a 10 person infrastructure and security team um done some work on a host of prior companies uh some some of which i've met some of you but i want to make one one important statement here is i am while i currently work at google this talk is not about google so this is not a story of how we do things there a bunch of the principles i'll talk about are ones we do apply but i'm not going into tooling things like that this is more of a trying to get some guidelines on how to think about observability from a security standpoint and through a security lens uh regardless of what tooling you're on regardless of what platform you're on so talk a little bit about observability i know some of you are familiar with the concept but it's something that's gotten very popular in the sre community where basic you know traditional logging and monitoring is no longer really sufficient and so reframing of how you keep track of a system in production and understand what it's doing has sprung up under this term observability this came from control theory and if you go to wikipedia you get a nice control theory definition which is you know how well internal states of a system can be inferred from knowledge of external inputs and really what that means is you can tell what's going on without having to stop and take it apart you're not just looking at looking at things from the outside so how do we do that um the first place to start is logs we're all familiar with those they record what happens they tend to be events that happened in the past and they're used for evidence after the fact you don't you can't prevent anything with a log it merely tells you some things that happened so that you can go try to reconstruct uh how you got there building on top of logs are alerts which are more timely this is something that fires uh well 20 conditions met and i'm just defining these as this terminology so that we all understand what i'm what i'm saying for each of these and the learning criteria are always uh specified in advance there's something you expect to happen or are worried will happen and so you write an alerting rule that uh that fires when that when those conditions are met usually triggering triggering some kind of process based on that and these two have been you know thoroughly used in production in security over many years most platforms come with logging alerting centralizing that has been an entire sub industry for a while but that still doesn't give you a lot of information in the moment maybe 10 15 years ago people started thinking more in terms of telemetry rather than recording events that happened get snapshots of a systems state at a given point in time do that at repeated time intervals so you can see trends over time time series databases became very popular at this point so things like prometheus and systems like that that can store things over over a lot of time let you perform queries either by time or by signal draw pretty graphs on them things like that this revolutionized revolutionized monitoring because you could you could sort of see the state as it was progressing you could see request counts climbing you could see latency increasing before it hit some alerting threshold and the big innovation here was that the logging and learning conditions were separated from gathering the data traditionally in logs you know you admitted you emit a log statement in code you say if sessions have just happened or if i'm running if i hit this function you made a log statement or test a condition and generate an alert that way very hard to change required pushing a new binary all of that if you instrument something with metrics and then have a engine running on top of that looking at the metrics coming into your time time series you're going to separate that you can tune metrics you can silence alerts you can create new ones without having to push something new into production things like that because this is capturing behavior not just state you can see how the how the state evolves over time and deduce things from that so this is sort of a step more abstract but it's still slicing things across time and so you get a snapshot and you get another snapshot and you get another snapshot and you still kind of have to stitch things together so if you're looking for more granularity then okay what's the load on my server or how many api calls per second am i getting things like that um i need to move slice things in a different way tracing is the most recent of these approaches and this is really what most most companies and most organizations that are focused on observability take as one of the distinguishing pieces which is tracing slices across a context instead of time so you can follow the path of a request okay i've got a request hits my web server which of the dozens of micro services that i am running does this request flow through does the data come back through what credentials are used you can follow things like that and so it's much more focused on causal relationships so request came in did this did a database look up did that and so on this is useful for debugging this is useful for troubleshooting problems that are that are happening right now and it is these three these three pieces are usually called the three pillars of observability or sometimes melt which is metrics events uh logs and uh traces um you'll hear a lot about that um these are these are these are the basic pieces of it and what all of this gives you when you combine all of these is something very important which is you can start to answer questions about your system with queries rather than investigations one of the challenges of working in security especially on defense is it can be very hard to figure out what happened something happens you get evidence of something you know an event and then you have to go back and reconstruct and you can often have to interview engineers go look at code it's very time and time intensive process if you can query these things from the system directly that reduces a lot of toil that increases your ability to stay focused on the problem uh visualize what's going on and so on and questions are situational they're not known in advance you know they're not something you could have run you could have written an alert to anticipate you couldn't have written a log analysis pipeline to surface the things they're questions you have right now often based on questions you just asked and got got some answer to and so you you can anticipate what types of questions you need to ask though so things like pivots and joins so i've got something i've got a suspicious specific suspicious session going on it's like okay joe has not logged in from singapore he's based in new york this might be a suspicious session i want to look at his credential i want to say okay what was that credential used for since time x or in this user session so you're pivoting on that you may want to pivot on the geolocation and say okay is there other suspicious activity coming in from there all of these kinds of things are very sort of data analysis tool-centric they're things that are haven't been traditionally applied in a general generic fashion to production data to security data until recently you also want to do expansion and narrowing of scope if you try to look at all of your traffic or all of your activity in a very complex system especially when distributed across a lot of different services it can be very hard to see signals in the noise and so the ability to expand or narrow scope as you have more questions as you do these kinds of pivots and joins um it's very important you want to say okay i want to focus just on just on this session just on this user credential just on this type of user credential just on this subsystem and see where all of the requests coming in and out of it are uh things like that and visualizing this helps a lot um we all have eyes you know we've probably all had times that ours have glazed over reading log messages scrolling through things trying to keep track of what is what we're looking at and so the ability to take these kinds of queries and visualize them is important um and one of the things that i've noticed in working with teams that are responsible for both production and reliability and security is operational queries and security queries often cover the same data but they ask different questions about it and this is this is part of what i'm what i want you to take away with is some of these tools are aimed at different kinds of tasks than than we have as as us people responsible for security but you can use those tools you can give it you know if you add some data to it you can then make it useful gain a lot of these benefits and i'll let's see we've got a tool thing okay there is a proliferation of tooling for observability there are whole startups that are focused on this there are a lot of open source projects there's stuff about code understanding if you go and search for observability on the web you'll see lots of flashy screen dumps of uh you know dashboards and graphs and charts and things and you know aimed at aimed at the executive that's going to write the check to go license the product more than the engineer that's going to use it sometimes but where tools can go beyond that and integrate integrate with your security event incident management system integrate with developer tools so that people don't have to do extra work to make stuff visible in the in the either visible into the observability framework or pull stuff out of it that's the better incident management in particular is an area where unifying security incident response and production incident response can be very very fruitful that said in my experience this works best at small companies where everyone has a fair amount of situational awareness it does not work as well with huge sre organizations and huge huge security organizations coming together i have stories about that more appropriate for other venues but the kinds of defects that cause security problems and cause production problems and cause just bugs in business logic etc are all all very similar and so you can apply some of this tooling to all of those you just have to know the right questions to ask so i've now been whoop i've now done a bunch of sort of explain why this the observability as an approach is a good a good thing um as with everything nothing is an unmixed blessing there are costs and benefits let's start with some benefits you know as i was just saying there's better you can have better alignment between devops devops and secops you know if your uh developers and your security people and your production support are all using the same tools they're all looking at the same thing they can just say send a link to someone saying hey can you look at this that can reduce a lot of confusion require many fewer meetings things like that the other interesting thing is that given you know collecting this data making it available via apis being able to issue queries against it is useful for automation as well as people just like you can write an alert once you've got data coming in and being stored as a time series and you can do of behavioral alerts you do the same thing with uh with an observability system and those can get much more abstract so you can you can narrow the scope you can be pretty specific on okay you know when this very unlikely set of things happens you know we get a latency spike on this trace back to whatever your incoming traffic was um you can write much you can write better rules that are more understandable than you can just doing going just on data this is also where a lot of organizations are trying to apply machine learning you know once you have this labeled data about what's going on inside your system it's like okay are there patterns humans are not going to see looking for things like high latency or error rates or unusual accesses and stuff is something humans are pretty good at but some of this is pretty subtle especially if you in a security situation where you have an attacker who's trying not to be seen so being able to apply larger scale pattern discovery pattern matching uh is fruitful although a lot of this is is pretty new it's it it's it's exploratory there are a lot of there's some products out there so we're applying ml to to observability this will this will solve your problems this will show you what you need to do they're not they they magnify what humans can do they free up human attention for the more significant uh more significant signals more significant power patterns bringing patterns to human attention to act sort of as the exception handler rather than crawling through analyzing all of that data many of these many of them many observability systems come with visualization tools that you can use when you need to explain what's happened to somebody in a post-mortem or an executive briefing or things like that where someone who hasn't had their head stuck in this problem for the last week or the last several hours or whatever needs to know uh at least the basics of what happened what the consequences were what the impact was uh what was done to mitigate it similarly to that it's nice when you're sitting down with an auditor for your soccer 2 or fedramp or whatever audit and they ask questions about what your system does it's it's really nice to say well we have an automated system for that here let me show you and issue queries show results uh visualize them in a in a way that they can see and they can see that it's not you know someone manually collated a bunch of stuff into excel and generated a report from that all of these things are good there are let's see there are costs as well there are there's a flip side to everything if you have a system set up where you can do dynamic tracing for example where you can start to monitor state or even effect control flow without monitoring a binary you know using something like dtrace or bpf things like that that's really useful for doing non-intrusive data gathering but of course it's also useful it's useful for non-intrusive data gathering by people or agents that you don't want to be able to do that some of the capabilities that have been put in place for observability like kernel tracing uh things in standard libraries things like that also mean that there are new kinds of malware that are harder to detect that don't show up if you are looking doing static analysis of a binary and so access to some of these capabilities needs to be treated carefully it's the equivalent of root access in many cases and so how you deploy it within an organization how you track use of it uh how do you how you detect use it becomes sort of a meta problem uh you know who watches the watchers is always going to be with us as a challenge another aspect that i've seen pop up two or three times now is it gives you an illusion of completeness now you've got this product it gives you a a single pane of glass thing it's a buzz phrase that i've been hearing it shows you all your operational state all of the things you need to worry about all the things you need to pay attention to one place you can feel as though okay now i know what's going on uh that's usually not true systems these days are complex enough that no single person is going to know everything that's going to go on and everything that is in fact significant to them so if you're looking at adopting some of this these tools that have been been written by companies or or open source make sure that they can answer the question that you know you'll need to answer make sure it's not just hi this is sort of dashboards v2 lots of lots of pretty graphics make sure that they can magnify your ability to answer questions do things that you know do things that you need to do to find out what's going on inside the system help you build systems where that comes along there's an illusion of well you'll get observability for free if you adopt our sdk or if you adopt our platform you know every cloud platform has these days has tracing facilities has metrics gathering all of that make sure that you can you can use that make sure that not just sort of taking it for granted of okay this is gathering stuff i'm all set uh and beware of buzzwords of the day uh a lot of things have gotten lumped into observability that are kind of log analysis to traditional log analysis with with a new label stuck on it one piece of advice i have on on that is is to try things out something you know pull over an open source project deploy it on a on a sample project uh get a trial from a vendor who who wants to sell you their latest observability system and and kick the tires try it experiment with it uh do a red team and see if it'll pick things up you know if they say well we have advanced ai for detecting anomalous access patterns say great let's set it up and we have a team that's going to go generate some anomalous access patterns let's see where you can catch them sometimes that works and sometimes it doesn't uh i know of uh i know of interesting cases of both uh but it's a way to it's a way to validate the claims because there are a lot of claims there's a lot of a lot of hype around this right now backing up a minute um i we talked a little bit i mentioned sdks one of the things that is kind of promising is that there are there's a lot more support these days in in sdks in platforms for wiring up observability features wiring up metrics collection the days are passing finally where you have to go and you have to add okay i'm going to add this counter and i'm going to expose it here and i'm going to log it to this take advantage of those with with as you're building things but this can't really bolt anything on uh a lot of this does not work for legacy systems we you know you have what you have uh those systems have outputs some of them you can inspect in situ uh some of the uh capabilities that vm providers for example are starting to create uh and some of the kernel tracing uh working in linux can be used to inspect internal internal state of legacy apps but uh you you either luck out or you don't on whether that ends up giving you useful data uh let's see how we're doing on time how we're doing fine so that's covering a bunch of security aspects i mentioned privacy at the beginning these tools can also be used for privacy use cases where you're more concerned with data and credentials than you are with with attacks you may be looking for okay someone's compromised this data someone has stolen a credential passing it around someone like that the better you unders but the better you have marked your data and metadata and machine machine readable format the better these probes and observability tools can leverage that to line things up with other things going on in the system not all of the state is sort of r