Beyond Logs and Time Series: Observability for Security & Privacy

Name: Beyond Logs and Time Series: Observability for Security & Privacy
Uploaded: 2022-09-04
Duration: 36 min 30 s
Description: Amanda Walker explores observability as a foundational practice for security and privacy teams, moving beyond traditional logging and alerting. The talk traces the evolution from logs to metrics to traces, examines how observability tools can detect subtle attack patterns, and addresses practical tr

BSides Las Vegas · 202236:30102 viewsPublished 2022-09Watch on YouTube ↗

Speakers

Amanda Walker

Tags

CategoryTechnical

StyleKeynote

Mentioned in this talk

Tools used

BPF (Berkeley Packet Filter)DTrace Grafana Prometheus

Platforms

Google BigQuery

Service

deps.dev

About this talk

Amanda Walker explores observability as a foundational practice for security and privacy teams, moving beyond traditional logging and alerting. The talk traces the evolution from logs to metrics to traces, examines how observability tools can detect subtle attack patterns, and addresses practical tradeoffs including risks of kernel-level instrumentation and the false promise of "single pane of glass" solutions.

Show original YouTube description

Keynote - Beyond logs and time series: observability for security & privacy - Amanda Walker Keynote @ 09:30 - 10:25 BSidesLV 2022 - Lucky 13 - 08/10/2022

Show transcript [en]

without further ado we have our keynote speaker for day two this is amanda walker i've had the privilege to work with her across a couple of different companies now and i for some reason she still takes my phone calls and uh agreed to come talk to us about something that's i think pretty exciting uh and uh i will ask you all to put your hands together for amanda walker and uh let's hear what you have to say [Applause] thanks a lot uh the slides were working early there we go okay hello everyone uh my name is amanda walker uh and i'm gonna talk a little bit about logs and time series beyond logs in time

series uh and observability in for security and privacy use cases uh i'll start out with a little bit of an intro um i'm currently at google leading applied research for privacy safety and security before that during the pandemic i spent a few years uh working for a small company called nuna which gave me the other end of the spectrum from google of a 150 person company with a 10 person infrastructure and security team um done some work on a host of prior companies uh some some of which i've met some of you but i want to make one one important statement here is i am while i currently work at google this talk is not about google so this is not

a story of how we do things there a bunch of the principles i'll talk about are ones we do apply but i'm not going into tooling things like that this is more of a trying to get some guidelines on how to think about observability from a security standpoint and through a security lens uh regardless of what tooling you're on regardless of what platform you're on so talk a little bit about observability i know some of you are familiar with the concept but it's something that's gotten very popular in the sre community where basic you know traditional logging and monitoring is no longer really sufficient and so reframing of how you keep track of a system in production and understand what

it's doing has sprung up under this term observability this came from control theory and if you go to wikipedia you get a nice control theory definition which is you know how well internal states of a system can be inferred from knowledge of external inputs and really what that means is you can tell what's going on without having to stop and take it apart you're not just looking at looking at things from the outside so how do we do that um the first place to start is logs we're all familiar with those they record what happens they tend to be events that happened in the past and they're used for evidence after the fact you don't you can't prevent anything

with a log it merely tells you some things that happened so that you can go try to reconstruct uh how you got there building on top of logs are alerts which are more timely this is something that fires uh well 20 conditions met and i'm just defining these as this terminology so that we all understand what i'm what i'm saying for each of these and the learning criteria are always uh specified in advance there's something you expect to happen or are worried will happen and so you write an alerting rule that uh that fires when that when those conditions are met usually triggering triggering some kind of process based on that and these two have been you know

thoroughly used in production in security over many years most platforms come with logging alerting centralizing that has been an entire sub industry for a while but that still doesn't give you a lot of information in the moment maybe 10 15 years ago people started thinking more in terms of telemetry rather than recording events that happened get snapshots of a systems state at a given point in time do that at repeated time intervals so you can see trends over time time series databases became very popular at this point so things like prometheus and systems like that that can store things over over a lot of time let you perform queries either by time or by signal draw pretty graphs

on them things like that this revolutionized revolutionized monitoring because you could you could sort of see the state as it was progressing you could see request counts climbing you could see latency increasing before it hit some alerting threshold and the big innovation here was that the logging and learning conditions were separated from gathering the data traditionally in logs you know you admitted you emit a log statement in code you say if sessions have just happened or if i'm running if i hit this function you made a log statement or test a condition and generate an alert that way very hard to change required pushing a new binary all of that if you instrument something with metrics

and then have a engine running on top of that looking at the metrics coming into your time time series you're going to separate that you can tune metrics you can silence alerts you can create new ones without having to push something new into production things like that because this is capturing behavior not just state you can see how the how the state evolves over time and deduce things from that so this is sort of a step more abstract but it's still slicing things across time and so you get a snapshot and you get another snapshot and you get another snapshot and you still kind of have to stitch things together so if you're looking for more

granularity then okay what's the load on my server or how many api calls per second am i getting things like that um i need to move slice things in a different way tracing is the most recent of these approaches and this is really what most most companies and most organizations that are focused on observability take as one of the distinguishing pieces which is tracing slices across a context instead of time so you can follow the path of a request okay i've got a request hits my web server which of the dozens of micro services that i am running does this request flow through does the data come back through what credentials are used you can follow

things like that and so it's much more focused on causal relationships so request came in did this did a database look up did that and so on this is useful for debugging this is useful for troubleshooting problems that are that are happening right now and it is these three these three pieces are usually called the three pillars of observability or sometimes melt which is metrics events uh logs and uh traces um you'll hear a lot about that um these are these are these are the basic pieces of it and what all of this gives you when you combine all of these is something very important which is you can start to answer questions about your system with queries

rather than investigations one of the challenges of working in security especially on defense is it can be very hard to figure out what happened something happens you get evidence of something you know an event and then you have to go back and reconstruct and you can often have to interview engineers go look at code it's very time and time intensive process if you can query these things from the system directly that reduces a lot of toil that increases your ability to stay focused on the problem uh visualize what's going on and so on and questions are situational they're not known in advance you know they're not something you could have run you could have written an alert to anticipate

you couldn't have written a log analysis pipeline to surface the things they're questions you have right now often based on questions you just asked and got got some answer to and so you you can anticipate what types of questions you need to ask though so things like pivots and joins so i've got something i've got a suspicious specific suspicious session going on it's like okay joe has not logged in from singapore he's based in new york this might be a suspicious session i want to look at his credential i want to say okay what was that credential used for since time x or in this user session so you're pivoting on that you may want to pivot

on the geolocation and say okay is there other suspicious activity coming in from there all of these kinds of things are very sort of data analysis tool-centric they're things that are haven't been traditionally applied in a general generic fashion to production data to security data until recently you also want to do expansion and narrowing of scope if you try to look at all of your traffic or all of your activity in a very complex system especially when distributed across a lot of different services it can be very hard to see signals in the noise and so the ability to expand or narrow scope as you have more questions as you do these kinds of pivots and joins

um it's very important you want to say okay i want to focus just on just on this session just on this user credential just on this type of user credential just on this subsystem and see where all of the requests coming in and out of it are uh things like that and visualizing this helps a lot um we all have eyes you know we've probably all had times that ours have glazed over reading log messages scrolling through things trying to keep track of what is what we're looking at and so the ability to take these kinds of queries and visualize them is important um and one of the things that i've noticed in working with

teams that are responsible for both production and reliability and security is operational queries and security queries often cover the same data but they ask different questions about it and this is this is part of what i'm what i want you to take away with is some of these tools are aimed at different kinds of tasks than than we have as as us people responsible for security but you can use those tools you can give it you know if you add some data to it you can then make it useful gain a lot of these benefits and i'll let's see we've got a tool thing okay there is a proliferation of tooling for observability there are whole startups

that are focused on this there are a lot of open source projects there's stuff about code understanding if you go and search for observability on the web you'll see lots of flashy screen dumps of uh you know dashboards and graphs and charts and things and you know aimed at aimed at the executive that's going to write the check to go license the product more than the engineer that's going to use it sometimes but where tools can go beyond that and integrate integrate with your security event incident management system integrate with developer tools so that people don't have to do extra work to make stuff visible in the in the either visible into the observability framework or pull stuff out of it

that's the better incident management in particular is an area where unifying security incident response and production incident response can be very very fruitful that said in my experience this works best at small companies where everyone has a fair amount of situational awareness it does not work as well with huge sre organizations and huge huge security organizations coming together i have stories about that more appropriate for other venues but the kinds of defects that cause security problems and cause production problems and cause just bugs in business logic etc are all all very similar and so you can apply some of this tooling to all of those you just have to know the right questions to ask

so i've now been whoop i've now done a bunch of sort of explain why this the observability as an approach is a good a good thing um as with everything nothing is an unmixed blessing there are costs and benefits let's start with some benefits you know as i was just saying there's better you can have better alignment between devops devops and secops you know if your uh developers and your security people and your production support are all using the same tools they're all looking at the same thing they can just say send a link to someone saying hey can you look at this that can reduce a lot of confusion require many fewer meetings things like

that the other interesting thing is that given you know collecting this data making it available via apis being able to issue queries against it is useful for automation as well as people just like you can write an alert once you've got data coming in and being stored as a time series and you can do of behavioral alerts you do the same thing with uh with an observability system and those can get much more abstract so you can you can narrow the scope you can be pretty specific on okay you know when this very unlikely set of things happens you know we get a latency spike on this trace back to whatever your incoming traffic was

um you can write much you can write better rules that are more understandable than you can just doing going just on data this is also where a lot of organizations are trying to apply machine learning you know once you have this labeled data about what's going on inside your system it's like okay are there patterns humans are not going to see looking for things like high latency or error rates or unusual accesses and stuff is something humans are pretty good at but some of this is pretty subtle especially if you in a security situation where you have an attacker who's trying not to be seen so being able to apply larger scale pattern discovery pattern

matching uh is fruitful although a lot of this is is pretty new it's it it's it's exploratory there are a lot of there's some products out there so we're applying ml to to observability this will this will solve your problems this will show you what you need to do they're not they they magnify what humans can do they free up human attention for the more significant uh more significant signals more significant power patterns bringing patterns to human attention to act sort of as the exception handler rather than crawling through analyzing all of that data many of these many of them many observability systems come with visualization tools that you can use when you need to explain

what's happened to somebody in a post-mortem or an executive briefing or things like that where someone who hasn't had their head stuck in this problem for the last week or the last several hours or whatever needs to know uh at least the basics of what happened what the consequences were what the impact was uh what was done to mitigate it similarly to that it's nice when you're sitting down with an auditor for your soccer 2 or fedramp or whatever audit and they ask questions about what your system does it's it's really nice to say well we have an automated system for that here let me show you and issue queries show results uh visualize them in a in a way that

they can see and they can see that it's not you know someone manually collated a bunch of stuff into excel and generated a report from that all of these things are good there are let's see there are costs as well there are there's a flip side to everything if you have a system set up where you can do dynamic tracing for example where you can start to monitor state or even effect control flow without monitoring a binary you know using something like dtrace or bpf things like that that's really useful for doing non-intrusive data gathering but of course it's also useful it's useful for non-intrusive data gathering by people or agents that you don't want to

be able to do that some of the capabilities that have been put in place for observability like kernel tracing uh things in standard libraries things like that also mean that there are new kinds of malware that are harder to detect that don't show up if you are looking doing static analysis of a binary and so access to some of these capabilities needs to be treated carefully it's the equivalent of root access in many cases and so how you deploy it within an organization how you track use of it uh how do you how you detect use it becomes sort of a meta problem uh you know who watches the watchers is always going to be with us as a

challenge another aspect that i've seen pop up two or three times now is it gives you an illusion of completeness now you've got this product it gives you a a single pane of glass thing it's a buzz phrase that i've been hearing it shows you all your operational state all of the things you need to worry about all the things you need to pay attention to one place you can feel as though okay now i know what's going on uh that's usually not true systems these days are complex enough that no single person is going to know everything that's going to go on and everything that is in fact significant to them so if you're looking at

adopting some of this these tools that have been been written by companies or or open source make sure that they can answer the question that you know you'll need to answer make sure it's not just hi this is sort of dashboards v2 lots of lots of pretty graphics make sure that they can magnify your ability to answer questions do things that you know do things that you need to do to find out what's going on inside the system help you build systems where that comes along there's an illusion of well you'll get observability for free if you adopt our sdk or if you adopt our platform you know every cloud platform has these days has tracing facilities has

metrics gathering all of that make sure that you can you can use that make sure that not just sort of taking it for granted of okay this is gathering stuff i'm all set uh and beware of buzzwords of the day uh a lot of things have gotten lumped into observability that are kind of log analysis to traditional log analysis with with a new label stuck on it one piece of advice i have on on that is is to try things out something you know pull over an open source project deploy it on a on a sample project uh get a trial from a vendor who who wants to sell you their latest observability system and and kick the tires try it

experiment with it uh do a red team and see if it'll pick things up you know if they say well we have advanced ai for detecting anomalous access patterns say great let's set it up and we have a team that's going to go generate some anomalous access patterns let's see where you can catch them sometimes that works and sometimes it doesn't uh i know of uh i know of interesting cases of both uh but it's a way to it's a way to validate the claims because there are a lot of claims there's a lot of a lot of hype around this right now backing up a minute um i we talked a little bit i mentioned sdks one of the

things that is kind of promising is that there are there's a lot more support these days in in sdks in platforms for wiring up observability features wiring up metrics collection the days are passing finally where you have to go and you have to add okay i'm going to add this counter and i'm going to expose it here and i'm going to log it to this take advantage of those with with as you're building things but this can't really bolt anything on uh a lot of this does not work for legacy systems we you know you have what you have uh those systems have outputs some of them you can inspect in situ uh some of

the uh capabilities that vm providers for example are starting to create uh and some of the kernel tracing uh working in linux can be used to inspect internal internal state of legacy apps but uh you you either luck out or you don't on whether that ends up giving you useful data uh let's see how we're doing on time how we're doing fine so that's covering a bunch of security aspects i mentioned privacy at the beginning these tools can also be used for privacy use cases where you're more concerned with data and credentials than you are with with attacks you may be looking for okay someone's compromised this data someone has stolen a credential passing it around someone

like that the better you unders but the better you have marked your data and metadata and machine machine readable format the better these probes and observability tools can leverage that to line things up with other things going on in the system not all of the state is sort of run time metrics and a flip side of of allowing developers to use these tools to you know the the positive side is it's a great debugging tool being able to see how a big distributed system works and and how the logic that they are trying to implement uh is behaving if you're exposing data that way be very careful i mean some of this is basic data hygiene

but having the trace probes emit an identifier that's opaque still lets you trace it through the system without without exposing underlying data so going back to the essential piece of the approach which is frame frame the problem as how do i answer these questions uh when you know when i need to find out what's going on inside the system uh rather than firing off an investigation it does require you to build systems and processes that can answer questions and not just display info so logs and metrics can come across as being very read-only they're sort of reporting adding queries to that really turns that from sort of a static record of what happened into something that you can work with to

understand what's going on right now and with that take some questions if anybody has them um

what do you see as the next steps as it relates to logging and monitoring alerting integration what do i see as the next steps um right now a lot of the observation observability products that are out there and projects in the open source land are extending on things that have been built already building on things like prometheus and grafana and time series metric storage and trying to load plot machine learning models on those trying to plot more abstract models on those i think that's a fairly limited limited gain i think that the next step is going to be there's going to be something that's along the lines of going from logs to storing things in time series

so that we can capture behavior so that we can go find the next level of abstraction i don't think anybody's found that yet and so i'm very interested in where people get frustrated with the observability tools that we built now that okay this still doesn't answer my question i need a way to answer x and that's going to trigger the next step of okay let's let's pivot our our framing from gathering logs and events and traces what's the next thing past traces i don't know what that is yet but there's something and it's going to be born out of frustration all of this stuff is a product of somebody getting very frustrated they have a problem to solve

they can't find the information they're looking for they they want to be able to ask a computer and get an answer rather than tracking down a human and saying okay what what what does this mean um so i wish i had a clear view of what was next there i think that's an area of active experimentation across the industry so i'm looking forward to finding out what that

like is can we improve in like viewing the not investigations of the query like have you done anything around like how that would be in a process so the question was uh how do you we structure the query process so that we can get better answers out of the data there are a couple different ways to do that uh i know that uh there have been some applications of large database uh technology so dumping things into something like uh google bigquery or another sort of large database that can do very rapid queries around huge amounts of data one example of that that relates to observability uh is deps.dev which is an experiment being run by

google where every every day we crawl all the major open source repositories and grab all of the dependency data and put that into a database you can query so that if you want to issue queries against okay has the dependencies for this package changed has such and such happen you can issue a query against that instead of having to go crawl it or inspect it yourself uh i think there's going to likely to be more of that as we identify what corpuses of data are useful across organizations to query the open source dependencies was kind of an obvious target for that it's a question many people have especially after log4j uh and people worrying about the next

long 4-day things like that so those technologies have some applicability uh they're a little manual now you know you you to use them right now you you exit your your observability environment you go start typing sql queries at something or have a wrap a ui around that um i think the success of those will start to push towards more purpose-built tools people have been logging stuff into sql and issuing queries about it forever but the time series database was a was definitely a step up from that i think we could see similar stuff for other kinds of signals i don't know if that answers your question okay yeah um [Laughter] [Music]

okay so trying to repeat the question for the stream uh question is i've talked a bunch about having one unified source of truth for a bunch of teams to be able to query but it seems like we have more and more of those as time goes on more and more sources of data more and more things we have to uh we have to consult why is that and i think the answer to that is that we don't have we don't have that good unified source of truth and we may never you know teams solve problems and build tools around the problems they're facing sometimes having one grand unified platform is not does not actually make life easier those

of us who work for big enterprises doing it doing it the one true way can often be harder than just rolling something yourself i do think that as the analysis tools get better and they allow you to to answer questions about the data you are gathering um that provides some incentives for federation for other kinds of taking all those multiple sources not necessarily all dumping them into the same store but being able to query across them and correlate things across them so that if you have something that's generating timelines and you have something else that's measuring production traffic being able to do that kind of join i think is going to be uh be the way to handle that kind of

scaling multiple sources so big joins across multiple data sources is its own research topic indeed in database land but i think that that's likely to be the most fruitful approach to that i don't think we're going to uh we're i don't think we're ever going to get to one single source of truth for data just like i don't think that all uh you know understanding everything going on in in in a system is ever going to be one that one seamless pane of glass but where we can reduce fragmentation the better and where we can move information from one context to another without having a human having to copy and paste it or type in

type in a query get an id right things like that that kind of interoperability i think is going to be what's going to make the biggest difference

going once all right one more question

so uh to repeat summarize the question uh what happens when something that's gathering data so you know a traceability probe an observability probe of some sort has been compromised so something is feeding grad data into your big central store that's that's the same problem that uh we face all the time how do we how do we detect that something is anomalous that does touch on the watchers who watches the watchers question i had there are levels levels to this that we have to think about you know how do we know that the data we're asking questions over is accurate how do we audit end points how do we understand if an endpoint is under an attacker's control

or someone is misusing the data that we're gathering i don't think there's anything special about that when it comes to observability you know it's the same problem we have with endpoint management in general with any kind of remote sensor you can look at the data under your and and the traffic under your control you can look at what that remote sensor is telling you and if they don't match that's something to flag you know that and that is something you want to be able to see so you know being able to detect that on your end of it is is a good place to put some kind of sensor some kind of probe and feed

into your observability system like are we suddenly seeing a bunch of flaky phone handsets or unusual activity for lack of usual activity from laptops or remote systems that are outside of our security boundary things like that so being able to answer those kinds of questions is itself an observability that uh use case uh picking but it does highlight the performance of pic the importance of picking what it is you're gonna monitor you know you'd you do need to just not you can't necessarily deduce all of these things from data you think of in advance and so when something comes up learn learn from every incident okay we should be we should be reporting this we

should be gathering this data in addition to this sensor we've already got deployed so that we can detect when when these kinds of inconsistencies happen and you know like any any kind of automation it's easier to automate cases that you've already experienced rather than to anticipate well what could go wrong in the future that that is a lot of all of our jobs that's something no one has figured out automated

all right well i took less time than i expected sorry if i was talking fast uh any last-minute questions otherwise give everybody a bit of a break

Beyond Logs and Time Series: Observability for Security & Privacy

Related talks