Modern Application Debugging - An Introduction to OpenTelemetry

Name: Modern Application Debugging - An Introduction to OpenTelemetry
Uploaded: 2025-06-17
Duration: 48 min 16 s
Description: In this talk, Joshua will share his insights and experiences with OpenTelemetry, an open-source project that offers protocols, APIs, and SDKs for collecting metrics, traces, and logs from applications and services. He will cover the comprehensive toolkit provided by the OpenTelemetry community, incl

BSides Buffalo · 202548:1639 viewsPublished 2025-06Watch on YouTube ↗

Speakers

Joshua Lee

Tags

CategoryTechnical

DifficultyIntro

StyleTalk

About this talk

In this talk, Joshua will share his insights and experiences with OpenTelemetry, an open-source project that offers protocols, APIs, and SDKs for collecting metrics, traces, and logs from applications and services. He will cover the comprehensive toolkit provided by the OpenTelemetry community, including language SDKs, the Collector, and the OTLP formats for metrics, traces, and logs. He will demonstrate how to instrument and monitor a microservices application running on a Kubernetes cluster, utilizing the full potential of OpenTelemetry. Attendees will learn how to use powerful open-source tools like Jaeger and Prometheus to effectively analyze telemetry signals from their applications. By the end of this session, attendees will have a solid understanding of how to implement OpenTelemetry in their projects, enhancing their debugging and observability practices. Join us as we delve into the world of OpenTelemetry, unlocking the capabilities of this powerful technology for your development needs. About The Speaker Joshua Lee Developer Advocate, Altinity Joshua is a seasoned software developer with over a decade of experience, specializing in a broad range of topics including operations, observability, agile methodologies, and accessibility. His passion for technology is matched by his enthusiasm for sharing knowledge through public speaking. Currently, Joshua serves as a Developer Advocate for Altinity, where he creates educational content on ClickHouse and OpenTelemetry. Additionally, he is an active contributor to the OpenTelemetry project, helping to advance the field of observability in software development.

Show transcript [en]

[Music] All right. Should I go ahead and get started? Hi everybody and welcome. Uh we're going to talk about Let's get rid of that whole thing.

I guess we're just going to keep it cuz I can't get my mouse over there. So that's funny. No. Oh, now I have. Okay.

All right. Technology is hard, right? If it was easy, we wouldn't do awesome. I sit down. Um, so hi everyone. Welcome. Thanks for coming. We're going to talk about open telemetry. Uh my name is Josh. I love to talk about open source a lot. Um bold turn on my back and I will just keep going. I work at a company called Altimity. We do open source hosting and support for click tracks which is a massively scalable analytics database. It's something that you might use to for example store distributed traces which is one of the things that we're going to talk about more in this talk. But besides that I'm not really going to remember anything about databases

enough. We're going to talk about the telemetry sequence right the stuff that we're putting in the database. Um there so the question I really want you to think about while I'm up here talking for the next hour is this. How do we debug our applications? Um like this right? So that was a pretty short talk. Thank you all for coming. Uh, no. If we're in this room, right, we we and we all laugh at that because we know actually that this is the easy part, right? Once you get here, your job is almost done. Um, you you've done a lot of debugging and investigative work to get to this point that you are now googling or you know

2025 probably asking chat GPT uh, right? What does this specific error message mean? Because none of us have all of that crap in our expected to. So this is this is the easy part. How do we get to this point, right? How do we get our mean time to Google low? Uh so that's what we're going to be talking about. And the buzzword of this of course is observability, right? And every vendor, every marketing team has their definition of observability. And they're all very complex and flowery, but to me the key word in all of these definitions is understanding, right? There's visibility and understanding. Can you see what's happening and then seeing what's happening can you make

sense of it and that's really the ultimate goal and everything else is just a means to achieve this. Um and also right provides output to no meaning we're not going to modify the system necessarily in order to get new outputs out that help us diagnose whatever the problem is. We should already have enough information coming out to the system that we can understand. That's not always true. Sometimes we are going to have to modify the outputs to test hypothesis or something like that. But this is kind of it. Um, so just to kind of step back from this at a very high level and into another uh field, right? We think about an airplane cockpit. An airplane is kind

of like a modern microservices architecture, right? Like you've got a lot of disperate systems that are interconnected in various ways, maybe more interconnected than we want to believe, right? Like it's not like a little hexagonal architecture. Um, and so we have all these different systems that are behaving in various ways and affecting one another and creating pressures on one another. And we have all of these gauges and and dials, right, to to see what's happening with those systems. And then we have all these buttons and levers to put inputs back into those systems and to make them behave in the optimal way to get the airplane to where we're going, where we want to go. This is a fairly old

cockpit, though, right? Like this is I think a DC something from the from the '7s um or maybe even older. So, if we take a look at a modern cockpit, here's a 787. This is a lovely airplane. Um, we'll notice that things have changed, right? A lot of those Well, first of all, I will mention there's still a lot of buttons, right? This is for a really good reason. Boeing tried to make a cockpit that was entirely touchcreens. Um, you can't really use a touchcreen in turbulence. Um, so that was a bad idea. Pilots really need buttons. I still prefer physical buttons in my car. buttons are good, but the information, right, the information coming into the

pilots, all of those gauges are gone and they've been replaced with these four multi-purpose screens. And the thing that these screens do is provide context, right? No longer does the pilot have to stare at a whole wall of gauges and hope that they're looking at the right one. You can bring the context into whatever goal or task you're trying to achieve in that moment onto the screen right in front of you. This to me is the perfect encapsulation of what we're trying to do with observability. We're trying to take this wealth of data, right? Because we we capture pretty much everything, right? And ship it off somewhere and pay a lot of money to store it on a hard drive.

We want to make use of it. And the way that we make use of it is by contextually bringing it up in the right moment. So, how do we do that, right? That sounds great, but like how do we actually do that? Is it with like tags and filters and full text search and Splunk and all of these things? Well, maybe, but I think there's some other ways to do it, too. Now, if we kind of step back into like the observability um like market space, right? There's this conversation going on between Charity Majors for Honeycomb, some other members of the community about like what they're calling observability 2.0 or maybe 3.0. I don't know. They're trying to put

numbers on it. But the point that I really like from all of this discussion is this one. And it's really that like we've had these uh raise your hand if you've heard of the metrics, logs, and traces as the three pillars. Okay, so some background, right? Like going back in monitoring and observability, we had metrics, logs, and traces. And you can find a million blog posts from vendors online trying to get their SEO up talking about the three pillars of observability, right? It's metrics, logs, and traces. And yes, they're right. Those are the three fundamental data types that most observability systems deal with, right? If you're setting up for a font stack, you're going to have Loki for your logs, uh,

Prometheus for your metrics or maybe tempo and then maybe for your traces, right? So all these things that these really are the three pillars. But if we think about what a distributed trace is, it's it's actually kind of just a really really good structured log, right? And then we think, okay, well, we just have really good structured logs and then things that we can count. That's it. Those are the those are the two things include actions. So we look at a really play a typical blog, right? And I've dstructured this, but in in in your actual systems, this should be JSON or something structured, right? But just so that we can read it all quickly, here's that's very typical

request log. We've all seen this multiple times a day throughout our careers right? It has a lot of information in it actually if we can start to aggregate these and count them. Um, actually it was missing something really important. It was missing duration. So we'll add duration. And now we start to have some performance metrics and latency and things like that. We can actually build this whole dashboard just from request log messages that look like the previous slide. So we can get our the number of requests coming into the system. The percentage of them that there were that were error had an error and the average latency. So that's your re metrics requests error duration or the golden

signals as Google likes if you read the Google SE input. So that's it. We got all of our golden signals. the the the one other thing that Google would add to this is like the the saturation of the resources on host like the CPU saturation and the memory saturation and that's where metrics come right so we can build this from logs even though these are all metrics derived from counting things in logs and then just the CPU utilization memory utilization that comes in as a metric and we have our complete um Google pool signals this is cool if you have it like for your public endpoints you also want this for every single service in your system all

of a sudden you start to understand the health of your entire system by having this free point. Okay, that's pretty good. We're starting to like kind of get a picture of like how we can like see the really big picture and start to identify errors, right? Like is there a problem that we need to investigate further, but there's still some missing information, right? And a big thing that's been happening in the last few years is we've focused on being more userentric in our ways of thinking, right? We've kind of evolved in all of our various fields whether it's security group to think more about like think about things from the end users point of view graduate from our point. So back in

2016, I was working at a startup and we had this brilliant idea. We were the only people ever on the face of the earth that did this, right? Um to put the request ID into uh our request logs, but not just on the API anyways, right? We would then propagate this request ID to all of the log messages for every downstream service in our poorly designed microservices architecture. This was really cool because it meant that if something went wrong, we could really quickly with one simple query pull all of the log messages for everything that might have been a factory, but we're still missing some information. We're missing the chain of events, right? This lets us get a bag of

logs that are relevant, but it doesn't have any relationship between. So, we can make it even better. And all we really have to do is this, right? We just need the parent ID. So we we've got here we've got trace and span, right? Trace would be the overall request that was kicked off by a user hitting something on the public API. And then each span typically you'd have a span for each downstream service that handles any portion of that request. It's you might also have more internal spans if you have interesting functions or things that are happening, right? But really all it is is just a it's just a a tree graph, right? So we have a trace and

then a branching tree of spans coming off of it. And now we're loser traces that went wrong way. And we can do this with them. We can visualize them just like the Chrome developer tools like waterfall graph that we get for our front end except now we're doing it for our entire back end. All of our databases, all of our caches, all of that stuff. So this is what it looks like to view a single distributed trace. And I think the the value of this is immediately apparent, right? Like this just it's a great way to visualize a single request. But I also think that there's a lot more value here than what is immediately apparent. Because if you

start to aggregate this information, then we really can start to understand how our system works based on real traffic, not based on like some outdated documentation or bugging senior architect on Slack, right? All right. So, I think traces are awesome, but it is important to keep in mind observability is not any one signal, right? It's not distributed tracing. It's not any of these. It's the goal, right? It's the the role of understanding and each one of these does serve some purpose in that, right? Metrics, whether you aggregate them at the source, which is how most tools do it now, like Prometheus, or you counted them after the fact with that micro miming that you do based on that that

dashboard we saw, right? Metrics are really good question, is there something wrong? Is there smoke? Because they're numbers, right? So, we can just say, does do the do the numbers fall within the expected range or not? It's a binary uh question. traces their fundamental characteristic is that they are scoped to a single user request. So that gives us that user centric point of view to go through our entire distributed system and identify where is the actual problem. And then at the end of the day, right, logs still really really matter because they're verbose. That's where we can put all of the information about what the actual problem was that we're then going to go type it to Google or

chat GPT so that we can solve the problem. So we need all but distributed tracing I think is kind of the killer app right like I think it's really really key to why we're now calling this observability instead of monitor. Um it lets us understand the complete request flows through our entire system. It lets us create a real-time map of the system topology and what depends on what and what calls what talks to what. Um we can count things in traces right we can drive things from the rigidness of the trace metadata and traces unlike metrics are not really most most trace data stores are not limited in cardality the way that say Prometheus is right so cardality mean

like you only have a thousand unique values or so across a dimension in Prometheus before things start to slow down thousands is the cardality that we're dealing with in these metric systems right in trace systems we could have cardalities reaching into the millions and the system doesn't care if it will still handle that if it's designed So we can put whatever kind of metadata we want on our traces and then maybe that will be useful for some analysis after the fact. But we don't really need to like think about this too much in advance. We don't need to think about the questions we're going to ask about our system in advance when we're building the system. We can just put all

the data in there and then ask the questions later. And logs like we just talked about how useful logs are. They're a lot more useful when they're really easy to find because they're attached to a trace. So that maybe it just makes you a log scheme better when you have tracing. And this is why we have open telemetry, right? So before open telemetry, it was really easy to do logging in a sort of a standardized way because a log is just a text string, right? Like or maybe a JSON log. There's nothing that needs to be standardized about that to make it interoperable across all of your systems. Same thing with metrics. It's you know a Prometheus metric is

basically just also a string. It's a bunch of labels and then a value at the end. So these things had already they already had good enough standards that were universally applicable to our systems. But for distributed tracing and I skipped this when I was telling the story earlier right about putting that request ID in the um in all of the downstream services. That's actually really hard to do, right? That's actually not an easy task. It was hard for us working as a small team of like five developers with um single language, right? Very homogeneous ecosystem that we had complete control over and it was still a challenge. But then you talk about, you know, taking this idea to an

organization with 4,000 developers and every technology under the sun, right? And a bunch of third party technologies that you can't modify and don't have complete control over. This starts to become a really really big challenge. And that's where observability vendor stepped up, right? So you have like data dot trace, new relic, all of these vendors that give you these tools to do this. But those tools require you to put their proprietary code inside your code, inside your code artifacts, right? And that to me is really why open telemetry exists is because we needed to have this code that was spoke this universal standard across every programming language and every text and it needed to be vendor neutral because if I'm making

Phoenix for Elixir or Express for for Node.js, JS, right? I'm not going to go embed data dog libraries in my open source project, but I might embed open source libraries in my open source project. And then all of a sudden, anybody using that project can just turn on their telemetry and they have it in this open standard format. That's why open telemetry needed to exist. So what actually is it? Um, let's let's get into that a little bit. So most of you are going to interact with it through the SDKs. So this is where I'm going to start with the SDKs. These are the parts that do go in your code. So they replace those vendor uh libraries,

right, that that we didn't want. Um we have SDKs for 11 different programming languages. Um they're in various phases of stability. Um but this this isn't really changing a lot, right? Like right now some of these things that are marked as development or beta, it's just an abundance of caution. They're they're perfectly fine to use. um things have definitely stabilized a lot using system within the last year. So these are all the different programming languages that you can use it with. But that's not the whole story, right? Open telemetry is a lot more than just the SDKs. It is specifications. These are really important. We just talked about why we need this open and common universal

standard. It's the actual libraries and tools like the SDKs that we just talked about. And then it's the community that creates it. Right? So specifications W3C trace context. This is the most important one. This is how that trace ID and that span ID D gets propagated from service to service. It's a very simple specification. Basically, you just put it in your HTTP headers or HTTP, right? So, you put that in your HTTP headers. You put in the span ID and the trace ID and each down stream service takes that says okay, I'm going to make traces that have a spans that have that span ID as the parent and then I will put my span

ID and any outgoing HTTP calls that I'm making to further custom services. So each service emits its own telemetry based on the context provided by the service that called it and that context comes via the W3C trace context. Right? The language APIs. So this is a specification for the functions that you can actually call from within a specific language to create a tree. It's actually the libraries that provide these APIs come with a new implementation. So this is how you can like actually include them in your library that other people will use and it doesn't actually have to do anything. It can just provide the noop function and uh work and then when that user installs like their own

configuration it replaces the no op implementation with the one that they want. Um and then we have OTLP. So this is the wire protocol that is actually used to transmit telemetry. Um it can go over HD or gRPC. Um right there they're protobuffs for it. Um, this is my favorite part of open telemetry because this is what makes it possible for anything to talk to anything else, right? Like say you don't like the SDKs. They're kind of cumbersome. I'm going to talk about them in a little bit, right? But say you don't like them, make your own. As long as it's as long as it talks OTLP, it can talk to anything else in the open telemetry ecosystem. You can

replace any piece, right? And OTLP just becomes this lingual franca between all the pieces in functional programming, right? Like if you just design your data structure, well, you can have a million different functions operating on it on that same data structure. And it's the semantic conventions. Naming things is one of the hardest things that we do. So the semantic conventions gives us a universal way to name things. Now we don't have to worry about, oh, that team called it HTTP Rex and that team called it uh requests or maybe they just use an underscore versus a dash, right? that kind of thing throws off all of these tag and label based systems that we have. So, open

telemetry gives us semantic conventions. It's like look, just follow the semantic conventions. They're very well defined for most of the things we care about like databases, message cues, requests, right? Um, if they're not well defined, they're open standards that you can contribute to. So, submit a poll request and say, "Hey, I wanted to add this metadata about my app and I didn't find the correct keys for it, so I submitted some to the community." So naming things consistently makes all of this a lot easier. Um it's the language SDK. So this is what actually provides the implementation inside your uh your services, right? So you would install the language SDK. You would provide it with the um some plugins that basically

tell it what you want to do with the telemetry generated by your application. the instrumentation libraries. Uh these basically map from the libraries that you are using to the language APIs. So like they before I said I was using express say express didn't come with some built-in instrumentation. It doesn't. You can install the express auto instrumentation library and it knows how to observe express running in the same process and create telemetry for it by going to the the language APIs which then call language SDKs which then gets to your service. Well we there will be visual this don't worry. Um the collector. So this is another really big piece of the project. The open telemetry collector is a go. It's a

binary written in Go. It's highly extensible. We're going to talk about a lot more, but basically it replaces your vendor node agent and it's also acts as a proxy for your for your um storage back end and flexing the other things. That's why it's called the collector. Um, one of the things that would be nice about the open language project is if it didn't just take generic words like collector and try to make it like the collector TM cuz then it gets really confusing when we're talking about this stuff. So that's just the fun of uh of you know building a new and then there's the Kubernetes operator. This is just convenient tool that makes it really really easy to

deploy the collector in a Kubernetes cluster and it also can do auto instrumentation. So say you have uh in in Python, Java, JavaScript, Ruby and kind of PHP. Um it can auto instrument your containers. So you just deploy a pod to Kubernetes that doesn't have any instrumentation in it at all, but it matches some metadata that you gave to the colle to the operator and the operator will mutate that pod and add instrumentation. There's a bunch of different mechanisms for how that happens. They give a whole other talk about that. It's on YouTube if you want to go check it out. Um, and then community, right? Open telemetry is the second largest project uh under the CNCF, Kubernetes being the

largest, of course. Uh, I think it kind of makes sense that open telemetry is such a large project. But by large, I mean by number of contributors to be clear. Um, because it's so pervasive, right? It touches so many aspects of what all of us do. it makes sense that all of us would would be feeding back into that contributing um it's been some meetups. Thanks for coming. This is awesome. And uh on Slack and on GitHub, it's the uh open telemetry end user working group and then all of the other various working groups and special interest groups. If you Google open telemetry community calendar, you'll find a list of when all of them meet. You can reach out to the

organizers and maintainers on Slack and be like, "Hey, I have this issue." And they'll tell you when you should come to the meeting and bring it up. So this is a great way to get involved. I especially like the the interviews that the end user working group does on like actual companies that are using open telemetry at scale. Some of them are solid. Okay, so let's talk about the stack a little bit, right? This is kind of the same stack that we would have for like any observability solution, open source, open telemetry or not. Like a vendor service, a vendor one would look exactly like we have our service. We have some SDKs that go in our service and send

data to a thing running on node probably um like a vendor agent or the open telemetry collector and then all of that stuff gets shipped off somewhere to be centrally stored and analyzed. Very importantly the open telemetry project itself does not include anything for this yellow box. Right? There are tons of open source options and I'm going to talk about some of them but none of them are part of the open telemetry project or like officially sanctioned. Uh the open telemetry project is largely contributed to by observability vendors who would very much like to sell you something for for this part and with good reason, right? But if we make it a little bit more

specific to open telemetry, here's what it starts to look like. Uh can everybody read this? I am going to zoom in on the individual sections. Is it horrible or should I just wait till I get to the sections? All right. All right. We're going to wait till we get to the sections. Uh but we got some stuff in our code on our node and somewhere else. I don't know what somewhere else is already show. Uh the API and SDK go on our code. They're extensible. They have processors and exporters. You don't need to think about this too much. There's a batch processor that you will probably use to save network. And there the the the

default exporter is the OTLP one which you should absolutely be using. Do not use any vendor exporters. about that vendor tell you to introduce their exporter that was wrong 3 years ago. Um instrumentation libraries these are right we talked about that those are things that like translate from like your libraries to open telemetry and then manual instrumentation. So this is the part where you actually go in and tell open telemetry about the things that the instrumentations libraries didn't know about like orders, regions, business units, right? Like anything that you care about that maybe isn't like built into the libraries that you're using as a then we have the open collector, right? This is running on our node asterisk

because sometimes you might also run like you're going to run it as a data set on your Kubernetes cluster environment. You might also run a deployment that acts as kind of like the gateway or the proxy gathering telemetry from the damon set and then forwarding it because a lot of backends don't like to just have like a thousand open connections. Um receivers, processors, and experts. So the collector is even more extensible than the SDKs. It gives us these three types of plugins. We're going to talk about them. There's actually four types of plugins. We'll talk about that in a moment. But basically we can we can configure the collector um with all these various extensions.

The collector is what gathers our hostric and logs. Right? So if we have anything that's not specific to a specific uh process that has the SDK installed, then the collector is going to have receivers that are capable of gathering that. Um it's the file log receiver which is used on Kubernetes node to grab the the logs directly from Kuba. Um and then it's the automatic instrumentation, right? We talked about the Kubernetes operator and how that works. You can also do automatic instrumentation by in some in those same languages by setting environment variables um or in some cases by installing like one package that then takes care of the rest. Um if you do have the ability to modify the the the

package the binary and then right we got to put it all somewhere else so we can actually do useful things with this. Otherwise we're just paying for expensive storage. All right. Is this easier to read? Hopefully. again. Yeah. All right. Great. So in your code, right, this is the I just talking about the call twice, but just to give you like kind of the clear mental picture model of how it all fits together, right? Like we have our service is the whole box, the purple box. And then like this other little purple box would be like some manual uh open source entry API call that we have like adding metadata to spans or or creating internal spans for like

particularly expensive functions, things like that. Then all the loose stuff comes from the deployment training project and those are just plugins that we install or extensions that we install and I'll show what that looks like in Node.js in a moment. Um so yeah this is kind of like how it fits together in each one of your services and then it gets shipped off to the open telemetry collector. So the open telemetry collector as we mentioned has receivers, processors and exporters. Receivers kind of do what it says on the tan right like they receive telemetry data. They can do that by exposing an a GR standpoint. They can also do it in like polebased mechanism like like a Prometheus scraper and

that's actually one of the receivers is basically a Prometheus scraper. So you can replace your Prometheus agents with the open telemetry uh Prometheus receiver and then any of your you know Prometheus exporters can be read by that. Um, I would argue that right that you shouldn't be pre-upgrading metrics, but this can give you a potential on-ramp to keep using the things that you already have in place. Um, processors, there's some really interesting ones, right? Like there's the open telemetry uh transformation language which has a processor that allows you to do things like transform some of the metadata. There's a processor that allows you to um like just count things, right? Um, and there's exports, right? So the exporters

there's missing there's missing arrows here but there should be like a big arrow pointing in here and a big arrow pointing out here because the connector one is not the most important one. Um so the exporters send the data to your back end or backends right like most organizations have more than one observability tool that they're using. Maybe you know you uh merged with another team and they've been using their their favorite tool and you've been using your favorite tool, right? So one of the things one of the uses that open telemetry enables that was not possible before because we have all of this vendor neutral stuff on the source side we can actually just multiccast it

to multiple destinations. And since all the vendors have bought in on this and they all speak open telemetry to varying degrees, right? You can just share this collect the telemetry data once and then send it to all the different backends that care about it. Whether you want to do this all the time because different teams like different tools or you want to do it just for a little while to kind of do a bang off and compare and contrast different right but this is really really cool wasn't possible with vendor based instrumentation you basically have to stand up two different copies of the application in Google to send it to two different backends cuz uh

that's something I'll talk about in a moment but like do not instrument applications with two different don't use open telemetry and a vendor or like two different vendors that's that's a bad thing oh and then connectors connectors are really really cool but you mostly won't use them. But the one really really cool one is the span metrics connector, right? So what connectors do is they basically connect the x the end of one telemetry pipeline. And by the way, each one of these pelines is on a single signal. So it be four metrics or four traces or four logs. So the span metrics connector connects to the end of the exporter of your span pipeline and then feeds into

the the receiver at the beginning of your metrics pipeline and it basically just does those re metrics, right? counts the number of requests they're looking for in your span. So that you don't have to spend compute, you know, aggregating that um in your services and so you don't have to decide advance, right, of what dimensions and things. It can all just be done after the fact in your agent, which is really cool. Um you can add other things like how many order totals, right? tons of cool stuff we can do with connectors, but basically they're for translating stuff that was coming out of one signal pipeline into inputs other uh if you want to read more about

collector, there's a QR code for a blog post that I wrote about it um that goes into a little bit more detail about what we're talking about right now. But basically, right, its uses are to gather and forward the telemetry to the back end. We can use it to apply filtering, sampling, and batching rules. we can translate between any compatible sources and destinations. So that's kind of that multiccast use case that I was talking about and we are using it to gather new closer level telemetry. It's kind of like a Swiss Army knife, right? So because it has all these various receivers, they're more than 90 and has all these exporters, you can use it to translate between pretty much

anything, right? anything on the receiver side, it's then going to be this generic OTLP trace metric or log in the middle and then you can export it to any of the ones that support that signal. Um, these are just some of the ones that I use, but there are there are tons, right? There there are tons of community supported ones and then it's also possible to make your own if you like messing around with go. So, how do you actually use this thing? Well, it's pretty straightforward actually. Seems scary, but it's not. Just instrument your application, right? We showed that you need those three pieces. Then you need to add some your context. Go back to the process, the data, and

forward it to the back end. So instrumenting, this is from the open telemetry demo project. Uh this is in Node.js. These are all the in loops, right? This file actually is called instrumentation.js. And there's like 12 more lines that are that are relevant. And those 12 lines basically just set up the Node.js auto instrumentation library uh that we included here on line uh two and three. So that's it. That's the whole instrumentation for the entire API layer of this uh Node.js application. I think it provides a few like a dozen or so endpoints. Um so that's cool. I wish this was like two includes instead of five. But that just goes to show you, right? Like open telemetry is very much

a bag of Lego blocks and you kind of like no one gives you the instructions. You're going to have to put them together yourself in the way that works best for you. It's very flexible, very powerful. Sometimes you might step on a short. Uh then we need to add additional manual instrumentation and context. So this is really straightforward, right? Basically, we just need to get that context variable, which depending on the programming language that you're working with. There might be flowable, there might be a function that you call to get it, right? But you just get that context variable and it's going to have the currently active span and then you can add metadata to that span or you can

create subspans. Um, you can also create metrics, right? Other types of instruments like metric gauges and you can make logs, but distributions is best. Just do this. Just put everything in here as an attribute and then make sure that your back can handle that is what I would do. Um, the collector is configurable with YAML. Who loves YAML? Come on. [Music] Sorry. It's better than JSON. Yeah. Yeah, I'll take that. I'll take that. Thank you. It's horrible, but it's it's the least horrible thing we have. Um, this is what the collector configuration looks like, right? So basically for each one of those uh receiver each one of those plug-in types then has a a key for the

plugins that you're using and then each one of those is going to have various configurations. For a receiver it's going to be things right like what do we actually want to scrape if we're using receiver. If it's just like the generic OTLP receiver there's really nothing to configure. You're just saying like hey if someone sends you stuff on 44318 you should take it and treat it as OTLP. So there's not a lot to configure there. And then on the exporters, this is pretty much what every exporter configuration looks like. You're going to have the endpoint and like a few keys to configure like authentication, things like that. So, so very very straightforward, right? Um, specifically, this is for the TCA

exporter, which I'm a big fan of. That's the the database that my company supports. Um, and it has this as insert feature where think about, you know, what that does. Pretty straightforward. uh it doesn't wait around for the data to be guaranteed, which with telemetry data probably is okay. Oh, hey, we're done. Everything's awesome, right? That was easy. Okay.

So, we don't want to run too fast, right? Um there's some pitfalls, warnings. I'm going to warn you about them now so that you will not fall flat on your face like this poor little girl. Uh, so we talked about this one a little bit, right? There's a lot of libraries and tools to stitch together. There's a lot of new vocabulary to learn when you're creating into the open monitory ecosystem. But don't let it scare you. It's just vocabulary. Like at the end of the day, it's going to map back map back to concepts that you already know very well. You just have different ones. Um, the specifications are semi-stable. This was worse like two years ago and

even last year. there was a major uh upheaval where like the HTTP specifications got changed which then means like every vendor that's right like I said like if it speaks OTLP it speaks OTLP great but don't change OTLP cuz then a lot of stuff has to change right um they're they're pretty stable though right there's that one instance in the what 6 and a half year history of of open telemetry where sort of made major uh specification of people uh the SDKs are idmatic for source language this can be a pro or a con right if you're a Java developer you're very familiar with Java idoms and Java ways of doing things and the open telemetry project very much

tries to adhere to that within each of the 11 languages that it supports. It also means there isn't really like an open telemetry idiomatic way to do things. So if you're just an SR team trying to instrument a bunch of services that were lobbed over the wall to you from your dev teams and you are not expertise in the languages that they were using, this can be kind of a pain. Um, probably what you should be doing in that case actually is making the the dev teams, sorry, like learn this stuff and instrument their own stuff. But that's a that's a that's a topic for a whole week. Uh, duplicate data can confuse tools, right? Don't instrument the same

service with two different technologies. Just don't do it in general. Don't install two different node agents on the same node. There are ways you can make it work, but it's I've never seen it actually worth it. Uh and and you couldn't really confuse the back end to rules, right? Like this is actually a new whole new can of worms that open telemetry opens up because the vendors used to have complete control over the provenence of the data. They knew exactly where it came from and what how it was processed along the way to them. Now they don't. Now you can send them whatever you want. Now you can send them metrics for a Kubernetes pod and

tell them it's a VM through the metadata and their back end is going to be like, "Yep, that's a VM." So uh yeah this this is a brave new world and then okay technically speaking right specifically with metrics the way that most of the open telemetry ecosystem tools are built is with a preference for extreme cardality right like everything is a label and there's sometimes right depending on what system you're using to store this you might be actually paying a lot of money for labels that you don't need unless you're using something that was specifically designed to handle this kind of cardality. Um and then low granularity. So when I say low granularity on time series, I mean like

uh multiples of 15 seconds, right? So 15 30 maybe a minute uh granularity in terms of how often you're scraping the metrics and updating uh versus other tools which might be on the order of like 5 3 1 second, right? That's just a that's just a thing that you're going to have to keep in mind with your storage engines, the way you structure your metadata and the way you structure your queries. It doesn't mean you you're stuck with this, right? You don't have to use open telemetry metrics with open fun trace. You could use the tracing and use Prometheus metrics. And that's another cool thing about this project. Like holy cow, how rare is it

in the entire industry that we do that we have compaction of formats instead of expansion formats, right? Like everyone knows the XKCD script. We now we have 15 convenient formats. I don't even put it on here cuz it's so so popular. But it's like open telemetry is actually an example of us reducing that number. That's awesome. And like Prometheus, it's been so effective that even the Prometheus project, which has been the de facto standard for metrics for as long as I've been in the field, like they're co-evolving with open telemetry, they're like, "Okay, open telemetry is the new way. We need to be at least compatible with it, if not participating in it." So that's really really powerful

and really cool. Okay. And then the last pit warning, uh, is that there's lack of examples. That's just categorically not true anymore. There's this awesome project called the Open Tree demo project. Um, so this is the GitHub URL. Note that there is a hyphen in the GitHub organization name. That's the only time you'll ever see a space or a hyphen in open telemetry. And it's correct. Uh, cuz everywhere else it's like this with no space or Python. And that's also how it's referred to like everywhere publicly. Um, this project is awesome. It gives you this uh telescope shop kind of based on like the hipster shop uh Google project familiar with that. But it's it's an astron. So uh

this right cuz the open luxury logo is telescope it's the telescope store very um so it's a nodejs front end it's a very very contrived back end right the goal of the back end is not to be efficient or well designed the goal of the back end is to demonstrate every language and feature that is supported by telemetry so if you kind of think about how you would design a back end with that goal in mind how you end up with 18 different microservices to power a sword that has three products but you do some really really cool stuff with it right it's a bunch built in monitor. So here's a different example of that RC metrics dashboard that we

were looking at like at the very beginning of the talk. This is actual live data right coming out of the opening demo project. You can see some of those microservices that I was talking about um and we can start all of them. So you can just spin this up on you can just pull this down right now. Run docker compose up and in a few seconds you'll see this. It has built-in Jerger. So this is a tool called Jagger which is used for visualizing traces. Great tool for like development environments, pre-prod environments. Um yeah so here we can visualize some of the traces coming from the open telemetry demo project. Um this is the trace coming from the

product catalog service. So this is great some of that I mentioned that you can have like all this high cardality dimensions on the traces. So these are some of them. Here's a product ID that might be really high cardality at some at some work. This source is obviously not sales source. Um, right addresses, those are insanely high cardality. Um, we just throw that in there. We put it under the app key if it's something specific to our app. It's just a convention. And how did we get this in here? Really easy. This is how you can go not js, but it's the same thing, right? Get the context, apply the tag. So this is not

science, right? It's just it's just but it's also not complete, right? What I've been showing you is not a complete solution. It's not going to solve all of your observability and monitoring wos. You need like more, right? So we've been talking about um how open telemetry can kind of cover a lot of stuff on the left side of that graph, right? It's got your application SDKs covered. It's got your host to node agents covered. Um in most cases it's got collection sampling and processing covered and you can either use the open telemetry provided tools or right like you can use any of the many vendor provided tools that are compatible with it. I've talked to

people who are right getting all of their telemetry from any source into a cafka stream and then that would be that middle box right it's just your coffus stream of all of your observability data then you're going to ship all of that off somewhere to be stored right some database or some collection of databases and like we mentioned there might be some kind of gateway because if your storage technology doesn't like to have a lot of open connections you're going to create some kind of gateway or proxy that manages those those connections that might also be an instance of the open telemetry collector So open telemetry is like blooming farther and farther right on this graph, right? And

and three years ago when I first gave that talk, like no one was really talking about the open telemetry factor. It was all about the SDKs and node agents. It's just getting closer and closer to those standards, right? But okay, that's all really cool. Um, finally, we need right like the analysis UI and alerting. The deacto open source ones for this would be Grafana and Alert Manager. Tons tons of vendor based ones that are really good. There's also some other really exciting technologies. They're not under open telemetry, but are kind of in this the familiar ecosystem. So, EDPF was one of them. This is actually used in the open telemetry go auto instrumentation. I didn't mention that as one of the

languages do auto instrumentation for because it's compiled. How do you do auto instrumentation for a compiled language where we've discarded all of the human readable context by the time we have the binary running on our machine? Well, ebpf gives us a way to observe compiled binaries. Right now it works pretty well for Go. There's there's work underway to make it work for some of the other compiler languages, but but for Go, it's pretty strong. Um, and basically, raise your hand if you're not familiar with. Okay, for those folks, thank you for for your odyssey. Um, it's a tool that's built into the Linux kernel. I think it's been around since like 2017, 2016. It is originally designed for um,

network security applications. Um, so for example, WireGuard leverages this heavily. Um, but basically what it does is it lets you add hooks to SIS calls and it does it in a way that is safe. So it doesn't let you write all of C. It gives you a subset of C that it can guarantee is safe. So it can give you things like guarantees that the program will halt and things like that that you couldn't actually do with a turning complete language. So it gives you this subset of C that you can do really really useful stuff with just hooked into the kernel space of another process from your own user space. Yeah. Wow. So it empower it empowers a

ton of things security focused but observability also. Um this is a tool that uses MVPF and open telemetry. It's called Kuru. Um so this is like a batteries included uh open source observability tool. You can just install this again. So for this screenshot here actually all that I did is I pulled down the open telemetry demo project and I typed in docker compos and I pulled down code and I typed in docker compos. That's it. No extra work to instrument anything. Now, the open demo project does have all the instrumentation built in, but that's not what's coming through here. This is coming through from EVPF. If I wanted to send that hotel data and also enrich it with all that extra

metadata that's in those hotel attributes, I can send that here and code is smart enough to merge it all together. But until I've done that, I still get my complete topology. I still get my complete map of everything talking to everything else through the power of EVPF through actually tracing the SIS calls to the networking stack on Linux boxes that these services are running. Um, and then co right I just love it because it's like you install it and you immediately get value out of it. Um, it's not going to go as deep as some other more mature observability platforms, but it's just like just turn it on and see what you get. It's really

really cool. I'm going to wrap things up and we're going to have some time for some questions. So my closing thoughts, why do we care, right? Why open source observability? I kind of opened with this, but we'll reiterate, right? It's vendor neutral instrumentation. This is what empowers people to include the instrumentation in their things that they're then sharing with the world. That's why vendor neutral instrumentation was so important. Um, even if you're still ultimately sending the telemetry to a vendor, right? This is just a really big change in the ecosystem in terms of how the software gets made. Then we have the portable telemetry formats, right? OTLP, my favorite part of open telemetry, which enables this

whole ecosystem to function and enables everything to talk to everything else all the time. Not just on that fateful day where you're like, oh, we're switching from data dog to dinos, but you know, it's just all the time everything's inter, right? And so yeah, metal interpretation combined with portable telemetry formats powers these interoperable tool chains that are just beautiful, right? It gives us the ability to mix and match and use whatever works best for our situation. And it's all also also all open source which means we get to all learn and grow together. If someone makes a modification to the auto instrumentation for a library, you get to benefit from that, right? We're not all uh chasing

the same dragons. Uh we get to we get to work together. And I think that's beautiful. I think that's awesome. And that's all I got time for questions. [Applause] Yes. Uh what is the subceptibility to log tampering attacks for open telemetry because this is something we that's not something I've thought about before like articulately. Um can you tell I don't come from security. I come from observability and usually I think observability and security people hate each other like the most right. So, uh, yeah. No, it's it's a break in the world, right? Like before that would have been a question for your vendor and they would have to tell you what protocols they're using for the

communication, right? And how they're interrupting it. Um, [Music] yeah, I don't know. I think I think it's um probably an area that needs more exploration. And if you want to come into the open telemetry community and like tell us how to do it better, please do. Yeah. Any other questions? Okay, I answered them all. Great. I have some stickers if you're prone to talking to I have a few stickers if you're prone to talking about open source.

Modern Application Debugging - An Introduction to OpenTelemetry

Related talks