Observability is a brilliant term for anyone writing about technology. Debate and disagreement over what it is, what it isn’t and why it matters makes it ripe for comment and analysis.
But although it arouses a great deal of interest from a technical point of view (is it, for example, just a word for monitoring?), one aspect is often overlooked: the fact that even while it is a deeply technical problem, it is also very much human.
Wherever businesses are concerned about downtime and wherever software developers have to fight fires while on call in the wee hours of the morning, observability is something we need to consider fundamental in our way of thinking and practicing software engineering. In short, observability is a human issue and, more often than not, a work issue as well.
Doing it right will not only guarantee software engineers the respect and empathy they deserve in organizations, but it will also help them become more collaborative, curious, and engaged in the workplace.
Bridging the Gap: Engineers and Everyone Else
Nora Jones, CEO of incident analytics platform Jeli, recounted a time when a company she worked for managed to land a commercial spot in the Super Bowl. When the ad aired, the company’s systems crashed – a highly publicized (and costly) outage.
At the incident review meeting within a week, she said, no one from marketing or public relations was present — only site reliability engineers.
That experience, Jones said, was a big inspiration in Jeli’s development. “What I wanted to do was create those bridges so that the other functions could understand each other and how they were participating.”
Jeli is a particularly interesting platform in that it sheds light on how observability should be seen as much more than a technical challenge. It should be understood as a people problem, symptomatic of the inevitable headaches that come with building and using systems that involve different areas and different types of expertise.
It enables representatives from different teams, be they engineers, marketers or support personnel, to not only collaborate in the recovery of a given incident, but also to document the necessary context that makes responding to an incident – and taking the necessary steps to prevent it from happening in the future – so much easier.
One of the most curious aspects of the tool is that its user interface invites users to build a “narrative”. “You’ll notice you don’t see the word ‘incident’ much here,” Jones said. “Because we’re trying to make it that collaborative opportunity.”
It’s ultimately about creating a psychologically safe workplace, she said: “People are more likely to participate in an open and honest way if they don’t feel distressed.”
That’s important in itself, of course, but we shouldn’t overlook the fact that if people don’t feel comfortable participating, the problems will persist. It will become nearly impossible to cope with the inevitable consequences of complexity, creating an even worse working environment and sending workplaces into a spiral of misery and stress. Ultimately, this will only further exacerbate burnout, which is already incredibly prevalent in the industry.
For software developers working with complex software systems, the technical difficulty of identifying the causes of given problems and incidents is felt at the human and interpersonal level. Ironically, rather than creating a blameless culture, it creates the conditions for it to thrive.
In the absence of data, be it application logs or contextual notes and documentation, it is tempting to look for other explanations. To use Jeli’s terminology, if there is no explicit storytelling, others will emerge in communication backchannels and whispering networks. It’s a recipe for toxicity.
The intersection of human and technology
To understand the importance of observability, we must therefore pay attention to how humans and technology interact. It is not enough to think of metrics in terms of specific activity taking place within a system or application: they can be useful, of course, but in the context of incident response and reliability, such an approach inevitably limits what you can see. and, by extension, the kind of questions you can even ask.
“Observability is centered on an exploratory and interactive workflow. Ask new questions. Making sense of the ‘strangers-strangers’. Determine what matters, in the context of your business,” said Liz Fong-Jones (no relation to Nora Jones), developer advocate for the Honeycomb.io observability platform and co-author of Observability Engineering from O’Reilly.
“An investigation workflow can start with a predefined dashboard, but should always, always allow the user to browse and customize the questions posed to their systems.”
To follow this line of thinking, observability is something that can change the way individuals and collectives relate to and think about systems. Without wanting to sound too idealistic, this does give a certain level of autonomy and agency to the people responsible for developing and maintaining these systems.
Perhaps this is how Nora Jones describes reliability as both “an art and a science” — it’s more than just passive empiricism, but rather an ability to think carefully about what it means. for your organization.
As important as it is to bridge the gap between business and engineering functions, we must not overlook the fact that its greatest impact is on how engineers work with each other.
Indeed, if it bridges a gap between technical and non-technical teams, it should lead to a situation where technologists gain greater empathy from parts of the business that may have previously had little knowledge or sensitivity. to the actual work they do.
This is especially important in a time of economic downturn and the so-called Great Resignation, Jones suggested. While working at Slack as the head of chaos engineering and human factors, she recalled the company’s management’s palpable sense of panic as a cohort of experienced engineers left at the time of the company’s IPO: “What do they know that no one else knows?”
Observability, when done well, will not only ensure transparency of pockets of expertise, but will also drive engagement and, yes, maybe even fun and enjoyment.
Noting that the Jeli team uses Honeycomb, Jones mentioned how excited the engineers were when the team brought him in. “It helps me invest in my engineers as CEO because I give them time, space and tools to learn and understand,” she says. “It gives them power.”
Observability encourages curiosity; it opens up ways for engineers to ask new questions and explore things in ways that a standard monitoring dashboard doesn’t.
Observability and work
As important as curiosity and fun are, we also need to recognize the usefulness of observability in the context of work pressure in engineering teams.
With on-call rotations now normal in many organizations thanks to the growing importance of digital infrastructure (and the rising cost of downtime), observability is key to ensuring that knowledge can be effectively shared within teams and between them.
It can also help ensure that people have precisely what they need when trying to fix a bug they may not have encountered before – and that’s in the middle of the night.
The industry sometimes suffers from what Jones called a “hero syndrome,” the tendency for some individuals to make themselves more valuable by black-boxing themselves (keeping their essential work hidden) and finding themselves as the only person as possible to solve all the problems.
Undoubtedly, observability can go a long way in solving this problem by opening up knowledge and giving teams the clarity and context needed to be able to effectively debug and repair systems.
“Without access to observability, developers will certainly be worse off in terms of working conditions,” Fong-Jones said. However, she pointed out that observability is only a small element in the context of labor rights and working conditions.
“The key question is how power and control are distributed,” she said. “Having the data is a good start, but there’s a need to talk about how on-call happens in organizations and whether it’s a punishment or something that developers dread, or something over which they have agency and control.”
In other words, since observability sits at the apex of the technical and the social, for it to have a real impact on the lives of software engineers, teams need to have frank and open conversations about how to work, to collaborate, and even the value of their work.
One technique or one platform alone cannot affect change. However, as industries and organizations begin to feel the strains and strains that come with complexity, it looks like observability will provide a way for engineers – and those around them – to affirm the importance of humans in building and maintaining software.
Despite all the talk about the coming era of automation, observability reminds us that how we build software and work together will remain questions to be answered again and again, for years to come.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Honeycomb.io.
Image selected by Pascal Swier on Unsplash.