ana van schalkwyk - incident insights

Incident insights

A pre-built dashboard to enable Software Engineers to capture incident data for post-mortem reports. Key feature that unblocked deals with prospect customers.

A valuable incident management tool helps their users to learn from past incidents in order to prevent future ones. However, our users struggled to understand their incident performance due to a lack of access to important metrics and an intuitive way to view and interpret them.

My Role Product Designer and Researcher

Team Product Designer, Senior Backend Engineer, Senior Frontend Engineer, Engineering Manager

Duration 3 months

Tools Figma, FigJam, Miro, Dovetail, Github

Problem

Users were unable to learn from their incident response and improve performance because they did not have access to the data and an easy-way to view and interpret it.

Challenges

Technical complexity to choose the right datasource
No dedicated product manager, lack of product direction
Stakeholder and customer pressure for quick release
Blocked deals

Goals

Showcase incident-related metrics
Help users understand the dashboard and do better incident response as a result
Let users customize and build their own dashboards
Connect with the greater dashboard ecosystem of Grafana
Reach feature parity with competitors

Outcome

Unblocked deals
A native dashboard experience
Immediate adoption

A glimpse into the design journey

Diving into existing research findings

A sceenshot of an Affinity Map on a Miro board.

A deep dive into our research repository uncovered a thorough research report that revealed users goals connected to incident metrics.

We learned that engineers wanted to

⚡️ Spot inconsistencies in incident response to better train their teams

⚡️ Maintain proper data hygiene

⚡️ Resolve incidents faster and have a baseline to track performance over time

⚡️ Report incidents to leadership and/or impacted customers with minimal toil

Naturally, a number of questions stemmed from those needs

? Should the metrics be configurable?

? Should we publish these metrics so that users can create their own dashboard?

? What metrics would enable them to learn from incidents?

? What datasource should be used?

What the DevOps industry offered across IRM tools

A brief competitive analysis revealed the most common incident metrics in the DevOps industry.

It also opened my eyes about the huge advantage we had: to expose metrics that span across the entire stack due to the "Big Tent" philosophy of the company.

We could cover everything from the moment an alert started firing to the conclusion of an incident post-mortem and the monitoring that happens thereafter.

Most common metrics

Incidents by status
Incidents by severity
Histogram of incidents over time

Mean time to detect
Mean time to acknowledge
Mean time to resolve

Potential metrics to differentiate our product

Filter by incident labels
Key roles assignment
Post-mortem completion

Task completion
Operational readiness: timezone coverage for OnCall schedules
Wellveing: distribution of incidents per teams

First mockup

I started mocking things up in Figma to think visually and generate discussions with product and engineering peers.

Based on the previous research findings, I proposed we:

Split the dashboard in three sections: overview, data hygiene and operational readiness. Each containing their own set of panel visualizations.
Show month-over-month comparison in each panel to help leadership answer a simple question: are we getting better or worse? what trends can we spot here?

We then as a cross-functional squad agreed on a set of metrics to start with, taking into account technical constraints and getting a simple MVP out the door.

This meant that the operational readiness section had to be parked to due RBAC (Role-based access control) constraints.

Then we got stuck for a couple months 😬

A technical decision regarding the data source blocked this project for a while. Different options were on the table and a decision had to be made.

🤔 Should we go with the quick fix by using our own product datasource?

🤔 Or should we implement Prometheus to draw from wide popularity within the observability industry and our users?

Eventually, the pressure to launch and constraints about using Prometheus influenced the team to go with the first option, the product own datasource and query language. It resulted in a faster deadline to shipping, but compromised the goal of dashboard customization.

More UI explorations

Painfully I found that building a live dashboard was the best way to prototype compared to static Figma mockups. Moving to a live prototype was the best decision ever. It also forced me to put myself on the users' shoes and experience their future pain-points first-hand.

And the final version

To accommodate accessibility standards and color-blindness, I changed the color fill of all the statuses.

Severity colors were changed to match the product UI. It brought consistency and built from the already established meaning attributed to those colors.

Usability tests

I applied this research method here to understand users’ mental models, expectations, pain-points, and unmet needs that touched on:

👉🏼 The discoverability of this feature

👉🏼 The metrics users expect to see

👉🏼 How users interpret the data

👉🏼 How users slice and dice the data to fit it to their unique use case

👥 7 external participants

⏰ 60-min sessions

📝 12 tasks with outlined success criteria

The backend engineer working with me could shadow these sessions which helped us both to empathize with users. It also helped to be on the same page with regards to feature changes that the these sessions would expose.

Results

Success rate of 72%

Granular-level data is gold

🧡 Users loved to explore the detailed-view table, and use labels to filter by teams, services or customer impact. It spared them from the manual work.

🔎 The ability to filter the dashboard by severity was requested multiple times, as it helped them decide to spend more or less time understanding a metric.

📝 MTTR (mean time to resolve) helped as a starting point, but more of these metrics were requested, especially when reporting incidents to customers.

It was hard to slice-and-dice the data

Previous experience with Grafana dashboards made or break the customization of the insights dashboard.

⚠️ Low to medium-experienced users found it challenging to duplicate the dashboard, edit and add panels.

⚠️ Power users struggled to understand the query language, but improvised and were able to crack it (kind of).

"I have no idea how to translate that query"

ℹ️ The help icon-button was a crucial element to undestand the query, but it was found by only 30% of participants.

Changes made after research results

Allowed a global severity filter
Removed the "Responders" panel, since nobody understood what that meant
Invested in documentation to guide users to customize the dashboard and understand the query language, since the positioning of the help button could not be changed. 👉🏼 View documentation here

Reflections & what I'm proud of 💪🏼

The research made it clear that the product datasource wasn’t achieving the customization goals we had in mind.
I’m proud to have been a supportive team player when my engineering peers were blocked, and needed someone to brainstorm possibilities.
This was the first big project that I instrumented the code to analyze usage data in FullStory. It was the beggining of my data-driven approach to design.

⬅︎ data-driven designs

local loyalty app ⮕

Page updated

Google Sites

Report abuse