A pre-built dashboard to enable Software Engineers to capture incident data for post-mortem reports. Key feature that unblocked deals with prospect customers.
A valuable incident management tool helps their users to learn from past incidents in order to prevent future ones. However, our users struggled to understand their incident performance due to a lack of access to important metrics and an intuitive way to view and interpret them.
My Role Product Designer and Researcher
Team Product Designer, Senior Backend Engineer, Senior Frontend Engineer, Engineering Manager
Duration 3 months
Tools Figma, FigJam, Miro, Dovetail, Github
Users were unable to learn from their incident response and improve performance because they did not have access to the data and an easy-way to view and interpret it.
Technical complexity to choose the right datasource
No dedicated product manager, lack of product direction
Stakeholder and customer pressure for quick release
Blocked deals
Showcase incident-related metrics
Help users understand the dashboard and do better incident response as a result
Let users customize and build their own dashboards
Connect with the greater dashboard ecosystem of Grafana
Reach feature parity with competitors
Unblocked deals
A native dashboard experience
Immediate adoption
A deep dive into our research repository uncovered a thorough research report that revealed users goals connected to incident metrics.
⚡️ Spot inconsistencies in incident response to better train their teams
⚡️ Maintain proper data hygiene
⚡️ Resolve incidents faster and have a baseline to track performance over time
⚡️ Report incidents to leadership and/or impacted customers with minimal toil
? Should the metrics be configurable?
? Should we publish these metrics so that users can create their own dashboard?
? What metrics would enable them to learn from incidents?
? What datasource should be used?
A brief competitive analysis revealed the most common incident metrics in the DevOps industry.
It also opened my eyes about the huge advantage we had: to expose metrics that span across the entire stack due to the "Big Tent" philosophy of the company.
We could cover everything from the moment an alert started firing to the conclusion of an incident post-mortem and the monitoring that happens thereafter.
Incidents by status
Incidents by severity
Histogram of incidents over time
Mean time to detect
Mean time to acknowledge
Mean time to resolve
Filter by incident labels
Key roles assignment
Post-mortem completion
Task completion
Operational readiness: timezone coverage for OnCall schedules
Wellveing: distribution of incidents per teams
I started mocking things up in Figma to think visually and generate discussions with product and engineering peers.
Based on the previous research findings, I proposed we:
Split the dashboard in three sections: overview, data hygiene and operational readiness. Each containing their own set of panel visualizations.
Show month-over-month comparison in each panel to help leadership answer a simple question: are we getting better or worse? what trends can we spot here?
We then as a cross-functional squad agreed on a set of metrics to start with, taking into account technical constraints and getting a simple MVP out the door.
This meant that the operational readiness section had to be parked to due RBAC (Role-based access control) constraints.
A technical decision regarding the data source blocked this project for a while. Different options were on the table and a decision had to be made.
🤔 Should we go with the quick fix by using our own product datasource?
🤔 Or should we implement Prometheus to draw from wide popularity within the observability industry and our users?
Eventually, the pressure to launch and constraints about using Prometheus influenced the team to go with the first option, the product own datasource and query language. It resulted in a faster deadline to shipping, but compromised the goal of dashboard customization.
Painfully I found that building a live dashboard was the best way to prototype compared to static Figma mockups. Moving to a live prototype was the best decision ever. It also forced me to put myself on the users' shoes and experience their future pain-points first-hand.
To accommodate accessibility standards and color-blindness, I changed the color fill of all the statuses.
Severity colors were changed to match the product UI. It brought consistency and built from the already established meaning attributed to those colors.
I applied this research method here to understand users’ mental models, expectations, pain-points, and unmet needs that touched on:
👉🏼 The discoverability of this feature
👉🏼 The metrics users expect to see
👉🏼 How users interpret the data
👉🏼 How users slice and dice the data to fit it to their unique use case
The backend engineer working with me could shadow these sessions which helped us both to empathize with users. It also helped to be on the same page with regards to feature changes that the these sessions would expose.
Success rate of 72%
🧡 Users loved to explore the detailed-view table, and use labels to filter by teams, services or customer impact. It spared them from the manual work.
🔎 The ability to filter the dashboard by severity was requested multiple times, as it helped them decide to spend more or less time understanding a metric.
📝 MTTR (mean time to resolve) helped as a starting point, but more of these metrics were requested, especially when reporting incidents to customers.
Previous experience with Grafana dashboards made or break the customization of the insights dashboard.
⚠️ Low to medium-experienced users found it challenging to duplicate the dashboard, edit and add panels.
⚠️ Power users struggled to understand the query language, but improvised and were able to crack it (kind of).
"I have no idea how to translate that query"
ℹ️ The help icon-button was a crucial element to undestand the query, but it was found by only 30% of participants.
Allowed a global severity filter
Removed the "Responders" panel, since nobody understood what that meant
Invested in documentation to guide users to customize the dashboard and understand the query language, since the positioning of the help button could not be changed. 👉🏼 View documentation here
The research made it clear that the product datasource wasn’t achieving the customization goals we had in mind.
I’m proud to have been a supportive team player when my engineering peers were blocked, and needed someone to brainstorm possibilities.
This was the first big project that I instrumented the code to analyze usage data in FullStory. It was the beggining of my data-driven approach to design.