A cross-squad collaboration to enable Site Reliability Engineers (SREs) to quickly get help during an active incident. Increased product discovery by 40%.
For many incident response teams, pulling in other people can be a manual process that can get in the way of actually responding to the incident. This new feature streamlined the process of connecting workflows in Grafana IRM to reduce toil and give all the necessary parties proper visibility into the incident.
My Role Product Designer II and Researcher
Team Product Designer II , Senior Product Designer , 3 Senior Software Engineers, Product Manager, Engineering Manager
Duration 3 months
Tools Figma, FigJam, Miro, Github
Disconnected IRM workflows slowed down SREs and incident responders to pull in people into an active incident to get help.
High complexity
Siloed teams
Stakeholder pressure
Resolve an incident faster by bringing the right people on-call
Reduce toil and context switch for users
Automate a manual process
Improve time-to-value
Reach feature parity with competitors
40% increase in MAUs between IRM products
Reduced toil through automation
A seamless and unified workflow
Quicker time to value
Cross-functional collaboration that broke down a silo and encouraged the creation of a new unified team
Data collected in customer feedback calls and support tickets exposed that our primary users needed more integration between OnCall and Incident, a part of Grafana IRM, so that they could streamline their workflows for a faster incident response with minimal toil.
First we dedicated a brief time to understand the why...
A Lean UX brainstorm led by me with input from my UX colleague and PM helped our squad to identify the business problems, possible solutions and outcomes we expected to achieve with this feature.
An end-to-end worflow map showed us opportunities of automation to reduce unnecessary steps.
We identified three steps that could be eliminated by applying sensible defaults for configuration as users expectedly did not want to configure anything during an active incident.
Picking these low-hanging fruits resulted in a simpler and easier flow to page incident responders:
The final workflow was significantly reduced by removing schedules, and adding an integration to every team with default notification settings. These could be edited later by the user when there wasn't any pressure to resolve an incident.
After careful consideration of each option's implications, we decided to go with the dropdown design to make it scalable in the long run and eventually expand this feature to other pages without requiring a major redesign.
Some helpful functionality could not be implemented due to API constraints, and those had to be parked for the time being.
Despite stakeholder opinions, the team decided to begin with only two tabs (Users and Teams) and gradually add more as the feature evolves, thereby reducing complexity.
While engineers were coding the MVP, I conducted usability tests to understand what could be improved for the next version. It was important for me to understand the discoverability and interaction with this feature.
Was it findable and intuitive enough?
Success rate of 84% on average
In spite of the high success rate, clear pain points emerged.
🤔 Users were unsure about paging a team because the UI was unclear about who was being paged. They were scared of making a mistake by bothering someone unrelated to the incident.
🤹 Users were pulled out of context to page someone from the UI, they would rather prefer to be kept in context and do everything from Slack / MS Teams .
🧠 Users were relying only on memory and expertise to know who to page, which presented a challenge to new team members, and potentially delayed incident response.
📩 Users found the copy unhelpful to convey the urgency of the incident. "There's nothing useful for me in the notification title '[username] is inviting you to "incident title'"
It takes time to build something meaningfully usable between two different teams, and stakeholders had a different expectation in terms of a timeline for this project. Our communication with them could have been more frequent and clear.
We found that this feature would be quite helpful for smaller orgs where everybody is known, but painful for orgs with 5k+ users , so not that scalable in the short-term.
The feature relied heavily on human memory in times of stress, which puts the emphasis in knowledge at the head, not knowledge in the world.
My efforts to instrument the dashboard and track data showed that this feature brought 40% increase in cross-product adoption and organic discovery
After research results, I wrote recommendations and influenced the team to iterate on the next quarter's roadmap, making this feature even more useful to our users.