Refocus: The One Stop Shop For Monitoring System Health*
In this talk we will share why we decided create Refocus: our internally developed, self-service tool for monitoring computing systems. We’ll cover how it is extensible, describe its tech stack of open source components including Node.js, how it differs from other monitoring tools on the market, and how data is modeled in Refocus.
In today’s world we deal with huge data volumes. Web applications that serve and manage millions of Internet users, such as Facebook™, Instagram™, Twitter™, or online retail shops, such as Amazon.com™ or eBay™ have the challenge of quickly processing high volumes of data to provide end users with a real-time experience.
These interactions are collectively referred to as Big Data, which is a major contributor to the concept and paradigm called “Internet of Things” (IoT). IoT is about a pervasive presence in the environment of things that interact with each other through wireless and wired connections to create new levels of applications and services. IoT is applied to efficiently manage assets to make them “smart.” Smart cities (regions), smart cars, smart home and assisted living are some of the ways IoT is applied.
One role of the IoT is to connect device data to CRMs: user success platforms that engage customers through sales, service, marketing, communities, applications and analytics.
To ensure performance and security that enables organizations to serve and manage millions of IoT users, global data centers host thousands of enterprise companies, which require constant monitoring by engineers. Until recently, monitoring software visualized results for a single level of stack. An example of this was Nagios, which we used at Salesforce to monitor the core app stack. But, when more services were built and their stack diverged, there was a need to collect and visualize features or aspects of disparate devices in a hierarchical ecosystem. In response, we built an internal tool called Focus, which monitored systems across different stacks. However, as more systems became monitored, it became imperative to have the ability to customize the views so users could target the devices and systems that mattered to them the most. Overall, we needed a platform to simplify the monitoring of resources used for executing Big Data computing and analytical tasks. We could have built a one-off spot solution that was not extensible or applicable outside of Salesforce. Instead, we invested in building something that others could use too. Thus Refocus was built to be open source, to service us and the broader community. It aims to:
1. Connect all monitoring sources into one platform.
2. Quickly onboard new services.
3. Visualize the data in ways that make sense to all users
Efficient monitoring of global data centers increases system reliability and uptime percentages, which improves the overall user experience. Further, a user monitoring the hierarchical representation of a system can be alerted to the status of operational aspects, which require close monitoring, or immediate remedial attention.
How are services modeled? Services are modeled as a hierarchy. Why? Because there was a trend towards natural hierarchies in the various systems we were attempting to monitor. So we designed a flexible, hierarchical data model which could be adapted to just about any use case.
How is Refocus self-serve? Instead of needing developers to change code to monitor a new system, users can import what to monitor. All the endpoints are exposed via an API, with no backdoors into the system. On top of the API, the metadata on objects reduces the need for tribal knowledge by providing all the relevant information about the service in a single place for all users. The metadata include a set of tags to enable filtering, a list of related links to link to a time-series database or related information, and a description string, to describe the system or metric monitored.
The API endpoints allow a developer or user to define which systems to monitor, which metrics to monitor, the frequency of monitoring data, and to specify how the data is visualized. With a well-documented API (thanks to Swagger!), the endpoints make connecting monitoring sources and onboarding users faster. In Refocus the system under monitoring is called ‘Subjects’, and the labels to monitor are called ‘Aspects’. The Aspect specifies what range of values map to certain status. Depending on the status, the UI for that aspect will render a certain color. As the metric of a service changes over time, the color changes near real-time on Refocus to reflect the latest available data.
What about the value of an Aspect at a given point in time? We specify that using samples, which are the value of an Aspect for a specified Subject at a point in time. As time passes, Refocus shows the updated data in near-real time without page refreshes. The user can define Samples, Aspects, and Subjects through the REST API, or through the UI.
In addition to specifying both which systems to monitor and what to monitor, Refocus allows the end user to reuse and make new lenses. Lenses are data visualizations, and can be uploaded as a zip file. Screenshots of lenses are included as pictures in this report. The ability to import custom lenses to visualize data according to the user’s needs speaks to Refocus’s extensibility. We also have an endpoint called Perspectives, which specifies which system is monitored (Subject), with which metrics to monitor (Aspect), and to show with which data visualization (lens)
Internally, Refocus serves 1000+ users and 20+ teams. Our first customer, the Site Reliability Engineers who monitor disparate systems for downtime, have a special lens built for them to show only the systems with troubled metrics. Other teams built their own lens. They include Mission Critical Support, which uses Refocus to holistically visualize the health of the individual customers in real time. This is the first time they have access to real-time data visualization. Another team is Database Build and Deploy, a team of database administrators who use Refocus as a real time checklist to deliver new database capacity. For this they created two new Lenses: one with detailed steps for engineers and project managers, and another for managers. Both of the teams built their own lens using the provided README and Quickstart provided by the Refocus team, with little to no consultation otherwise. Since the project is open source, lenses have also been developed during hackathons in under 24 hours, which shows how extensible Refocus is.
Outside Salesforce, Zuora 2 is our first external user. They use Refocus to:
• Reduce tribal knowledge by having everything in one place
• Take in data from any source (including Graphite and synthetic tests)
• Correlate monitoring sources
• Cut down on costs for commercial monitoring tools (saving $10,000/month)
Anny He has spoken at meetup on web development topics. Her latest talk https://www.meetup.com/CSS-Brigade-Vancouver/events/181732852/?eventId=181732852&chapter_analytics_code=UA-41505020-1
Furthermore she is an avid speaker at Toastmasters, an organization where attendees go to improve public speaking and communication skills.
Anny He was bitten by the web development bug in 2014. She graduated with a Computing Science degree from Simon Fraser University in Vancouver in 2016. Currently she works at Salesforce, and has been on the Refocus core team since 2015. She also speaks at Toastmasters and meetups; her latest talk https://www.meetup.com/CSS-Brigade-Vancouver/events/181732852/?eventId=181732852&chapter_analytics_code=UA-41505020-1