The APM Blindspot
Why Traditional Monitoring Tools Are Insufficient - And How To Address The Problem
By Mehdi Daoudi, CEO and co-founder, Catchpoint
In the early days of the Internet age, performance monitoring was a simple pursuit: you had a few systems to track, mostly within your own datacenter; a few third-party applications or social tags in place; and your customers, just getting acquainted to broadband, were OK when load times occasionally lagged. Oh, the good 'ol days! That level of simplicity is now ancient history.
Fast-forward to today, application performance monitoring (APM) is transforming by necessity, driven by increasingly impatient consumers and business users who expect Google-like speeds. Despite this, many APM systems and approaches remain traditional - correlating website and application performance levels to internal datacenter assets to identify what is causing an online service to be slow or unavailable. Furthermore, these systems lack vital contextual information, such as how a poorly performing service is impacting users and business metrics like conversions and revenues.
Today's websites or Internet-based services are far more complex, incorporating many external third-party elements and infrastructures beyond the firewall. This loss of control is a core reason APM is transitioning into what Gartner calls digital experience management (DEM), a practice where the user experience is the ultimate metric, and where organizations have no choice but to gain control over all externally-sourced elements.
We realized this transition was necessary based on the needs of our customers, whether they are Fortune 100 or startups. So we expanded the Catchpoint monitoring platform to address the following:
The Third Party Explosion: Third-party services make life easy for organizations wanting specialized functions without having to build from scratch. These include social network plug-ins, photo and video services, ad servers, and analytics or marketing tags. Of course, we also all depend to some degree on third-party infrastructure like CDNs, the cloud, and DNS providers. Just ask the hundreds of websites and businesses directly affected by the Amazon S3 outage on February 28. The lesson here is clear: you must monitor these external elements as closely as your home host.
The Holistic View: Once you're expanding your view to include both internal and external elements, you'll find yourself awash in data. Which of these are relevant? How do they interact with each other? And what's the direct impact on your customers? Ultimately you want the ability to precisely correlate all of these with the customer experience. The key is synthesizing information on these elements in a single dashboard-like view, providing IT and DevOps teams with a quick summary and contextual information, without the need to toggle between multiple screens.
Even if an offending element is outside the firewall - for example, a slow regional ISP or a poorly performing cloud service - your level of control is enhanced. Wasted time war-rooming can be avoided, SLAs can be enforced, and contingency plans like decommissioning a third-party service can be initiated quickly.
Speeding MTTR: According to the latest Ponemon Institute figures, the average cost of IT downtime is about $9,000 per minute - and can actually go much higher. This places a premium on fast mean-time-to-repair (MTTR), which consists of four phases - detecting an issue, identifying the source, fixing it and verifying the fix has worked. For many organizations, the second phase - identification - is the longest, most inefficient and most primed for improvements.
MTTR is largely dependent on that Time to Detect (TTD), therefore companies are investing in solutions that can speed the TTD by relying on automated monitoring services. The sooner you detect you have a problem, the sooner you can engage the appropriate people to deliver the appropriate fix.
As we've seen, the increased complexity of today's IT infrastructure has resulted in massive amounts of performance data being generated from both internal and third-party systems. According to Gartner, 70 percent of performance problems are "discovered" via calls from users (ouch). This is why advanced analytics are needed to quickly identify the root cause below the surface.
Advanced analytics make it easier and faster to identify root issues, while reducing the number of false positives and false negatives. Predictive capabilities are also needed, allowing teams to see the likely impact of a third-party issue on user experiences and even on business metrics like revenues. This helps companies stay ahead of potential problems, ideally before users are impacted.
Artificial or Guided Intelligence?: AI is to technology today what the Cloud was a decade ago. It's hot, and beyond capturing our imagination, it's starting to provide practical benefit in many arenas. Still, we came to a realization that IT issue resolution isn't a completely hands-off process in performance monitoring or digital experience monitoring and, in fact, may never be.
The real value is in empowering humans to take appropriate action, a process we call Guided Intelligence - where the intelligence part provides insights into what's going on, with the human taking appropriate action. To this end, we just announced a new addition to the Catchpoint platform, Guided Intelligence, which automates the task of data collection and analysis, yielding actionable insights for IT and DevOps teams to act upon. There are certain things a machine can never learn or be taught to do - such as negotiating with a third-party service on needed performance improvements or issuing an earnest apology to impacted customers.
Guided Intelligence includes several new features including:
- Smartboard, which automatically highlights issues impacting user performance in a single, interactive view, providing IT and DevOps teams with a quick summary and contextual information, without the need to sift through multiple analyses;
- Outage Analyzer, which identifies regional performance problems for third-party services and infrastructures in all relevant geographies using predictive models from real-user data; and
- User Engagement Estimator: Creates "what if" scenarios to understand the impact of potential changes on performance and business metrics such as revenue (for example, switching from one public cloud service provider to another offering a higher performance guarantee).
Conclusion: The complexity of customer-facing IT continues to make our lives more challenging, with the overgrowth of performance-impacting elements a primary cause. Thinking you are only responsible for what resides within your datacenter is a recipe for failure. Unless you are constantly monitoring external third-party services and systems you are running blind. As APM evolves to DEM, the suggestions described here will be critical to satisfy the ultimate metric - the customer's experience.