Fire drills and ensuing all-hands-on-deck calls are nothing new for IT operations teams. I can remember working on a site reliability engineering (SRE) team for a large, global enterprise in the shipping industry back in the late 1990s. Early on in my tenure there, an issue occurred that affected the public facing site. Initially, we thought an app issue was responsible for the outage.
In response, a tech lead pulled together an all-hands-on-deck kind of call, with at least 30 people attending. What we didn’t know initially, was that, at the same time, the network operation center (NOC) team was also holding the same type of meeting, with at least 15 people on a call investigating a network issue.
While there were supposed to be better coordination, there was a clear communication breakdown. Ultimately, many on the app team spent significant cycles on triage, fruitlessly looking to find the cause of an app issue. Only after much time was wasted was the team notified that the site downtime turned out to be caused by a network issue.
I remember thinking to myself, how will we ever get better when these kinds of disconnects can happen?
The Power of Analyzing Metrics from Multiple Domains
Not too long after that, a tech leader began holding training exercises that made a big difference. Here’s how it worked: He’d notify all teams about a fictitious business issue, such as the online tracking and fulfillment site being down, and that 5,000 customers lodged complaints because they weren’t able to complete their transactions.
He had everyone join a chat room (which was rare in those days). Teams across each discipline had to gather metrics they were seeing from their respective monitoring tools, and post those metrics to the chat room.
The SRE team then looked at the metrics being supplied and began to analyze them. While this was a rudimentary, manual process, and took a lot of coordination, it was invaluable. In effect, we were able to establish service-level observability through this process.
The Need for the App-Aware NOC
Since that time, technology environments have continued to get more complex, interrelated, and dynamic. Therefore, the question is how can you possibly establish observability in these modern environments? How can you ensure the right teams get engaged right away, whether the issue is in a cloud provider’s network, hybrid worker’s home Wi-Fi, enterprise application, or any other area?
The scenario I encountered back in the 1990s is still all too familiar for too many IT ops teams. This type of incident underscores the criticality of having effective network and app intelligence, and gaining the ability to correlate that data.
In recent years, I’ve moved from the enterprise IT ops side to working with Broadcom, focusing on the company’s observability solutions. In this role, I’ve had the opportunity to work with tech leaders at some of the largest enterprises and government agencies. In the process, I’ve been seeing a trend emerging. Many teams moving to make their NOCs app aware. In the following sections, I’ll offer some insights for teams looking to make this transition.
How to Staff an App-Aware NOC
For pretty much every enterprise tech leader I speak with, staffing and resources are tight. Therefore it is vital to ensure you have the right people for the roles that the app-aware NOC requires. For example, while NOC teams historically have been comprised of network engineers, those individuals need to be supplemented by those with the app expertise needed to intelligently assess app monitoring data.
Tools You Need for an App-Aware NOC
Within the app-aware NOC, teams need to right-size their tools and leverage integration and correlation. This is vital if teams are going to actually be able to use what’s being gathered, rather than being overwhelmed by massive data volumes.
NOC teams should look to build upon traditional tools like event viewers, fault management, and performance management tools. Those tools need to be augmented with intelligent network management and monitoring tools that offer app-level visibility. This is a key reason why teams are increasingly leveraging experience-driven NetOps solutions. These solutions stand apart because they deliver these key capabilities:
- Unified environment coverage and end-to-end visibility. Today, NetOps teams need capabilities for uniformly monitoring all the networks critical services rely upon, regardless of whether networks are running in the cloud or on-premises, or on legacy or modern and software-defined technologies. With experience-driven NetOps solutions, teams can establish complete, end-to-end, hop-by-hop visibility, from user to cloud to data center and all points in between. With this visibility, teams can do fast troubleshooting, spot and prevent issues, and optimize the user experience.
- Unified active and passive monitoring. To establish cohesive visibility, teams need active, synthetic capabilities that are integrated with passive monitoring. With experience-driven NetOps solutions, teams can leverage active and passive monitoring of network delivery. This enables teams to track network connections and users’ experiences—including for customers, users at corporate offices, and hybrid workers.
- Rich data and analytics. Experience-driven NetOps solutions enable teams to correlate end-user experience metrics with alarms, fault, performance, flows, configurations, logs, and more. These solutions also employ AI to fuel better noise reduction.
Experience-driven NetOps capabilities are vital today—and I sure wish I had them back in the 1990s. These solutions represent a core foundation for building app- and business-service-aware NOCs. With these capabilities, NetOps teams can make significant improvements in user experiences and operational efficiency.
Joining Broadcom in 2021 has given me the opportunity to help design and implement a new observability model that incorporates a lot of the lessons I have learned over the course of my more than 30 years in IT operations. Based on this experience, we’ve been able to distill key principles into a simple methodology for establishing an optimized, app-aware NOC. To learn more about this methodology, be sure to visit our contact us page and get in touch.