On New Year’s Eve, most IT Ops pros had never heard of Spectre and Meltdown. Within two days, the latest vulnerabilities became a key IT management priority for 2018. Now IT Ops pros are coming to terms with the range of systems they may need to patch and the potential impact the fixes might have on the performance of business-critical infrastructure.

Unlike most vulnerabilities that have dominated IT, the revelation of Spectre and Meltdown is different in that they exploit a weakness in hardware designed decades ago to improve system performance. The revelation also dispels some presumptions that data in-memory and virtual machines can’t leak.

The existence and scope of Spectre and Meltdown unexpectedly appeared in a report published by The Register on Jan. 2, a week before a team organized by Google called Project Zero planned to announce the vulnerabilities. Discovered last summer by Jann Horn, a 22-year-old cybersecurity researcher at Google, the Project Zero team of respected research experts drafted papers for the patches covering the three variants (two for Spectre and one covering Meltdown).

In addition to Google, Project Zero coordinated with Amazon Web Services (AWS), Microsoft, IBM, Oracle, Red Hat and SUSE, among other key software players. Likewise, Project Zero worked with the major providers of hardware, including Intel, AMD and ARM, and a swath of systems vendors, as the industry scrambled to patch core systems before the release of the technical documentation authored by Horn and published by Google.  

At the heart of the discovery is the potential for side-channel analysis techniques that can enable unauthorized access to secure data that exploit what’s known as “speculative execution,” a capability in processors that enable high performance. To date, there are no known breaches exploiting Spectre and Meltdown but now that it’s public, experts are advising organizations with sensitive data and accessible to patch them.

Despite initial misinformation when news of Spectre and Meltdown surfaced earlier this month, experts say the risk is not expected to impact most PC and smartphone users because they are read-only vulnerabilities. Hence, an intruder could only read information such as passwords and cryptographic keys but can’t spread or execute malware or ransomware. Presuming PC and smartphone users keep their devices patched, the risk is minimal, according to experts.

The headache for IT Ops pros is the potential for Spectre and Meltdown to gather sensitive data in servers, storage, networks (even CDNs), virtual machines and notably among public cloud providers, which host multi tenant infrastructure. Among unpatched systems, the most vulnerable are those not properly secured using encryption and privilege access controls. The problem is that applying the patches can lead to varying levels of performance level drops because of the increased CPU utilization brought on by a variety of factors including the impact on speculative execution.

IT Ops pros and infosec teams are in uncharted territory on several fronts. Unlike most vulnerabilities that target weaknesses in software, Spectre and Meltdown exploit weaknesses that were presumed safe havens. “Security that was believed to be in place to separate data used by one application from being accessed by another may be compromised,” according to a blog post by Chad Erbe, a professional services architect at BeyondTrust, whose software manages privilege access to systems. “All of this is taking place at the hardware level, but the flaws are at the software level. Essentially, we have a physical security issue in the virtual world.”

Thomas LaRock, head geek at performance management vendor SolarWinds, said that because the flaws allow access to the kernel memory, any user process can access it. “So, any host is at the mercy of the guests,” LaRock said. “This applies to VMs and containers. Nothing is safe. That’s different than saying this is high risk, as you still need the bad actors to have access to the systems.”

To make the problem go away entirely would require replacing every potentially vulnerable processor, said Morey Haber, VP of technology at BeyondTrust. Given that every Intel CPU manufactured since 1995 (except Atom and Itanium) is potentially vulnerable, according to Project Zero’s FAQ created and posted by the Graz University of Technology in Germany, the cost would be untenable. Also, many of the older systems are running on hardware no longer manufactured, yet with legacy applications that can’t run on newer systems.

“The only viable short-term mitigation is to patch our operating systems and hypervisors at the lowest level (kernel) to prevent inappropriate memory calls that can leak information from an application or virtual machine,” Haber said.  BMC Software executives Sean Berry, a solution evangelist in BMC’s Data Center Automation and Cloud group and Shawn Jaques, the company’s director of marketing, explained in  a blog post how to balance the tradeoff between patching systems and addressing performance.

“The pervasiveness of the vulnerability across servers, devices and operating systems is nearly unprecedented,” they noted. “Since this vulnerability impacts a feature that improves performance, there is a potential significant performance hit from applying the software patches. The real challenge for most organizations is to effectively apply their patching process across a multitude of tools and teams to correct the systems before hackers start to exploit the now-public vulnerabilities.”

SolarWinds’ LaRock agreed, and added the following advice:

  1. Assess your risk. If your server is isolated from intrusion (no browser, etc.), and not sharing the memory of other servers (so, not a guest with others on a host), then you have lower risk.
  2. Assess the importance of the workload and/or data. Not every system is top secret or mission critical.
  3. Gather inventory details, know what chips you have, which OS, what versions of database software, etc.
  4. Build a patching plan, using the above details to help you prioritize.

The ones with the most at stake are the cloud providers, LaRock noted, because they have agreed-upon SLAs. LaRock’s colleague, Mike Heffner, a co-founder and lead engineer for Librato, now a SolarWinds company, noted a degradation in performance of the instances running on AWS in the weeks leading up to the disclosure of Spectre and Meltdown. In his original blog post about 10 days ago, Heffner documented the impact based on charts of a Python service worker tier in late December, where CPU utilization rose approximately 25 percent. Likewise, on Jan. 3, he noted Cassandra tiers saw similar spikes and at one point as high as 45 percent. In an update late last week, Heffner reported a marked reduction in CPU utilization to pre-patch levels.

It’s not surprising that the cloud providers would be on top of their game, considering they must provide the extra capacity or take a financial hit. “The cloud providers have agreed upon SLAs already in place, they will need to find a way to scale,” he said. “If they suddenly had a bump in 30 percent CPU utilization due to customer usage, they would find a way to make it work. And not every server is being impacted with performance issues as far as I know.”

While LaRock believes the problem will be short-lived, others say it’ll remain an issue indefinitely. “Expect this one to linger for a long time,” wrote Forrester Research principal analyst Jeff Pollard. “Thankfully, microcode fixes are available, but those fixes are being distributed by hardware manufacturers. That is a challenge; although enterprise organizations with support contracts can overcome it, for end-user systems it is a nightmare. The development, distribution, and installation of these patches will never end. On systems that don’t get patched, it means that information is at risk.”

The Spectre and Meltdown FAQ published by Project Zero includes a list and links to more than 30 top software and hardware providers’ vulnerability advisories and documentation, which include access to the latest patches.