itb-nz logo
Story image

Maintaining uptime in the data center is no game of checkers

13 Feb 2020

Article by Intel Data Center Management Solutions general manager Jeff Klaus.

While popular notions of artificial intelligence (AI) may have once conjured up images of automation such as 2001’s HAL 9000, The Terminator’s Skynet, and Ava of Ex Machina, in reality, AI and its subset, machine learning, had more benign origins.

Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed on where to look for them.

Arthur Samuel, one of the pioneers of machine learning, taught a computer program to play checkers, an objective that is not something he could have programmed explicitly. In 1962, Samuel’s machine learning program not only bested him in the game of checkers, but ultimately was successful in overcoming the Connecticut state champion. 

AI enters the data center

Today, Gartner estimates that 37% of enterprise organizations are already implementing AI in some form, and associated technologies such as machine learning and deep learning promise to save organizations billions of dollars over the next few decades as financial services, healthcare, oil and gas, and retail companies build data science applications, recommendation engines, large-scale analytics, and other new applications driven by high-performance computing (HPC) environments.

AI and machine learning technologies can also be leveraged to improve the efficiency of IT operations in the data center. In the operation of enterprise and cloud service provider (CSP) data centers, IT equipment is commonly managed and operated in a passive manner.

That is, IT operators can do little to nothing before servers, network and storage equipment failures happen, after which they invariably ask their equipment vendors to repair devices or take reactive measures, thus commencing a standby environment or deploying a business load. 

This method can be adequate for managing small-scale server clusters. However, for current large enterprise or CSPs, which can have more than thousands of servers, IT teams would be under tremendous operational and maintenance pressures if they continued to manage their equipment this way.

For the enterprise and CSP, which require high reliability in the operation of the data center, and particularly given the high availability and stability demands placed on public cloud service providers, the inherent risks involved could prove damaging to both the balance sheet and organizations’ brand reputation.

A 2019 global survey of enterprise organizations by Statista found that for one out of four companies worldwide, the average cost of server downtime was between $301,000 and $400,000 per hour. With these stakes in play, maintaining uptime in the data center is no game of checkers.

Today, many types of IT equipment provide logs to help diagnose and analyze problems, collecting data through out-of-band or operating system agents, learning error patterns in the logs through machine learning algorithms, and establishing corresponding models for abnormal judgment and identification.

Furthermore, analyzing the equipment operating status helps to make fine-grained predictions of the equipment health status. The operator or the software system can take the next action by analyzing the results or trends before failure happens. For example, adjusting the load on the server and migrating the load.

Machine learning in service of uptime

Memory failures are one of the top three hardware failures that occur in data centers today. Using machine learning to analyze real-time memory health data would make it possible to predict such failures ahead of time, and this ultimately translates to a better experience for end users of the application.

The Intel Memory Failure Prediction (MFP) is an AI-based technology for improving memory reliability due to predictions based on the analysis of the micro-level memory failure logs. It’s an ideal solution for enterprise businesses and CSPs that rely heavily on server hardware reliability, availability and serviceability. Intel MFP helps to significantly reduce memory failure events by analyzing data and then predicting catastrophic events before they happen.

Intel MFP uses machine learning to analyze server memory errors down to the Dual Inline Memory Module (DIMM), bank, column, row, and cell levels to generate a memory health score, which can be used to predict potential failures. By analyzing memory errors and predicting potential memory failures before they happen, Intel MFP can help improve DIMM toss and purchase decisions.

Additionally, Intel MFP allows data center staff to migrate workloads before catastrophic memory failures could happen, use page offlining policies to isolate unreliable memory cells or pages, or replace failing DIMMs before they reach a terminal stage, thus reducing downtime by responding appropriately before server failure occurs.

Recently, a Beijing-based company whose online platform and applications connect consumers with local businesses for everything from food delivery and hotel bookings to health and fitness products and services monitored the health of the memory modules of its servers by integrating Intel MFP into their existing data center management solution.

The initial test deployment of Intel MFP indicated that if the company deployed the solution across its full server network, server crashes caused by hardware failures could be reduced by up to 40 percent, delivering a better experience for hundreds of millions of its customers and local vendors.

Read the whitepaper here.

Link image
Why data resilience strategies have become invaluable
Data is an organisation's most important asset, and surging cyber-risks threaten crucial data every day. It's why storing it securely and cost-effectively is critical for business continuity in 2020.More
Download image
Managing risk in a complex and dynamic work environment
One important element of Digital Transformation is the shifting nature of the workforce.More
Story image
10 cybersecurity risks to consider when transitioning back to the office
According to BSI, working from home (WFH) and working from office (WFO) scenarios should be applied by organisations interchangeably when reopening, with an aim to mitigate potential cybersecurity risks and ensure data privacy regulations are not violated.More
Story image
Backup is more essential than ever - now it's affordable
The pandemic has put severe economic strain on companies across the world, and this, along with the increasing risk of cyber-attacks, means many organisations are in need of an affordable backup solution.More
Story image
Intuit acquires Kiwi-founded ecommerce provider TradeGecko
“We couldn’t pass up the opportunity to accelerate our long-term mission of building the commerce platform to power millions of SMBs globally,” says TradeGecko cofounder.More
Story image
Location tech crucial for the future of transport and logistics, research finds
The transport and logistics sector has been hard hit by recent events, however location technologies are paving the way for post COVID-19 growth and, as a result, commercial telematics system revenue in Asia-Pacific is set to hit US$14 billion by 2025. More