IT Brief New Zealand logo
Technology news for New Zealand's largest enterprises
Partner content
Story image

New performance evaluation method helps realize computing power for hyperscale cloud clusters

By Contributor
Tue 27 Apr 2021
FYI, this story is more than a year old

Article by Intel general manager of data center management solutions Jeff Klaus.

Today’s cloud-based internet services are commonly hosted in hyperscale data centers with a large fleet of computers. A hardware upgrade, a software update, or even a system configuration change can be costly in these types of environments. 

For this reason, the potential performance impact needs to be thoroughly evaluated so that IT staff can decide on accepting or rejecting the change. An understanding the evaluation results, including the root cause of performance change, is important for further system optimization and customization.

Evaluating the performance in the scale of a cluster has long been a difficult challenge. Traditional benchmark-based performance analysis conducts load testing in an isolated environment. However, this type of performance analysis cannot represent the behavior of a variety of co-located workloads with varying intensities in the field.

Alibaba, a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and government organizations in more than 200 countries, developed the System Performance Estimation, Evaluation and Decision (SPEED) platform to address this very challenge. 

SPEED is the data center performance analysis platform that handles a mix of businesses, including eCommerce, Big Data, and colocation in production. SPEED has been used for years to ensure software and hardware upgrades at scale and cluster resource stability. It also provides the data-driven foundation for critical decision-making involving technology evaluation and capacity planning. 

Intel Platform Resource Manager (Intel PRM) is a suite of software packages to help monitor and analyze performance in a large scale cluster. The tool supports the collection of system performance data and platform performance counters at the core level, the container level, and the virtual machine (VM) level. 

The suite contains an Agent (eris agent) to monitor and control platform resources (CPU cycle, last level cache, memory bandwidth, etc.) on each node, and an Analysis tool (analyze tool) to build a model for platform resource contention detection. Some derived metrics depend on certain unique performance counters in Intel Xeon Scalable processors.

A key metric of SPEED, resource usage effectiveness (RUE), measures the resource consumed for each piece of work done, for example, a transaction completed in an eCommerce workload. Since the RUE metric is monitored for each workload instance, it is preferred that the resource consumed is measured using the system metrics in a low-overhead mechanism. 

As Jianmei Guo, an Alibaba staff engineer and the leader of the SPEED platform, explained, “RUE requires reliable measurements for resource utilization, and as a server provider, Intel brings in tools and methodology to work with Alibaba on the hyperscale cluster performance evaluation. Intel and Alibaba worked together on strengthening the data collection and analysis in SPEED.” 

Solving the challenges of performance evaluation 

We start from sampling RUE per instance in the cluster, then aggregate the samples to a workload-level metric after removing the bias from the varying workload intensities, and eventually derive the performance results. 

For a better understanding of the performance speedup, we speculate the potential cause of the performance change and select auxiliary metrics derived from the platform performance counters or the system statistics. A further analysis on the auxiliary metrics is used to strengthen the empirical justification of performance change in addition to the primary analysis of RUE.

In traditional offline performance analysis, small-scale benchmark workload is often stressed to peak or a certain fixed intensity by load generator in an isolated environment. This type of benchmark is simple, reproducible, and useful for offline performance profiling. 

However, in a cloud computing environment with a large number of workloads co-located together, the workload behaviors are quite diverse due to the varying intensities and possible micro-architectural interferences between workloads, making this load-testing approach less practical.

To solve these shortcomings, a new performance evaluation method involves the following step-by-step procedures: 

1. Propose a system change for performance improvement and speculate why and how the system change impacts the performance from a certain perspective.

2. Monitor multiple instances of important workloads in the cluster for workload throughput and resource utilization before and after applying the system change, and calculating the baseline average RUE (primary metric) and the new average RUE for each job. During the calculation, the bias of RUE with respect to the varying workload intensity is identified and eliminated through a regression approach.

3. Calculate the job level performance speed-up using the baseline average RUE and the new average RUE of the job. In the case where each high importance job is assigned with an importance weight, a cluster level speed-up can be aggregated with a weighted average across the jobs. 

4. Based on the speculation in Step 1, define the relevant auxiliary metrics that can be used as the evidence to strengthen the results of speed-up in Step 3. Monitor and calculate the auxiliary metrics for a sampled set of instances before and after applying the system change.

5. Calculate the job level metric change using the baseline auxiliary metrics and the new auxiliary metrics. If the metric change complies with the speculation, the performance speed-up is regarded to be strengthened from the specific perspective speculated in Step 1.

Just recently, Intel released a white paper featuring two case studies on this new performance evaluation method. The first case study evaluates the performance improvement from a change of core pinning setting. Before the change, two containers running instances of a job are not pinned to the processors. Instead, their CPU resource usages are simply limited by the CPU quota assignment. 

During running, the Linux CPU scheduler moves the worker threads of a workload across the processor boundary. With current NUMA architecture and default first-touch NUMA policy, such a movement introduces a higher latency in memory access.

The change applied is to pin the two containers to the two processors, respectively. After applying the change, we speculate the workload performance will improve due to micro-architectural performance gain through reducing the amount of remote memory access.

The second case study evaluates the performance impact of replacing Pouch, an open source container engine created by Alibaba, with the Kata container in a workload colocation context. A container created from Pouch is a standard Linux container from the performance perspective.

Kata is a secure container based on the lightweight VM mechanism offering a better security context. After replacing Pouch with Kata, we speculated that the performance may be impacted due to the overheads in virtualization, such as the translation of system calls in virtualization, an additional layer of page table translation in the guest OS.

As Alibaba’s Jianmei Guo said, “The reason we concern ourselves with performance in Alibaba is that we aim at cost reduction while continuously improving performance through constant technological innovation. 

“We’re happy to work with Intel to support the performance monitoring and analysis challenge in Alibaba SPEED with Intel PRM.”

Related stories
Top stories
Story image
Video: 10 Minute IT Jams - An update from Mendix
Mendix is a low-code platform used by businesses to develop mobile and web apps at scale, and Jornt joins us today to discuss how these offerings work, and what benefit they have in the development process.
Story image
Artificial Intelligence
Appier achieves historically high growth rate of 56% YoY
"Our strong momentum over the past two quarters underscores Appier's significant growth alongside our customers."
Story image
Lucid Software
Lucid Software expands enterprise offerings with enhanced slack apps
Lucid Software has expanded its enterprise offerings with enhanced slack apps for its Lucidspark and Lucidchart technology.
Story image
Kaspersky uncovers new attacks by advanced persistent threat group
The attacks involved modifications of the well-known malware, DTrack, as well as the use of a brand-new Maui ransomware.
Story image
Can biometrics help? 123% increase in Gen Zs scammed online
In the three years leading up to 2022, the number of Gen Zs who fell victim to online scams rose by 123%, according to Ping Identity.
Story image
How well do rangatahi understand cyber safety in Aotearoa?
Do rangatahi in Aotearoa understand the importance of being safe online, or has lifelong exposure to the internet resulted in widespread complacency?
Story image
Data analytics
Pressure on orgs to up their data analytics game - study
A recent report from Sisense highlights data transmission, analysis, and risk management remain top concerns for data professionals in APAC.
Story image
Artificial Intelligence
Gartner unveils key emerging tech to watch in 2022
"Such technologies present greater risks for deployment, but potentially greater benefits for early adopters," says Gartner.
Story image
Garmin expands NZ footprint with new Auckland distribution centre
The facility at Goodman’s Highbrook Business Park will be fully operational from October 2022 and features 3,586sqm of warehouse space.
Story image
Tech job moves
Tech job moves - Fastly, INX, Kinly, SmartBear & Vectra AI
We round up all job appointments from July 29 - August 12, 2022, in one place to keep you updated with the latest from across the tech industries.
Story image
Why enhancing bot protection for web and API endpoints matters
The trouble with bots is that they aren’t all bad. Unfortunately, this can make it challenging to detect malicious bots that find their way into your system and threaten your business.
Story image
Latest VMware threat report reveals truth about deepfakes
"Cyber criminals have evolved. Their new goal is to use deepfake technology to compromise organisations and gain access to their environment."
Story image
Avast reveals zero-day exploits targeting Chrome and Microsoft
Avast, released its Q2/2022 Threat Report today, revealing a significant increase in global ransomware attacks, up 24% from Q1/2022.
Story image
Exclusive: The Access Group shares the benefits of embracing SaaS
In today's rapidly changing working environments, efficiency and productivity are surefire ways to create business growth and success.
Story image
Data Protection
Advancing genomic sequencing and public health with digital infrastructures
Right before our eyes, we've witnessed the development of the COVID-19 vaccine in record time. An enormous achievement in an otherwise lengthy task that previously took, on average, 10-15 years.
Story image
Augmented Reality
TeamViewer remote access software integrated into RealWear Cloud
TeamViewer has announced a major expansion of its partnership with RealWear, a leading provider of assisted reality wearable solutions for frontline industrial workers. 
Story image
Dicker Data
Dicker Data brought on as Acronis partner for A/NZ
The news about the partnership comes in as cyber criminals continue to exploit gaps in traditional solutions and strategies in NZ and across the APAC region.
Story image
Why security needs to shape your journey to the cloud
It's estimated that 80% of workloads could be in the cloud in the next few years. How can you make all that data secure?
Story image
IBM expands Power10 server line for business modernisation
IBM has recently announced a significant expansion of its Power10 server line with the introduction of mid-range and scale-out systems.
Story image
Application Performance Monitoring / APM
New Relic integrates offering with Atlassian’s Jira Software
New Relic has integrated errors inbox with Jira Software to allow developers to easily access and set up complete stack error tracking and software performance monitoring from within the tool.
Story image
Investment in APAC cold storage to reach $5 in next decade
Investment in Asia Pacific’s cold storage market is expected to grow fivefold in the next decade, according to JLL.
Story image
Hybrid Cloud
The essential guide to digital transformation by SolarWinds
Digital transformation is a buzzword thrown around all the time by companies, but what does it actually mean and why is it important? SolarWinds breaks it down.
Story image
Datacom research explores reality of zero trust in A/NZ
Zero trust is fast emerging as global best practice in cybersecurity and local leaders are on board, with 83% considering it essential to security.
Story image
Dynatrace extends application security capabilities for runtime environments
Dynatrace has announced that it has extended its Application Security Module to detect and protect against vulnerabilities in runtime environments.
Story image
Privileged Access Management / PAM
The importance of stopping identity sprawl for cybersecurity
The 2021 Data Breach Investigations Report (DBIR) shows that 61% of all breaches involve malicious actors gaining unauthorised, privileged access to data by using a compromised credential. Unfortunately, it is often too late when the misuse of a credential is detected.
Story image
Organisations exposing highly sensitive protocols to public internet
More than 60% of organisations expose remote control protocol SSH to the public internet, while 36% of organisations expose the insecure FTP protocol.
Story image
Enterprise Resource Planning / ERP
Why the right ERP (and partner) is crucial to an innovative and successful business
Enterprise Resource Planning (ERP) is a foundational step to ensuring a robust business model; here's why choosing the right one could be vital to ensuring long-term success and innovative results.
Story image
Digital Transformation
Top tips for making your finance transformation program a resounding success
Planning to make 2023 the year you embark on a wholesale finance transformation program? It’s a move that will stand your enterprise in excellent stead as you navigate the complexities of the post-Covid business landscape.
Story image
Cloud and data protection big challenges for NZ businesses
"This surge towards a cloud-first approach meant security and safety became afterthoughts - there's no point being the fastest car on the racetrack if you crash.”
Story image
Ingram Micro
Ingram Micro NZ sees $74 million revenue growth in 2021
Ingram Micro New Zealand's latest financial report reveals that its revenue from contracts with customers increased by almost $74 million in 2021.
Story image
Why printing security plays a vital part in keeping Aotearoa safe
While internet printing, mobile printing and other similar technologies have no doubt made things easier to manage, it has also brought a whole new set of problems to the table.
Story image
Snyk announces plans to expand partner network in APJ
Recognising that partnerships are critical for growth, Snyk is building an entire partner ecosystem that will drive its expansion across APJ.
Story image
Artificial Intelligence
Is your chatbot bringing down the customer satisfaction score?
The top 10 reasons why chatbots are failing to meet customer expectations and what you must do to avoid that.
Story image
High level of Customer Identity & Access Management adoption
The study from Okta revealed that the pandemic has either accelerated or highlighted the need for digital-first strategies.
Story image
Education sector seeing highest volumes of cyber attacks
When breaking down the numbers to education attacks by region in July 2022, A/NZ was the most heavily attacked.
Story image
Ministry will no longer accept equipment from Chinese firm Hikvision
The Ministry of Business, Innovation and Employment (MBIE) says it will no longer accept equipment from a major Chinese surveillance camera maker.
AWS Marketplace
Learn how security orchestration, automation, and response (SOAR) enhances your security strategy.
Link image
Story image
Gartner Magic Quadrant
Gartner names Lookout a Visionary in 2022 Magic Quadrant
Gartner has recognised Lookout as a Visionary in the 2022 Magic Quadrant for Security Service Edge (SSE) and one of the top three offerings in the 2022 Gartner Critical Capabilities for SSE report.
Story image
New Zealand cloud provider challenges Google's claims on data control for region
A Wellington cloud services provider says Google's claim it will offer New Zealanders complete control over their own data is not true.
Story image
Cyber attacks
Dramatic uptick in threat activity with exploits growing nearly 150%
"While it’s not a surprise given increased attack opportunities like remote work, it’s still a worrying development and one we cannot ignore."
Story image
Cloud Security
Tenable makes additions to Cloud Security portfolio
Tenable has announced additions to Tenable Cloud Security that represent the next step in assessing threats related to cloud vulnerabilities.