Scaling AI: Making machine-learning models more effective and efficient

Mon, 29th Nov 2021

FYI, this story is more than a year old

By Raja Shah, SVP - industry head, global markets, Infosys

High-quality, clean and appropriately labelled data is undeniably crucial in today's world.

Companies are increasingly dependent on the ability of AI and machine learning (ML) models to provide real-time insights that drive business and customer engagement outcomes.

With an exponential increase in data, AI and ML algorithms are integral to leveraging this data effectively. This is key to enabling everything from self-driving cars, cashier-less shopping services and even cancer detection.

In the realm of the telecom world specifically, we see AI and ML being used for various use cases that enhance customers' experiences of solutions and services. This includes speech recognition and activated commands which have become almost must-have smart features in today's fast-paced world.

And with increasing reliance, the quality of data and data models that minimise unconscious bias from human data labellers is even more important. With customer behaviour and genome analysis more prevalent for customer mapping, telecoms can confidently hyper-personalise offerings when data is effectively cleansed.

As the data is crucial, the design and testing process, which includes data cleansing and labelling, must be extensive to minimise bias in data. The industry is awash with new and dedicated data labellers, such as San Francisco-based start-ups Scale AI and Sama.

Google and Amazon also complete gargantuan manual labelling tasks, especially in the legal and healthcare industries, but often charge businesses a particularly high fee.

Across all these data labelling services, there is no guarantee that the output will be comprehensive, unbiased, or free from noise, which adds a risk of flawed outcomes and inefficiencies. The length of time required to successfully clean and label data is often too long for agile companies.

At Infosys, we understand that 25-60% of ML projects costs come from manual labelling and validation of data. Expenditure on these tasks seems to be increasing, with little guarantee of quality. AI consultancy Cognilytica estimates enterprises will collectively spend US$4.1 billion on data labelling by 2024.

So, what's a faster and more effective way to reduce bias and deliver clean data for hungry ML algorithms?

An approach that combines intelligent learners and programmatic data creation is required. By allowing AI to do the heavy lifting for deskilled data labelling, overall bias can be reduced and efficiency and effectiveness can ultimately be boosted. Here are some of the ways this transformation can take place:

Active Learning

During the active learning process, an intelligent learner examines unlabelled data and picks parts of it for further human labelling. Using a classifier can help control what data is selected and helps address areas that haven't been optimised for machine learning. This makes the labelling process active rather than passive and, in turn, increases data quality.

Active learning was recently used in the legal industry to label contractual clauses. Through the process, data accuracy increased from 66% to 80%, even when using fewer data points, while the cost and time involved were also significantly lower.

In a situation where an AI-based decision appears biased, it is easier to interrogate and find the reason why. The result of a Netflix recommendation, for example, is based on a set of rules driven by user data. If the rules appear to be displaying biased results, while complicated, the machine learning model can be investigated to find out why and corrected to remove perceived bias.

Distant supervision

Using distant or weak supervision to programmatically create data sets is the best way to use AI at scale. In both approaches, a labelling function is programmed to create labels from input datasets. That means distant or weak supervision can combine noisy signals and resolve conflicting labels without any sort of reference to a "ground truth".

Distant supervision produces noise-free training data using distance knowledge bases. By looking across multiple data sources and databases, distant supervision can map the metrics for machine-based learning models.

The process has 98% accuracy, but there may still be noise in the label depending on the type and number of knowledge bases in the training data available. One challenge with this model is that finding distant knowledge bases can be difficult, and ML engineers need expert domains to help them uncover the appropriate information.

When data needs to be sourced from unreliable sources, it is best to use weak supervision.

Synthetic data generation

When data and labelling functions don't yet exist, there's an option to make up the data.

Amazon took this approach at its new Go Stores, which are small convenience stores where no check-out is required. Amazon created virtual shoppers using graphics software, which in turn trained computer vision algorithms about how to learn what real-world shoppers select off the shelf.

NASA's Perseverance mission to Mars also saw the entire Martian landscape synthetically captured using synthetic data generation.

Like the virtual shoppers, synthetic data has the same representative characteristics as the real-world data from which it is derived. The data must have exposure to converse use cases and outliers to reduce uncertainty and ensure it is fair, safe, reliable and inclusive.

This can be seen in the case of churn prediction. Churn prediction is about analysing relevant data to identify factors indicating that a given customer is a flight risk. If you know which customers are about to cancel their subscription or terminate their contract, you can take proactive measures and prevent them from leaving. This can be created without data being generated by calls which may be annoying to the customer and who may have already been contacted for other services by the same provider.

AI projects require quality labelling of data in a timely manner. At the moment, about one-quarter of the time devoted to a machine learning task is spent labelling – well above the 3% of time devoted to developing algorithms.

As large corporations seek to scale AI into every part of their business, they will likely struggle with the trade-off about how to make the process work effectively and efficiently. But active learning, distant supervision and synthetic data generation can do the heavy lifting and significantly reduce costs and increase the efficiency of deskilled data labelling while also improving the quality required to achieve powerful AI models into the future.

For more information on Infosys, visit: https://www.infosys.com/australia/