Alibaba has secured first place in the latest global Visual Question Answering (VQA) leaderboard, performing better than humans in the same context.
This is the first time that a machine has outperformed humans in understanding images for answering text questions, with Alibaba's algorithm recording an 81.26% accuracy rate in answering questions related to images, comparing to human's performance of 80.83%.
The challenge has been held annually since 2015 by the visual conference company CVPR. It attracts global players, including Facebook, Microsoft, and Stanford University. The evaluation has an image and related natural language question, where participants are asked to provide an accurate natural language answer. This year, the challenge contained over 250,000 images and 1.1 million questions.
Alibaba says the results were made possible thanks to the algorithm design from Alibaba DAMO Academy, a global research and development initiative of Alibaba Group.
The company says by leveraging its proprietary technologies, including diverse visual representations, multimodal pre-trained language models, adaptive cross-modal semantic fusion, and alignment technology, the Alibaba team made significant progress in analysing the images and understanding the intent of the questions, and also answering them with proper reasoning while expressing it in a human-like conversational style.
The VQA technology has already been widely applied across Alibaba's ecosystem. For example, it's used in Alibaba's intelligent chatbot Alime Shop Assistant, used by tens of thousands of merchants on Alibaba's retail platforms.
"We're proud to have achieved another milestone in machine intelligence, which underscores our continuous efforts in driving research and development in related AI fields," says Alibaba DAMO Academy, head of natural language processing, Si Luo.
"This is not implying humans will be replaced by robots one day. Rather, we're confident that smarter machines can be used to assist our daily work and life, and hence, people can focus on the creative tasks they are best at."
Luo says VQA can be used in a wide range of areas, such as searching for products on eCommerce sites, supporting the analysis of medical images for initial disease diagnosis, and smart driving, as the auto AI assistant can offer basic analysis of photos captured by the in-car camera.
This is not the first time Alibaba's machine-learning model has eclipsed others. The Alibaba model also topped the GLUE benchmark rankings, an industry table perceived as the most important baseline test for the NLP model.
In 2019, Alibaba's model exceeded human scores when tested by the Microsoft Machine Reading Comprehension dataset, one of the AI world's most challenging tests for reading comprehension. The model scored 0.54 in the MS Marco question-answer task, outperforming the human score of 0.539, a benchmark provided by Microsoft. And in 2018, Alibaba also scored higher than the human benchmark in the Stanford Question Answering dataset, one of the most popular machine reading comprehension challenges worldwide.