MLCommons Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

Engineering consortium advances data-centric AI with pioneering datasets and tools

The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.

The People’s Speech Dataset

The People’s Speech Dataset is among the world’s largest English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago. The dataset, released under a Creative Commons license, democratizes access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.

Top iTechnology AIOps News: Swap.com Partners with FIND.Fashion to Offer Unprecedented Visual Searching Powered by AI

Multilingual Spoken Words Corpus

Also available today is the Multilingual Spoken Words Corpus (MSWC), a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language. A diverse multilingual dataset that spans languages spoken by over five billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.

DataPerf

The new DataPerf benchmark suite supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Training and test datasets are a key part of creating an ML solution — the solution can only be as good as the data — but much less effort is spent on understanding and improving datasets than on mastering and improving models. DataPerf fosters and measures progress in this vital area. The MLCommons Association will support a series of challenges with leaderboards in 2022 to encourage participation in DataPerf. Contributors to the suite include researchers from Alibaba, Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven, drawing on the teams responsible for Cats4ML, the Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf™ benchmarks. The MLCommons Association invites other participants to join the DataPerf effort at dataperf.ai.

Top iTechnology IoT News: UEI Launches New and Sustainable Whole-Home Control Solutions at CES 2022

Historically, most AI research has focused on improving model architectures and making them available to the community; in contrast, attention to engineering and maintaining datasets has lagged and is often manual and ad-hoc. The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation.

“The machine learning model architecture for many applications is basically a solved problem. In many cases, focusing on engineering the data is more important for unlocking successful AI applications. Data is food for AI, and our systems need not just massive amounts of calories, but also high-quality nutrition. We need not just big data, but good data,” said Andrew Ng, founder and CEO of Landing AI, founding lead of Google Brain, co-founder and chairman of Coursera, and adjunct professor at Stanford University. “Thanks to the shared efforts by the community, including the work initiated by the MLCommons Association and its members, the movement demonstrates the potential for Data-Centric AI, and how we can collectively implement a greater AI adoption.”

“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” said David Kanter, the MLCommons Association co-founder and executive director. “The People’s Speech is a large-scale dataset in English while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”

Top iTechnology Machine Learning News: SparkBeyond Discovery Now Available in the Microsoft Azure Marketplace

[To share your insights with us, please write to sghosh@martechseries.com]

MLCommons Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

Engineering consortium advances data-centric AI with pioneering datasets and tools

The People’s Speech Dataset

Multilingual Spoken Words Corpus

DataPerf

CIO Influence News Desk

Leave a Comment Cancel Reply

Quick Links

Visit Our Other Sites

Engineering consortium advances data-centric AI with pioneering datasets and tools

The People’s Speech Dataset

Multilingual Spoken Words Corpus

DataPerf

Atos Develops Video System for Dassault Aviation’s “Falcon Albatros”, Future Surveillance Aircraft of France’s Navy

New Study Finds Over 96% of Computer Vision (CV) Teams Already Using Synthetic Data

CIO Influence News Desk

Related posts

Bluebeam Introduces Bluebeam Cloud: Designed for the Business of Building

Tackle.io Secures $100Million Series C Funding from Coatue and Andreessen Horowitz to Take Digital Software Sales Mainstream

Hear from Digital Transformation Leaders from AWS, IBM, Twitter, and SK Telecom at K-Global 2021

Leave a Comment Cancel Reply