Discover the importance of data-centric AI

June 23, 2023
Artificial Intelligence

Organizations are tuning their Artificial Intelligence (AI) models for the best accuracy, using more and more data. Every day there are new and improved algorithms available on the market. So, it is no surprise that much time and energy is spent trying the latest algorithms and measuring their impact on the model’s accuracy. Companies must ask themselves, “does improving the algorithm continuously guarantee us the highest accuracy?”

We will discuss why focusing more closely on the data can significantly impact accuracy. Organizations have access to more data than they can afford to analyze. They need help to reduce the noise from their data, preventing them from extracting the highest possible business value. A greater focus on improving data quality and data augmentation are some of the innovations promising higher accuracy in AI solutions.

Ingredients of an AI project

An AI project requires two main components to be successful. First, the “data” is the main ingredient as it encapsulates past events and is essential to make predictions based on those events. The second component is the ability to “analyze” the data in a timely fashion with the highest possible accuracy.

A question I frequently get is, “How much data do we need for our AI project?” Unfortunately, there is no straight answer to that question as it depends on many factors. Most of the time, people respond with, “You can’t have enough data.” The issue is that collecting the data is just the beginning. The data needs to be processed with the help of infrastructure and AI algorithms. Storing and analyzing the data in an AI model is expensive. The more data, the higher this cost, and budgets have limits. So understanding early in the process what subset of data will give us the best results will also deliver the highest ROI. We look at improving the “quality data” to achieve this goal.

Data-centric AI

Andrew Ng, the co-founder of Coursera, a former scientist at Baidu, and founder of the Google Brain research lab, is a leader and pioneer in AI. He understands the state of AI and what areas need to evolve further. Most recently, he started a movement to improve the quality of the data to unlock the full potential of the data. The concept is new and in the “ideas and principles” phase.

Andrew coined the term “Data-centric AI.” It is the discipline of systematically engineering the data to build an AI system. The emphasis is on data quality and creating tools to facilitate building AI solutions with the highest accuracy. That said, not all problems are the same, and the need for flexibility is still a factor.

It is important to note that improving the algorithms and the data quality is critical. However, the argument is that focusing on the quality of the data will help unlock its full potential.

“Focusing on the quality of the data will help unlock its full potential”
– Andrew Ng, AI pioneer.

Data quality

No algorithm in the world can compensate for poor data quality. In other words, without quality data, the expectations for high accuracy are limited. The data must fit the intended business domain for it to be called quality data. Data quality must be measured as early as possible with the necessary actions to increase the quality. But what you do not measure, you cannot improve.

Data Management tools can help to automate and streamline the data process. However, organizations need help integrating the tools into their frameworks, as outlined by the ESG Data Management Survey from 2021. The survey indicates that organizations use several data management tools to deliver data management within their environment. The tools need to be more mature and have a too-narrow focus. The complexity makes the overall approach far from efficient.

The most crucial data quality characteristics that need to be addressed by Data Management tools are:

Uniqueness	The presence of duplicate data.
Accuracy	Incorrect information or lack of detail.
Completeness	Incomplete data or missing required values.
Consistency	Contradictions in the data.
Timeliness	Out-of-date information (stale data)
Validity	Data in the wrong format or order (as defined by business rules)
Relevance	Irrelevant data to the business domain

Data Augmentation

Another challenge with data is the fact that the data might need to be a better fit for your domain. We could describe them as data gaps, as they reflect the imperfection of real-world data.

The gaps can be observed as follows:

Domain Gaps: training data does not match the real world where the model is used.
Data Bias: the collected data might have a bias impacting individuals and society.
Data Noise: data with a low signal-to-noise ratio. A frequent problem is mislabeled data.

Data Augmentation can help mitigate those imperfections with a data-centric approach in mind. It is a technique where the training data is artificially augmented by generating a modified version of existing data.

The topic of Data Augmentation deserves a more in-depth analysis, and we will leave that for a future blog. However, the central concept is to add additional training data to overcome the data gaps.

One approach is self-supervision, to use existing data and make minor changes. When dealing with images, one could, for example, apply cropping, resizing, or changing the contrast to generate additional training data.

The second approach is to generate synthetic data. Synthetic data is generated artificially without using the original dataset. Creating synthetic data is a new concept but a significant advancement. The idea has already been introduced in AI fields such as Healthcare, self-driving cars, Natural Language Processing (NLP), and Automatic Speech Recognition.

The techniques behind synthetic data generation often use DNNs (Deep Neural Networks) and GANs (Generative Adversarial Networks).

Measuring Data Quality

When benchmarking an AI solution, one should consider measuring the model’s accuracy, speed, and efficiency. The benchmark gives us a mechanism for consistent measurements and a process to validate improvements to our model. Such a benchmark is MLPerf^TM from ML Commons.

Data is considered the new bottleneck and limiting future advancements. To acknowledge this problem, it is essential to put data at the center and create meaningful metrics. It is impossible to improve datasets without metrics and to measure against the metrics. The metrics help measure the vital qualities of data and indicate where improvements are needed.

The people at ML Commons have started the DataPerf initiative to create appropriate training and testing data to deliver new machine learning (ML) solutions. In other words, benchmarking datasets to make better datasets.

The AI industry is moving towards a Data-centric AI approach, and tools to build, maintain and evaluate datasets easier, cheaper, and more repeatable will be front and center.

DataPerf is backed by a strong community and always looking for people to help. Please reach out to them if you are considering contributing.

Wrapping Up

To achieve accurate results, it’s crucial to comprehend not only the algorithms but also the source and type of data being used. Assess the data regularly as you test your algorithms. And know that understanding where the data lives and its value is vital to a cost-effective AI strategy. More data is better than less data. But better data beats more data!