If you are in the Machine Learning team, do you still spend most of your time optimizing the model to improve the accuracy?
Professor Andrew Ng introduces Data-centric AI, which is a way more efficient way to improve the accuracy of AI models than model-centric AI. He emphasizes that the machine learning teams should move from Model-centric to Data-centric AI throughout his lecture.
After you read my article, you will have a better understanding of Data-centric AI and learn efficient ways to increase the accuracy of AI models!
Andrew Ng is currently an adjunct professor at Stanford and leads a research group at Stanford University, focusing on AI, Machine Learning, and Deep Learning. He was a co-founder and head of Google Brain and was the former Chief Scientist at Baidu, building the company’s Artificial Intelligence Group into a team of several thousand people.
Data-centric AI is an approach that improves data consistency iteratively to increase the accuracy of the AI model while codes and algorithms are fixed.
Opposite to Data-centric AI, Model-centric AI is an approach that the data is fixed after standard preprocessing and optimizes the model so it can deal with the noise in the data.
Let me show you an example of the model that detects defects in steel. While the accuracy of the model was increased to 76.2% after tuning the hyperparameter of the model, the accuracy of the model was increased to 93.1% after improving the data quality. (Basically, the performance of the baseline model should be good.)
This shows that the performance of the model was much more increased after improving the data consistency.
- Train a model
- Error analysis to identify the types of data the algorithm does poorly on (e.g., speech with car noise)
- Either get more of that data via data augmentation, data generation, or data collection (Change input x) or give a more consistent definition for labels if they were found to be ambiguous (change labels y)
If you can label speech data by following: which label would be perfect?
- “Um, today’s weather”
- “Um…today’s weather”
- “Today’s Weather”
All of them are good labels but the most important thing is having consistency in data labeling. If you have inconsistent data labels, the performance of your model will be decreased.
- Ask two independent labelers to label a sample data
- Measure consistency between labelers to discover where they disagree
- For classes where the labelers disagree, revise the labeling instructions until they become consistent.
- When you have small data and noisy labels(inconsistent labels), you will not be able to get a clear picture of the decision curve.
- When you have big data and noisy labels(inconsistent labels), you will be able to get a clear picture of the decision curve.
- When you have small data and noisy labels(inconsistent labels), you will be able to get a clear picture of the decision curve.
If 60 data out of 500 data are inconsistently or incorrectly labeled, the following solutions are equally effective to improve the performance of the model.
- Clean up the noise
- Collect another 500 new examples (double the training set)
It is much faster and efficient to clean up the noise to improve the performance of the model.
Did you know that you need twice as much noisy data to have the equal accuracy of the AI model with clean data? It is more efficient to improve inconsistent or incorrect data to increase the accuracy of the model. Professor Ng argues that MLOps’ most important task is to make high-quality data available through all stages of the ML project lifecycle.
No, it can be applied to solve small data problems in big data. For example, when you are trying to improve the performance of the AI model for an autonomous car, you can apply the Data-centric AI approach to handle rare events happening on the road, which is small data. Also, when you are trying to improve the performance of the AI model of e-commerce, you can apply the Data-centric AI approach to handle little-known products, which is small data.
- Data that is defined consistently (no ambiguous labels)
- Data that covers important cases
- Data that has timely feedback from production data (distribution covers data drift and concept drift)
- Data that sized appropriately
Yes, there is a tool called Datasheets for the dataset that enables better communication between dataset creators and users and helps the AI community move toward greater transparency and accountability. This was made by the Microsoft team led by Timnit Gebru.
Yes. Ainize launched Teachable NLP which helps users to focus on developing Data-centric AI. You just need to prepare a good quality dataset to train an NLP model with Teachable NLP. Try Teachable NLP to learn more about Data-centric AI!
- Teachable NLP Challenge has begun and it will go through until May 23rd. Challenge is open to everyone who has a good idea and dataset to make their own AI model! ( Go to Teachable NLP Challenge )
If you have any questions regarding the article, submit your questions via the following link . Your questions will be answered through the upcoming Clubhouse event (The event schedule will be notified via invitation email).( Register Clubhouse invitation )