In the June 13th issue of The Economist magazine, an article entitled Data: Not so Big got us thinking about how companies leverage big data, and how the difficulty surrounding obtaining valid data to create software may be more difficult than most people who aren’t in the development world commonly think. So, how do developers use data? Why is data so valuable? And costly? We wanted to take some time to explain how we at Wiip.ai use data to improve our product, and how our investment in said data could be valuable for other companies, as well.
So, what is all this data used for, really? Let’s walk through it step by step…
Algorithms learn from data. A developer will create a machine learning algorithm, but much like how children learn language, without having data to learn from, the algorithm can’t improve itself.
The quality and quantity of the training data — the data a ‘baby’ algorithm will learn from — greatly affects the accuracy and value of the algorithm itself. Think of training data as the building blocks of the algorithm’s brain. Like a child learning to speak, the algorithm needs a starting point to learn from.
The trick is, having a data set is not enough to train the machine learning model. Even if you have vast quantities of data, unless they are annotated, the algorithm can’t learn on its own without being taught. To create a training data set, a human needs to identify or give meaning to data points before training. As teaching a child to speak, it is necessary to point to an image for them to make the connection from a word to a concept. Just language is not enough.
So, the data you want to use for training usually needs to be enriched or labeled. Additionally, the more data available the stronger the algorithm.
So, while training data is used to make sure the machine recognizes patterns in the data, the cross-validation, or test data is used to ensure better accuracy and efficiency of the algorithm used to train the machine,. The test data is used to understand how well the machine has learned from the training data, and can provide a percentage of accuracy. While training data needs to be tagged in order to teach the machine learning model, test data can be a little bit more rogue. The test data should challenge the model so that developers can improve the algorithm and pick up on weak points.
At wiip.ai, for example, when testing out language detection models we
While there are some open source datasets available out there, in most cases, training data needs to either be built from scratch, or bought from a provider. Depending on the size and scope of your AI powered project, you may want to consider purchasing massive amounts of data that have been tailored for your domain to save the time and energy it takes to generate the data yourself. However, if your model is more simple, you may be able to create the data yourself, or find an organization to donate some of their data in return for the use of your service or product.
As data collection and privacy policies are changing, there are two things to consider when collecting and tagging data for training and testing your model. Even the biggest players in the market are facing challenges. Just recently Google was hit with a $5 billion lawsuit for tracking what was thought to be private browsing data, and the popular Tik Tok has to deal with a class action suit for collecting biometric data.
So first of all, these lawsuits are so valuable because the data is valuable. Building datasets to train and test is a timely and costly endeavor, and a lot of tech giants try to take advantage of their consumers’ data without paying for it. Legal consequences can be costly, so it is worth giving some thought to how you will obtain your data, and whether you are doing so legally.
Secondly, there are services that work offline and still collect data in a private manner with consumers’ permission. Wiip works with our clients to allow them to improve voice recognition and translation AI models offline without the risk of sharing data with other developers. Use your clients’ data with their permission to improve your models’ domain-specific knowledge without the risk of a lawsuit down the road.
Reach out to us to learn more about the ins and outs of quality data for machine learning and artificial intelligence training and testing.