2020 Trends: Data Markets and the Value of Data

Date: February 10th, 2020

This article is a guest blog post from one of our friends in the industry, Dr. Nikolaos Vasiloglou. Previous posts in our ongoing "Trends" series include:

2019 Trends: AI Interview

2019 Trends: Kubeflow

2019 Trends: Kubernetes

Framing the Value of Data

Everybody knows the boring cliche that “Data is the new oil”, as stated by the Economist a few years ago. There are endless case studies about the value of data in modern businesses, but most of them on very coarse granularity. For example, companies know that by storing location or transaction data that they were mainly producing they were able to save x amount of millions or create a new product that generated y millions of dollars. It becomes more tricky when companies have to make decisions about buying data.

The biggest problem is pricing data points. It is easy to rush and think that online advertisers have solved the problem with auctions. You have to be careful though, publishers auction the advertising space and the visitor. It is an advertiser’s (DSPs) problem to find the data to make the right bid with the right return. Indeed we have seen the rise of marketplaces like AWS, where people can subscribe to various feeds (at a very high price) and then they have to figure out how useful the feed is. That sounds like buying a service that delivers a box of potatoes every day to you. Not exactly the same. All the potatoes have the same utility to you (except for the rotten ones), but not all the data points are useful to you. So while you feel ok buying tomatoes in bulk, you don’t feel the same when you buy datasets.

The Devil is in the Features

My friends at nitrogen.ai have created a site that helps you find the value of specific features. It is the good old Pearson coefficient. They have created a service (an engineering masterpiece in the background) where you upload your labels keyed by location and time, they join them with public sources and they tell you which ones are highly correlated with your labels. Wonderful, indeed when you can join your labels with these attributes. Things are not so easy when you want for example to buy images to train your image classifier. How do you make a decision like that?

This is an interesting problem, where you are given a big dataset and you need to choose which points are useful to you. The problem looks familiar. Undoubtedly, every data scientist has used at some point feature selection. The problem here is that if you have a set of features and you need to work with a smaller subset find the most important ones. The truth is that most people don’t do exactly feature selection, they prefer to call it feature importance, as interpretability and explainability have become very popular. The most famous method is the Shapley value. I will stay away from the mathematics of the term, all I will say is that it is just a game and that it is very costly to compute.

Conclusion

A lot of approximations have been proposed. The same algorithm has been proposed recently for selecting points that are useful for your model. Don’t get confused with active learning, in such a case, you assume you have an abundant source of unlabeled data and you seek to find which ones you want to label. Here we are talking about the problem of buying labeled data and deciding which ones are useful to you.

If you want to learn more about how Shapley values are used to price the data, I highly recommend the recent blogpost by Ruoxi Jia.