☎️ E17: In Conversation with Shiv Malik, CEO of Pool.io on the State of Data Unions 🤝₿
The data broker market is worth $250bn. The entire machine-learning market was worth $40 billion-ish. 5x more interesting.
Welcome 👋. I’m Lawrence Lundy-Bryan. I do research for Lunar Ventures, a deep tech venture fund. I write State of the Future, the World’s most comprehensive deep tech tracker. You get your Vector Databases, LLMs, and Decentralised AI as well as your Optical, Neuromorphic, and Analog Computing. If you like your technology with a dose of good old-fashioned humour, just 👇
Data unions are one of the most interesting ideas in technology at the moment. And I’m the guy who assessed 100 technologies. Few concepts can change an existing value chain materially. Obviously, Xenobots. Xenobots and whole brain emulation. But I can’t just always write about xenobots and whole brain emulation. That’s not a newsletter.
Data unions say: what if we got rid of data brokers that sit in between data sellers and buyers, and we gave the data sellers a cut of the sales? It’s interesting because the data broker market is worth $250+ billion. The entire machine-learning market was worth $40 billion-ish. The data broker market, which, in theory, data unions could remove from the market entirely /displace) is 5x larger than the entire ML market!
Do I have your attention?
In my assessment of data unions on the site, I write:
“Data unions offer a way for data creators to be paid for their “labour”. (aka data exhaust). They are part of a narrative of exploitative Big Tech and the Web3 world in which users “own” their data. The high impact scenario see the wallet becomes core to the next iteration of the Web in the same vein as the personal “data pods” of the Tim-Berners Lee and Solid vision. If that scenario is true, then data unions and thus dataDAOs will become the primary interface between users and data buyers, aggregating individual data into privacy-protected cohorts for advertisers and brands. A better vision is a DataDAO that uses Multi Party Computation (MPC) to run private analytics as well as selling data directly. A low impact scenario, is that data unions don’t exist at all as consumers are happy with the current data broker market and the amount unions can pay users isn’t sufficient to make consumers switch en masse. As much as I would like to believe in the high impact scenario, consumer behaviour to date, the complexity of a data union versus the advertising model and the fact that individual data isn't worth that much, makes the low impact case more plausible.”
Individual data isn’t worth much. That’s the crux of it isn’t it. Individual data does not a business make.
To Tell Me Why I’m WrongTM (the name of my upcoming podcast) I spoke to my good friend, Shiv Malik, the CEO and founder of Pool Data, a startup building a zero-party log-in solution. You sign in with their product called Pocket, and you choose which brands and publishers to share your data with. Sort of like Tim Berners Lee’s Solid project. Originally Pool was trying to build infrastructure for data unions. This pivot was interesting in what it said about the maturity and capabilities of data unions.
You won’t find these insights in your LLM.
5 Things I Learned
🛒Market context: Data unions belong to a zero-party data model where individuals voluntarily provide data; it differs from a first-party data model, where data is collected directly from a company's own interactions, and the third-party data model, where data is acquired from external sources like data brokers. Shiv thinks zero-party outcompetes first-party models on the supply side by persuading users to share more data, and demand-side, by having higher quality data than first-party providers like Meta, Google, Amazon, or Apple. This is a pretty compelling argument.
🚦Technical Viability: Data unions were almost impossible before the Digital Markets Act (DMA) forcing social media companies to offer real-time API access to user data and recent progress in crypto with stablecoins and Layer 2s on the payments side. It’s unclear whether privacy-enhancing technologies are a differentiator for data unions or a prerequisite. Progress in data unions is a canonical example of a supply-side driver.
🚦Commercial Viability: Most (all) data unions have had a weak business model. Mirroring data brokers isn’t sufficient. Adding analytic capabilities and selling insights not data would make a difference on the economics. An even better model is to enable users to sell their data directly but enrich it through some federated learning/MPC tools.
🛞Drivers: The Digital Markets Act (DMA) comes into force in March 2024 and forces large social media platforms to open up certain core platform functionalities to third-party apps and services via APIs. This is basically Open Banking for social media. This should make it much easier to port personal data to new services. This is likely to be a huge catalyst for the market.
🔮Impact: A data union aggregating user data might be the wrong mental. A federated learning protocol might be what we are talking about here. We may never need to consolidate user data if we can generate insights without moving it to a third-party server or consolidating. This is the federate learning story. We just need a twist where users are paid for their contributions. Or where the entire FL model is collectively owned by a DAO.
⏰Timing: Shiv tells me that if data unions aren’t a thing by the end of 2025, then they will never be a thing because if the DMA going live from March 2024, and Chrome phasing out cookies from the middle of 2024 doesn’t create the necessary push, nothing will.
Lawrence: Shiv, data unions are more of an idea and a business model. But technically how easy are they to build? What are the technical barriers today? I'm defining data unions as any entity collecting and selling data and remunerating the members.
Shiv: There are some technical challenges. A few elements make up the tech stack of a data union. There's data ingestion, insights, privacy enhancing technologies and payment. At a crude level, all the pieces of the puzzle are there. But I wouldn’t say it’s easy to pull them all together.
With data ingestion, data unions struggle to gather enough data. It’s still hard to pull together a coherent dataset. A web browser plugin gives you 90 days of historical browser data. That's pretty useful. But when you try and do the same on mobile, you can now do it on Safari, but you can't do it on Android with Chrome. So that’s a blocker, and that’s just browsing data. There are hundreds of possible streams that are changing constantly. So it's a technical impediment. Banking data is even more complicated. Again, technically, it's all there. In the US, it’s set up through bilateral arrangements. Plaid is the leader in that. In the UK and Europe, it's enforced through open banking legislation. A lot of work is going on through the Digital Markets Act in Europe to open up social data in Google, Apple, Facebook, LinkedIn, etc., in an Open Banking-style framework. The technical impediments are the complexity of APIs and the economic incentive of the big firms that have first-party data not to make it easy to access.
With analytics, It's less of a technical problem. There are tons of analytical engines, data experience, and off-the-shelf products. You plug in the data sources, and off you go. Obviously, you can begin to use some ML or train your own models, too. The privacy side of things is more complex, and it’s less clear to me what the stack looks like here. Ideally, you would have an easy-to-use synthetic data engine or something; as a developer, you really don’t want to be figuring out how to connect these things together and worry about data leakage and risks. Maybe it’s some federated learning and differential privacy tied together. But the field is developing pretty rapidly. I do think this will be crucial to the success of data unions. You want to be able to say to users, we can’t read your data. It’s safe because either it never leaves your device unencrypted, or we don’t ever decrypt it.
And finally, payments. This is where crypto fits in. This is why you see data unions starting in crypto because, before crypto, you literally couldn’t distribute small payments back to members. Even with crypto, 3 years ago, it was also impossible. But in 2023, with more sophisticated smart wallets, Layer 2s, stablecoins and DeFi, it’s pretty easy to take fiat, deal with crypto in the back end and even pay members back in fiat if they want. Transaction costs continue to fall, and UX continues to get better.
So it’s basically a data ingestion problem. But that is getting solved with the DMA. And a privacy-enhancing tech problem, and we can probably launch products before the privacy side is solved entirely.
[Editors Note: We agree. See our thesis on this below. Tell your friends]
Lawrence: On the data ingestion side, what would make it easier to build a product? Is it as simple as if every data stream had a real-time open API?
Shiv: It’s precisely what the DMA has already legislated, but it doesn’t come into force until March next year. Large social media platforms designated as gatekeepers are required to open up certain functionalities to third-party apps and services via APIs. It’s a real win. But It won't exist until March next year. But when it arrives it changes the market, because the most valuable data is the most recent data. And what better data than real-time feeds. Real-time data enables a whole new class of products that aren’t possible with today’s data portability rules.
Lawrence: Has the market fully digested this yet? It seems that in 2023, you can technically get a name or email address. But from 2024, a third party can pull all Facebook's data on an individual, assuming they get consent.
Shiv: It's still a little uncertain as to what will happen. But broadly, the legislation means that as long as they have the end user’s consent, third parties can legally request much more data about that individual. And they will be able to regularly pull new information via an API. It would probably work like the auth system works now, instead of getting basic account information like first name, last name and an email, which they obviously can't use if it's a personal email for GDPR purposes. But instead, that would be a real-time connection to someone's account information. The third parties requesting the data don’t have to be a data union, obviously, but they do need to get consent. But it will be easy to do. So, there won’t be a technical barrier; the barrier will be convincing users that you can provide enough value for them to hand over their data. This is the opportunity for data unions to differentiate. To say, give us your data, and we will remunerate you, instead of giving it to another company that won’t. Of course, this new world relies on gatekeepers adhering to the legislation. We know from Open Banking that gatekeepers can sometimes drag their feet.
Lawrence: Interesting; what I am hearing is that data unions are about to go from maybe a 3/4 on the scoring to a 5; technically, it’s about to become much more accessible. I might go as far as to say in 2022/23, it was close to impossible to be a data union. But with the DMA, data unions are now viable. Is there a straightforward commercial offering? Is there enough value in an individual’s data to make it worthwhile for them to sign up? If they get back 10 dollars, is that good enough? What about 100? What do the economics look like?
Shiv: The basic model for almost all data unions today is to collectivise. To gather lots of user information, remove personal information, and sell it in bulk as a real-time data set. There’s a massive market for that data, whether open banking or browsing data. Or even better, linking the two. It's also feeding into the world of first-party data, which is growing as cookies recede. Every brand and publisher is in the market for this data. The problem, as you’ve noted, is that even if the aggregated data set is expensive, divided up by hundreds and thousands of contributors, the numbers get pretty small, maybe 5 or 10 dollars a month.
Lawrence: The challenge here is that 5 to 10 isn’t enough for the most valuable users, right?
Shiv: Well, it’s 5 to 10 more dollars than you get monthly. That's two cups of coffee. (Editors note: not anymore in London). But as always, it's in the framing. It's the psychological framing that turns users off. So, the members of data unions so far tend to be bargain hunters. People looking for a discount, for a deal. We know this because we are at the nexus of information; we were serving other data unions so we could see what they could see across multiple data unions. This user persona gets a thrill out of getting an extra 3 or 4 bucks every month, but they don't tend to spend money. They are frugal. The thing is, brands and publishers don’t like this persona, right? Because they don’t spend money. So you see the problem. The current cohort of data unions has incompatible users and buyers. So it’s not a problem you solve with a more significant number. More bargain hunters won’t solve your problem. You need to appeal to other sub-groups.
Lawrence: Interesting. And I assume the crossover between people in crypto and bargain hunters is even smaller still?
Shiv: You can analyse the same user persona and determine what motivates crypto users. Obviously, it’s much more nuanced than this, but a large proportion of crypto users identify with wanting to get rich quick. They like the thrill of gambling to some extent. And people looking to get rich are generally not the same people as bargain hunters.
Lawrence: I wonder how many users of moneysavingexpert also have a crypto wallet? That crossover has to be vanishingly small.
Shiv: There is probably more crossover than you think. Both segments care about money more than the average person in the sense they want to earn more. But yeah, I think broadly you are right. But the point is that a data union paying 5 to 10 dollars a month attracts the wrong member base. Data unions and other community groups outside crypto tend to give out points, rewards, and vouchers. It’s rarely cash directly.
Lawrence: If the user took the time to understand the value of a reward point, they would realise the cash value is really small and probably not worth it. So crypto doesn’t really enable anything because, actually, micropayments attract the wrong members?
Shiv: Well, not quite. You could give crypto straight to users; that’s not quite direct cash. it’s closer to a points system because the crypto can be used for many things, including crypto services, DeFi, etc. But still, yes, there is a tension for sure. We’ve seen some experiments as data unions attempt to attract a more extensive user base by turning their token into a meme coin. Or where DAOs are trying to figure out a way to incentivise members by offering services beyond just crypto. But it really depends on the value system of the DAO, for example, because many of them spin up to sell a coin and get rich. But there certainly are value-aligned ones in which you build a data union where getting paid for your data is a proposition, but there are many other benefits for the member that make the 5 or 10 dollars nice but not the reason they joined. This is probably how the economics work out if data unions are to succeed.
Lawrence: This is close to how real-world unions work anyway. Much of it is collective bargaining, but there are a lot of benefits, including legal representation, insurance, and even restaurant discounts. Let’s move on to drivers. What are the drivers of the market today? Is it regulation, the demise of cookies and crypto?
Shiv: Two significant trends are happening. The first is the opening up of data pipelines of real-time information, of which the DMA will be a huge part. The second is the decline of traditional service spyware. So cookies and Apple killing IDFA tracking are the big things. And very soon, we won’t be sharing emails for sign-in and tracking. Google is already heading in that direction, and Apple’s Hide My Email feature is another step. Google told the UK Competition Markets Authority they would get rid of IP tracking from Chrome in 3- to 4-year. At this point, in the name of privacy, you have created a world in which only Meta, Google or Apple can identify an individual user. This might be better than our mess now, but that’s also not a stable equilibrium.
Lawrence: tldr, we end up in a world where individual users cannot be tracked across the web; how does this benefit data unions?
Shiv: Well, post March ’24, it will never have been easier for individuals to give third parties real-time access to their data. But you see, it’s the user that has to consent. Today, or at least for the past decade, data was shared without consent. So, every data broker and buyer of data basically has to explicitly ask for it. The middle won’t hold. So the question is: how do data unions win?
I think this is why the business model is so essential, and we shouldn’t write off all data unions because they have all alighted on one single unsuitable business model.
They are basically data aggregators. They’ve all persuaded users to hand over their data; the data unions pool it and then sell it to data buyers. There is often little privacy protection here, but let’s park that for a second. Basically, it’s up to the data buyer to do the analytics and figure out what they want to know. This seems wrong to me. The way I see a future data union is an entity that does the aggregation and analytics all in-house. They then find buyers for the insights, not the data. This is not only better from a privacy perspective, but you’ve cut out a middleman, reducing costs and adding more value. You can take that a step further because that is still an aggregation business model. But what if the data union didn’t even sit in the middle of the transaction? You can imagine a lightweight private analytics protocol that does federated learning, allowing the users to build 1-1 relationships with brands or publishers. The users have the benefit of agreeing on which entities can have access to their enriched data. This would be an experimental model, and it’s unclear exactly how it looks. It’s somewhat close to the Tim-Berners Lee and Solid pod model. Or an even older idea of a data vault. However, with some private data enrichment, it is more valuable and addresses the economic problem. There are many working through these issues, Snickerdoodle springs to mind.
Lawrence: Okay, so to clarify, data unions to data have got it wrong. They are just replicating the old data broker model. Hoover up data and sell it in aggregate. You’re saying, first, why not do some analytics and sell insights rather than data? That adds more value. But more importantly, you’re saying aggregating it all up is a mistake. A better model is to collect, enrich and allow users to trade their data directly. Is that right?
Shiv: This is how the economics add up. The best a data union could ever give you would be $5 to $10 dollars. That’s not going to work to go mainstream. But people love discounts. An individual could easily trade their data for a 10% discount off a $200 spend, and that's already $20. So imagine doing that multiple times in a month. Now you're making real money. So you can already see how the proposition markedly changes because you've changed where the data is being delivered to and cut out those middlemen.
Lawrence: You describe something much more like a federated private analytics protocol than a data union. This speaks to something I wrote about in the collaborative computing thesis: once users trust privacy technologies, they will be more willing to share their data. This is a fundamental value unlock.
Shiv: No need to quibble over the terminology. Everyone has their own view of a data union, ethical data broker, or data trust. I don’t care that much because, to date, nothing has worked. I am more interested in tweaking the model to appeal to users. And if that means not aggregating the data and doing federated or distributed analytics, then so be it. Is it a “union”? Well, individual data is used to train a local model, which is then used to train a global model. Is that collectivisation? Does it matter? If the user receives a cut of the data shared as cash or discounts, I’m not sure it matters. This model is much closer to empowering users than the aggregation model. Maybe the language matters for marketing. A data trust or union is 100% better than a broker or intermediary. But none of the terminology matters unless there are strong confidentiality guarantees.
Lawrence: I’m not sure my prediction stands up in the face of this conversation. I’ve said that by 2030, data unions and dataDAOs fail because data brokers will rebrand as ethical. I imagine a world in which Experian continues to dominate by using privacy tools and federated learning. They say, “We never see your data, we never bring it to our servers”. And there's some marketing around the fact that they're ethical and the value chain doesn't change. What’s your take?
Shiv: From the position we are in today, it's a reasonable expectation that the first-party data system succeeds. Why? Because it's got every advantage. You have an ecosystem with companies worth billions and a considerable incentive to maintain the first-party system. And regulation, despite having good intentions, entrenches this world. But I can’t imagine a world where it’s so easy to transfer data that users won’t choose zero-party over first-party. First-party can never get as much data as a zero-party model, so, in isolation, the data wallet/pod will always be more valuable to the data buyer. Why would a brand go to Amazon Experian or Google when they can go directly to the user? And ideally, that data has been enriched, remember, with other user data. And as a data seller, why would I give it away for free when I can get discounts with the data union? Zero-party wins on the supply and demand side.
Lawrence: Where do dataDAOs fit here? Crypto is this adjacent market with unique datasets. However, it is structured as a zero-data model with users giving permissions for all transactions using their private keys. This is what you are describing as the winning model. However, as far as I know, these dataDAOs still use that aggregation business model, which you say is sub-optimal.
Shiv: DAOs have their own problems above and beyond the aggregation model. It’s like combining an extremely nascent organising structure with an extremely nascent business model. DAOs do collectively manage an asset or resource, but managing data is an order of magnitude more complicated than managing a pot of money. This generation of DAOs probably isn’t suited to managing data. We need all sorts of new infrastructure, but more importantly, a much broader and diverse membership to get anywhere close to these models making sense. I can’t imagine it working until you combine blockchain data with all the web data because wallet data isn’t valuable to the average brand in 2023. But your point is well made; the underlying structure of crypto with user-controlled wallets is a zero-party model.
Lawrence: If I were looking for a catalyst that would catapult data unions into the mainstream, what should I be looking for?
Shiv: The end of cookies. That’s the catalyst, and it’s happening. So that’s the shift from third to first-party data. The value chain is now up in the air. Data unions have a moment to seize right now. Google will phase out cookies from Chrome just after the DMA comes into force, so 2024 is the moment.
Lawrence: That's an excellent testable prediction. We could say if data unions haven’t taken off by the end of 2025, they won’t take off at all.
Shiv: From today, you've got a year and a half, max to two years. But data is this big amorphous thing with many regulations and stakeholders, and there really isn’t a single thing as “data”. So, we might not see the emergence of “data unions” as such. Like 100 organisations saying they are data unions. It will be piecemeal, industry-by-industry, where data providers will spring up to take advantage of the DMA. I'm not sure whether they look like federated learning platforms, federated analytics, or privacy-persevering data intermediaries. But I am sure that the current data economy is slowly breaking, and it will be broken next year. So something big is coming.
Lawrence: Something big is coming. There is no better place to leave an interview. Find Shiv here and sign up for the waitlist for Pocket — The app that allows you to take control of your data and trade it with the brands you trust in return for the discounts and deals you deserve.
here be all the data unions out there today amiright?