☎️ Interview: Rick Hao, Partner at SpeedInvest on the State of Privacy-Enhancing Technologies #005
On more data as the driver of PET adoption; why healthcare needs its own infrastructure; and why machine learning is likely to be a catalyst for adoption
The premise of Collaborative Computing is that when data can be shared internally and externally without barriers, the value of all data assets can be maximized for private and public value.
To explore this vision more deeply, I spoke with Rick Hao, a deep tech investor at Speedinvest, a venture capital fund with more than €450m to invest in early-stage tech startups across Europe. Highlights include:
Why getting more data will continue to be a business driver for the foreseeable future, despite trends of cost-efficient algorithms and fewer data algorithms at the frontier;
Why healthcare is likely to need a different data infrastructure than other markets and;
Why machine learning will be the main driver of data sharing tools
"So I'm thinking of data sharing as more of a data acquisition strategy. If data sharing or data aggregation platforms are framed in that way, then there is a massive market."
---
Let’s start at the top; how do you think about investing in the data-sharing space? Is the governance, sharing and monetization framework useful?
Broadly, it’s a useful way to think about data management in the enterprise. I would say regulations and privacy concerns means data governance is by far and away the most important of those categories today. There is either a missing piece or it fits under sharing, but I think a lot about enriching the data pipeline for machine learning algorithms. The reality is, today, more data means more accurate algorithms which makes companies want to collect or get access to more data, even if that is third-party data. So I'm thinking of data sharing as more of a data acquisition strategy. If data sharing or data aggregation platforms are framed in that way, then there is a massive market.
Financial services and healthcare have a huge need for this stuff. In healthcare, though, we are a little further behind because there just isn’t any data infrastructure to work with. Even if healthcare providers want to put up data to share with researchers or pool datasets and share the model, they just can’t do that today. Most use cases in healthcare like genomic sequencing and drug discovery can’t just take off the shelf tools designed for manufacturing or commerce markets. The workflow process, data formats, and size of datasets all require a vertical-specific approach. SpeedInvest, as an investor, looks at vertical-specific approaches.
The idea of external data sharing is laughable for many in the enterprise, in the same way open-source server software was laughable in the 1990s, though we are starting to see a shift. What might it take to make the idea of data sharing more acceptable?
Maybe, but open-source software was laughable mainly because of security considerations, that’s not quite the challenge with data sharing. Open-source server software had lots of measurable benefits including being cheaper. With data sharing, the benefits are more abstract, there are just less concrete ROI examples and a clear business case. The benefits are less clear and the risks are obvious, that makes it hard. Synthetic data, for example, in theory there are lots of benefits around reducing risk of working with personally-identifiable information (PII) and the ability to generate more of it easily. But when you apply those benefits to specific use cases, the ROI is harder to quantify. The way to do it with data sharing is really to tie it to algorithmic performance which should be tied to revenues. You can then say: ”enlarging the training dataset by X will improve the algorithm by Y, and therefore increase revenues by Z”. For AI use cases like language and computer vision, this benefit can be strong.
When you think about investing in the enterprise data infrastructure space, how much weight do you place on the technology vs go to market plan?
As a true deep tech investor, I’m always looking for some innovative tech. There has to be a real hard engineering challenge being addressed that unlocks a huge market, so the technical risk is worth taking as an investor. I’m very hands on and will want to at least try the product and take a look at the code. But the reality is I’m looking for some early sign of product-market fit, especially with privacy-preserving technologies. PETs, at least some of them, like federated learning and multi-party computation, don’t require huge technical breakthroughs to scale like say fully-homomorphic encryption. So it depends on the technique, but ,generally, we are at the stage now where I want to see that there is a market pull for the solution. The key for data infrastructure is usability. It is not a straightforward deployment, and so integration is crucial. Of course, the go-to-market might change, but you can see very early on if a team understands product and solving customer pain points rather than just building technology.
What are your thoughts on distributed computation or federated learning for enterprise? The pitch of running analytics locally without sending data back to a remote server feels like a compelling proposition? Do you think it will catch on?
Yes, it will be one of the hottest topics in machine learning in the next few years. Privacy-preserving machine learning is not so much a buzzword but will start to reach the C-suite as a way of addressing privacy concerns. It’s still a relatively early topic for enterprise, many firms are still in the early stages of the data management journey and haven’t reached the stage of doing real machine learning yet. So, for them, this is a theoretical problem. So there are market size and market growth questions around this. I expect to see federated learning or distributed computation projects take the open-core business model to start gaining adoption with developers. The focus has to be on making this new distributed workflow as easy to use as possible.
The crypto market has been pushing forward the state-of-the-art in private and collaborative data sharing with things like zero-knowledge proofs and MPC, do you think because this stuff is happening in the crypto world that it is being ignored or underestimated?
My personal view, beyond the technology, is that zero-knowledge proof technology has value. It is just so useful for so many use cases, but we only see it applied to blockchain technology today. That seems to be a strong use case that needs a scalable way to validate transactions, but we haven’t seen many other really clear industry use cases. I don’t think it’s being underestimated as such, it’s just either companies aren’t taking the technology and applying it to industry use cases yet or that the pain points just aren’t strong enough yet. As I mentioned, privacy-preserving machine learning is a potential use case, but we are very early. The question I would ask is: is there an alternative and easier way to solve the same problem without using complicated cryptography?
For data markets to be realised, we will need strong privacy tools, but also a way to track, pay and exchange the data quickly. Can you imagine widespread data markets not built on blockchain technology? If not on open infrastructure, how do we keep the markets themselves open? Or should we be comfortable repeating the same mistakes as Facebook, when it comes to data rights?
Right now you can’t look past the privacy problems. We are still far from protecting privacy, but if you suppose data assets are private, you still have the challenge of ownership or rights attached to the assets being sold. Most data has a long value chain and it’s rarely as simple as a single creator and owner. So how can we attribute ownership or property rights to data assets in markets that might be traded hundreds or thousands of times? Data markets are part of the future, but they will have to be limited to data assets that don’t have long value chains and contested ownership stakes. Geo-satellite data is very specific and so a market around this could work. Financial data is likely to end up in the market, too.
In terms of open infrastructure, generally, the world isn’t that interesting if big tech just provides APIs access to huge trained models. GPT-3 or Codex-like. I don’t think it will go that way anyway, as the future of ML is cost-efficient models, not just massive pre-trained datasets. This future means the tendency to consolidate and monopoly might reverse.
Finally, the idea of ‘collaborative computing’ is a computing environment in which strong privacy tools are built into software and data infrastructure letting anyone anywhere package up and trade data on a global market. Much in the same way encryption with SSL allowed anybody anyway to securely transact with someone else on the web. How much does this idea resonate?
It resonates, but it's hard to see a pathway for this in the medium term. There are just so many moving pieces that go into creating a marketplace for an asset that isn’t traded today. I think the big open question is how to resolve the ownership piece. Very few data assets are simple enough to be traded easily.