ð¡Thesis: Unbundling the Database: Serverless, Edge and Secure
Building the future with serverless, edge, and secure databases
Part 1: The unbundling is coming
As databases move to the cloud, hosting and management will be left to others. High performance, low-latency, and scalable databases will be table stakes. Databases will âjust workâ just like compute and storage. As the back-end is outsourced, aka serverless, innovation and differentiation moves up the value chain to query and data analysis. As this process unfolds, we see three opportunities:
1. Serverless databases
Hosting as an API or UX. The âdatabaseâ splits into two: âhostingâ on the back-end and âanalytics/queryâ on the front-end. The hosting is a commodity while analytics/query differentiated by algorithms becomes the point of leverage and differentiation. New database companies look more like a data analytics company than a database management systems (DBMS) company.
2. Edge databases
Hosting on edge data centres or directly on the client, potentially even synchronizing using a p2p architecture. Databases at the edge and potentially with p2p synchronization features support extreme low-latency and offline-first applications.
3. Secure databases
Integrity and confidentiality using secure enclaves and private information retrieval (PIR) techniques. Fast private information retrieval is the ultimate query engine and combined with other privacy-preserving tools offers a confidential and secure computing stack.
Part 2: Trends
Cloud transition
One the one hand 75% of databases are expected to be deployed in the cloud by 2022. While on the other the percentage of cloud-native engines like Amazon DynamoDB, Microsoft CosmosDB, Amazon Redshift was just 3.7% in 2019 (the rest are servers running on someone elseâs computers). 95%+ of the market are still migrating legacy databases to the cloud. Cloud deployments will reduce complexity, but migration, maintenance, and multi-vendor support will continue to make things harder in the short term. Sometimes it feels like the Cloud is over, but weâre still very early. The market is still growing by 15%+ and will reach $1.3 trillion by 2025.
The Cloud is and will continue to be the most important trend in enterprise IT for the next 5 years and the implications are yet to really be felt
Performance demands
Making sure databases are not the bottleneck for application performance is one, if not the key task for database administrators (DBAs) today. On a day-to-day basis DBAs are working to improve performance by optimising everything from the server hardware and network, the database design via tuning, as well as the queries themselves. The job is made more complicated by two trends: more databases and more countries. First, more databases, excitingly known as polyglot persistence (also the name of a hip-hop/acid house collective in Mexico City) means DBAs simply have more to optimise. These databases are generally deployed to handle different data types (relational, document-store, key-value, graph, etc) needing skills the company likely doesnât have or canât hire for easily.. Second, databases are now serving customers across the globe. Applications need to be near instantaneous and latency needs to be driven to near-zero (This also needs to be tackled at the protocol level, see Codavel, a portfolio company). This demand means organisations need to host databases on servers around the world to get close to the customer. Optimising database performance is an increasingly complex challenge that few organisations have the capacity to solve.
Privacy regulation
The way personally identifiable information (PII) is handled is now a key concern for DBAs and data scientists. The EU General Data Protection Regulation (GDPR) and The California Consumer Privacy Act (CCPA) legislation forced compliance for the first time to get involved in data management. Organisations have been forced to invest in tools to govern the way data is used. Tools to handle data scanning, metadata management, data lineage, and consent management need to be incorporated into the business and affect how and who can access and query databases. The easiest way to avoid fines is to have a small group of privileged users. But the trade-off is that data scientists do not have access to the data they need or have to do through a lengthy request process. As more granular application-level policies are adopted, more people will query databases, but today, thereâs a tremendous amount of valuable data sitting idle.
Solutions
Serverless
The âas-a-serviceâ model has come for databases as death will come for us all. Regardless of the nuances of whether itâs serverless or database-as-a-service, vendors offer to manage the database and give a simple API to use. Pay for what you use and shift CAPEX to OPEX. Itâs convenient as well as cheaper. The vendor deals with all the back-end stuff like migrations, updates and performance optimisation. Extending database-as-a-service into back-end-as-service, some vendors also manage things like user authentication, push notifications, and cloud storage. The application developer is left to just deal with the front-end and serve the customer. The logic extends further to offering the front-end-as-a-service too, but this isnât an essay about low-code.
When databases are abstracted away they get unbundled into hosting and indexing. The hosting is all back-end and done by the big cloud players with scale. The index is where the next-gen databases will be built. These companies will be serverless and compete to win customers not on performance, scale, or data types, all these will be table stakes. What matters is the software and algorithms on top to surface not just to serve queries quickly but to offer more value-add features that help companies make data-driven decisions.
A new index/query layer will be the fastest growing segment of the database market, and companies that successfully land will be able to expand into the enterprise knowledge base market. Understanding what data people want and for what purpose within an organisation is a much richer starting point than say a CRM, email or project management tool.
Edge databases
Edge databases are enabled by the shift to serverless. Bringing data closer to the user is a well worn way to increase application performance. Geo-distributed databases such as CockroachDB, ElectricDB, and HarperDB run on edge data centers to improve response times and effectively distribute computational load. Edge databases address the performance problems of delivering high performance, low-latency applications to users across the world. As hosting becomes distributed and moves closer to the customer, for extreme low-latency requirements like autonomous vehicles, voice recognition, and computer vision, the index and query engine may end up running directly on the end-point.
The logical end point of the push for low latency is to remove the server from the architecture completely to avoid the need for data to travel to and from a relatively distant server. A shorter travel path is from one end-point to another in a peer-to-peer architecture. Peer-to-peer database synchronization is still nascent but progress over the past few years has been fast. With no server to resolve conflicts, there needs to be a resolution protocol that manages consistency. The technology does exist, conflict-free replicated data types (CRDTs) are used in Azure Cosmos, Redis Riak and many others, but they rely on a server for merges. Outstanding problems that still need to be solved include the buildup of large change histories, peer-to-peer networking, and p2p data store integration.
There are projects trying to build pieces of a p2p software stack. The Solid project is developing protocols for p2p interactions. Automerge, ObjectBoxand OrbitDB (weâre invested in some of those) are working to develop fundamental technologies to bring p2p databases into production. CouchDB and PouchDB are databases designed not specifically for p2p apps, but offline and local-first applications. Weâre seeing most progress in the Web3 community with its vision to remove privileged administrators from the stack. Even though there are few full-fat p2p databases today, all of the pieces are progressing and will be glued together in a new bundle soon.
Secure databases
Secure databases refer to confidential databases and private querying, or more broadly the use of cryptography to protect data leakage when accessing a database. These are complementary techniques with hardware like secure enclaves potentially enabling confidential databases and multi-party computation (MPC) techniques like private information retrieval (PIR) enabling private querying. Edgeless Systems offers EdgelessDB, which they call the first confidential database. Itâs a SQL database that runs entirely inside runtime-encrypted Intel SGX enclaves. This is different to most secure database solutions today which encrypt data for storage, and use a hardware security module (HSM) to store the corresponding cryptographic keys.
But ideally we wouldnât rely on trusted hardware. Cryptographic protocols like private information retrieval (PIR) attempt to deal with the problem of retrieving an element from a database, without revealing to the database administrator any information about that element. There are different PIR schemes that attempt to practically solve this and itâs a hard problem (Zama, a portfolio company, partially addresses this). It will be a long time before PIR performance is practically viable which is why few people have it on their radar. And yet, practical implementations will be transformative, making it possible for databases to be hosted on untrusted servers or clients. As with all privacy-enhancing technologies, the implications are broader than just security. When people are not worried about data leakage then data can flow more smoothly.
Unbundling the database
The transition to the cloud, performance needs, and privacy regulation are driving the unbundling of the database. What we think of as the âdatabaseâ is being unbundled into commodity hosting, and increasingly valuable private querying and data analytics services.
Serverless databases will dominate the market offering much-needed simplicity.
Databases will move to the edge, all the way to the end-point for some low-latency and high privacy applications.
Secure enclaves and private information retrieval will allow for highly secure databases with data integrity and confidentiality.
Thanks to Luis Shemtov, Martin C. Smith, James Arthur, Andrew McDonough, Ole Moeller-Nilsson, Emil Eifrem, and Martin Pompery for conversations and feedback on this essay.