Opinion
Data infrastructure: The picks and shovels of the AI gold rush
"While we believe it is too early to call what AI will look like in a few years’ time given the mind-blowing pace of innovation, it is clear to us that a solid data foundation is key to enabling its progress," writes Itay Inbar of Greenfield Partners
Much has been written about AI over this past year. The progress ushered in by Generative AI models such as GPT-4 and Stable Diffusion is driving a new era of technological innovation and has kicked off an “AI Gold Rush.” Resultantly, companies are competing for a piece of the massive pie, estimated by Goldman Sachs at over $150 billion in annual revenues.
In a gold rush, invest in picks and shovels
It was clear there was gold in the ground amidst the Gold Rush of the late 19th century, although the success of any one gold mining operation was highly uncertain. However, there was one thing that was certain: the demand for picks and shovels.
Similarly, in the AI gold rush, a lot of uncertainty remains as to what the AI value chain will look like in just a few short years – will the leading foundation models be closed- or open-sourced? Will startups be able to successfully compete with their own novel models, or will we all be using fine-tuned flavors of models built by OpenAI and the big CSPs? Will the AI application layer spawn a cohort of new AI-native software providers, or will the current leaders be able to successfully embed AI into their offerings to stay on top?
Whatever the future of AI may be, one certainty remains, the importance of data infrastructure in enabling this revolution – the “picks and shovels” of the AI gold rush.
Data infrastructure Is the key enabler of AI at scale
While AI models form the cornerstone of this recent progress, scaling AI requires a robust data foundation that trains models and serves them effectively.
This process involves collecting and storing raw data, utilizing computational power to transform data and train models, and processing and ingesting data in real-time for inference. Ultimately, turning raw data into AI insights in production is complex and dependent on having strong data infrastructure. Data engineering teams will play a crucial role in enabling AI and must lean into an ever-improving set of tools to address rapidly growing volumes of data, larger models, and the need for real-time processing and movement of data.
Data infrastructure has transformed over the past decade irrespective of AI, driven by the shift to the cloud and a greater focus on analytics. This transformation has created huge commercial successes with the likes of Snowflake, Databricks, Confluent, Elastic, MongoDB, and others.
Today, we are in a moment in time where storage and compute limitations have largely been erased thanks to the cloud. As a result, today’s leading trends revolve around developing processes that make the data universe faster, more reliable, efficient, and impactful; all critical elements for successful AI deployment.
Key trends shaping data infrastructure
We expect the following areas to play a key role in shaping the next-generation of data infrastructure, and highlight some of the promising Israeli startups within them:
Advances in datacenter hardware – As models and data grow in size and the volume of inference expands, faster compute capabilities are required to keep models feasible from both speed and cost perspectives. A new cohort of dedicated hardware accelerators for both ML, and data transformation/querying, such as Neuroblade, NextSilicon, and Speedata, are vying to challenge incumbent chipmakers; while companies like Run:AI are focused on virtualizing GPU clusters to extract increased utilization of existing resources.
In parallel, novel software-defined storage architectures, such as Vast Data (a Greenfield Partners portfolio company), and faster networking and I/O technologies, such as Silicon Photonics developed by DustPhotonics (a Greenfield Partners portfolio company) are evolving to meet faster data transfer requirements.
Accelerating compute engines – As AI becomes more widespread, we anticipate an accelerated convergence of cloud data warehouses and data lakes towards a unified Data Lakehouse architecture. This architecture provides both the flexibility to support a wide range of use cases and compute engines, while maintaining necessary structure.
While Snowflake, Databricks, and the CSPs are by far the market leaders in the space, we maintain that there remains opportunity for newer players with compute engines optimized for specific tasks to take additional share, and expect new entrants to tackle this challenge.
Real-time data retrieval and processing – Enabling low-latency model inference and feedback requires high-performance data retrieval and pipelines. There are several emerging tools available to achieve this, including new categories of databases optimized for ML such as Pinecone’s vector database, faster caching enabled by the likes of Redis and DragonflyDB, to real-time datastores such as market leaders Pinot, Druid, Clickhouse, Materialize, and Israeli Epsio, alongside streaming data pipelines and in-stream processing such as the popular open-source projects Kafka/Confluent and Flink, as well as other approaches. The likes of these companies are enabling rapid movement of data, a key focus for optimizing model performance.
Increased data governance, security, and observability – With the growing complexity of companies' data infrastructure and an increasing number of sources feeding it and users accessing it, data governance is becoming an increasingly significant focus area for data engineers. Observability, cataloging, privacy, and security are just some of the areas growing in importance as companies seek more control and visibility over their data stacks. In turn, spawning promising Israeli companies leading this charge such as Monte Carlo, Illumex, Sentra, Cyera, Dig Security, and others.
Evolving MLOps landscape – Though perhaps MLOps can be thought of as a separate category from data infrastructure, it is still well worth a mention in this context. As a connecting thread between the underlying data platform, model development, and model deployment, MLOps tools are gaining in importance (just as DevOps did in traditional software development) to streamline the process of developing and deploying models. Following the rapid evolution of AI models in recent months we expect this category to experience significant shifts, but recognize its importance in the years to come and see significant promise in several Israeli startups such as Qwak, Aporia, Deci, and others.
Related articles:
We are living through a historical technological shift which will have widespread implications on business and society more broadly. While we believe it is too early to call what AI will look like in a few years’ time given the mind-blowing pace of innovation, it is clear to us that a solid data foundation is key to enabling its progress and will be crucial to those building AI applications, just as those picks and shovels were essential during the Gold Rush.
Itay Inbar is a Senior Associate at Greenfield Partners