An Assay of the Modern AI Tech Stack
Author’s Note: An Assay is my version of a VC investment thesis. Read the Prelude section of An Assay of the Creator Economy to learn more.
First and foremost, I want to give a big thank you to Max Rimpel, Derick En’Wezoh, Michael Bervell, Palak Goel, and Sam Udotong for reviewing earlier versions of this piece. A special shoutout to Hassan Khajeh-Hosseini, Nader Khalil, Kashish Gupta, Chris Lu (and also Sam once again!) for allowing me to cover your respective startups over the years.
Introduction
Generative AI was heralded in the seminal paper “Attention is All You Need” by Ashish Vaswani et al, but had some of its first, popular commercial applications in the form of OpenAI’s GPT-3 in 2020. Since then, the topic of generative AI sucked up all of the oxygen in the room, completely displacing web3 and crypto in tech discourse.
However, is it here to last?
Rackspace Technology states that 90% of their 1,420 survey respondents indicated their modernization efforts are motivated by their desire to use AI technologies. However, in a separate Rackspace report, “The 2023 AI and Machine Learning Research Report,” the cost of implementing AI as an issue has grown from 26% of participants surveyed in December 2020 to 57% in December 2022.
Thus, the limelight of AI hype casts a long shadow over the overlooked but critical data infrastructure needed to make mass adoption of machine learning applications possible. The overwhelming majority of SMBs and non-tech Fortune 500 enterprises, struggling to solidify the basics of their data infrastructure, are not currently able to implement the latest technologies from AI startups. Several key trends seed AI/ML’s proliferation.
This Assay will cover a broad overview of the 180+ hardware and software startups comprising the Modern AI Tech Stack: Hardware, Frameworks, Data Infrastructure, Machine Learning Operations (MLOps), Foundation Models, and Applications. The Assay identifies ten key trends and eight novel opportunities for founders to innovate and help realize the AI-generated future of tomorrow.
Overarching Trends & Enduring Tailwinds List:
- Cloud computing spend continues to experience massive growth. Gartner forecasts worldwide end-user spending on public cloud services to total $679 billion and is projected to exceed $1 trillion in 2027.
- However, organizations continue to focus on optimizing cloud computing spend. Virtana states that 94% of IT leaders report their cloud storage costs are rising, and 54% confirm their storage spend is growing faster than overall cloud costs. This trend is promising for startups like Infracost to help businesses reduce their cloud spend or Brev assisting companies to get access to low-cost GPUs to train, test, and deploy their ML/AI models.
- Small-to-medium-sized businesses make up nearly 40% of U.S. GDP. McKinsey notes they make up “half of the roughly $370 billion in overall tech spending.”
- Despite the current slowdown in economic activity, it hasn’t stopped SMBs from adopting new technologies. Salesforce reports that in 2024, “60% of small business owners expect to increase their budget this year, with 50% planning to allocate that budget toward technology and infrastructure. 35% are excited to implement new tech or update tech for their business in 2024 — and of those, 49% are planning to implement new productivity and collaboration tools, as well as other software tools (53%) — reflecting a clear inclination toward maximizing efficiency with limited resources.”
- The modern data stack (MDS) will consolidate in the wake of massive tech layoffs and SMB/enterprise customer churn. Anna Geller, a popular AWS cloud engineer and subject matter expert in data infrastructure, captures this shift as “practitioners are incentivized by their management to pick a tightly integrated all-in-one product to better control costs (one vendor and contract to negotiate).” In addition, Hightouch reports that “…less than 50% of our enterprise customers use another major solution in the MDS.”
- MDS software vendors are beginning to target customers that store most of their data within their on-premise computing infrastructure. Altan observes that more MDS providers are beginning to target enterprise customers that use Oracle ($320B market cap) and SAP ($229B market cap) to store and analyze their data.
- Despite cloud computing’s rapid growth, on-premise data center spending is still a massive market in its own right. DCD calculates that the current annual enterprise data center spend is roughly $100B (based on statistics from Synergy Research Group).
- As companies of all sizes seek to implement AI/ML solutions into their core business, they struggle to trust their data. Precisely’s 2023 Data Integrity Trends And Insights Report states that 70% of those who struggle to trust their data say data quality is their biggest issue.
- Erwin reports that in their 2022 State of Data Governance and Empowerment Report, 42% of all respondents indicated at least half of their data was “dark,” meaning they had limited visibility and access to necessary information.
- Data downtime is getting worse, not better. Monte Carlo’s 2023 The State of Data Quality survey reveals that data downtime, or the period where critical business data is inaccessible, doubled from last year among its survey respondents.
Promising Opportunities List:
Data Infrastructure:
- Data quality services for SMBs
- Data quality products for non-technical users
- Connectivity between different systems and services
- Data contracts for enterprises
MLOps:
- Monitoring data drift and performance degradation
- Process sustainability
- Lower cost of implementation
Application Layer:
- Consumer & Prosumer
Hardware
The software side of AI usually gets the limelight, but it wouldn’t be possible without the hardware behind it. Hardware has been abstracted away in the form of the major cloud platforms: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Here are the common integrated circuits, or chips, used in powering today’s AI applications at the hardware level.
- Computer Processing Units (CPUs): These chips are designed to handle a wide variety of computing tasks. While they can be used to train machine learning models, they are not optimized for performing machine learning calculations.
- Graphics Processing Units (GPUs): Originally designed for rendering computer graphics, researchers discovered that GPUs were also well suited for machine learning because they can perform parallel processing and accelerate technical calculations.
- Tensor Processing Units (TPUs): These are application-specific integrated circuits (ASICs) developed by Google for machine learning operations, such as neural network calculations.
- Field Programmable Gate Arrays (FPGAs): These are integrated circuits that can be reprogrammed after manufacturing for specific tasks, unlike ASICs.
There are two major bottlenecks to AI’s greater adoption. The first is memory; the second is computation. All modern chip designs separate memory and computing based on the von Neumann architecture. While computing power can be scaled by adding more cores to the chip’s processor, it does not work that way for memory. There are spatial and energy limitations to adding more memory to a chip. Concerning energy, there are non-trivial amounts of energy loss in the form of heat from moving data from the memory unit to the processing unit and vice versa. Thus, the divergence between the increased processing power of chips versus the memory means that the former cannot be fully utilized because of the limitations of read and write speeds of the latter, which is known as the memory wall problem. In addition, bandwidth limits exist on how much data you can move between the memory and processing units at once.
The largest AI models, which have hundreds of billions of parameters, require hundreds of chips, each containing gigabytes of memory, to store and retrieve training and test data as the model is being trained.
Novel chip design and production are capital-intensive initiatives that only large-scale enterprises such as Intel, AMD, and NVIDIA can successfully pursue. Moreover, the physical chip fabrication is handled by Taiwan Semiconductor Manufacturing Corporation (TSMC), which handles roughly 90% of the world’s semiconductor production. However, a few notable hardware startups are aiming to create new AI chip technologies:
- Silicon photonics technology (using light rather than electricity to move data between memory and processing units): Lightmatter, Celestial AI, Ayar Labs
- AI-specific chips: Cerebras Systems
- Connectivity (software and cable hardware serving vendors and data centers): Astera Labs
- Edge AI: EdgeQ, Sima.ai
- Other: Ampere Systems, D-Matrix
Frameworks
ML frameworks allow users to quickly build, test, and refine their models without furnishing the basic algorithms, math, and statistics from scratch. There are countless ML frameworks out there, but some of the most popular are listed below:
Given that the most popular frameworks are open-source and ubiquitous, there doesn’t seem to be a potential market here, as most developers will use these widely popular libraries. It’s worth noting that Modular, one of the new entrants to the framework space, was backed by General Catalyst and others to the tune of $100M in a Series B funding round.
Data Infrastructure
Data infrastructure relates to the hardware, software, networking, services, processes, policies, and governance mechanisms needed to share and consume data within a company. Only hardware and software will be explored in detail in this section.
Beyond chip hardware, most firms will or already face the main bottleneck in adopting AI/ML solutions is their outdated data infrastructure. SMBs and non-tech enterprises fail to be data-driven because they do not trust the quality of their data. Without trust in the quality of their data, they can not depend on the results of operations or applications that leverage such data. Garbage in; garbage out. Therefore, most firms, as much as they profess their desire to incorporate AI/ML into their core businesses, are realistically unable to do so.
Two major areas to overcome for AI’s mass proliferation are data quality and data governance.
Data quality is the first and most important bottleneck to ensuring that AI/ML operations and applications are successful. I spent hours writing SQL queries to build a basic dashboard in Metabase when I was a PM at an early-stage startup. I wanted to compare my dashboard results to our transaction data from Stripe. I couldn’t get the two to match; therefore, I spent hours pouring through the rows of the Metabase data to find any errors or invalid entries. I could find them and make the SQL query adjustments to match the Stripe data, but it took a long time. If I, as a single PM trying to build a simple dashboard, can’t trust the data collected within my systems, how could my former company reliably use AI/ML in future products?
Assuming a team, organization, or company has clean data, the next major challenge is data governance. Who gets access to what data and why? How are the permissions systems set up? Where does the data live? Who is responsible for it over time? Many policies and procedures within companies are a mix of informal procedures and formal systems gatekeeping access to key sets of siloed data. I faced this experience firsthand at Boeing, working in the Core Analytics team within the Customer Support Center in Seal Beach, CA. I spent two months on a project where I had to produce process diagrams for workflows that my team used to deliver data packages to other consumers within the organization. I remember spending much time drawing basic flowchart diagrams in Microsoft Visio or examining representations of databases and tables in Visual Studio. One of the more time-consuming tasks was searching and talking with subject matter experts on other teams who controlled access to critical information that my team needed. On average, it would take weeks or months to access said data as an individual, depending on what you were requesting.
Software-driven data governance can greatly help reduce the time and labor needed to provision access to data for individuals and teams within large organizations.
Hardware:
Most startups rely on one of the three big three to provide the computing needed to run certain sections of their data infrastructure: AWS, GCP, or Azure. These platforms will also have additional tools and services to execute key tasks in your ML data pipeline. In conjunction, major players like Databricks and Snowflake offer data lakeshores and data clouds, respectively, to centralize structured and unstructured data and make it more accessible to more teams, especially ML teams. Hardware mostly matters to larger enterprises that routinely spend millions of dollars in capital expenditures on their on-premise computing infrastructure. Most startups use one or a mix of the big three cloud vendors to meet their hardware needs.
Below is a breakdown of the software used in the various layers of the Modern AI Tech Stack:
Software:
Data Infrastructure Layer:
- Data Quality & Monitoring: Soda, Superconductive, Elementary Data, Metaplane
- Gathering: OpenML, OpenCV
- Transformation: dbt, Fivetran, Talend, Singer, Hightouch, Census
- Labeling: Scale AI, Sagemaker, Labelbox, Google AI, Figure Eight
- Version Control: Pachyderm, dataiku, dvc
- Storage: Databricks, Snowflake, Redshift, BigQuery
Model Layer:
- ML Frameworks: PyTorch, Keras, TensorFlow, Scikit-learn, ONNX
- Feature Engineering: Tecton, Feast, Databricks
- Monitoring: Evidently AI, Censius, Comet, Arize
Application Layer:
- Deployment and Serving: Kubeflow, DataRobot, Cortex, TFX, TorchServe, PerceptiLabs.
A common challenge that larger organizations have is that their data infrastructure is not fully connected or accessible in a cohesive way for the relevant teams. The lack of connections is a bottleneck between the production and consumption of data. Enterprises looking to get the most out of AI will require robust data governance mechanisms and high-throughput networking technology to allow the necessary systems to talk to one another.
One trend to consider is the responsibilities of a typical software engineer concerning data applications being unbundled. A simple example would be a software engineer being asked by their marketing team to build (and maintain!) an ETL pipeline connecting data collected in Snowflake to their HubSpot CRM. Developing these pipelines takes significant time and effort from the software engineers’ core responsibilities.
Now, there’s a tool called Hightouch that makes creating and maintaining this pipeline easier for the engineer. From this example, there are two trends to pay attention to: one, the proliferation of Snowflake as a data warehouse, and two, the more time the engineer can spend on other higher-priority tasks by not having to build a custom pipeline to move data from one software to another. There’s an opportunity for software to help unbundle or distribute the responsibilities data engineers have in maintaining their organization’s data infrastructure in the same way Hightouch did for software engineers. Moreover, founders should focus on building complementary technologies to dominant incumbents such as Snowflake (like Hightouch did).
Potential Opportunities:
Resolving data quality issues for SMBs: Most organizations have their data spread across many systems, applications, and services. Each one of these silos may have data that is missing key information, which can affect the overall performance of an ML model if such data is used in training, testing, or validation datasets. Tools that can help address data quality issues upfront can save time and prevent future headaches.
- Companies like Soda, Superconductive (Great Expectations), and Elementary Data all service large enterprises. SMBs have largely gone ignored. There’s a clear opportunity for startups to help smaller businesses trust their data. Metaplane is one of the few (if only) solutions that serve small teams.
- McKinsey notes that SMBs make up “half of the roughly $370 billion in overall tech spending.”
- Precisely’s 2023 Data Integrity Trends And Insights Report states that 70% of those who struggle to trust their data say data quality is their biggest issue.
Data Quality products for non-technical users: As someone who spent an excessive amount of time manually reviewing rows of customer data, a tool allowing me to set checks that reflect key business logic can save me and other non-technical team members time and money investigating the integrity of their data.
- Lightup.ai is a clear example (and the only solution currently) of a no-code tech solution in this space.
- AWS reports that more than half of SMBs don’t have the knowledge or the experience to use data to drive growth.
Software-based Data Governance: Data governance can and should be driven by software. As organizations get larger and more siloed, having a common platform where data access is shared between producers and consumers is critical for allowing teams to rapidly locate and leverage the data they need for their ML/AI applications and operations.
- Moreover, almost half of all organizations surveyed by Rackspace reported that the technical challenges of modernizing their infrastructure were a significant barrier to AI adoption.
- Precisely reports that 57% of survey respondents say data governance results in better analytics and insights.
Data contracts for enterprise teams: Defining how data is shared between teams whose work relies on collaboration is critical to ensuring that trust in the integrity of the data is maintained when either side changes its data output. Data catalog companies such as Atlan are the first to provide a version of this product.
- Data contracts between teams will help reduce data downtime and help bridge the gap between data producers and data consumers. Monte Carlo reports that 74% said data consumers identified issues first with respect to poor data quality.
- According to Metaplane, data contracts are predicted to have greater adoption in 2024.
Machine Learning Operations (MLOps)
Initially, MLOps used to be the domain of data scientists. However, with newer software tools, software engineers can also handle the tasks by themselves or work alongside their data science teams. There are three layers of the modern AI stack: infrastructure, model, and application. While the data pipeline varies from firm to firm, here are the general steps that every ML data pipeline shares:
Infrastructure:
- Data Ingestion: Pulling in data from a source (e.g., database, web form, application, SaaS tool)
- Data Preprocessing: Organizing the data into a format that the ML model can train and test in development and operate on in production
- Feature Engineering and Selection: Identifying relevant/most important parameters of the data for the model to account for
Model:
- Model Training: Training the model on the training data
- Model Evaluation: Testing the model on test data after it has been trained
- Model Versioning: Saving and updating iterations of the model
Application:
- Model Deployment: Having the model launched in a production environment and performing inference
- Monitoring and Maintenance: Tracking the performance of the model over time and making adjustments as needed
There are a plethora of startups in the machine learning operations space. Most are point solutions or take care of a specific step in the ML pipeline workflow. A smaller group considers themselves end-to-end or handles every process in handling data from ingestion to deployment into production.
Potential Opportunities:
Monitoring data drift and performance degradation: Even with a successful ML implementation, there is still the risk of data drift, when the production data the model performs is substantially different than the training, test, and validation data used in its development. This leads to performance degradation over time, which can be hard to observe, diagnose, and fix sooner rather than later.
- Arize AI survey shows that 26.2% of data scientists and ML engineers say it takes their team a week or more to detect and fix an issue with a model in production.
- CometML reports that 41% of ML experiments had to be scrapped for various reasons.
Process sustainability: One of the more challenging issues is building sustainable practices around ML operations that do not require more resources (time, money, labor, technology) to maintain an implemented solution that works across teams.
- CometML’s report states that 41% of machine learning practitioners are concerned with developing sustainable processes to support their AI efforts.
- Arize AI states that 48.6% of data scientists say their jobs are more difficult in the wake of COVID-19 due to elevated drift, data quality, and performance issues.
Lower cost of implementation: Most AI/ML products work with other popular or bleeding-edge SaaS tools. There is a clear opportunity to work with legacy systems so that enterprise customers don’t have to upgrade their own infrastructure immediately to get modern-day solutions to work.
- Rackspace reports that one of the major barriers to adopting ML solutions is the cost of implementation.
- Geller, a popular AWS cloud engineer and subject matter expert in the space, notes that “practitioners are incentivized by their management to pick a tightly integrated all-in-one product to better control costs (one vendor and contract to negotiate).”
Foundation Models
Foundation models are pre-trained ML models that can handle more operations than their standard, task-specific counterparts. These generalist models are the basis of generative AI. Here are some examples below:
- Text-based: OpenAI, Anthropic, Cohere, AI21 Labs, Character.ai, Mistral
- Image-based: Stable Diffusion, Midjourney
- Video-based: Pika Labs, Runway
- Audio-based: Suno.ai, ElevenLabs, AudioCraft
- Human-to-computer Interface: Adept
Most companies cannot build their own foundational models due to the massive expenditure of labor, time, data, and money required to build, train, test, deploy, and monitor them at scale. According to Wired, training GPT-4 alone costs over $100M. However, smaller companies or startups can leverage these pre-trained models by repurposing them for more specific, narrow tasks via fine-tuning.
The caveat, however, is that companies that regularly develop, maintain, and update new versions of these foundation models will absorb this capability, leaving the smaller companies relying on them without any defensibility in their product or service. Developers building on top of OpenAI’s GPT models have experienced this firsthand, as many “ChatGPT with a wrapper” companies have been made redundant by later versions of the foundation model.
Applications
Foundation models offer incredible convenience, as you don’t have to build from scratch. However, there is a risk later versions of LLMs can replicate the functionality you had built on top of it, rendering your business redundant.
The choice of application for building on top of a foundation model is a critical decision.
Differentiation of AI/ML takes on four different forms:
Hardware:
- What chips are you using that increase the performance of your application by an order of magnitude? Why did you pick that particular hardware component?
Model:
- What techniques or algorithms have you built your model upon to provide better results to your end user?
Application:
- What do you know about said application that makes itself tractable by AI that others don’t?
- How have you structured your go-to-market strategy to drive user adoption rapidly?
Data:
- Do you have proprietary access to the information used to train, test, and validate your model?
- What do you understand about the data that no one else does?
There are countless applications of AI, ranging from recommendation systems to fraud detection. However, the value realized from AI depends on the strength and connectivity of the data infrastructure supporting it and the technical proficiency of the ML engineers implementing the solution.
One notable trend is whether large enterprises sitting on massive amounts of data within their data lakes will aim to build their own enterprise-specific foundational models or try to leverage existing closed-source models for their use. If the latter, security is a major concern, as the said enterprise does not want its proprietary data leaking into the external foundational model.
Potential Opportunities: Consumer & Prosumer Applications
AI’s biggest successes so far have been mostly on the consumer side. ChatGPT took five days to reach 1M customers. PhotoRoom, an AI-based photo editing tool, generated $50M+ ARR in three months. On the prosumer (professional consumer) side, Microsoft reports that developers across all programming languages accept 46% of all code created using GitHub Copilot. Companies such as Copy.ai have surpassed 10M+ users. Fireflies.ai has continued to build an enduring business, going from being an AI-powered notetaking to a conversation intelligence platform.
I think there’s a serious opportunity for vertical generative AI companies that are focused on B2B2C or B2C. Some suggest that with the launch of OpenAI’s GPT store, several “wrapper” companies will go by the wayside, but I don’t believe that to be true. OpenAI’s app marketplace has no intrinsic moat like Apple’s or Google’s because it’s not hardware-based. Furthermore, developers in the marketplace will still have to compete with individual startups; their presence in the marketplace does not give them any inherent built-in advantage when it comes to distribution. Furthermore, given ChatGPT’s general ability in various tasks, it is difficult to excel in a few specific ones.
Therefore, generative AI companies focused on a specific vertical or use case can succeed by rebuilding common workflows that professionals use in their day-to-day jobs with the help of AI. In the same way that people have built their careers off learning Figma, professionals can do the same with vertical AI startups.
Summary
The modern B2B Infrastructure tech stack continues to evolve at a rapid pace. The most promising and accessible opportunities lie within data infrastructure, MLOps, and the application layer, while hardware remains the bottleneck for the throughput and performance of AI solutions.
- [1706.03762] Attention Is All You Need
- The 2023 AI and Machine Learning Research Report
- The complete guide to the modern AI stack | by Ayush Patel | Towards Data Science
- ML Infrastructure Tools — ML Observability | by Aparna Dhinakaran | Towards Data Science
- ML Infrastructure Tools for Data Preparation — Arize AI
- Emerging Architectures for Modern Data Infrastructure | Andreessen Horowitz
- The Industry Is Ready for Machine Learning Observability At Scale | Arize AI
- The great acceleration: CIO perspectives on generative AI | MIT Technology Review
- Machine Learning Model Management: What It Is, Why You Should Care, and How to Implement It
- State of MLOPs Industry Report | 2023 Machine Learning Practitioner Survey — Comet
- AI voice startup ElevenLabs lands $80M round, launches marketplace of cloned voices | VentureBeat
- The 10 Hottest Semiconductor Startups Of 2023 (So Far) | CRN.
- How Data Engineering Will Change in 5 Years | Secoda
- The Future Of Data Engineering As An Engineer | Monte Carlo
- Navigating the Data Quality & Data Observability Landscape: Understanding Architectural Nuances | Lightup Data Blog
- Forecasting 2023 Data Engineering Trends in 2024 | Metaplane
- OpenAI’s CEO Says the Age of Giant AI Models Is Already Over | WIRED
- 2023 DATA INTEGRITY TRENDS AND INSIGHTS REPORT | LeBow College of Business
- Survey: The Industry Is Ready for ML Observability At Scale — Arize AI
- Gartner Says Cloud Will Become a Business Necessity by 2028.
- The 2023 Cloud Modernization Research Report | Rackspace Technology
- Winning the SMB tech market in a challenging economy
- 2024 Is a ‘Make or Break’ Year, According to Slack’s Survey of U.S. Small Business Owners — Salesforce
- The Annual State Of Data Quality Survey, 2023
- 2022 State of Data Governance and Empowerment Analyst Report