LLM Infrastructure Stack Market Map
A deep dive into the components and tools of the modern LLM infra stack in 2024
In 2024, the AI scene continues to experience an electrifying transformation. At the heart of this revolution lies the cutting-edge LLM infrastructure stack, a groundbreaking architecture poised to fuel AI applications for the next generation.
Gone are the days of untamed frontiers in LLM development. Today, developers are uniting around a harmonized infrastructure and leveraging their tech stacks to stay ahead of the curve and meet the demands of the business world.
In this article, we will dive deeper into the components of the modern LLM infrastructure stack and explore the tools and technologies that are driving this transformation. This stack is comprised of several infrastructure components that work together seamlessly to support the entire lifecycle of an LLM production including data and pre-development, model development, deployment, and monitoring.
We will also unveil the key trends we see as a VC investing in this space and after an exhaustive analysis of the market and new rising stars.
The future of AI is bright, and the modern LLM infra stack stands as the bedrock upon which this future is being constructed.
Join us as we delve into the captivating advancements that are shaping the very foundation of the next wave of AI applications.
📩 Building in this space? We’d love to hear from you at aperez@caixacapitalrisc.es
Note: Companies are categorized into specific areas based on what we believe they excel at.
Defining the Modern LLM Infrastructure Stack
According to Menlo Ventures, in 2023, enterprises spent over $1.1 billion on the modern AI stack—making it the largest new market in generative AI and a massive opportunity for startups.
The total funding raised by LLM infra companies from investors is $10 billion as of March 2024 (excluding big companies that raised more than $1Bn). VC-backed companies in the map founded in 2022 and 2023, 67 in total, have raised so far $650 million.
At Criteria, we define the key layers of the modern LLM infrastructure stack as:
Layer 1: Data. The data layer contains the infrastructure to ingest, curate, and manage data to train LLMs and connect the model to the right context wherever it may exist within enterprise data systems. Core components include data pre-processing, ETL and data pipelines, and databases like vector databases, data management, synthetic data for data augmentation and privacy, and data labeling and annotation.
Layer 2: Model. The model layer contains the infrastructure to train and build the model from fine-tuning to quality testing and evaluations. It also includes federated learning, experiment tracking, prompt engineering, RAG, and LLM orchestration frameworks.
Layer 3: Deployment. The deployment layer provides the tools for managing and orchestrating AI applications, from the model to the GPU. It includes model routing, model orchestration, model computing, inference, serving, and hosting to allow developers to easily deploy and manage LLM in production environments.
Layer 4: Monitoring. The monitoring layer contains the tools that help developers observe and analyze the LLM and Gen AI apps to understand the behavior of the models in their run-time, costs, latencies, bugs, user behavior, and feedback.
Layer 5: Enterprise considerations: This layer involves addressing the various enterprise-wide considerations that are critical to the successful deployment and management of LLMs. This includes ensuring privacy and security, governance, managing model risks, and ensuring compliance with relevant regulations and standards.
Bonus track: E2E Approach. Holistic platforms that span across all the layers of the modern LLM infrastructure stack, from model playground to deployment and monitoring. Some of them also have data management solutions.
Layer 1 - Data
Vector Database & Search
A vector database is a specialized database designed to store and query large-scale vector data, such as unstructured data like word embeddings, sentence embeddings, and document embeddings. The vector search function is used to retrieve the most relevant vectors from the database.
Data Management
Data management refers to the process of collecting, preparing, integrating, cleaning, and pre-processing the data used to train and fine-tune LLMs. Here, raw data that is tailored to a company’s requirements can be gathered, preprocessed, masked, and transformed into a format suitable for LLMs.
Data Labeling & Annotation
The data labeling and annotation layer is responsible for assigning labels and annotations to the data that is used to train and test the LLMs. This includes tasks such as sentiment analysis, entity recognition, and part-of-speech tagging. Data labeling and annotation tools ensure that the data is accurately labeled and annotated, which is critical for achieving high-quality model performance.
Data Pipeline & Orchestration
The data pipeline & orchestration layer contains the infrastructure to ingest, transform, and connect data to train LLMs and connect the model to the right context wherever they may exist within enterprise data systems.
Synthetic Data
Synthetic data refers to data that is generated artificially, rather than collected from real-world sources. Synthetic data can be used to augment real-world data, providing more examples for training and fine-tuning smaller and specialized models without security and privacy risks
Layer 2 - Model
Fine Tuning & Training
Fine-tuning and training consist in the process of adapting a pre-trained LLM to a specific task or domain. This includes adjusting model hyperparameters, adding task-specific layers, RLFH and DPO techniques, and training the model on a new dataset.
RAG
The retrieval-augmented generation (RAG) platforms help get the data from outside the LLM, an external knowledge base, to ground LLMs on the most accurate, up-to-date information improving the coherence and accuracy of answers by searching for and attaching relevant information to enrich the context for the model giving users high-quality responses.
LLM Orchestration Framework
Orchestration frameworks are comprehensive tools that streamline the construction and management of Gen AI applications with the integration of multiple models. They are designed to simplify the complex processes of prompt engineering, API interaction, data retrieval, and state management across conversations with language models.
Experiment Tracking & Prompt Engineering
Experiment tracking and prompt engineering refer to the process of designing and testing different prompts and input parameters to generate the desired output from an LLM. This includes tasks such as iterating on prompts, Chain-of-Thought prompting, and testing different input parameters, and evaluating the performance of the model on different tasks.
Quality Testing & Evaluation
LLM evaluation refers to the process of assessing the quality, performance, and effectiveness of an LLM system. This includes evaluating the accuracy and fluency of the model's output, as well as its ability to generate relevant and coherent text. Evaluation is important for ensuring that an LLM system is meeting the desired performance criteria and for identifying bias, model drifts, hallucinations, and other areas for improvement.
Federated Learning
Federated learning refers to a decentralized approach to train LLMs, where data is collected and processed on devices or servers at the edge of the network, rather than in a centralized data center. This can help to improve the privacy and security of the LLM as well as reduce the latency and bandwidth requirements for efficient model training and deployment.
Layer 3 - Deployment
The deployment layer includes several things like serving and hosting, computing, inference, model compression, model routing, and others.
Model Serving
Serving refers to the process of deploying a trained LLM into a production environment. There are several serving options available, including serverless functions, containerization, and virtual machines.
Model Computing
Computing refers to the process of generating text or performing other tasks using a deployed LLM through computational resources. It is important to have the ability to have scalable, and flexible resources to handle large workloads efficiently.
Model Inference
Model Inference refers to the process of getting a response from a trained model for the user's query or prompts. A fast and efficient inference process is critical in LLM deployment.
Layer 4 - Monitoring
Observability
Observability refers to the ability to monitor and understand the behavior and performance of a LLM system. This includes monitoring and tracking metrics, traces, and logging, such as latency, throughput, and error rates, as well as understanding how the model is making predictions and generating output. Observability is important for identifying issues and optimizing the performance of an LLM system.
App/User Analytics
App/user analytics refers to the process of collecting and analyzing data about how users interact with LLM systems. This includes tasks such as tracking user behavior, analyzing user feedback, and identifying areas for improvement.
Layer 5 - Enterprise considerations
Security & Privacy
Security and privacy refer to the processes and measures to protect LLM systems and the processed data from unauthorized access or malicious attacks. This includes tasks such as encryption, protection, hardening, prompt security, firewalls, vulnerability scanning, data leak protection and data loss prevention, and other data privacy and security solutions.
Governance, Compliance & Risk
Governance, compliance, and risk refer to the processes and measures put in place to ensure that LLM systems are aligned with business objectives and comply with relevant regulations and standards. This includes AI explainability, AI alignment, AI risk management.
E2E Approach
E2E LLMOps Platform
The E2E LLMOps platform is a comprehensive platform that provides a suite of tools and services for managing the entire LLM lifecycle. This includes tasks such as data management, model training, and model deployment. The E2E LLMOps platform ensures that the LLM infrastructure stack is integrated and working together seamlessly.
Low/No-code Gen AI App builder
The low/no-code AI application builder is a tool for non-technical and AI expert users to build and deploy AI-powered applications without writing any line of code.
What’s next? (trends we see…)
RAG FRAMEWORKS
RAG is essential and required because of the predictive nature of LLMs' lack of reasoning. It is crucial to provide the LLM access to relevant, domain-specific content to improve context understanding. RAG frameworks are emerging to ensure the quality and efficiency of the whole RAG process.
🌟 Rising stars:
AI ALIGNMENT & RESPONSIBLE AI
The rapidly changing AI regulatory landscape emphasizes the need for companies to understand the implications of these regulations for deploying LLM-powered applications. Organizations need to adopt processes from building responsible LLM to navigating compliance challenges, focusing on aspects such as fairness, explainability, safety, toxicity, and misinformation. Interesting solutions are emerging to align LLMs with regulatory requirements and implementing guardrails.
🌟 Rising stars:
SLM vs LLM
Data-centric fine-tuning is essential to craft in-house SML for specific use cases or tasks optimized for higher performance, efficiency, accuracy, and reliability. With improvements in training techniques, knowledge distillation, hardware advancements, and efficient architectures, the gap between SLMs and LLMs will continue to narrow.
🌟 Rising stars:
FASTER & EFFICIENT INFERENCE
Developers are facing challenges in model deployment as models continue to grow in size and complexity. The inference process is becoming increasingly time-consuming, leading to slow response times and decreased efficiency. The computational resources required to perform inference can be substantial, limiting the scalability of the model.
Developers are looking for solutions in model compression, serverless computing, hardware acceleration, distributed GPU, job queues & batch processing, and runtime optimization to handle large workloads cost-effectively.
🌟 Rising stars:
LLM & PROMPT INJECTION SECURITY
LLM attacks serve as a stark reminder of the critical need for continuous security enhancements, ongoing vulnerability assessments, and robust detection and response solutions. Safeguarding LLM models from malicious actors is of utmost importance to maintain the integrity and reliability of these systems.
One emerging vulnerability that deserves attention is Prompt Injection, which represents a modern incarnation of SQL injection attacks. Prompt Injection exploits the dark side of prompt engineering, where malicious actors slip in malicious code into LLM databases. This poses a significant threat as it can compromise the functionality and trustworthiness of LLMs.
🌟 Rising stars:
FEDERATED LEARNING TRAINING
Federated Learning enables LLM training while preserving data privacy and security by leveraging distributed data across multiple devices or edge nodes. Developers are look ingfor alternatives to train cost-effectively while preserving the sensitive data privacy.
🌟 Rising stars:
EVALUATIONS & OBSERVABILITY TOOLING
Currently, logging and evaluation tasks are carried out manually due to the high stakes involved. Gen AI and LLM vendors are well aware that delivering high-quality outputs with an optimized infrastructure is crucial, as customer expectations in quality, speed and costs are demanding. They recognize the risks associated with potential issues like hallucinations, biased outcomes, and misinformation, which could lead to a loss of customer trust. As a result, there is a significant opportunity to develop new tools that enhance observability, quality testing, and evaluation procedures.
🌟 Rising stars:
SHORTEN DEV CYCLES WITH NO/LOW-CODE GEN AI APP BUILDER
As LLM infra stack continues to grow, there has been a proliferation of no/low-code tools to effortlessly build Gen AI Apps empowering product managers and data scientists to give them the ability to build an enterprise-ready Gen AI App without extensive reliance on ML engineers.
🌟 Rising stars:
WHAT’S EXCLUDED:
· Hardware / GPU Providers
· Foundation models
· LLM applications and interface
· Knowledge graph databases
Crafted and written by Aleix Pérez, Associate at Criteria Venture Tech
Great article, thanks Aleix. There are a few MS offerings that might be added to the tooling:
- CosmosDB is also a vector DB
- Semantic Kernel is an orchestrator
- Azure AI Search provides RAG capabilities
There is new tooling poping up every day! :)
Great resource!
btw. you missed Wordware.AI in the prompt engineering/experimentation part ;)