Part 0.5: The LLM Engineering Glossary
LLM Engineering Concepts, Techniques, and Frameworks at a Glance.
This post is a quick, structured glossary of the most important ideas in LLM engineering — from prompt techniques to agentic frameworks. It’s built as a tree, not an alphabet soup. Perfect if you’re building, learning, or just trying to make sense of what everyone’s talking about.
LLM Engineering
LLM Engineering is the umbrella discipline focused on building useful, reliable, and goal-driven systems using large language models. It includes prompt design, tool integration, memory handling, reasoning strategies, and application orchestration. It’s where software engineering meets machine intelligence.
Prompt Engineering
Prompt Engineering is the foundational skill in LLM Engineering. It involves crafting inputs (prompts) that guide the model to generate accurate and useful outputs. Good prompts control the model’s tone, reasoning, and behavior — and are key to making LLMs useful in the real world.
Key Techniques
- Zero-shot prompting – Asking the model a question directly, without giving any examples. Useful for simple or general tasks.
- Few-shot prompting – Providing a few examples in the prompt to show the model the desired pattern or format. Helps improve accuracy for custom tasks.
- Chain-of-Thought (CoT) – Encouraging the model to reason step-by-step by including or prompting it to “think out loud.” Great for solving complex or multi-step problems.
Agentic Frameworks
Agentic frameworks provide the structure and tools needed to build AI agents — systems that can reason, make decisions, take actions, and sometimes collaborate with other agents or tools. These frameworks help manage memory, tool use, step-by-step logic, and multi-agent planning, so developers don’t have to build everything from scratch. They’re a key part of turning LLMs into reliable, goal-oriented systems.
Popular Frameworks
- LangChain – One of the most widely used open-source frameworks for connecting LLMs with tools, APIs, memory, and external data. It supports complex multi-step workflows and is commonly used in chatbots, agents, and RAG applications.
- Autogen (Microsoft) – Designed for multi-agent communication and collaborative reasoning. Agents can have roles and interact with each other to solve tasks, making it ideal for advanced workflows like team-based coding or analysis.
- CrewAI – A role-based agent framework that simulates team collaboration. Each agent has a defined role (e.g., researcher, planner) and works together with others to complete tasks efficiently. Great for orchestrated, modular agent setups.
LLM Applications
These are the real-world systems and tools built using large language models. Applications can range from simple Q&A bots to complex assistants that handle reasoning, memory, and tool use. They may use techniques like RAG, CAG, or full-blown agent frameworks, depending on the problem they’re solving.
Common Examples
- Chatbots – Conversational interfaces that answer questions or assist users based on predefined logic or real-time reasoning.
- Copilots – Assistive tools that work alongside users in apps (like coding IDEs or CRMs) to complete tasks faster.
- Support Agents – LLM-powered assistants that handle customer queries, escalate issues, or pull relevant knowledge from help docs.
- Summarizers – Tools that condense long text into digestible summaries for fast understanding or reporting.
RAG (Retrieval-Augmented Generation)
RAG is a powerful technique that gives LLMs access to external knowledge — like documents, databases, or internal systems — at query time. Instead of relying only on what the model was trained on, RAG lets you fetch the most relevant information and feed it into the prompt. This makes responses more accurate, up-to-date, and context-aware.
Core Components
- Embeddings – Numerical representations of text that allow the system to measure similarity between a question and chunks of content.
- Chunking – Breaking long documents into smaller, searchable pieces so they can be embedded and retrieved effectively.
- Vector Databases – Specialized databases (like Pinecone, FAISS, or Weaviate) that store embeddings and support fast similarity search.
CAG (Code-Augmented Generation)
CAG refers to giving LLMs the ability to use code or external tools to complete a task. Instead of just generating a response from text, the model can call functions, run calculations, or query structured data. This makes it ideal for analytical, logic-driven, or tool-integrated workflows.
Common Capabilities
- Tool Use – The LLM can invoke external tools (e.g., a calculator or weather API) as part of its reasoning process.
- Function Calling – The model is guided to choose from a set of functions, pass arguments, and interpret results.
- Code Execution – In some setups, the LLM can generate code (e.g., Python or SQL), run it in a safe environment, and use the output in its reply.
Acronym Overlap: Code-Augmented vs. Cache-Augmented Generation
If you've seen “CAG” floating around the internet and felt confused — you're not alone. In the world of LLM engineering, CAG can stand for two completely different concepts, depending on context:
- Code-Augmented Generation (CAG) – This is when an LLM is enhanced with the ability to generate or call code, use tools, or execute functions during reasoning. It's often used for logic-heavy tasks, calculations, or tool-integrated workflows.
- Cache-Augmented Generation (CAG) – A newer, performance-focused strategy where static or semi-static knowledge is preloaded into the model's extended context, and its attention states are cached for reuse across sessions. It reduces latency and infrastructure load by skipping real-time retrieval.
This acronym collision is unfortunate but real. To keep things clear in your writing and architecture:
- Always spell out the full term on first use.
- Use context clues — "code" means action and logic, "cache" means memory and performance optimization.
- Consider adding a brief clarification note (like this one) when both appear in the same project or post.
In this glossary and series, Code-Augmented Generation is the default unless stated otherwise.
Cache-Augmented Generation (CAG)
Cache-Augmented Generation is an optimization technique that improves the performance and efficiency of LLM-based systems by reusing static knowledge across sessions. Instead of performing retrieval or reasoning from scratch on every query, the model loads stable information into its context window and caches the internal attention states (Key-Value pairs) for fast reuse. It's especially useful in low-latency, high-consistency use cases like internal FAQ bots, onboarding copilots, or compliance advisors.
Core Components
- Static Context Encoding – Before the session begins, known reference material (e.g., policy docs, product info) is fed into the model. This serves as a fixed “knowledge payload” for the session.
- KV Caching (Key-Value Attention States) – The model’s internal attention cache (used during inference) is saved and reused across multiple queries. This avoids recomputation and ensures consistent grounding.
- Warm Starts – Sessions begin with a pre-populated context + KV cache. Unlike traditional retrieval (RAG), no similarity search is needed at runtime.
Benefits
- Consistency – Responses are grounded in a fixed, pre-approved knowledge base.
- Speed – Eliminates retrieval steps, reducing latency significantly.
- Cost Efficiency – Fewer token calls and processing per query.
Best Used For
- Employee onboarding assistants with fixed content
- Policy or HR bots where information rarely changes
- Use cases that require predictable responses from a stable dataset
Cache-Augmented Generation is not a replacement for RAG — it complements it. While RAG is great for dynamic knowledge lookup, Cache-Augmented Generation is ideal when your data is known, stable, and reused across many queries.
Infrastructure Tools
These tools and platforms help you run, host, and experiment with large language models. Whether you're working locally or deploying at scale, infrastructure tools handle model serving, APIs, memory, and performance. Some are aimed at developers building full apps, while others are more for testing or local exploration.
Popular Tools
- Ollama – A developer-friendly way to run LLMs locally via a simple command-line interface. Great for embedding models into apps or workflows.
- LM Studio – A desktop GUI for chatting with local models like LLaMA or Mistral. Ideal for quick testing without code.
- llama.cpp – A low-level C++ backend for running quantized models on CPU. Lightweight and highly portable, even runs on Raspberry Pi.
- OpenAI / Bedrock – Cloud-based platforms that provide access to powerful proprietary and open models via API, with enterprise-scale reliability.
Core Building Blocks
These are the foundational components that power almost every LLM engineering system. Whether you're building a simple chatbot or a complex agentic workflow, these elements provide the capabilities that everything else builds on. Understanding them is essential for designing scalable, reliable AI systems.
Key Components
- LLMs – The language models themselves (e.g., GPT-4, Claude, LLaMA) that generate responses based on prompts. They are the "brains" of the system.
- Memory – Stores and recalls user interactions or task context. Can be short-term (conversation-level) or long-term (user preferences, history).
- Embeddings – Vector representations of text that help with similarity search, clustering, or context alignment. Essential for RAG and intent matching.
- Tool Calling – The ability of a model or framework to invoke APIs, functions, or external code to extend the model's capabilities.
These building blocks are reused across prompting techniques, agentic frameworks, and both RAG and CAG approaches. Mastering them gives you the power to compose more advanced and trustworthy LLM applications.
Conclusion
This glossary doesn’t cover everything there is to know about LLM engineering — and honestly, it never could. The field is evolving fast, and new tools, patterns, and ideas are emerging every week. But what we’ve covered here are the core concepts that matter most: the parts that show up again and again in real-world systems, from prompt design to agent frameworks to generation strategies like RAG, CAG, and caching.
Whether you’re building your first LLM-powered chatbot or architecting an enterprise-grade AI assistant, these terms and structures will keep showing up. Think of this as your grounding — something to come back to as the rest of the series builds deeper and wider.