Agentic AI Systems in the Cloud: LLM Workflows with Tools, Memory & Guardrails
One-line description: Practical patterns for building distributed, safe, tool-using LLM agents with memory—plus what to say in system design interviews.
Tags: agentic-ai, llm, distributed-systems, cloud-native, system-design, ml-architecture, guardrails
Introduction
Agentic AI systems are LLM-powered workflows that can plan, call tools, observe results, store/retrieve memory, and take actions toward a goal—often across multiple steps and services. Unlike a single prompt/response “chatbot,” an agentic system behaves more like a distributed application where the LLM is an orchestrator embedded in a broader architecture.
Why this topic matters
In production, most valuable LLM use cases require more than text generation:
- Fetching data from internal systems (CRM, ticketing, logs, metrics)
- Executing operations (create a Jira ticket, run a query, deploy a change)
- Coordinating multi-step workflows (triage → diagnose → propose fix → verify)
- Applying safety and compliance constraints (PII redaction, policy checks)
- Maintaining context over time (user preferences, previous incidents)
This pushes you into classic distributed systems territory: retries, idempotency, consistency, observability, access control, rate limits, and failure modes.
Real-world context
Common agentic workloads:
- SRE Copilot: Investigates alerts by querying metrics/logs, proposes mitigations, opens incidents.
- Customer Support Agent: Reads ticket history, looks up account state, drafts responses, escalates when needed.
- Sales Ops Assistant: Pulls pipeline data, generates summaries, schedules follow-ups, updates CRM.
- Developer Productivity Agent: Reads repos, runs tests, creates PRs, ensures policy compliance.
In interviews, this topic is increasingly used to test whether you can design LLM-enabled systems with robust cloud-native foundations, not just prompt engineering.
Core Concepts
1) Agent loop: Plan → Act → Observe → Reflect
A typical agent runs an iterative loop:
- Plan: Decide next step(s) based on goal + context.
- Act: Call a tool (API/DB/function) or ask a human.
- Observe: Ingest tool results.
- Reflect: Update state/memory; decide whether done.
This resembles a workflow engine, except the decision logic is probabilistic and must be constrained.
2) Tools and function calling
A “tool” is any external capability the agent can invoke:
- HTTP APIs (internal microservices)
- SQL queries (read-only or controlled write)
- Vector search (RAG retrieval)
- Code execution sandbox (carefully)
- Message queues / workflow triggers
- Human-in-the-loop approvals
Function calling (structured tool invocation) is essential to avoid brittle “parse text” approaches. Your system should treat tool invocation as a typed API contract with validation.
3) Memory: short-term context vs long-term state
Agentic systems need multiple memory layers:
- Short-term: Current conversation/workflow state (task, intermediate results).
- Episodic memory: Past runs, outcomes, user preferences; used to personalize or avoid repeated mistakes.
- Semantic memory: Retrieved knowledge (docs, runbooks) via RAG.
- Operational memory: Audit logs, traces, tool call records (for debugging and compliance).
Key design decision: What belongs in the LLM context window vs external stores?
- Put volatile, bounded state in context (recent messages, current plan).
- Put large, durable, queryable state in external storage (DB, object store, vector DB).
4) Guardrails: safety, security, and reliability controls
Guardrails are not just moderation. In production they include:
- Input validation: tool schemas, allowed parameters, max ranges
- Policy enforcement: role-based tool access, PII handling, data residency
- Prompt injection defenses: treat retrieved content as untrusted
- Output constraints: JSON schema, safe templates, citations
- Human approvals: for high-risk actions (writes, deletions, deployments)
- Rate limits and budgets: token spend, tool call limits, timeouts
5) Orchestration patterns: single agent vs multi-agent vs workflows
- Single agent: One LLM orchestrates all steps. Simpler, but can become monolithic.
- Multi-agent: Specialized agents (retriever, planner, executor, reviewer). Helps separation of concerns but adds coordination complexity.
- Workflow-first: Deterministic workflow engine (Temporal/Step Functions/Durable Functions) with LLM used only for specific tasks (classification, summarization, plan generation). Often the most reliable for enterprise.
6) Distributed systems concerns
Treat the agent runtime as a distributed system component:
- Idempotency: tool calls must be safe to retry
- Consistency: avoid “double writes” when retries happen
- Timeouts: LLM calls and tools can be slow; design async
- Backpressure: protect dependencies from floods
- Observability: trace each tool call and model call with correlation IDs
- Versioning: prompts, tools, policies, and model versions must be tracked
Implementation Details
This section outlines a practical, cloud-native reference architecture and includes code examples for tool use, memory, and guardrails.
Reference architecture
flowchart LR
U[User / Client] -->|HTTP/WebSocket| API[Agent API Gateway]
API --> RT[Agent Runtime Service]
RT -->|call| LLM[LLM Provider / Model Gateway]
RT -->|tool calls| TOOLS[Tool Router]
TOOLS --> SVC1[Internal Services]
TOOLS --> DB[(SQL/NoSQL)]
TOOLS --> VS[(Vector DB)]
TOOLS --> Q[Queue/Workflow Engine]
RT --> MEM[State Store<br/>(Redis/Postgres)]
RT --> LOG[(Audit Log / Event Store)]
RT --> OBS[Tracing + Metrics]
subgraph Guardrails
POL[Policy Engine<br/>(OPA / Cedar)]
MOD[Content Safety / DLP]
SCHEMA[Schema Validation]
end
RT --> POL
RT --> MOD
RT --> SCHEMA
Key ideas:
- Put the LLM behind a Model Gateway (centralized auth, routing, caching, cost controls).
- Route tools through a Tool Router that enforces schemas, policies, and rate limits.
- Store workflow state in a State Store; store immutable traces in an Audit Log.
- Use a workflow engine/queue for long-running or high-latency tasks.
Data model: agent run, steps, and tool calls
In production, you’ll want durable records:
agent_run: run_id, user_id, goal, status, started_at, model_versionagent_step: step_id, run_id, type (plan/tool/reflect), input/output, timestampstool_call: tool_call_id, tool_name, args, result, latency, errorpolicy_decision: decision_id, allow/deny, reason, attributes
This supports debugging, compliance, and offline evaluation.
Practical code: a minimal agent runtime with tools + guardrails (Python)
Below is an intentionally “framework-light” example showing:
- Typed tool definitions
- Schema validation
- Policy checks
- Memory persistence
- Iterative loop with tool calls
1) Tool contracts (Pydantic schemas)
from typing import Any, Dict, Optional, Literal, List
from pydantic import BaseModel, Field, ValidationError, conint, constr
class ToolCall(BaseModel):
name: str
arguments: Dict[str, Any]
class SearchDocsArgs(BaseModel):
query: constr(min_length=3, max_length=256)
top_k: conint(ge=1, le=10) = 5
class GetCustomerArgs(BaseModel):
customer_id: constr(min_length=3, max_length=64)
class CreateTicketArgs(BaseModel):
title: constr(min_length=5, max_length=120)
severity: Literal["SEV1", "SEV2", "SEV3"]
description: constr(min_length=10, max_length=4000)
customer_id: Optional[str] = None
2) Tool implementations (stubs)
import time
def tool_search_docs(args: SearchDocsArgs) -> Dict[str, Any]:
# In reality: vector DB query + reranking
return {
"results": [
{"doc_id": "runbook-123", "title": "Payment latency runbook", "snippet": "Check p95..." }
]
}
def tool_get_customer(args: GetCustomerArgs) -> Dict[str, Any]:
# In reality: call internal customer service
return {"customer_id": args.customer_id, "plan": "enterprise", "region": "us-east-1"}
def tool_create_ticket(args: CreateTicketArgs) -> Dict[str, Any]:
# In reality: call Jira/ServiceNow with idempotency key
time.sleep(0.1)
return {"ticket_id": "INC-4567", "status": "created"}
3) Policy engine hook (OPA/Cedar-style)
In interviews, it’s valuable to describe policy as a separate service. Here’s a simplified local check:
class PolicyDecision(BaseModel):
allow: bool
reason: str
def authorize_tool_call(user_role: str, tool_name: str, args: Dict[str, Any]) -> PolicyDecision:
# Example: only SREs can create tickets with SEV1
if tool_name == "create_ticket" and args.get("severity") == "SEV1" and user_role != "sre":
return PolicyDecision(allow=False, reason="Only SRE role can create SEV1 tickets")
# Example: customer lookup allowed for support roles
if tool_name == "get_customer" and user_role not in ("support", "sre"):
return PolicyDecision(allow=False, reason="Insufficient role for customer lookup")
return PolicyDecision(allow=True, reason="Allowed")
4) Memory store (workflow state)
Use Redis/Postgres for state; keep it simple here:
import json
from dataclasses import dataclass, field
@dataclass
class RunState:
run_id: str
user_id: str
goal: str
messages: List[Dict[str, str]] = field(default_factory=list)
scratchpad: Dict[str, Any] = field(default_factory=dict)
class InMemoryStateStore:
def __init__(self):
self._store: Dict[str, RunState] = {}
def get(self, run_id: str) -> RunState:
return self._store[run_id]
def put(self, state: RunState) -> None:
self._store[state.run_id] = state
5) LLM interface (function calling)
This is pseudo-LLM code: in real systems you’ll call your model gateway.
class LLMResponse(BaseModel):
assistant_message: str
tool_call: Optional[ToolCall] = None
done: bool = False
def llm_step(messages: List[Dict[str, str]], available_tools: List[str]) -> LLMResponse:
"""
Replace with actual model call. We assume the model returns either:
- a tool call (name + JSON args), or
- a final response.
"""
last = messages[-1]["content"].lower()
if "runbook" in last or "how do i" in last:
return LLMResponse(
assistant_message="I'll search internal docs.",
tool_call=ToolCall(name="search_docs", arguments={"query": messages[-1]["content"], "top_k": 3}),
)
if "create ticket" in last:
return LLMResponse(
assistant_message="Creating an incident ticket.",
tool_call=ToolCall(name="create_ticket", arguments={
"title": "Payment latency investigation",
"severity": "SEV2",
"description": "Customer reports elevated latency. Investigate p95 and dependencies."
}),
)
return LLMResponse(assistant_message="Here is what I found and recommend...", done=True)
6) Agent runtime loop with guardrails and tool routing
TOOL_REGISTRY = {
"search_docs": (SearchDocsArgs, tool_search_docs),
"get_customer": (GetCustomerArgs, tool_get_customer),
"create_ticket": (CreateTicketArgs, tool_create_ticket),
}
class ToolError(Exception):
pass
def execute_tool(user_role: str, tool_call: ToolCall) -> Dict[str, Any]:
if tool_call.name not in TOOL_REGISTRY:
raise ToolError(f"Unknown tool: {tool_call.name}")
# Policy check
decision = authorize_tool_call(user_role, tool_call.name, tool_call.arguments)
if not decision.allow:
raise ToolError(f"Policy denied tool call: {decision.reason}")
# Schema validation
args_model, fn = TOOL_REGISTRY[tool_call.name]
try:
validated_args = args_model(**tool_call.arguments)
except ValidationError as ve:
raise ToolError(f"Invalid tool args: {ve}")
# Execute
return fn(validated_args)
def run_agent(state_store: InMemoryStateStore, run_id: str, user_role: str, max_steps: int = 8) -> str:
state = state_store.get(run_id)
for step in range(max_steps):
resp = llm_step(state.messages, available_tools=list(TOOL_REGISTRY.keys()))
state.messages.append({"role": "assistant", "content": resp.assistant_message})
if resp.done:
state_store.put(state)
return resp.assistant_message
if resp.tool_call:
try:
result = execute_tool(user_role, resp.tool_call)
# Tool result is appended as a structured message (do not mix with user text)
state.messages.append({"role": "tool", "content": json.dumps({
"name": resp.tool_call.name,
"result": result
})})
except ToolError as e:
state.messages.append({"role": "tool", "content": json.dumps({
"name": resp.tool_call.name,
"error": str(e)
})})
state_store.put(state)
return "Stopped: max steps reached. Consider escalating to a human."
What this demonstrates (interview talking points):
- Tool calls are validated and authorized before execution.
- Tool results are appended as structured tool messages (reduces prompt injection surface).
- State is persisted each step (supports retries and async continuation).
Memory architecture patterns (short-term + long-term)
A robust system typically uses three stores:
flowchart TB
RT[Agent Runtime] --> ST[(State Store<br/>Redis/Postgres)]
RT --> VS[(Vector DB<br/>RAG Memory)]
RT --> ES[(Event Store / Audit Log)]
- State Store: current run state, cursor, pending tool calls; TTL for ephemeral runs.
- Vector DB: embeddings of docs and optionally “memories” (preferences, summaries).
- Event Store: immutable append-only record of actions for audit, replay, evaluation.
Design tip: store summaries of long conversations as episodic memory to control token growth.
Guardrails in depth: where to enforce what
A common mistake is relying on the model to “behave.” Instead, enforce constraints at multiple layers:
- Before LLM call
- sanitize user input (PII detection, malware links, prompt injection heuristics)
- attach user identity and entitlements to the request context
- After LLM proposes a tool call
- schema validate + policy authorize
- ensure tool is in an allowlist for that user/workspace
- enforce budgets: max tool calls, max write actions
- After tool returns
- redact sensitive fields (PII) before feeding back to LLM
- tag tool output as untrusted input (especially from web/RAG)
- Before final output
- content safety checks (toxicity, secrets, regulated advice)
- enforce response format (JSON schema, templates, citations)
Workflow-first pattern (Temporal / Step Functions)
For long-running or high-stakes actions, deterministic orchestration often beats a free-running agent loop.
sequenceDiagram
participant C as Client
participant W as Workflow Engine
participant A as LLM Planner
participant T as Tool Services
participant P as Policy Engine
C->>W: Start workflow(goal, user_ctx)
W->>A: Generate plan (read-only)
A-->>W: Plan steps (structured)
loop steps
W->>P: Authorize(step)
P-->>W: allow/deny
alt allow
W->>T: Execute tool call
T-->>W: Result
W->>A: Summarize/decide next (bounded)
A-->>W: Next step / done
else deny
W-->>C: Escalate / request approval
end
end
W-->>C: Final result + audit trail
Why interviewers like this:
- Clear separation between decisioning (LLM) and execution (workflow engine).
- Strong story for retries, timeouts, idempotency, and auditability.
Reliability and scaling considerations
Idempotency and retries
Tool calls must be safe to retry. Use:
- Idempotency keys for write operations (e.g.,
X-Idempotency-Key: run_id:step_id) - At-least-once execution semantics with dedupe on the server side
- Store “tool call already executed” markers in state store
Timeouts and async execution
- LLM calls can take seconds; tools can take longer.
- Use async job queues for slow tools and let the agent “await” results.
- In UI, stream partial progress and show step-by-step traces.
Concurrency control
If multiple agent runs can mutate shared resources:
- Use optimistic concurrency (ETags/version fields)
- Or enforce a “single writer” workflow per resource (e.g., per ticket/customer)
Cost controls
- Token budgets per run/user/team
- Cache retrieval results (RAG) and deterministic tool outputs
- Prefer smaller models for routing/classification; reserve large models for synthesis
Observability
Instrument:
- Model latency, tool latency, success/error rates
- Step counts, abandonment rates, policy denials
- Traces with run_id correlation across services
Best Practices
Industry standards and practical guidance
Model gateway and centralized governance
- Standardize auth, logging, routing, fallback models, and cost controls.
- Version prompts and tool schemas like APIs.
Treat tool outputs as untrusted
- Especially web content or user-uploaded docs.
- Apply output filtering/redaction before feeding back to the LLM.
Use structured interfaces everywhere
- Function calling with JSON schema
- Typed tool args and typed tool results
- Structured final outputs for downstream automation
Prefer workflow-first for high-risk actions
- Human approval steps for destructive operations
- Deterministic state machine with explicit transitions
Memory minimization and summarization
- Keep only what you need in the context window.
- Summarize long histories into compact episodic memory.
- Store raw logs in the audit store, not in prompts.
Defense-in-depth guardrails
- Policy engine for authorization
- DLP for PII/secrets
- Content moderation
- Rate limiting + budgets
- Safe tool allowlists
Evaluation and red-teaming
- Offline test suites with adversarial prompts
- Tool misuse simulations
- Regression tests on prompt/tool changes
Common pitfalls to avoid
- Letting the LLM directly call internal APIs without a tool router and policy checks.
- No idempotency on write actions → duplicate tickets, duplicate refunds, repeated emails.
- Prompt injection via RAG: retrieved docs can contain malicious instructions.
- Unbounded loops: agent keeps calling tools; enforce max steps and budgets.
- Overstuffed context: cost spikes and degraded accuracy; summarize and externalize state.
- No audit trail: impossible to debug incidents or satisfy compliance requirements.
- Mixing concerns: planner, executor, and safety logic all in one prompt.
Interview Relevance
Agentic AI design shows up in system design interviews in two main ways:
- “Design an AI assistant for X” (support, SRE, finance ops, dev productivity)
- “Add LLM automation to an existing platform” (ticketing, CRM, monitoring)
How to frame your solution (a strong interview narrative)
Start with requirements:
- What actions can it take? read-only vs write actions
- Latency expectations: interactive vs asynchronous
- Safety/compliance: PII, approvals, audit logs
- Scale: concurrent users, tool QPS, cost budget
Propose a reference architecture:
- Agent Runtime Service
- Model Gateway
- Tool Router with schema validation
- Policy Engine (OPA/Cedar)
- State Store + Event/Audit Store
- Vector DB for RAG
- Workflow engine for long-running/high-risk steps
Discuss failure modes explicitly:
- Tool timeouts, partial failures, retries
- Model hallucination → mitigated by tool grounding + citations
- Prompt injection → mitigated by isolation and validation
- Cost overruns → budgets and caching
Explain data and control planes:
- Control plane: tool registration, policy management, prompt/model versioning
- Data plane: runtime execution, tool calls, state transitions, logs
Key discussion points interviewers probe
- Guardrails: Where are they enforced? How do you prevent unauthorized actions?
- Idempotency: How do you avoid duplicate side effects?
- Observability: Can you trace a bad action back to a tool call and model output?
- Memory: What do you store, where, and for how long? How do you handle deletion (GDPR)?
- Workflow vs agent: When do you use a deterministic workflow engine?
- Multi-tenancy: How do you isolate customers, rate limit, and enforce entitlements?
- Evaluation: How do you test changes to prompts/tools/models safely?
A concise “system design answer” template
- Clarify scope and actions (read vs write).
- Draw the architecture (runtime, tools, memory, guardrails).
- Walk through one end-to-end request with steps and data flow.
- Cover reliability (timeouts, retries, idempotency).
- Cover safety/security (policy engine, DLP, approvals).
- Cover scaling and cost (caching, model selection, budgets).
- End with observability and evaluation strategy.
Conclusion
Agentic AI systems are best understood as distributed workflows where an LLM proposes actions but the platform enforces correctness, safety, and reliability. The most production-ready designs separate:
- Decisioning (LLM planning and synthesis),
- Execution (tool router + workflow engine),
- State (short-term run state + long-term memory + audit logs),
- Guardrails (policy, schema validation, DLP, moderation, budgets).
In interviews, strong answers emphasize cloud-native fundamentals—idempotency, observability, security boundaries, and deterministic orchestration—while showing how tool use and memory make LLMs genuinely useful without sacrificing control.