Agentic AI Systems in the Cloud: LLM Workflows with Tools, Memory & Guardrails

Comprehensive guide to Agentic AI Systems: Designing LLM-Powered Workflows with Tool Use, Memory, and Guardrails for software engineers

Agentic AI Systems in the Cloud: LLM Workflows with Tools, Memory & Guardrails

One-line description: Practical patterns for building distributed, safe, tool-using LLM agents with memory—plus what to say in system design interviews.
Tags: agentic-ai, llm, distributed-systems, cloud-native, system-design, ml-architecture, guardrails


Introduction

Agentic AI systems are LLM-powered workflows that can plan, call tools, observe results, store/retrieve memory, and take actions toward a goal—often across multiple steps and services. Unlike a single prompt/response “chatbot,” an agentic system behaves more like a distributed application where the LLM is an orchestrator embedded in a broader architecture.

Why this topic matters

In production, most valuable LLM use cases require more than text generation:

  • Fetching data from internal systems (CRM, ticketing, logs, metrics)
  • Executing operations (create a Jira ticket, run a query, deploy a change)
  • Coordinating multi-step workflows (triage → diagnose → propose fix → verify)
  • Applying safety and compliance constraints (PII redaction, policy checks)
  • Maintaining context over time (user preferences, previous incidents)

This pushes you into classic distributed systems territory: retries, idempotency, consistency, observability, access control, rate limits, and failure modes.

Real-world context

Common agentic workloads:

  • SRE Copilot: Investigates alerts by querying metrics/logs, proposes mitigations, opens incidents.
  • Customer Support Agent: Reads ticket history, looks up account state, drafts responses, escalates when needed.
  • Sales Ops Assistant: Pulls pipeline data, generates summaries, schedules follow-ups, updates CRM.
  • Developer Productivity Agent: Reads repos, runs tests, creates PRs, ensures policy compliance.

In interviews, this topic is increasingly used to test whether you can design LLM-enabled systems with robust cloud-native foundations, not just prompt engineering.


Core Concepts

1) Agent loop: Plan → Act → Observe → Reflect

A typical agent runs an iterative loop:

  1. Plan: Decide next step(s) based on goal + context.
  2. Act: Call a tool (API/DB/function) or ask a human.
  3. Observe: Ingest tool results.
  4. Reflect: Update state/memory; decide whether done.

This resembles a workflow engine, except the decision logic is probabilistic and must be constrained.

2) Tools and function calling

A “tool” is any external capability the agent can invoke:

  • HTTP APIs (internal microservices)
  • SQL queries (read-only or controlled write)
  • Vector search (RAG retrieval)
  • Code execution sandbox (carefully)
  • Message queues / workflow triggers
  • Human-in-the-loop approvals

Function calling (structured tool invocation) is essential to avoid brittle “parse text” approaches. Your system should treat tool invocation as a typed API contract with validation.

3) Memory: short-term context vs long-term state

Agentic systems need multiple memory layers:

  • Short-term: Current conversation/workflow state (task, intermediate results).
  • Episodic memory: Past runs, outcomes, user preferences; used to personalize or avoid repeated mistakes.
  • Semantic memory: Retrieved knowledge (docs, runbooks) via RAG.
  • Operational memory: Audit logs, traces, tool call records (for debugging and compliance).

Key design decision: What belongs in the LLM context window vs external stores?

  • Put volatile, bounded state in context (recent messages, current plan).
  • Put large, durable, queryable state in external storage (DB, object store, vector DB).

4) Guardrails: safety, security, and reliability controls

Guardrails are not just moderation. In production they include:

  • Input validation: tool schemas, allowed parameters, max ranges
  • Policy enforcement: role-based tool access, PII handling, data residency
  • Prompt injection defenses: treat retrieved content as untrusted
  • Output constraints: JSON schema, safe templates, citations
  • Human approvals: for high-risk actions (writes, deletions, deployments)
  • Rate limits and budgets: token spend, tool call limits, timeouts

5) Orchestration patterns: single agent vs multi-agent vs workflows

  • Single agent: One LLM orchestrates all steps. Simpler, but can become monolithic.
  • Multi-agent: Specialized agents (retriever, planner, executor, reviewer). Helps separation of concerns but adds coordination complexity.
  • Workflow-first: Deterministic workflow engine (Temporal/Step Functions/Durable Functions) with LLM used only for specific tasks (classification, summarization, plan generation). Often the most reliable for enterprise.

6) Distributed systems concerns

Treat the agent runtime as a distributed system component:

  • Idempotency: tool calls must be safe to retry
  • Consistency: avoid “double writes” when retries happen
  • Timeouts: LLM calls and tools can be slow; design async
  • Backpressure: protect dependencies from floods
  • Observability: trace each tool call and model call with correlation IDs
  • Versioning: prompts, tools, policies, and model versions must be tracked

Implementation Details

This section outlines a practical, cloud-native reference architecture and includes code examples for tool use, memory, and guardrails.

Reference architecture

flowchart LR
  U[User / Client] -->|HTTP/WebSocket| API[Agent API Gateway]

  API --> RT[Agent Runtime Service]
  RT -->|call| LLM[LLM Provider / Model Gateway]
  RT -->|tool calls| TOOLS[Tool Router]

  TOOLS --> SVC1[Internal Services]
  TOOLS --> DB[(SQL/NoSQL)]
  TOOLS --> VS[(Vector DB)]
  TOOLS --> Q[Queue/Workflow Engine]

  RT --> MEM[State Store<br/>(Redis/Postgres)]
  RT --> LOG[(Audit Log / Event Store)]
  RT --> OBS[Tracing + Metrics]

  subgraph Guardrails
    POL[Policy Engine<br/>(OPA / Cedar)]
    MOD[Content Safety / DLP]
    SCHEMA[Schema Validation]
  end

  RT --> POL
  RT --> MOD
  RT --> SCHEMA

Key ideas:

  • Put the LLM behind a Model Gateway (centralized auth, routing, caching, cost controls).
  • Route tools through a Tool Router that enforces schemas, policies, and rate limits.
  • Store workflow state in a State Store; store immutable traces in an Audit Log.
  • Use a workflow engine/queue for long-running or high-latency tasks.

Data model: agent run, steps, and tool calls

In production, you’ll want durable records:

  • agent_run: run_id, user_id, goal, status, started_at, model_version
  • agent_step: step_id, run_id, type (plan/tool/reflect), input/output, timestamps
  • tool_call: tool_call_id, tool_name, args, result, latency, error
  • policy_decision: decision_id, allow/deny, reason, attributes

This supports debugging, compliance, and offline evaluation.


Practical code: a minimal agent runtime with tools + guardrails (Python)

Below is an intentionally “framework-light” example showing:

  • Typed tool definitions
  • Schema validation
  • Policy checks
  • Memory persistence
  • Iterative loop with tool calls

1) Tool contracts (Pydantic schemas)

from typing import Any, Dict, Optional, Literal, List
from pydantic import BaseModel, Field, ValidationError, conint, constr

class ToolCall(BaseModel):
    name: str
    arguments: Dict[str, Any]

class SearchDocsArgs(BaseModel):
    query: constr(min_length=3, max_length=256)
    top_k: conint(ge=1, le=10) = 5

class GetCustomerArgs(BaseModel):
    customer_id: constr(min_length=3, max_length=64)

class CreateTicketArgs(BaseModel):
    title: constr(min_length=5, max_length=120)
    severity: Literal["SEV1", "SEV2", "SEV3"]
    description: constr(min_length=10, max_length=4000)
    customer_id: Optional[str] = None

2) Tool implementations (stubs)

import time

def tool_search_docs(args: SearchDocsArgs) -> Dict[str, Any]:
    # In reality: vector DB query + reranking
    return {
        "results": [
            {"doc_id": "runbook-123", "title": "Payment latency runbook", "snippet": "Check p95..." }
        ]
    }

def tool_get_customer(args: GetCustomerArgs) -> Dict[str, Any]:
    # In reality: call internal customer service
    return {"customer_id": args.customer_id, "plan": "enterprise", "region": "us-east-1"}

def tool_create_ticket(args: CreateTicketArgs) -> Dict[str, Any]:
    # In reality: call Jira/ServiceNow with idempotency key
    time.sleep(0.1)
    return {"ticket_id": "INC-4567", "status": "created"}

3) Policy engine hook (OPA/Cedar-style)

In interviews, it’s valuable to describe policy as a separate service. Here’s a simplified local check:

class PolicyDecision(BaseModel):
    allow: bool
    reason: str

def authorize_tool_call(user_role: str, tool_name: str, args: Dict[str, Any]) -> PolicyDecision:
    # Example: only SREs can create tickets with SEV1
    if tool_name == "create_ticket" and args.get("severity") == "SEV1" and user_role != "sre":
        return PolicyDecision(allow=False, reason="Only SRE role can create SEV1 tickets")
    # Example: customer lookup allowed for support roles
    if tool_name == "get_customer" and user_role not in ("support", "sre"):
        return PolicyDecision(allow=False, reason="Insufficient role for customer lookup")
    return PolicyDecision(allow=True, reason="Allowed")

4) Memory store (workflow state)

Use Redis/Postgres for state; keep it simple here:

import json
from dataclasses import dataclass, field

@dataclass
class RunState:
    run_id: str
    user_id: str
    goal: str
    messages: List[Dict[str, str]] = field(default_factory=list)
    scratchpad: Dict[str, Any] = field(default_factory=dict)

class InMemoryStateStore:
    def __init__(self):
        self._store: Dict[str, RunState] = {}

    def get(self, run_id: str) -> RunState:
        return self._store[run_id]

    def put(self, state: RunState) -> None:
        self._store[state.run_id] = state

5) LLM interface (function calling)

This is pseudo-LLM code: in real systems you’ll call your model gateway.

class LLMResponse(BaseModel):
    assistant_message: str
    tool_call: Optional[ToolCall] = None
    done: bool = False

def llm_step(messages: List[Dict[str, str]], available_tools: List[str]) -> LLMResponse:
    """
    Replace with actual model call. We assume the model returns either:
    - a tool call (name + JSON args), or
    - a final response.
    """
    last = messages[-1]["content"].lower()
    if "runbook" in last or "how do i" in last:
        return LLMResponse(
            assistant_message="I'll search internal docs.",
            tool_call=ToolCall(name="search_docs", arguments={"query": messages[-1]["content"], "top_k": 3}),
        )
    if "create ticket" in last:
        return LLMResponse(
            assistant_message="Creating an incident ticket.",
            tool_call=ToolCall(name="create_ticket", arguments={
                "title": "Payment latency investigation",
                "severity": "SEV2",
                "description": "Customer reports elevated latency. Investigate p95 and dependencies."
            }),
        )
    return LLMResponse(assistant_message="Here is what I found and recommend...", done=True)

6) Agent runtime loop with guardrails and tool routing

TOOL_REGISTRY = {
    "search_docs": (SearchDocsArgs, tool_search_docs),
    "get_customer": (GetCustomerArgs, tool_get_customer),
    "create_ticket": (CreateTicketArgs, tool_create_ticket),
}

class ToolError(Exception):
    pass

def execute_tool(user_role: str, tool_call: ToolCall) -> Dict[str, Any]:
    if tool_call.name not in TOOL_REGISTRY:
        raise ToolError(f"Unknown tool: {tool_call.name}")

    # Policy check
    decision = authorize_tool_call(user_role, tool_call.name, tool_call.arguments)
    if not decision.allow:
        raise ToolError(f"Policy denied tool call: {decision.reason}")

    # Schema validation
    args_model, fn = TOOL_REGISTRY[tool_call.name]
    try:
        validated_args = args_model(**tool_call.arguments)
    except ValidationError as ve:
        raise ToolError(f"Invalid tool args: {ve}")

    # Execute
    return fn(validated_args)

def run_agent(state_store: InMemoryStateStore, run_id: str, user_role: str, max_steps: int = 8) -> str:
    state = state_store.get(run_id)

    for step in range(max_steps):
        resp = llm_step(state.messages, available_tools=list(TOOL_REGISTRY.keys()))
        state.messages.append({"role": "assistant", "content": resp.assistant_message})

        if resp.done:
            state_store.put(state)
            return resp.assistant_message

        if resp.tool_call:
            try:
                result = execute_tool(user_role, resp.tool_call)
                # Tool result is appended as a structured message (do not mix with user text)
                state.messages.append({"role": "tool", "content": json.dumps({
                    "name": resp.tool_call.name,
                    "result": result
                })})
            except ToolError as e:
                state.messages.append({"role": "tool", "content": json.dumps({
                    "name": resp.tool_call.name,
                    "error": str(e)
                })})

        state_store.put(state)

    return "Stopped: max steps reached. Consider escalating to a human."

What this demonstrates (interview talking points):

  • Tool calls are validated and authorized before execution.
  • Tool results are appended as structured tool messages (reduces prompt injection surface).
  • State is persisted each step (supports retries and async continuation).

Memory architecture patterns (short-term + long-term)

A robust system typically uses three stores:

flowchart TB
  RT[Agent Runtime] --> ST[(State Store<br/>Redis/Postgres)]
  RT --> VS[(Vector DB<br/>RAG Memory)]
  RT --> ES[(Event Store / Audit Log)]
  • State Store: current run state, cursor, pending tool calls; TTL for ephemeral runs.
  • Vector DB: embeddings of docs and optionally “memories” (preferences, summaries).
  • Event Store: immutable append-only record of actions for audit, replay, evaluation.

Design tip: store summaries of long conversations as episodic memory to control token growth.


Guardrails in depth: where to enforce what

A common mistake is relying on the model to “behave.” Instead, enforce constraints at multiple layers:

  1. Before LLM call
    • sanitize user input (PII detection, malware links, prompt injection heuristics)
    • attach user identity and entitlements to the request context
  2. After LLM proposes a tool call
    • schema validate + policy authorize
    • ensure tool is in an allowlist for that user/workspace
    • enforce budgets: max tool calls, max write actions
  3. After tool returns
    • redact sensitive fields (PII) before feeding back to LLM
    • tag tool output as untrusted input (especially from web/RAG)
  4. Before final output
    • content safety checks (toxicity, secrets, regulated advice)
    • enforce response format (JSON schema, templates, citations)

Workflow-first pattern (Temporal / Step Functions)

For long-running or high-stakes actions, deterministic orchestration often beats a free-running agent loop.

sequenceDiagram
  participant C as Client
  participant W as Workflow Engine
  participant A as LLM Planner
  participant T as Tool Services
  participant P as Policy Engine

  C->>W: Start workflow(goal, user_ctx)
  W->>A: Generate plan (read-only)
  A-->>W: Plan steps (structured)
  loop steps
    W->>P: Authorize(step)
    P-->>W: allow/deny
    alt allow
      W->>T: Execute tool call
      T-->>W: Result
      W->>A: Summarize/decide next (bounded)
      A-->>W: Next step / done
    else deny
      W-->>C: Escalate / request approval
    end
  end
  W-->>C: Final result + audit trail

Why interviewers like this:

  • Clear separation between decisioning (LLM) and execution (workflow engine).
  • Strong story for retries, timeouts, idempotency, and auditability.

Reliability and scaling considerations

Idempotency and retries

Tool calls must be safe to retry. Use:

  • Idempotency keys for write operations (e.g., X-Idempotency-Key: run_id:step_id)
  • At-least-once execution semantics with dedupe on the server side
  • Store “tool call already executed” markers in state store

Timeouts and async execution

  • LLM calls can take seconds; tools can take longer.
  • Use async job queues for slow tools and let the agent “await” results.
  • In UI, stream partial progress and show step-by-step traces.

Concurrency control

If multiple agent runs can mutate shared resources:

  • Use optimistic concurrency (ETags/version fields)
  • Or enforce a “single writer” workflow per resource (e.g., per ticket/customer)

Cost controls

  • Token budgets per run/user/team
  • Cache retrieval results (RAG) and deterministic tool outputs
  • Prefer smaller models for routing/classification; reserve large models for synthesis

Observability

Instrument:

  • Model latency, tool latency, success/error rates
  • Step counts, abandonment rates, policy denials
  • Traces with run_id correlation across services

Best Practices

Industry standards and practical guidance

  1. Model gateway and centralized governance

    • Standardize auth, logging, routing, fallback models, and cost controls.
    • Version prompts and tool schemas like APIs.
  2. Treat tool outputs as untrusted

    • Especially web content or user-uploaded docs.
    • Apply output filtering/redaction before feeding back to the LLM.
  3. Use structured interfaces everywhere

    • Function calling with JSON schema
    • Typed tool args and typed tool results
    • Structured final outputs for downstream automation
  4. Prefer workflow-first for high-risk actions

    • Human approval steps for destructive operations
    • Deterministic state machine with explicit transitions
  5. Memory minimization and summarization

    • Keep only what you need in the context window.
    • Summarize long histories into compact episodic memory.
    • Store raw logs in the audit store, not in prompts.
  6. Defense-in-depth guardrails

    • Policy engine for authorization
    • DLP for PII/secrets
    • Content moderation
    • Rate limiting + budgets
    • Safe tool allowlists
  7. Evaluation and red-teaming

    • Offline test suites with adversarial prompts
    • Tool misuse simulations
    • Regression tests on prompt/tool changes

Common pitfalls to avoid

  • Letting the LLM directly call internal APIs without a tool router and policy checks.
  • No idempotency on write actions → duplicate tickets, duplicate refunds, repeated emails.
  • Prompt injection via RAG: retrieved docs can contain malicious instructions.
  • Unbounded loops: agent keeps calling tools; enforce max steps and budgets.
  • Overstuffed context: cost spikes and degraded accuracy; summarize and externalize state.
  • No audit trail: impossible to debug incidents or satisfy compliance requirements.
  • Mixing concerns: planner, executor, and safety logic all in one prompt.

Interview Relevance

Agentic AI design shows up in system design interviews in two main ways:

  1. “Design an AI assistant for X” (support, SRE, finance ops, dev productivity)
  2. “Add LLM automation to an existing platform” (ticketing, CRM, monitoring)

How to frame your solution (a strong interview narrative)

Start with requirements:

  • What actions can it take? read-only vs write actions
  • Latency expectations: interactive vs asynchronous
  • Safety/compliance: PII, approvals, audit logs
  • Scale: concurrent users, tool QPS, cost budget

Propose a reference architecture:

  • Agent Runtime Service
  • Model Gateway
  • Tool Router with schema validation
  • Policy Engine (OPA/Cedar)
  • State Store + Event/Audit Store
  • Vector DB for RAG
  • Workflow engine for long-running/high-risk steps

Discuss failure modes explicitly:

  • Tool timeouts, partial failures, retries
  • Model hallucination → mitigated by tool grounding + citations
  • Prompt injection → mitigated by isolation and validation
  • Cost overruns → budgets and caching

Explain data and control planes:

  • Control plane: tool registration, policy management, prompt/model versioning
  • Data plane: runtime execution, tool calls, state transitions, logs

Key discussion points interviewers probe

  • Guardrails: Where are they enforced? How do you prevent unauthorized actions?
  • Idempotency: How do you avoid duplicate side effects?
  • Observability: Can you trace a bad action back to a tool call and model output?
  • Memory: What do you store, where, and for how long? How do you handle deletion (GDPR)?
  • Workflow vs agent: When do you use a deterministic workflow engine?
  • Multi-tenancy: How do you isolate customers, rate limit, and enforce entitlements?
  • Evaluation: How do you test changes to prompts/tools/models safely?

A concise “system design answer” template

  1. Clarify scope and actions (read vs write).
  2. Draw the architecture (runtime, tools, memory, guardrails).
  3. Walk through one end-to-end request with steps and data flow.
  4. Cover reliability (timeouts, retries, idempotency).
  5. Cover safety/security (policy engine, DLP, approvals).
  6. Cover scaling and cost (caching, model selection, budgets).
  7. End with observability and evaluation strategy.

Conclusion

Agentic AI systems are best understood as distributed workflows where an LLM proposes actions but the platform enforces correctness, safety, and reliability. The most production-ready designs separate:

  • Decisioning (LLM planning and synthesis),
  • Execution (tool router + workflow engine),
  • State (short-term run state + long-term memory + audit logs),
  • Guardrails (policy, schema validation, DLP, moderation, budgets).

In interviews, strong answers emphasize cloud-native fundamentals—idempotency, observability, security boundaries, and deterministic orchestration—while showing how tool use and memory make LLMs genuinely useful without sacrificing control.