Core Concepts

This page explains the key concepts behind AI Guardian: the filter engine, risk scoring system, routing logic, human review queue, and policy configuration.

Filter Engine

The Filter Engine is the heart of AI Guardian. It scans text content for threat patterns using a library of compiled regular expressions, each associated with a score delta. Multiple matches accumulate, but per-category scores are capped to prevent any single category from dominating. The total risk score is clamped to 100.

Input Filter

Applied to every incoming request before it reaches the LLM. Scans all message content (including multipart messages) across these categories:

Prompt Injection — "ignore previous instructions", DAN jailbreaks, system prompt leakage
SQL Injection — UNION SELECT, DROP TABLE, comment injection
Command Injection — shell metacharacters, ; rm -rf, curl exfiltration patterns
Data Exfiltration — requests to print credentials, config files, or environment variables

Output Filter

Applied to every LLM response before it is returned to the caller. Detects:

PII Leaks — credit card numbers (Luhn-checked), SSNs, email addresses
Secret Leaks — OpenAI APIキー（sk-...）, AWS access keys, generic tokens
Harmful Content — patterns suggesting harmful instructions in responses

If the output filter flags a response, it is blocked even if the input was safe. The caller receives a 403 with "code": "request_blocked".

Risk Scoring

Every request and response receives a risk score between 0 and 100 and a corresponding risk level:

Score Levels

Level	Score Range	Default Action
Low	0–30	Auto-allow → forward to LLM
Medium	31–60	Queue for human review
High	61–80	Queue for human review (priority)
Critical	81–100	Auto-block → 403 returned

The thresholds (auto_allow_threshold and auto_block_threshold) are configurable per tenant via the Policy Engine.

Request Routing

Based on the risk score and configured policy, every request is routed to one of three paths:

Allow — forwarded to the upstream LLM. Response is filtered before return.
Queue — held in the review queue. A 202 response is returned immediately to the caller, containing a review_item_id.
Block — rejected with a 403. No LLM call is made.

Human-in-the-Loop (HitL)

Queued requests appear in the Review Queue in the dashboard. Reviewers can take three actions:

Approve — the request is forwarded to the LLM and the response returned.
Reject — the request is rejected with a 403.
Escalate — the item is flagged for senior review without taking action.

SLA & Fallback

Each review item has an SLA deadline (default: 30 minutes, configurable per policy). A background worker polls every 60 seconds for expired items and applies the configured SLA fallback action:

block (default) — expired items are auto-blocked
allow — expired items are auto-approved and forwarded

Capability-Based Access Control (Layer 4)

Added in v1.3.0. Based on Google DeepMind's CaMeL paper, this layer separates control flow from data flow. Capability tokens are generated with cryptographic nonces — unforgeable by injected text. UNTRUSTED data cannot be promoted to TRUSTED without scanning, preventing data-to-control-flow escalation attacks.

Capability — access permission token with cryptographic nonce, scope, and expiry
TaintLabel — TRUSTED / UNTRUSTED / SANITIZED data provenance classification
CapabilityEnforcer — blocks control-flow operations (shell:exec, agent:spawn, code:eval) when data provenance is UNTRUSTED
Guard.authorize_tool() — new method integrating capability checks into the main API

Atomic Execution Pipeline — AEP (Layer 5)

Added in v1.3.0. Defines tool execution as an indivisible primitive: Scan, Execute, Vaporize. Input is always scanned, execution is always sandboxed, artifacts are always destroyed (with audit warning if explicitly opted out).

ProcessSandbox — stdlib-only execution sandbox (subprocess + tempdir, environment stripping, timeout enforcement)
Vaporizer — secure artifact destruction with os.urandom overwrite before unlink, verification pass
AtomicPipeline — thread-safe orchestrator guaranteeing indivisibility of the 3-step sequence

Safety Verifier (Layer 6)

Added in v1.3.0. Define declarative safety specifications (SafetySpec) and verify them before execution. On passing verification, a ProofCertificate (UUID4 + UTC timestamp) is issued for audit trails.

SafetySpec — allowed_effects, forbidden_effects, invariants
SafetyVerifier — built-in invariant checks: check_no_secrets_in_output, check_no_pii_in_output, check_path_traversal
DEFAULT_SAFETY_SPEC — 8 allowed, 10 forbidden, 2 invariants
STRICT_SAFETY_SPEC — stricter safety specification

Policy Engine

Each tenant has one active Policy that controls routing behavior:

auto_allow_threshold — scores ≤ this value are auto-allowed (default: 30)
auto_block_threshold — scores ≥ this value are auto-blocked (default: 81)
review_sla_minutes — SLA time for human review (default: 30)
sla_fallback — "block" or "allow"
custom_rules — array of additional regex patterns with custom score deltas

Multi-Tenancy

AI Guardian is fully multi-tenant. Each tenant has isolated data: their own policies, users, review queues, and audit logs. Tenant identity is determined by the API key used to authenticate proxy requests. JWT tokens are used for dashboard access.