AI Guardian

Core Concepts

This page explains the key concepts behind AI Guardian: the filter engine, risk scoring system, routing logic, human review queue, and policy configuration.

Filter Engine

The Filter Engine is the heart of AI Guardian. It scans text content for threat patterns using a library of compiled regular expressions, each associated with a score delta. Multiple matches accumulate, but per-category scores are capped to prevent any single category from dominating. The total risk score is clamped to 100.

Input Filter

Applied to every incoming request before it reaches the LLM. Scans all message content (including multipart messages) across these categories:

  • Prompt Injection"ignore previous instructions", DAN jailbreaks, system prompt leakage
  • SQL InjectionUNION SELECT, DROP TABLE, comment injection
  • Command Injectionshell metacharacters, ; rm -rf, curl exfiltration patterns
  • Data Exfiltrationrequests to print credentials, config files, or environment variables

Output Filter

Applied to every LLM response before it is returned to the caller. Detects:

  • PII Leakscredit card numbers (Luhn-checked), SSNs, email addresses
  • Secret LeaksOpenAI APIキー(sk-..., AWS access keys, generic tokens
  • Harmful Contentpatterns suggesting harmful instructions in responses

If the output filter flags a response, it is blocked even if the input was safe. The caller receives a 403 with "code": "request_blocked".

Risk Scoring

Every request and response receives a risk score between 0 and 100 and a corresponding risk level:

Score Levels

LevelScore RangeDefault Action
Low0–30Auto-allow → forward to LLM
Medium31–60Queue for human review
High61–80Queue for human review (priority)
Critical81–100Auto-block → 403 returned

The thresholds (auto_allow_threshold and auto_block_threshold) are configurable per tenant via the Policy Engine.

Request Routing

Based on the risk score and configured policy, every request is routed to one of three paths:

  • Allowforwarded to the upstream LLM. Response is filtered before return.
  • Queueheld in the review queue. A 202 response is returned immediately to the caller, containing a review_item_id.
  • Blockrejected with a 403. No LLM call is made.

Human-in-the-Loop (HitL)

Queued requests appear in the Review Queue in the dashboard. Reviewers can take three actions:

  • Approvethe request is forwarded to the LLM and the response returned.
  • Rejectthe request is rejected with a 403.
  • Escalatethe item is flagged for senior review without taking action.

SLA & Fallback

Each review item has an SLA deadline (default: 30 minutes, configurable per policy). A background worker polls every 60 seconds for expired items and applies the configured SLA fallback action:

  • block (default) — expired items are auto-blocked
  • allowexpired items are auto-approved and forwarded

Policy Engine

Each tenant has one active Policy that controls routing behavior:

  • auto_allow_thresholdscores ≤ this value are auto-allowed (default: 30)
  • auto_block_thresholdscores ≥ this value are auto-blocked (default: 81)
  • review_sla_minutesSLA time for human review (default: 30)
  • sla_fallback"block" or "allow"
  • custom_rulesarray of additional regex patterns with custom score deltas

Multi-Tenancy

AI Guardian is fully multi-tenant. Each tenant has isolated data: their own policies, users, review queues, and audit logs. Tenant identity is determined by the API key used to authenticate proxy requests. JWT tokens are used for dashboard access.