Core Concepts
This page explains the key concepts behind AI Guardian: the filter engine, risk scoring system, routing logic, human review queue, and policy configuration.
Filter Engine
The Filter Engine is the heart of AI Guardian. It scans text content for threat patterns using a library of compiled regular expressions, each associated with a score delta. Multiple matches accumulate, but per-category scores are capped to prevent any single category from dominating. The total risk score is clamped to 100.
Input Filter
Applied to every incoming request before it reaches the LLM. Scans all message content (including multipart messages) across these categories:
- Prompt Injection — "ignore previous instructions", DAN jailbreaks, system prompt leakage
- SQL Injection — UNION SELECT, DROP TABLE, comment injection
- Command Injection — shell metacharacters,
; rm -rf, curl exfiltration patterns - Data Exfiltration — requests to print credentials, config files, or environment variables
Output Filter
Applied to every LLM response before it is returned to the caller. Detects:
- PII Leaks — credit card numbers (Luhn-checked), SSNs, email addresses
- Secret Leaks — OpenAI APIキー(
sk-...), AWS access keys, generic tokens - Harmful Content — patterns suggesting harmful instructions in responses
If the output filter flags a response, it is blocked even if the input was safe. The caller receives a 403 with "code": "request_blocked".
Risk Scoring
Every request and response receives a risk score between 0 and 100 and a corresponding risk level:
Score Levels
| Level | Score Range | Default Action |
|---|---|---|
| Low | 0–30 | Auto-allow → forward to LLM |
| Medium | 31–60 | Queue for human review |
| High | 61–80 | Queue for human review (priority) |
| Critical | 81–100 | Auto-block → 403 returned |
The thresholds (auto_allow_threshold and auto_block_threshold) are configurable per tenant via the Policy Engine.
Request Routing
Based on the risk score and configured policy, every request is routed to one of three paths:
- Allow — forwarded to the upstream LLM. Response is filtered before return.
- Queue — held in the review queue. A 202 response is returned immediately to the caller, containing a
review_item_id. - Block — rejected with a 403. No LLM call is made.
Human-in-the-Loop (HitL)
Queued requests appear in the Review Queue in the dashboard. Reviewers can take three actions:
- Approve — the request is forwarded to the LLM and the response returned.
- Reject — the request is rejected with a 403.
- Escalate — the item is flagged for senior review without taking action.
SLA & Fallback
Each review item has an SLA deadline (default: 30 minutes, configurable per policy). A background worker polls every 60 seconds for expired items and applies the configured SLA fallback action:
block(default) — expired items are auto-blockedallow— expired items are auto-approved and forwarded
Policy Engine
Each tenant has one active Policy that controls routing behavior:
auto_allow_threshold— scores ≤ this value are auto-allowed (default: 30)auto_block_threshold— scores ≥ this value are auto-blocked (default: 81)review_sla_minutes— SLA time for human review (default: 30)sla_fallback—"block"or"allow"custom_rules— array of additional regex patterns with custom score deltas
Multi-Tenancy
AI Guardian is fully multi-tenant. Each tenant has isolated data: their own policies, users, review queues, and audit logs. Tenant identity is determined by the API key used to authenticate proxy requests. JWT tokens are used for dashboard access.