Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions capabilities/ai-red-teaming/skills/trace-analysis-advisor/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
name: trace-analysis-advisor
description: Analyzes historical attack traces to provide intelligent recommendations for attack selection, transform effectiveness, and strategic adaptation
allowed-tools: analyze_attack_effectiveness suggest_optimal_transforms predict_attack_success identify_vulnerability_patterns get_historical_metrics
---

# Trace Analysis Advisor

Leverages historical OTEL trace data from previous assessments to provide intelligent, data-driven recommendations for AI red teaming operations. This skill transforms the agent from reactive trial-and-error to strategic, evidence-based attack planning.

## Core Capabilities

### 1. **Attack Effectiveness Analysis**
Analyzes success rates of different attack types against specific target models and goal categories to recommend optimal attack strategies.

**Use case:** "Which attack should I use against claude-sonnet for credential extraction goals?"

### 2. **Transform Optimization**
Evaluates historical effectiveness of transforms across different targets and attack types to suggest optimal obfuscation strategies.

**Use case:** "What transforms work best for system prompt extraction against this target model?"

### 3. **Success Prediction**
Predicts likelihood of attack success based on target characteristics, attack type, and historical patterns.

**Use case:** "What's the probability that TAP will succeed against this target for this goal category?"

### 4. **Vulnerability Fingerprinting**
Identifies response patterns and characteristics that indicate specific vulnerability types in target models.

**Use case:** "This target shows similar patterns to models vulnerable to multi-turn attacks. Recommending Crescendo."

## API Endpoints Used

The skill integrates with the Dreadnode platform's ClickHouse-backed analytics API:

### Primary Endpoints
- `GET /workspaces/{ws}/airt/assessments/{id}/analytics` - Get aggregated metrics
- `GET /workspaces/{ws}/airt/assessments/{id}/traces/attacks` - Get attack-level performance data
- `GET /workspaces/{ws}/airt/assessments/{id}/traces/trials` - Get trial-level results with filtering
- `GET /workspaces/{ws}/airt/projects/{project}/summary` - Get project-wide statistics

### Query Parameters for Analysis
```
/traces/trials?attack_name={attack}&min_score={threshold}&limit={count}
/traces/attacks?assessment_id={id}
```

### Data Sources
- **OTEL Traces** - Full conversation history, prompts, responses, scores
- **Attack Spans** - ASR, best scores, transform effectiveness by attack type
- **Trial Spans** - Individual attempt outcomes with filtering capabilities
- **Analytics Snapshots** - Materialized metrics including severity, compliance

## Tool Functions

### `analyze_attack_effectiveness`
**Purpose:** Determine which attack types work best for specific targets and goals

**Parameters:**
- `target_model` - Target model identifier (e.g., "claude-sonnet-4")
- `goal_category` - Goal category (e.g., "system_prompt_leak", "credential_extraction")
- `lookback_days` - Historical window to analyze (default: 90)

**Returns:**
```json
{
"recommendations": [
{
"attack": "tap",
"asr": 0.73,
"avg_score": 8.2,
"confidence": 0.89,
"sample_size": 156,
"reasoning": "TAP shows 73% ASR vs 45% for Crescendo on this target/goal combination"
}
],
"target_profile": {
"vulnerability_level": "high",
"common_weaknesses": ["multi_turn_degradation", "tool_manipulation"],
"resistant_to": ["direct_prompting", "simple_obfuscation"]
}
}
```

### `suggest_optimal_transforms`
**Purpose:** Recommend transform combinations based on historical effectiveness

**Parameters:**
- `target_patterns` - Response characteristics of target
- `attack_type` - Attack being used
- `goal_category` - Attack objective category

**Returns:**
```json
{
"transform_rankings": [
{
"transform": "base64",
"effectiveness_boost": 0.12,
"asr_with": 0.68,
"asr_without": 0.56,
"confidence": 0.85,
"reasoning": "Base64 encoding shows 12% ASR improvement for this target pattern"
}
],
"optimal_sequence": ["base64", "authority", "role_play"],
"avoid": ["caesar", "leetspeak"],
"explanation": "This target shows strong resistance to simple ciphers but weak against encoding + persuasion"
}
```

### `predict_attack_success`
**Purpose:** Estimate probability of success before running expensive attacks

**Parameters:**
- `attack_type` - Attack to predict
- `target_fingerprint` - Target characteristics
- `transforms` - Planned transforms
- `goal_category` - Attack objective

**Returns:**
```json
{
"success_probability": 0.78,
"estimated_trials": 45,
"estimated_duration": "8-12 minutes",
"confidence": 0.82,
"similar_targets": 23,
"risk_factors": ["strong_refusal_training", "output_filtering"],
"success_factors": ["multi_turn_weakness", "tool_access"]
}
```

### `identify_vulnerability_patterns`
**Purpose:** Analyze target responses to identify exploitable patterns

**Parameters:**
- `target_responses` - Sample responses from target
- `response_metadata` - Timing, length, format characteristics

**Returns:**
```json
{
"vulnerability_fingerprint": {
"primary_weakness": "multi_turn_degradation",
"confidence": 0.91,
"indicators": [
"Refusal strength decreases after turn 3",
"Responds to authority figures in conversation",
"Shows tool selection confusion under pressure"
]
},
"recommended_attacks": ["crescendo", "tool_restriction_bypass"],
"predicted_success_rate": 0.74,
"similar_vulnerability_count": 18
}
```

### `get_historical_metrics`
**Purpose:** Provide context and trending data for strategic planning

**Parameters:**
- `metric_type` - "asr_trends", "transform_effectiveness", "target_coverage"
- `time_range` - Analysis window
- `filters` - Target model, attack type, goal category filters

**Returns:**
```json
{
"trends": {
"overall_asr": 0.67,
"trend_direction": "improving",
"monthly_change": 0.03
},
"top_performing": {
"attacks": [{"name": "tap", "asr": 0.73}],
"transforms": [{"name": "base64", "boost": 0.15}],
"combinations": [{"attack": "tap", "transform": "base64", "asr": 0.81}]
},
"coverage_gaps": [
"Limited data for agentic_memory_poisoning goals",
"Few assessments against gemini models"
]
}
```

## Implementation Strategy

### Phase 1: Basic Analytics Integration
- Connect to existing `/analytics` and `/traces/attacks` endpoints
- Implement attack effectiveness analysis
- Basic transform recommendation

### Phase 2: Advanced Pattern Recognition
- Trial-level analysis using `/traces/trials` with filtering
- Response pattern classification
- Vulnerability fingerprinting

### Phase 3: Predictive Intelligence
- Success probability modeling
- Cross-target pattern recognition
- Strategic attack sequencing

## Security and Privacy

- **Data Minimization** - Only analyze aggregated metrics, not raw conversation content
- **Tenant Isolation** - Analysis scoped to organization/workspace data only
- **Retention Policy** - Respect platform data retention settings
- **Anonymization** - Strip PII from pattern analysis

## Usage Examples

### Strategic Attack Planning
```
Operator: "I need to test claude-sonnet for system prompt leakage"

Trace Advisor: "Based on 47 previous assessments against claude-sonnet models:
- TAP has 68% ASR for system_prompt_leak goals
- Crescendo has 52% ASR but finds different vulnerability classes
- Recommend: Start with TAP + base64 transform (78% historical success)
- Predicted: 15-25 trials, 6-8 minutes to first jailbreak"
```

### Transform Optimization
```
Operator: "TAP isn't working well, suggest better transforms"

Trace Advisor: "Current ASR with your transforms: 23%
Historical analysis shows:
- Authority + role_play combination: 71% ASR on similar targets
- Your target pattern matches 'authority-responsive' cluster
- Switch recommendation: Replace leetspeak with authority persuasion"
```

### Vulnerability Assessment
```
Operator: "This target seems different, what's the best approach?"

Trace Advisor: "Response analysis indicates:
- Strong single-turn refusal (98% refusal rate)
- Degrades significantly in conversation (turn 4+: 34% refusal)
- Similar to Pattern-C targets (tool-enabled models with conversation memory)
- Recommendation: Crescendo attack, expect 65-80% ASR after turn 5"
```

## Integration with Existing Skills

- **Analytics Interpretation** - Provides raw data that this skill converts to recommendations
- **Attack Selection Guide** - Enhanced with historical evidence rather than theoretical guidance
- **Error Troubleshooting** - Identifies why attacks fail based on historical patterns

This skill provides data-driven attack recommendations based on historical trace analysis.
Loading