Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions .claude/skills/ci-visibility/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
name: ci-visibility
description: Query and analyze GitLab CI pipelines using Datadog CI Visibility MCP tools. Use when users ask about CI status, failed jobs, pipeline issues, deployment jobs, or want to fix CI failures. Also use when searching for specific GitLab job names or understanding pipeline structure.
---

# CI Visibility Skill

This skill enables querying and analyzing GitLab CI pipelines for the datadog-agent repository using the Datadog MCP server's CI Visibility tools.

## Available Tools

### Primary Tools

| Tool | Purpose |
|------|---------|
| `search_datadog_ci_pipeline_events` | Search for pipelines, stages, or jobs |
| `aggregate_datadog_ci_pipeline_events` | Compute statistics (counts, averages, percentiles) |
| `get_datadog_flaky_tests` | Find flaky tests causing CI failures |

### Key Parameters

**ci_level** - Granularity of search:
- `pipeline` - Entire pipeline execution
- `stage` - Pipeline stage (e.g., "build", "test", "deploy")
- `job` - Individual job within a stage
- `step` - Individual step within a job

**Common query filters:**
- `@ci.pipeline.name:"DataDog/datadog-agent"` - Filter to this repository
- `@git.branch:<branch-name>` - Filter by branch
- `@ci.pipeline.id:<id>` - Filter by specific pipeline
- `@ci.status:error` - Only failed items
- `@ci.status:success` - Only successful items
- `@ci.job.name:*keyword*` - Jobs matching a pattern

## Common Workflows

### 1. Find Pipelines for Current Branch

```
Tool: search_datadog_ci_pipeline_events
ci_level: pipeline
query: @ci.pipeline.name:"DataDog/datadog-agent" @git.branch:<branch-name>
from: now-7d
```

### 2. Find All Failed Jobs in a Pipeline

```
Tool: search_datadog_ci_pipeline_events
ci_level: job
query: @ci.pipeline.id:<pipeline-id> @ci.status:error
page_limit: 50
```

### 3. Help User Fix CI Failures

When users ask "help me fix CI" or "why is CI failing":

1. Get current branch: `git branch --show-current`
2. Get current commit: `git rev-parse --short HEAD`
3. Search for failed jobs:
```
ci_level: job
query: @git.branch:<branch> @ci.status:error
from: now-24h
```
4. For each failed job, examine:
- `@error.message` - The error summary
- `@error.domain` - Error category (code, platform, setup)
- `@error.subdomain` - More specific category (test, build, script)

### 4. Find Deployment Jobs

Search for jobs with "deploy" or "staging" in the name:

```
ci_level: job
query: @ci.pipeline.name:"DataDog/datadog-agent" @ci.job.name:*deploy* @git.branch:<branch>
```

Or for staging specifically:
```
query: @ci.pipeline.name:"DataDog/datadog-agent" @ci.job.name:*staging*
```

### 5. Get Job Details with Error Messages

When searching for failed jobs, the response includes:
- `job_name` - Full job name
- `job_id` - GitLab job ID (for constructing URLs)
- `@error.message` - Error summary
- `@error.domain` / `@error.subdomain` - Error classification
- `duration_seconds` - How long the job ran

### 6. Construct GitLab URLs

From job IDs, construct URLs:
- **Pipeline:** `https://gitlab.ddbuild.io/DataDog/datadog-agent/-/pipelines/<pipeline_id>`
- **Job:** `https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/<job_id>`

## Staging Deployment Guide

**Authoritative documentation:** https://datadoghq.atlassian.net/wiki/spaces/agent/pages/3457679550/How+to+deploy+custom+images+on+staging

### CRITICAL: Two Different Staging Targets

There are TWO different "staging" destinations that are commonly confused:

| Target | Registry | Job | Use Case |
|--------|----------|-----|----------|
| **Public Dev** | `docker.io/datadog/agent-dev` | `dev_branch_multiarch-a7` | Quick testing, external sharing |
| **Internal Staging** | `registry.ddbuild.io/images/datadog-agent` (ECR 727006795293) | `publish_internal_container_image-full` | Internal compute infrastructure, CNAB deployments |

### Public Dev Images (`dev_container_deploy` stage)

**Job:** `dev_branch_multiarch-a7`
**Output:** `docker.io/datadog/agent-dev:<branch-name>-py3`
**Trigger:** Manual, available in all pipelines
**Child pipeline:** Triggers `DataDog/public-images`

This is for quick dev testing but does NOT deploy to internal staging infrastructure.

### Internal Staging Images (`internal_image_deploy` stage)

**Job:** `publish_internal_container_image-full`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reference an existing internal image publish job

The guide instructs users to trigger publish_internal_container_image-full, but that job is not defined in our pipeline config; .gitlab/deploy/internal_image_deploy/internal_image_deploy.yml defines publish_internal_container_image-jmx, publish_internal_container_image-fips, and related variants (lines 73-121). Following this instruction will send users looking for a non-existent manual job and block the intended staging publish flow.

Useful? React with 👍 / 👎.

**Output:** `registry.ddbuild.io/images/datadog-agent:<branch-name>-full`
**Trigger:** Manual, but **ONLY available in deploy pipelines**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove deploy-only claim for internal image jobs

The statement that internal image jobs are “ONLY available in deploy pipelines” is inaccurate for current CI rules: internal image jobs use .on_deploy_internal_or_internal_image_change_or_manual (.gitlab/deploy/internal_image_deploy/internal_image_deploy.yml line 8), and that rule set includes a general manual fallback (.gitlab-ci.yml lines 583-596). This can misdiagnose missing-job situations and send users down the wrong troubleshooting path.

Useful? React with 👍 / 👎.

**Child pipeline:** Triggers `DataDog/images`

This is required for deploying to internal compute infrastructure (biscuits cluster, etc.).

### How to Get Internal Staging Images

#### Option 1: Run a deploy pipeline (recommended)
```bash
inv pipeline.run --deploy --here
```
This creates a pipeline with the `internal_image_deploy` stage. Then manually trigger `publish_internal_container_image-full`.

#### Option 2: Manual DataDog/images pipeline
Go to: https://gitlab.ddbuild.io/DataDog/images/-/pipelines/new

Variables to set:
```
IMAGE_NAME = datadog-agent
IMAGE_VERSION = tmpl-v7
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use the correct IMAGE_VERSION for datadog-agent deploys

For manual DataDog/images pipelines, this skill sets IMAGE_NAME=datadog-agent with IMAGE_VERSION=tmpl-v7, but the current internal deploy config pins datadog-agent to IMAGE_VERSION: tmpl-v23 (.gitlab/deploy/internal_image_deploy/internal_image_deploy.yml lines 51-54). Using tmpl-v7 here targets the wrong template/version and can cause the manual publish pipeline to fail or produce the wrong artifact set.

Useful? React with 👍 / 👎.

RELEASE_TAG = <your-branch-name>
BUILD_TAG = <your-branch-name>
TMPL_SRC_IMAGE = v<pipeline-id>-<commit-sha>-7-full
TMPL_SRC_REPO = ci/datadog-agent/agent
RELEASE_STAGING = true
RELEASE_PROD = false
```

Get `TMPL_SRC_IMAGE` from the `docker_build_agent7_full` job output in your pipeline.

### Checking Which Stages Exist in a Pipeline

Search for stages to verify if `internal_image_deploy` is available:
```
Tool: search_datadog_ci_pipeline_events
ci_level: stage
query: @ci.pipeline.id:<pipeline-id>
```

If you only see `dev_container_deploy` but not `internal_image_deploy`, the pipeline wasn't triggered with `--deploy`.

## GitLab CI Structure

### Key Stages (in order)

1. **container_build** - Build container images
2. **dev_container_deploy** - Deploy to public dev registry (always available)
3. **internal_image_deploy** - Deploy to internal staging ECR (**deploy pipelines only**)
4. **deploy_containers** - Production container deployment
5. **trigger_release** - Release triggers

### Related Repositories

| Repo | Purpose | Triggered By |
|------|---------|--------------|
| `DataDog/datadog-agent` | Main agent repo | Direct |
| `DataDog/public-images` | Public registry publishing | `dev_container_deploy` jobs |
| `DataDog/images` | Internal registry publishing | `internal_image_deploy` jobs |
| `DataDog/k8s-datadog-agent-ops` | CNAB deployment configs | Manual for staging deploys |

## Query Examples

### Find why tests are failing

```
ci_level: job
query: @ci.pipeline.name:"DataDog/datadog-agent" @git.branch:my-branch @ci.job.name:*test* @ci.status:error
from: now-24h
```

### Find longest running jobs

```
Tool: aggregate_datadog_ci_pipeline_events
aggregation: PC95
metric: @duration
ci_level: job
query: @ci.pipeline.name:"DataDog/datadog-agent"
group_by: ["@ci.job.name"]
```

### Check if a specific job exists

```
ci_level: job
query: @ci.pipeline.name:"DataDog/datadog-agent" @ci.job.name:"exact-job-name"
from: now-30d
page_limit: 5
```

## Error Classification

Datadog CI Visibility classifies errors into domains:

| Domain | Subdomain | Meaning |
|--------|-----------|---------|
| `code` | `test` | Test failure |
| `code` | `build` | Build/compilation error |
| `platform` | `setup` | Infrastructure setup issue |
| `platform` | `script` | Script execution error |

## Tips

1. **Always specify ci_level** - Default is `pipeline`, but you usually want `job` for debugging
2. **Use wildcards carefully** - `*deploy*` matches "deploy", "undeploy", "deploy_staging", etc.
3. **Check time range** - Jobs older than 7 days may need `from: now-30d`
4. **Pipeline vs Job IDs** - Pipeline IDs are for the overall run, Job IDs are for individual jobs
5. **Branch name format** - Use exact branch name with slashes: `sopell/my-feature`

## Integration with GitHub

For PRs, also check GitHub Actions status:
```bash
gh pr checks
```

This shows ALL checks (GitLab + GitHub Actions) with required/optional status from branch protection rules.
Loading