Skip to content

APPENG-4467- rpm analyzier mile stone one#222

Open
RedTanny wants to merge 95 commits into
RHEcosystemAppEng:mainfrom
RedTanny:APPENG-4467-Rpm-Checker
Open

APPENG-4467- rpm analyzier mile stone one#222
RedTanny wants to merge 95 commits into
RHEcosystemAppEng:mainfrom
RedTanny:APPENG-4467-Rpm-Checker

Conversation

@RedTanny
Copy link
Copy Markdown
Collaborator

@RedTanny RedTanny commented Apr 15, 2026

Summary

This PR implements the RPM Vulnerability Checker (Milestone 1) - a standalone pipeline branch that provides focused, two-level vulnerability investigation for RPM packages. Unlike the full E2E executor path, this checker answers three specific questions for a target package: Does the CVE apply? Where is the vulnerable code? Is a fix/mitigation in place?

JIRA: APPENG-4467


Architecture

The checker integrates as a conditional branch after add_start_time, selected via pipeline_mode: PACKAGE_CHECKER on the request JSON. It runs independently from the full pipeline while sharing fetch_intel, add_completed_time, and output_results.

START -> add_start_time
              |
    [conditional: pipeline_mode]
              |
    ┌─────────┴────────────────────┐
    |                              |
    v                              v
  [full_pipeline]           [package_checker]
  generate_vdbs            checker_init_state
  (UNCHANGED)                     |
    |                              v
    ...                    checker_fetch_intel
                                  |
                                  v
                           source_acquisition
                                  |
                                  v
                           checker_segmentation
                                  |
                                  v
                           l1_investigation -> route_after_l1
                             [vulnerable/uncertain -> l2_build_agent]
                             [protected -> generate_report]
                                  |
                                  v
                           add_completed_time -> output_results -> END

Two-Level Investigation

Level 1: Package Code Agent (Always runs)

Operates on extracted SRPM source (.spec, .patch, changelogs, source code).

Stage Purpose
Target Package Analysis Deterministic: find CVE-named .patch, parse spec PatchN:, extract %changelog, check build log for patch application
Reference Intel Gathering Fetch fixed SRPM via BrewDownloader, detect rebase fixes, retrieve OSV/GitHub patches when Brew unavailable
ReAct Agent Loop LLM-guided code search using Source Grep, Code Keyword Search to verify vulnerable/fixed patterns

Verdicts: code_not_present, protected_by_mitigating_control, vulnerable, uncertain

Level 2: Build Agent (Optional, runs when L1 = vulnerable)

Two sequential phases:

  1. BuildCompilationCheck (Phase 1): Is the vulnerable code compiled into the binary? Uses .spec %build/%configure, Makefile, CMakeLists.txt, #ifdef guards, build log. May override L1 verdict to NOT_VULNERABLE if code is provably not compiled.

  2. HardeningCheck (Phase 2): Do compiler/linker flags mitigate the CVE? Parses CFLAGS/LDFLAGS from build log, evaluates relevance to CVE mechanism (CWE). May refine to VULNERABLE_MITIGATED.


Key Features

  • Mandatory target_package input: User specifies the package to investigate (name, version, release, arch)
  • Brew profile support: Internal (Red Hat VPN) and External (Fedora public Koji) profiles via rpm_user_type
  • Multi-architecture build logs: Stored per-arch at logs/{arch}/build.log
  • VulnerabilityIntel extraction: Structured, grep-ready patterns from CVE descriptions and patches
  • OSV/GitHub patch retrieval: Fallback when Brew patches unavailable
  • Kernel package support: Kconfig-based prompts, hardening phase skipped (RHEL kernels assumed hardened)
  • Spec-only fallback: L2 runs without build log using spec and build-system file analysis

New Files

File Purpose
src/vuln_analysis/functions/cve_package_code_agent.py L1 agent graph and investigation
src/vuln_analysis/functions/code_agent_graph_defs.py L1 state schemas, search pipelines, report generation
src/vuln_analysis/functions/cve_build_agent.py L2 Build Agent graph
src/vuln_analysis/functions/build_agent_graph_defs.py L2 state, BuildHarvestReport, harvest_build_data()
src/vuln_analysis/functions/cve_checker_report.py Final markdown report (L1 + L2 synthesis)
src/vuln_analysis/utils/rpm_checker_prompts.py All L1/L2 prompt templates
src/vuln_analysis/utils/osv_patch_retriever.py OSV/GitHub patch retrieval
src/vuln_analysis/utils/vulnerability_intel_sanitizer.py Post-extraction intel cleanup
src/vuln_analysis/tools/source_inspector.py SourceInspector (multi-pattern grep)
src/vuln_analysis/tools/source_grep.py Source Grep LangGraph tool
src/vuln_analysis/tools/brew_downloader.py BrewDownloader (Koji/Brew SRPM download)
src/vuln_analysis/configs/brew/internal-user-profile.yml Red Hat Brew profile
src/vuln_analysis/configs/brew/external-user-profile.yml Fedora Koji profile

API Changes

  • New pipeline_mode field: FULL_PIPELINE (default) or PACKAGE_CHECKER
  • New target_package field: Required when pipeline_mode == PACKAGE_CHECKER
  • OpenAPI spec updated: See src/vuln_analysis/configs/openapi/openapi.json

Example request:

{
  "scan": {
    "vulns": [{ "vuln_id": "CVE-2024-12345" }]
  },
  "image": {
    "pipeline_mode": "PACKAGE_CHECKER",
    "target_package": {
      "name": "openssl",
      "version": "1.1.1k",
      "release": "8.el9_9",
      "arch": "x86_64"
    }
  }
}

Testing

  • Unit tests for BrewDownloader, package identification, intel sanitizer
  • Integration test via /test vulnerability-analysis-on-pr
  • Smoke test for external profile: scripts/test_fedora_brew_download.py

Limitations (v1)

  • RPM-only (Python/Go/Java/npm ecosystems deferred)
  • Single-package focus (no transitive dependency analysis)
  • Binary checksec/readelf path not yet implemented (planned Phase C)
  • External profile limited to Fedora Koji (RHEL packages require VPN)

@RedTanny RedTanny force-pushed the APPENG-4467-Rpm-Checker branch from cde60ba to 50638ee Compare April 27, 2026 13:30
@batzionb
Copy link
Copy Markdown

batzionb commented May 6, 2026

Can the API payload be simplified somehow?
Will this more simplified structure work:

{
  "scan": {
    "vulns": [
      { "vuln_id": "CVE-2023-0464" }
    ]
  },
  "rpm": {
      "name": "openssl",
      "version": "1.1.1k",
      "release": "8.el9_9",
      "arch": "x86_64"
    }
}

without the additional ecosystem option
more intuitive name than "target_package"
without all other fields being required and without pipeline_mode
if rpm is present, use the rpm pipleline mode
and not under the image field, which is confusing as it's not an image

@batzionb
Copy link
Copy Markdown

batzionb commented May 6, 2026

In addition - can the API changes appear in the openapi spec

@batzionb
Copy link
Copy Markdown

batzionb commented May 6, 2026

Follow up on #222 (comment)
Using the API as is, requires the client to define a dummy repo URL so request won't be rejected
See here

Copy link
Copy Markdown
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RedTanny
Very Good job.
Please see my comments, several things should be done.

In addition, once the code understanding sub-agent PR with the refactoring of sub-agent skeleton is merged, please rebase and adapt the 2 new sub-agents to this new template accordingly.



_PROFILE_PATHS: dict[BrewProfileType, Path] = {
BrewProfileType.INTERNAL: _CONFIGS_DIR / "internal-user-profile.yml",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny What about External profile configuration? Do you have a template file how to configure that? Just saw it wasn't implemented yet (raise of not implemented exception ), so better add comment about that, and maybe worthwhile adding some documentation about the process, that explicitly stating it.

Comment on lines +37 to +40
base_code_index_dir: str = Field(
default=".cache/am_cache/code_index",
description="Base directory for Tantivy code index storage.",
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Heads up about this one, Theo also touched this tool, so i anticipate conflicts here

for root, _, files in os.walk(code_path):
for file in files:
if any(file.endswith(ext) for ext in include_extensions):
if any(file.endswith(ext) for ext in include_extensions) or file in no_extension:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny What files with no extension are you adding here?
Potentially Maybe it could add a lot of noise or a lot of irrelevant files to the documents...
Can you characterize the pattern of files that are with no extensions that you willing to add to the search??

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg

The original intent was to include build system files (Makefile, GNUmakefile, configure) in the full-text search index, since these files contain compilation flags and conditional logic that can reveal whether vulnerable code paths are actually built.

However, looking at this again - you raise a valid point. The agent typically uses the grep tool directly when searching for build-related patterns (like checking for -DFEATURE flags or Makefile targets), not the lexical search index. So this addition may be unnecessary noise.

I can remove this change since the grep tool cover these cases

def _is_binary_file_path(path: str) -> bool:
"""Check if file path has a binary file extension."""
path_lower = path.lower()
return any(path_lower.endswith(ext) for ext in _BINARY_FILE_EXTENSIONS)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Have you considered using the linux file utilitiy to dynamically determine that type of the file based on content? ( not all the time a binary is with the expected ext, especially when extracted from payloads as base64 or from databases...) , off course there is the performance issue here, but i think you can check that if the file is without extension or not a code extension file also.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg
Good point about the limitations of extension-based detection. In this specific context, we're parsing patches fetched from GitHub APIs, so we only have file paths (not actual file content) to work with. The unidiff library's is_binary_file check (line 106) handles the content-based detection from the patch format itself. The extension check is a secondary filter for paths that might slip through.


logger = LoggingFactory.get_agent_logger(__name__)

_RPM_NEVRA_RE = re.compile(r"^(.+?)-(\d+):(.+?)-(.+)$")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Please add comment for what does NEVRA means

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread Dockerfile
Comment thread src/vuln_analysis/functions/build_agent_graph_defs.py
Comment thread src/vuln_analysis/functions/code_agent_graph_defs.py
Comment on lines +61 to +69
_JUSTIFICATION_LABEL_TO_STATUS: dict[str, _StatusLiteral] = {
"code_not_present": "FALSE",
"code_not_reachable": "FALSE",
"protected_by_mitigating_control": "FALSE",
"protected_by_compiler": "FALSE",
"requires_environment": "FALSE",
"vulnerable": "TRUE",
"uncertain": "UNKNOWN",
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Doesn't justification labels categories - protected_at_runtime and protected_at_perimeter , requires_dependency and requires_configuration relevant here?

Copy link
Copy Markdown
Collaborator Author

@RedTanny RedTanny May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg the only thing maybe relevant is requires_configuration
but it is duplicate is the area is covered by code_not_present

code_not_present -- > code not compile
code_not_reachable --> code compile but because configuration defaults it is not reachable
protected_by_mitigating_control -->code is patch
protected_by_compiler --> compiler hardening flags protect from the exploit
requires_environment --> a case where vulnerability is only for 32bit system but code is compile to 64bit

@zvigrinberg
Copy link
Copy Markdown
Collaborator

In addition - can the API changes appear in the openapi spec

Yes, good point @batzionb.
@RedTanny , When you're running the agent locally, Can you please just download the updated schema openapi.json from this endpoint:

http://localhost:26466/openapi.json

Just beautify it, and put it updated in the PR.

@RedTanny RedTanny force-pushed the APPENG-4467-Rpm-Checker branch from dff2e5c to 4d0b194 Compare May 17, 2026 13:11
Copy link
Copy Markdown
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Thank you for the work you've done.
In general , very good job.

Still, two general comments:

  1. The PR description should list all the details , features, architecture and implementation details, especially for such a huge PR, currently it's missing.
  2. Still missing some tests for the new RPM agent logic - especially for code_agent_graph_defs.py, build_agent_graph_defs.py , i suggest adding them separately in a new PR after that one will be done, due to the magnitude of this PR.

Moreover, please see more specific comments below.

Comment thread src/exploit_iq_commons/data_models/input.py
Comment on lines +61 to +82
def __new__(cls):
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance

def __init__(self, json_path: str | Path | None = None) -> None:
if not hasattr(self, '_initialized'):
base_path = Path(__file__).resolve().parents[1]
default_json = base_path / "data" / "hardening_kb" / "hardening_kb.json"
self.json_path = Path(json_path) if json_path else default_json

self._entries: list[HardeningEntry] = []
self._cwe_index: dict[str, list[HardeningEntry]] = {}
self._initialized = True
self._load()

@classmethod
def get_instance(cls) -> "HardeningKB":
"""Get the singleton instance of HardeningKB."""
return cls()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Using this code disposition, the default json will be used always, ignoring the json_path argument, which is currently not passed ( always None).
Consider propagate the json_path in that flow, or leave it as is and create another factory method to get instance from_path

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Is the Hardening knowledge base should be configurable or it must be changed in a PR ( if at all) only after enough tests?

Comment thread src/exploit_iq_commons/utils/hardening_kb.py Outdated
Comment thread src/vuln_analysis/functions/code_agent_graph_defs.py Outdated
Comment thread src/exploit_iq_commons/data_models/checker_status.py Outdated
Comment on lines +70 to +76
result = subprocess.run(
['bsdtar', '-xf', str(file), '-C', str(output_path)],
capture_output=True,
text=True
)
if result.returncode != 0:
logger.error(f"Failed to extract {file.name}: {result.stderr.strip()}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny Why switching from a python library of tar to a tar cli utility in the function? ( in addition, need to document it as well for local development, that the developer would install it if running locally).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg bsdtar more reliable and support more formats of compression

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedTanny OK Then just add it to local development/running locally section in the documentation.

Comment thread src/exploit_iq_commons/utils/source_rpm_downloader.py Outdated
Comment thread Dockerfile
Comment thread src/vuln_analysis/configs/brew/external-user-profile.yml Outdated
Comment thread src/vuln_analysis/tools/source_inspector.py
@RedTanny RedTanny requested a review from zvigrinberg June 3, 2026 05:37
Comment thread .tekton/on-pull-request.yaml Outdated
Comment thread src/vuln_analysis/tools/brew_downloader.py Outdated
Comment thread src/vuln_analysis/tools/source_inspector.py
Comment thread src/vuln_analysis/tools/source_inspector.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants