APPENG-4467- rpm analyzier mile stone one#222
Conversation
cde60ba to
50638ee
Compare
|
Can the API payload be simplified somehow? without the additional ecosystem option |
|
In addition - can the API changes appear in the openapi spec |
|
Follow up on #222 (comment) |
zvigrinberg
left a comment
There was a problem hiding this comment.
Hi @RedTanny
Very Good job.
Please see my comments, several things should be done.
In addition, once the code understanding sub-agent PR with the refactoring of sub-agent skeleton is merged, please rebase and adapt the 2 new sub-agents to this new template accordingly.
|
|
||
|
|
||
| _PROFILE_PATHS: dict[BrewProfileType, Path] = { | ||
| BrewProfileType.INTERNAL: _CONFIGS_DIR / "internal-user-profile.yml", |
There was a problem hiding this comment.
@RedTanny What about External profile configuration? Do you have a template file how to configure that? Just saw it wasn't implemented yet (raise of not implemented exception ), so better add comment about that, and maybe worthwhile adding some documentation about the process, that explicitly stating it.
| base_code_index_dir: str = Field( | ||
| default=".cache/am_cache/code_index", | ||
| description="Base directory for Tantivy code index storage.", | ||
| ) |
There was a problem hiding this comment.
@RedTanny Heads up about this one, Theo also touched this tool, so i anticipate conflicts here
| for root, _, files in os.walk(code_path): | ||
| for file in files: | ||
| if any(file.endswith(ext) for ext in include_extensions): | ||
| if any(file.endswith(ext) for ext in include_extensions) or file in no_extension: |
There was a problem hiding this comment.
@RedTanny What files with no extension are you adding here?
Potentially Maybe it could add a lot of noise or a lot of irrelevant files to the documents...
Can you characterize the pattern of files that are with no extensions that you willing to add to the search??
There was a problem hiding this comment.
The original intent was to include build system files (Makefile, GNUmakefile, configure) in the full-text search index, since these files contain compilation flags and conditional logic that can reveal whether vulnerable code paths are actually built.
However, looking at this again - you raise a valid point. The agent typically uses the grep tool directly when searching for build-related patterns (like checking for -DFEATURE flags or Makefile targets), not the lexical search index. So this addition may be unnecessary noise.
I can remove this change since the grep tool cover these cases
| def _is_binary_file_path(path: str) -> bool: | ||
| """Check if file path has a binary file extension.""" | ||
| path_lower = path.lower() | ||
| return any(path_lower.endswith(ext) for ext in _BINARY_FILE_EXTENSIONS) |
There was a problem hiding this comment.
@RedTanny Have you considered using the linux file utilitiy to dynamically determine that type of the file based on content? ( not all the time a binary is with the expected ext, especially when extracted from payloads as base64 or from databases...) , off course there is the performance issue here, but i think you can check that if the file is without extension or not a code extension file also.
There was a problem hiding this comment.
@zvigrinberg
Good point about the limitations of extension-based detection. In this specific context, we're parsing patches fetched from GitHub APIs, so we only have file paths (not actual file content) to work with. The unidiff library's is_binary_file check (line 106) handles the content-based detection from the patch format itself. The extension check is a secondary filter for paths that might slip through.
|
|
||
| logger = LoggingFactory.get_agent_logger(__name__) | ||
|
|
||
| _RPM_NEVRA_RE = re.compile(r"^(.+?)-(\d+):(.+?)-(.+)$") |
There was a problem hiding this comment.
@RedTanny Please add comment for what does NEVRA means
| _JUSTIFICATION_LABEL_TO_STATUS: dict[str, _StatusLiteral] = { | ||
| "code_not_present": "FALSE", | ||
| "code_not_reachable": "FALSE", | ||
| "protected_by_mitigating_control": "FALSE", | ||
| "protected_by_compiler": "FALSE", | ||
| "requires_environment": "FALSE", | ||
| "vulnerable": "TRUE", | ||
| "uncertain": "UNKNOWN", | ||
| } |
There was a problem hiding this comment.
@RedTanny Doesn't justification labels categories - protected_at_runtime and protected_at_perimeter , requires_dependency and requires_configuration relevant here?
There was a problem hiding this comment.
@zvigrinberg the only thing maybe relevant is requires_configuration
but it is duplicate is the area is covered by code_not_present
code_not_present -- > code not compile
code_not_reachable --> code compile but because configuration defaults it is not reachable
protected_by_mitigating_control -->code is patch
protected_by_compiler --> compiler hardening flags protect from the exploit
requires_environment --> a case where vulnerability is only for 32bit system but code is compile to 64bit
Yes, good point @batzionb. http://localhost:26466/openapi.json Just beautify it, and put it updated in the PR. |
dff2e5c to
4d0b194
Compare
zvigrinberg
left a comment
There was a problem hiding this comment.
@RedTanny Thank you for the work you've done.
In general , very good job.
Still, two general comments:
- The PR description should list all the details , features, architecture and implementation details, especially for such a huge PR, currently it's missing.
- Still missing some tests for the new RPM agent logic - especially for
code_agent_graph_defs.py,build_agent_graph_defs.py, i suggest adding them separately in a new PR after that one will be done, due to the magnitude of this PR.
Moreover, please see more specific comments below.
| def __new__(cls): | ||
| if cls._instance is None: | ||
| with cls._lock: | ||
| if cls._instance is None: | ||
| cls._instance = super().__new__(cls) | ||
| return cls._instance | ||
|
|
||
| def __init__(self, json_path: str | Path | None = None) -> None: | ||
| if not hasattr(self, '_initialized'): | ||
| base_path = Path(__file__).resolve().parents[1] | ||
| default_json = base_path / "data" / "hardening_kb" / "hardening_kb.json" | ||
| self.json_path = Path(json_path) if json_path else default_json | ||
|
|
||
| self._entries: list[HardeningEntry] = [] | ||
| self._cwe_index: dict[str, list[HardeningEntry]] = {} | ||
| self._initialized = True | ||
| self._load() | ||
|
|
||
| @classmethod | ||
| def get_instance(cls) -> "HardeningKB": | ||
| """Get the singleton instance of HardeningKB.""" | ||
| return cls() |
There was a problem hiding this comment.
@RedTanny Using this code disposition, the default json will be used always, ignoring the json_path argument, which is currently not passed ( always None).
Consider propagate the json_path in that flow, or leave it as is and create another factory method to get instance from_path
There was a problem hiding this comment.
@RedTanny Is the Hardening knowledge base should be configurable or it must be changed in a PR ( if at all) only after enough tests?
| result = subprocess.run( | ||
| ['bsdtar', '-xf', str(file), '-C', str(output_path)], | ||
| capture_output=True, | ||
| text=True | ||
| ) | ||
| if result.returncode != 0: | ||
| logger.error(f"Failed to extract {file.name}: {result.stderr.strip()}") |
There was a problem hiding this comment.
@RedTanny Why switching from a python library of tar to a tar cli utility in the function? ( in addition, need to document it as well for local development, that the developer would install it if running locally).
There was a problem hiding this comment.
@zvigrinberg bsdtar more reliable and support more formats of compression
There was a problem hiding this comment.
@RedTanny OK Then just add it to local development/running locally section in the documentation.
Summary
This PR implements the RPM Vulnerability Checker (Milestone 1) - a standalone pipeline branch that provides focused, two-level vulnerability investigation for RPM packages. Unlike the full E2E executor path, this checker answers three specific questions for a target package: Does the CVE apply? Where is the vulnerable code? Is a fix/mitigation in place?
JIRA: APPENG-4467
Architecture
The checker integrates as a conditional branch after
add_start_time, selected viapipeline_mode: PACKAGE_CHECKERon the request JSON. It runs independently from the full pipeline while sharingfetch_intel,add_completed_time, andoutput_results.Two-Level Investigation
Level 1: Package Code Agent (Always runs)
Operates on extracted SRPM source (
.spec,.patch, changelogs, source code)..patch, parse specPatchN:, extract%changelog, check build log for patch applicationVerdicts:
code_not_present,protected_by_mitigating_control,vulnerable,uncertainLevel 2: Build Agent (Optional, runs when L1 = vulnerable)
Two sequential phases:
BuildCompilationCheck (Phase 1): Is the vulnerable code compiled into the binary? Uses
.spec%build/%configure, Makefile, CMakeLists.txt,#ifdefguards, build log. May override L1 verdict toNOT_VULNERABLEif code is provably not compiled.HardeningCheck (Phase 2): Do compiler/linker flags mitigate the CVE? Parses CFLAGS/LDFLAGS from build log, evaluates relevance to CVE mechanism (CWE). May refine to
VULNERABLE_MITIGATED.Key Features
target_packageinput: User specifies the package to investigate (name, version, release, arch)rpm_user_typelogs/{arch}/build.logNew Files
src/vuln_analysis/functions/cve_package_code_agent.pysrc/vuln_analysis/functions/code_agent_graph_defs.pysrc/vuln_analysis/functions/cve_build_agent.pysrc/vuln_analysis/functions/build_agent_graph_defs.pyBuildHarvestReport,harvest_build_data()src/vuln_analysis/functions/cve_checker_report.pysrc/vuln_analysis/utils/rpm_checker_prompts.pysrc/vuln_analysis/utils/osv_patch_retriever.pysrc/vuln_analysis/utils/vulnerability_intel_sanitizer.pysrc/vuln_analysis/tools/source_inspector.pySourceInspector(multi-pattern grep)src/vuln_analysis/tools/source_grep.pysrc/vuln_analysis/tools/brew_downloader.pyBrewDownloader(Koji/Brew SRPM download)src/vuln_analysis/configs/brew/internal-user-profile.ymlsrc/vuln_analysis/configs/brew/external-user-profile.ymlAPI Changes
pipeline_modefield:FULL_PIPELINE(default) orPACKAGE_CHECKERtarget_packagefield: Required whenpipeline_mode == PACKAGE_CHECKERsrc/vuln_analysis/configs/openapi/openapi.jsonExample request:
{ "scan": { "vulns": [{ "vuln_id": "CVE-2024-12345" }] }, "image": { "pipeline_mode": "PACKAGE_CHECKER", "target_package": { "name": "openssl", "version": "1.1.1k", "release": "8.el9_9", "arch": "x86_64" } } }Testing
BrewDownloader, package identification, intel sanitizer/test vulnerability-analysis-on-prscripts/test_fedora_brew_download.pyLimitations (v1)
checksec/readelfpath not yet implemented (planned Phase C)