Skip to content

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395

Open
Copilot wants to merge 16 commits intomainfrom
copilot/add-tar-gz-support
Open

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395
Copilot wants to merge 16 commits intomainfrom
copilot/add-tar-gz-support

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

The extension, preset, and workflow download pipelines only accepted ZIP archives, blocking use of npm registries and CI artifact stores that serve tarballs natively.

Core utilities (extensions.py)

  • detect_archive_format(url, content_type="") — infers format from URL path extension (.zip, .tar.gz, .tgz) with Content-Type header fallback (application/gzip, application/x-gzip, application/x-tar+gzip)
  • safe_extract_tarball(archive_path, dest_dir, error_class) — safe extraction with:
    • Pre-extraction validation of all members: rejects absolute paths, .. traversal, symlinks, hard links, devices, and FIFOs
    • PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata-only entries (GNUTYPE_LONGNAME, GNUTYPE_LONGLINK) are silently skipped — they carry no extractable payload and are emitted by many common archiving tools; GNUTYPE_SPARSE is intentionally not skipped because sparse entries carry a real file payload and isreg() returns True for them
    • Python 3.11: passes pre-validated safe_members list to extractall()
    • Python 3.12+: uses tarfile.data_filter for additional OS-level protection
    • tarfile.TarError/OSError are caught and re-raised as the caller-supplied error_class for consistent error handling

Both helpers are public (no underscore prefix) and imported directly by presets.py and __init__.py.

Extensions & presets

  • install_from_zip() on both managers now detects archive format from the file extension and dispatches to ZIP or tarball extraction accordingly — existing callers are unaffected
  • download_extension() / download_pack() detect format from the download URL (or Content-Type fallback) and persist the archive with the correct extension (.zip or .tar.gz); unknown formats are rejected with a clear error rather than silently defaulting to ZIP
  • Both functions capture response.geturl() as the canonical post-redirect URL, use it for Content-Type fallback format detection, and re-validate the final URL's scheme to guard against scheme-downgrade via redirects

__init__.py call sites

Path Change
extension add --from Detects format from URL/Content-Type before saving; unknown formats rejected; HTTPS re-checked on post-redirect URL
preset add --from Same
extension update Inline manifest peek handles both ZIP and tar.gz; cache filename sanitized via Path(extension).name to prevent path traversal
workflow add (URL) Extracts workflow.yml from archive when URL points to one; temp-file paths initialized before write to avoid UnboundLocalError on disk-full
workflow add (local) Accepts local .tar.gz/.tgz/.zip archive files (case-insensitive detection)
workflow add (catalog) Same archive detection for catalog-sourced URLs

A shared _extract_workflow_yml(archive_path, fmt) helper handles root-level and single-nested-directory layouts for both formats, with tarfile.extractfile() handles properly closed via context managers.

Tests

30 new tests across test_extensions.py, test_presets.py, and test_workflows.py covering:

  • test_extensions.py: format detection (URL + Content-Type), flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_presets.py: flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_workflows.py (TestWorkflowAddArchive, 9 CLI-level tests): local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, URL-based archive download for both ZIP and tar.gz formats

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:04
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:06
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:09
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:12
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:14
Copilot AI changed the title [WIP] Add support for .tar.gz and .tgz archives Add .tar.gz / .tgz archive support for extension, preset, and workflow installation Apr 28, 2026
Copilot AI requested a review from mnriem April 28, 2026 18:16
Comment thread src/specify_cli/__init__.py Fixed
@mnriem mnriem marked this pull request as ready for review May 5, 2026 22:58
@mnriem mnriem requested review from Copilot and removed request for Copilot May 5, 2026 22:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 2

Comment thread src/specify_cli/extensions.py Outdated
Comment on lines +111 to +116
def _detect_archive_format(url: str, content_type: str = "") -> str:
"""Detect archive format from URL path extension or Content-Type header.

Args:
url: URL or file path to inspect.
content_type: Optional ``Content-Type`` header value from the HTTP response.
Comment on lines +4926 to +4931
def _extract_workflow_yml(archive_path: Path, archive_fmt: str) -> bytes:
"""Extract ``workflow.yml`` from a ZIP or ``.tar.gz`` archive.

Searches the archive root and a single nested top-level subdirectory
(e.g., ``repo-name-1.0/workflow.yml``).

@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 6, 2026

@copilot apply changes based on the comments in this thread

…add archive CLI tests

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/845e41d1-75e3-49fb-a580-a7fb805dd716

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

@copilot apply changes based on the comments in this thread

Applied both changes in commit 0a02369:

  • Public API (extensions.py): Renamed _detect_archive_formatdetect_archive_format and _safe_extract_tarballsafe_extract_tarball (dropped the underscore prefix). All callers in extensions.py, presets.py, __init__.py, and tests/test_extensions.py have been updated accordingly.
  • CLI-level tests (test_workflows.py): Added TestWorkflowAddArchive with 9 tests covering local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, and URL-based download for both ZIP and tar.gz formats.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

src/specify_cli/init.py:3669

  • archive_path.write_bytes(archive_data) can raise OSError/IOError (disk full, permission issues), but this block only handles URLError. That means specify extension add --from ... may crash with a traceback instead of emitting a controlled CLI error. Add except OSError (or broader I/O handling) around the write/install steps and surface a clear message before exiting.
                    suffix = ".tar.gz" if archive_fmt == "tar.gz" else ".zip"
                    safe_name = Path(extension).name or "extension"
                    archive_path = download_dir / f"{safe_name}-url-download{suffix}"
                    archive_path.write_bytes(archive_data)

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

if f is not None:
with f:
manifest_data = yaml.safe_load(f.read()) or {}
except KeyError:
Comment thread src/specify_cli/__init__.py Outdated
Comment on lines +2653 to +2654
archive_path.write_bytes(archive_data)
manifest = manager.install_from_zip(archive_path, speckit_version, priority)
- Fix tar.gz extractfile() None fallback in extension_update: nested-directory
  search now runs whenever manifest_data is still None, not only on KeyError
- Add OSError handling around write_bytes in preset --from URL path
- Add OSError handling around write_bytes in extension --from URL path
Comment thread src/specify_cli/__init__.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 16:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

Comment thread src/specify_cli/extensions.py Outdated
Comment on lines +204 to +208
# Only allow regular files and directories.
if not (member.isreg() or member.isdir()):
raise error_class(
f"Non-regular file in archive: {member.name}"
)
Comment thread tests/test_workflows.py Outdated
Comment on lines +1891 to +1893
with __import__("unittest.mock", fromlist=["patch"]).patch.object(
__import__("pathlib", fromlist=["Path"]).Path, "cwd", return_value=project_dir
):
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 7, 2026

@copilot apply changes based on the comments in this thread

…ock imports in workflow tests

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/c1fcc1ff-8766-4d97-90a5-368447980acf

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 7, 2026

@copilot apply changes based on the comments in this thread

Applied both changes in commit 05798a9:

  • safe_extract_tarball() metadata members: PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata entries (GNU_TYPES: longname/longlink/sparse) are now skipped instead of rejected. The _TAR_METADATA_TYPES tuple is defined once before the loop for efficiency. Symlinks, hard links, devices, and FIFOs are still rejected; only regular files and directories are extracted.
  • Test imports: Replaced the __import__(...) inline pattern in all 9 TestWorkflowAddArchive tests with from unittest.mock import patch / patch.object(Path, "cwd", ...) — consistent with the rest of the file.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (2)

src/specify_cli/extensions.py:231

  • In the Python 3.12+ branch, safe_extract_tarball() validates members into safe_members (and intentionally skips PAX/GNU metadata entries), but then calls tf.extractall(dest_dir, filter="data") without restricting extraction to safe_members. This makes extraction behavior inconsistent across Python versions and can reintroduce metadata entries (or other unexpected members) into the extraction step. Pass members=safe_members in the 3.12+ branch as well (and keep the filter="data" hardening), or extract only the validated members explicitly.
            # Extract — use the "data" filter on Python 3.12+ for extra hardening.
            # On older versions pass only the pre-validated members so that no
            # unvetted entry (added concurrently or via a race) slips through.
            if sys.version_info >= (3, 12):
                tf.extractall(dest_dir, filter="data")  # type: ignore[call-arg]
            else:
                tf.extractall(dest_dir, members=safe_members)  # noqa: S202 — validated above

src/specify_cli/init.py:3663

  • extension add --from detects archive format from the original URL and optional Content-Type, but does not validate the final URL after redirects. An HTTPS URL can redirect to non-HTTPS, and format detection can be wrong if the redirect changes the effective filename/extension. Consider validating response.geturl() against the HTTPS/localhost policy and running detect_archive_format() against that final URL.
                    with urllib.request.urlopen(from_url, timeout=60) as response:
                        if not archive_fmt:
                            content_type = response.headers.get("Content-Type", "")
                            archive_fmt = detect_archive_format(from_url, content_type)
                        archive_data = response.read()

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment on lines +176 to +181
_TAR_METADATA_TYPES = (
tarfile.XHDTYPE, # PAX extended header
tarfile.XGLTYPE, # PAX global extended header
tarfile.SOLARIS_XHDTYPE, # Solaris PAX extended header
*tarfile.GNU_TYPES, # GNU longname / longlink / sparse
)
Comment thread src/specify_cli/__init__.py Outdated
Comment on lines +2635 to +2640
archive_fmt = _det_fmt(from_url)
try:
with urllib.request.urlopen(from_url, timeout=60) as response:
zip_path.write_bytes(response.read())
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = _det_fmt(from_url, content_type)
Comment on lines +2155 to +2164
# Detect archive format from URL; resolve via Content-Type when needed.
archive_fmt = detect_archive_format(download_url)

# Download the archive
try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(download_url, content_type)
archive_data = response.read()
Comment on lines +2316 to +2324
# Detect archive format from URL; resolve via Content-Type when needed.
archive_fmt = detect_archive_format(download_url)

try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(download_url, content_type)
archive_data = response.read()
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 7, 2026

@copilot apply changes based on the comments in this thread

…r redirect-safe format detection and HTTPS re-check

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/739d3f73-200b-417a-8a86-134329200560

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 7, 2026

@copilot apply changes based on the comments in this thread

Applied all four changes in commit 1015ff2:

  • _TAR_METADATA_TYPES: Replaced *tarfile.GNU_TYPES with explicit GNUTYPE_LONGNAME and GNUTYPE_LONGLINK only. GNUTYPE_SPARSE is intentionally excluded — sparse entries carry a real file payload and member.isreg() returns True for them (they're included in REGULAR_TYPES), so they pass the existing file-type check and are extracted correctly.
  • download_extension() / download_pack(): response.geturl() is now captured as final_url inside the with block. The Content-Type fallback uses final_url (not the original URL), and after reading the response the final URL's scheme is re-validated to catch scheme-downgrade via redirects.
  • preset add --from / extension add --from (__init__.py): Same redirect-safe pattern applied — final_url = response.geturl() captured inside the with block, used for format detection, with a post-redirect HTTPS scheme check.
  • Three mock responses in test_extensions.py and test_presets.py were updated with mock_response.geturl.return_value = <download_url> to satisfy the new geturl() call.

Copilot finished work on behalf of mnriem May 7, 2026 18:54
download_dir.mkdir(parents=True, exist_ok=True)
zip_path = download_dir / f"{extension}-url-download.zip"
archive_fmt = detect_archive_format(from_url)
final_url = from_url
# Detect archive format from URL; resolve via Content-Type when needed.
# `final_url` may differ from `download_url` if the server redirects.
archive_fmt = detect_archive_format(download_url)
final_url = download_url
# Detect archive format from URL; resolve via Content-Type when needed.
# `final_url` may differ from `download_url` if the server redirects.
archive_fmt = detect_archive_format(download_url)
final_url = download_url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support .tar.gz / .tgz archives for extension, preset, and workflow installation

3 participants