Skip to content

Harden XML parsing via copernik-xml-factory#640

Draft
ppkarwasz wants to merge 2 commits into
apache:masterfrom
ppkarwasz:use-copernik-xml-factory
Draft

Harden XML parsing via copernik-xml-factory#640
ppkarwasz wants to merge 2 commits into
apache:masterfrom
ppkarwasz:use-copernik-xml-factory

Conversation

@ppkarwasz

Copy link
Copy Markdown
Member

Summary

Route every direct JAXP factory instantiation through the hardened factories from eu.copernik:copernik-xml-factory 0.1.1. These factories block external DTD and entity fetching and bound internal entity expansion, regardless of the JAXP implementation on the classpath.

Hardening the parsing of a configuration file is admittedly not necessary: configuration files are normally trusted. This limits the side-effects if a user (against advice) decides to parse untrusted configuration files.

Changes

  • Add the copernik-xml-factory dependency.
  • Route factory creation through XmlFactories (DocumentBuilderFactory, SAXParserFactory, TransformerFactory) in XMLConfiguration, XMLDocumentHelper, XMLPropertiesConfiguration and XMLPropertyListConfiguration, plus the affected tests.
  • Harden the source passed to XMLDocumentHelper.transform.
  • XMLConfiguration's entity resolver now throws a SAXException for unregistered entities instead of returning null. Returning null would let the parser fall back to fetching the external resource and would override the deny-all resolver the hardening factories install on some implementations. This is done in an anonymous DefaultEntityResolver subclass local to XMLConfiguration, leaving the public DefaultEntityResolver contract (return null for unknown entities) unchanged.

Testing

  • mvn test for the affected XML test classes: 145 tests, all passing.
  • mvn checkstyle:check: clean.

Replace all direct JAXP factory instantiations (DocumentBuilderFactory,
SAXParserFactory, TransformerFactory) with the hardened factories from
eu.copernik:copernik-xml-factory 0.1.1. These factories block external
DTD and entity fetching and bound internal entity expansion, regardless
of the JAXP implementation on the classpath.

Hardening the parsing of a configuration file is admittedly not
necessary: configuration files are normally trusted. This limits the
side-effects if a user (against advice) decides to parse untrusted
configuration files.

Changes:
- Add the copernik-xml-factory dependency.
- Route factory creation through XmlFactories in XMLConfiguration,
  XMLDocumentHelper, XMLPropertiesConfiguration and
  XMLPropertyListConfiguration, plus the affected tests.
- Harden the source passed to XMLDocumentHelper.transform.
- XMLConfiguration's entity resolver now throws a SAXException for
  unregistered entities instead of returning null. Returning null would
  let the parser fall back to fetching the external resource and would
  override the deny-all resolver the hardening factories install on some
  implementations. This is done in an anonymous DefaultEntityResolver
  subclass local to XMLConfiguration, leaving the public
  DefaultEntityResolver contract (return null for unknown entities)
  unchanged.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 2001 test configuration references its schema through an absolute
https URI (xsi:noNamespaceSchemaLocation). The hardened parser does not
fetch external resources, so testSchemaValidationError no longer reached
the intended schema validation error; the entity resolver refused the
remote fetch first.

Register the local testMultiConfiguration.xsd for that system URI via an
XML catalog (CatalogResolver pointing at the existing catalog.xml, which
already rewrites https://commons.apache.org/ to the local test
resources) and set it as the builder's entity resolver. The schema now
loads locally, so the expected SAXParseException validation error is
raised again.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ppkarwasz ppkarwasz marked this pull request as draft June 18, 2026 09:28
@ppkarwasz

Copy link
Copy Markdown
Member Author

Converting this to draft, since some Java-version dependent problems arose.

@garydgregory

Copy link
Copy Markdown
Member

-1: We've already discussed making an new Commons component for this code, let's do that please.

@ppkarwasz

Copy link
Copy Markdown
Member Author

-1: We've already discussed making an new Commons component for this code, let's do that please.

Sorry, I was busy iterating on code and I forgot to mention that I created this PR only to test the capabilities (or missing capabilities) of copernik-xml-factory, before it gets included as a separate Commons component.

For a new component to be useful, it should be adopted by at least a couple of projects. Currently I have tested support for such a library:

@garydgregory

Copy link
Copy Markdown
Member

Shading can cause more problems than it's worth as exemplified with the mess some projects make shading ASM. When downstream users blow up because they can't update ASM to test or support a new Java version since that code is now a hard copy 🤔

@ppkarwasz

Copy link
Copy Markdown
Member Author

JDK 8 is affected by a bug, which denies access based on ACCESS_EXTERNAL_SCHEMA before consulting the resolver. This is why tests with xsi:schemaLocation/xsi:noNamespaceSchemaLocation hints fail.

This was theoretically fixed by JDK-8159139, but still doesn't work.

@ppkarwasz

Copy link
Copy Markdown
Member Author

Shading can cause more problems than it's worth as exemplified with the mess some projects make shading ASM.

I totally agree, but each project should be able to choose its poison.

Unlike ASM, however, XML support is unlikely to change very much in the future, so if a project decides to replace a bunch of cargo-culted hardening options without any tests, with a shaded and relocated class, this still counts as a win.

@garydgregory

Copy link
Copy Markdown
Member

Thanks for the clarifications Piotr. Let's bring the component into Commons and iterate there. It will become more visible and it doesn't need to be perfect or bug free on the first commit. We'll want to create an alpha or milestone or pick-your-name release before a 1.0 anyway. WDYT?

@ppkarwasz

Copy link
Copy Markdown
Member Author

This was theoretically fixed by JDK-8159139, but still doesn't work.

Apparently OpenJDK never got the fix, probably the fix is only present in Oracle's builds. This mean that on JDK 8 parsing this file will always fail if ACCESS_EXTERNAL_SCHEMA does not contain https. This happens even if an EntityResolver is set.

<configuration
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:noNamespaceSchemaLocation="https://commons.apache.org/testMultiConfiguration.xsd">
  ...
</configuration>

This is a bummer, because on JDK 9+ libraries can choose to call DocumentBuilderFactory.newDefaultInstance(), which solves the problem of implementation-specific hardenings altogether. Almost: Android's default instance will be different from the one in the JDK and will not accept the same options.

@ppkarwasz

Copy link
Copy Markdown
Member Author

We'll want to create an alpha or milestone or pick-your-name release before a 1.0 anyway. WDYT?

Sounds good to me: please choose a repository name and I'll transfer the code to Commons. Do we need a new JIRA project too?

@garydgregory

garydgregory commented Jun 18, 2026

Copy link
Copy Markdown
Member

Here we go:

  • https://github.com/apache/commons-xml
  • I asked for a COMMONSXML Jira project but it's not there yet. Note that there already is a XMLCOMMONS Jira project but I can't tell what that's part of, maybe Xerces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants