Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default by olabusayoT · Pull Request #1650 · apache/daffodil

olabusayoT · 2026-04-03T15:16:57Z

it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters
Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests
Drop whitespace between elements to keep expected matching actual, but keep alll others like mixed whitespace, attributes, comments unchanged
Introduced tests for stringAsXML validation and namespace handling.
Added a noNormalizations flag to control whether comments, processing instructions, and line endings are normalized.
Updated associated XML parsing methods and test cases to support the new option.
Revised whitespace removal to handle specific scenarios for improved XML processing.
Verify prefixes resolve to the same namespaces when checking prefixes

stevedlawrence · 2026-04-08T18:06:48Z

+            NamespaceBinding(null, null | "", _),
+            _*
+          ) =>
+        dropWhitespace(e)


It looks like the dropWhitespace function recursively drops all whitespace, which I'm not sure we want to do. I would expect that infosets that use stringAsXML would want to ensure the stringAsXml portion is exactly the same, including whitespace.

Instead, maybe we should only expect the stringAsXml element to have a single Elem child, and any other children should be empty text nodes or we should error--that gives the user an indication that the expected stringAsXml messed up (or we have have a buggy infoset outputter).

Maybe instead we want something like:

case e @ Elem(..., stringAsXml, ..., children) => { val (elemChildren, otherChildren) = e.filter { _.isInstanceOf[Elem] } if (elemChildren.length != 1) ... // throw exception, stringAsXml must contain a single child element nonElemChildren.foreach { c => case Text(data) if data.matches("""\*""") => // no-op, empty text siblings are fine case _ => ... // throw exception, c is some kind of mixed content not allowed as a stringAsXml child } c.copy(child = elemChildren) }

This way the only thing we throw away are sibling whitespace Text nodes, and it also verfies that there is only single child element.

It's probably also worth adding a comment here explaining that this is specifically for the stringAsXml feature and that we avoid making changes to any of its children except removing any surrounding whitespace, requiring that stringAsXml in the infoset match results exactly.

stevedlawrence · 2026-04-08T18:13:20Z

-    val noMixedWS = removeMixedWhitespace(combinedText)
-    noMixedWS
+    n match {
+      case x @ Elem(


This function is only called on the root element which is rarely going to be the stringAsXml element, so I don't think this case will ever really match in practice. I think instead the functions below should all have cases to be no-ops for the stringAsXml element, similar to what you have for removeAttributes1

stevedlawrence · 2026-04-08T18:22:24Z

+              nsbB.getURI(prefixB),
+              b.getNamespace(prefixB)
+            )
+          )


Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?

I don't think we validate the expected infoset, it looks like it's still an open ticket

https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288

That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example

stevedlawrence · 2026-04-08T18:24:58Z

    val before = testSuite.loadingExceptions.clone()

-    val elem = loader.load(infosetSrc, None) // no schema
+    val elem = loader.load(infosetSrc, None, noNormalizations = true) // no schema


My only concern with disabling normalizations is that we don't remove things like comments now. That's important for string as XML, but I feel like that has caused problems in the past with pattern matching in the daffodil compiler (for example, expecting a Seq(Elem("foo") but the children are actually Seq(Comment(..), Elem(foo)), but maybe the TDML runner doesn't do any of that kind of pattern matching to parse TDML files and just uses XML paths to access elements, which ignores things like comments?

I think the TDML runner calls normalize which would remove comments for non-stringAsXML elems, the issue was the load was removing it wholesale, which is not what we wanted

stevedlawrence · 2026-04-08T18:27:56Z

    val dafpr = parseResult.asInstanceOf[DaffodilTDMLParseResult]
    val inputter = dafpr.inputter
-    val resNode = dafpr.getResult
+    val resNode = dafpr.getScalaResult


What happens if DAFFODIL_TDML_API_INFOSETS is "xml", then I think there wont' be a scala result?

stevedlawrence · 2026-04-09T13:15:16Z

+import java.util.Properties
+
+import org.apache.daffodil.api.validation.ValidatorFactory
+


Suggset we combine this file with the TestStringAsXmlValidator.scala file. It's small enough that I don't think the separate adds a whole lot. And it's nice to have al the custom validation logic in a single file for quick reference.

stevedlawrence · 2026-04-09T13:21:46Z

+
+  <parserTestCase name="stringAsXml_09" root="binMessage"
+                  model="/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"
+                  validation="on">


Should these use the TestStringAsXmlValidator or just inherit from the default if changed above?

stevedlawrence · 2026-04-09T13:22:48Z

+                xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
+                xmlns:xs="http://www.w3.org/2001/XMLSchema"
+                xmlns:ex="http://example.com"
+                defaultValidation="off">


Should we set the default validation so we don't have to repeat it in the tests?

stevedlawrence · 2026-04-09T13:40:21Z

 *                          as not normalizing CRLFs is non-standard for XML.
- *
+ * @param noNormalizations True to not remove comments and processing instructions and to not normalize
+ *                       CRLF/CR to LF. This is used to keep the XML as close to the original as possible


I don't love the name noNormalizations--setting noNormalizations=false is kidnof a double negative and a bit tricky to make sense of, and it also kindof makes it so normalizeCRLFtoLF is ignored if true. I would maybe suggest we just add additional flags for the specific behaviors (e.g. removeComments and removeProcInstr). It makes it clear exactly what those flags will do and gives control to users about exactly what they want to keep. I imagine in most cases normalizeCRLF, removeComments, and removeProcInstr will all be set to the same thing (a user either wants everything removed or everything kept), but it at least gives the option.

stevedlawrence · 2026-04-09T13:48:09Z

      TDMLInfosetOutputterAll()
-    } else {
+    } else if (tdmlApiInfosetsEnv == "scala") {
      TDMLInfosetOutputterScala()


Do we need this outputter? Feels like we really just need the core XML one, and then the all one used for CI to makes sure all of our infoset inputters/outputters do the same thing.

- it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters - Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests - Drop whitespace between elements to keep expected matching actual, but keep all others like mixed whitespace, attributes, comments unchanged - Introduced tests for `stringAsXML` validation and namespace handling. - Added a `noNormalizations` flag to control whether comments/processing instructions are normalized. - Updated associated XML parsing methods and test cases to support the new option. - Revised whitespace removal to handle specific scenarios for improved XML processing. - Verify prefixes resolve to the same namespaces when checking prefixes - update TDMLException with more information on why getSimpleText isn't matching - NullInfosetInputter should be received UTF-8 bytes for its events Deprecation/Compatibility Instead of ScalaXMLInfosetInputter/Outputter being the default inputter/outputter for TDML Runner, it is now XMLTextInfosetInputter/Outputter which supports stringsAsXml feature DAFFODIL-2909

- Removed unneeded `TDMLInfosetOutputterScala`. - Enhanced `DaffodilXMLLoader` and related classes with `removeComments` and `removeProcInstr` flags, replacing too-broad `noNormalizations`. - Updated `TestStringAsXmlValidator` with a unified validator for namespace and non-namespace cases. - add comments for clarity - remove getScalaResult from all but TDMLInfosetOutputterAll - add clarifying info to TDMLInfosetInputter TDML Exceptions in case of non-matches - remove intermingling of ScalaInfosetOutputter with TDMLInfosetOutterXML - Turn off pretty printing from XMLTextInfosetOutputter in TDMLInfosetOutputterAll - undo type aware changes for ScalaXMLInfosetOutputter (the way it was adding namespace bindings for scalaXML was not exactly correct as it wasn't part of the child element's minimized scope. We found it would be too much trouble to implement correctly for scalaXML, so we decided to remove the functionality) DAFFODIL-2909

jadams-tresys · 2026-05-10T00:37:31Z

+        if (elemChildren.length != 1)
+          throw new Exception("stringAsXml must contain a single child element.")


Thoughts on changing these and the below generic Exception throws to asserts? Does having it be an exception allow it to be caught or converted into a ParseError later on? Not sure how these sort of failure cases are supposed to be handled with the stringAsXML feature.

I don't think this wants to be an assert since the input could be an expected infoset in a TDML file. So if user provides invalid XML in the infoset, instead of an assert we probably want this to be some kind of exception.

Maybe we want a custom exception (maybe something like InvalidInfosetException?), then users and the TDML runner can handle this however they want. The TDMLRunner an just convert it to a TDMLException like it does most things.

jadams-tresys

+1 with one question/suggestion

- ensure stringAsXml file line endings are not normalized in windows - change generic exception to InvalidInfosetException

- make XMLUtils normalize CRLFs on expected/actual normalization during comparison if it is not StringAsXML DEPRECATION/COMPATIBILITY This code checks the actual infoset for the presence of XMLTextInfoset.stringAsXML(currently stringAsXML) and won't normalize CRLF to LF in infosets that contain that element

- reformat code

stevedlawrence

+1, minor suggestoins for refactor/cleanup

stevedlawrence · 2026-05-19T13:13:29Z

-/KEYS export-ignore 
+/KEYS export-ignore
+# ensure stringAsXml file line endings are not normalized in windows
+/daffodil-test/src/test/resources/org/apache/daffodil/infoset/stringAsXml/** -text


Did you look into any other alternatives to this? It would be nice if we didn't need to have this one special case in a hidden file. It might be hard to remember that this exists if this ever leads to issues.

If we do keep this, instead of doing a directory glob, can we list the exact files that this applies to to make it clear exactly which files have this issue? I think it's only one or two XML files? Is it also possible to add a comment to those XML files that explains that the file intentionally contains CR's for testing purposes, and that git's auto.crlf feature is disabled for this file via the .gitattributes to prevent that from changing?

stevedlawrence · 2026-05-19T13:15:49Z

+   * normalizes CRLF to LF within text nodes in non-stringAsXML elements
+   */
+  private def normalizeCRLFtoLF(ns: Node): Node = {
+    if (!ns.isInstanceOf[Elem]) return ns


Instead of this early return, can we change the default case to case e: Elem =>, and then have the default case be case _ => ns? It's maybe a bit slower since we'll have to do some matching, but it's more scala-y and I imagine it doesn't make that much of a difference in performance.

stevedlawrence · 2026-05-19T13:23:12Z

+      // NOTE: this is specifically for the stringAsXml feature as we avoid
+      // making changes to any of its children requiring that stringAsXml in
+      // the infoset match results exactly.
+      case e @ Elem(


We have this same kindof-complex match/case in a number of places to check for the stringAsElem. Thoughts on doing something like this:

private def isStringAsXmlElem(e: Elem) => e match { case Elem(null, XMLTextInfoset.stringAsXml, ...) => true case _ => false }

And then these become something like

case e: Elem if isStringAsXmlElem(e) => ...

It's also makes it easier to update that one function if we ever change what it stringAsXml elements look like.

stevedlawrence · 2026-05-19T13:34:20Z

+            }
+            case c => c
+          }
+          .map(normalizeCRLFtoLF)


Might be a bit cleaner to do something like:

case e: Elem ... => e // stringAsXml case case e: Elem => { val normalized = e.child.map(normalizeCRLFtoLF) val res = if (normalized eq children) e else e.copy(child = normalized) } } case Text(data) if data.contains("\r") => { val replaced = ... Text(replaced) } case _ => n

This makes it a bit more clear that it just recurses down non-stringAsXmlElements and the only thing that actually changes is Text nodes that contain CR's.

stevedlawrence · 2026-05-19T14:12:54Z


+  /**
+   * normalizes CRLF to LF within text nodes in non-stringAsXML elements
+   */


Can you add something that explain why this is important? Something about how fields in infosets could contain LF, but could be changed to CRLF due to git's autocrlf feature. And since infoset outputters always output LF we need to undo with git might do and normalize those CRLF's to LF.

stevedlawrence · 2026-05-19T14:27:25Z

+          cB.toString,
+          maybeType,
+          maybeFloatEpsilon,
+          maybeDoubleEpsilon


Should we pass in None for type, and epsilons for comments and PCData? I'm not even really sure how maybeType coul dbe defined here. Maybe if stringAsXml contained something like this:

<stringAsXml xmlns=""> <foo xsi:type="xs:int">  5 </foo> </stringAsXml>

I think in that case we might e usuing xs:int for type aware comparisons? And we'll try to compare the comment as if it were an int, which might break? I'm not postive, but I think we can avoid this all if we just pass in None for these to disable type awareness.

So I tested this and inline comment ends up having its own type who is None. foo is a separate iteration of computeDiff and is what actually carries the type. But for extra precaution, will do!

stevedlawrence · 2026-05-19T14:36:55Z

      }
+      case (cA: Comment, cB: Comment) => {
+        val thisDiff = computeTextDiff(
+          zPath,


It might be worth adding something to zpath to make it clear this is a comment that is a child of whatever zpath is. Maybe somethin like zPath + "/@comment". The path step make sit clear it's a child, and @ is what we've been using to indicate something is like an attribute or somethig other than an element. Same idea for pcadata below.

- Introduced `binMessageA` element and relevant schema changes, including `stringAsXmlGroupA` and updated XML payload references to test inline comment in stringAsXml. - Enhanced `XMLUtils` to streamline `CRLF` normalization and `stringAsXml` element identification. - Adjusted `.gitattributes` to prevent line ending normalization for specific test files.

olabusayoT requested review from jadams-tresys and stevedlawrence April 3, 2026 15:16

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch 5 times, most recently from 2e3475b to c42b8ae Compare April 6, 2026 18:06

stevedlawrence reviewed Apr 9, 2026

View reviewed changes

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from 8567dbb to d1ca111 Compare April 14, 2026 19:17

olabusayoT added 2 commits April 14, 2026 15:22

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from d1ca111 to 0ac13c9 Compare April 15, 2026 17:49

jadams-tresys reviewed May 10, 2026

View reviewed changes

jadams-tresys approved these changes May 10, 2026

View reviewed changes

fixup! fixup!

47912e4

- ensure stringAsXml file line endings are not normalized in windows - change generic exception to InvalidInfosetException

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from ba7308f to 2c5cadf Compare May 18, 2026 19:30

olabusayoT requested a review from stevedlawrence May 18, 2026 19:53

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from 2c5cadf to 6ddad8d Compare May 18, 2026 21:13

fixup! fixup! fixup! fixup!

1e8ba75

- reformat code

stevedlawrence approved these changes May 19, 2026

View reviewed changes

		import java.util.Properties

		import org.apache.daffodil.api.validation.ValidatorFactory

		if (elemChildren.length != 1)
		throw new Exception("stringAsXml must contain a single child element.")

Conversation

olabusayoT commented Apr 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jadams-tresys left a comment

Choose a reason for hiding this comment

Uh oh!

stevedlawrence left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants