Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650
Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650olabusayoT wants to merge 6 commits into
Conversation
2e3475b to
c42b8ae
Compare
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => | ||
| dropWhitespace(e) |
There was a problem hiding this comment.
It looks like the dropWhitespace function recursively drops all whitespace, which I'm not sure we want to do. I would expect that infosets that use stringAsXML would want to ensure the stringAsXml portion is exactly the same, including whitespace.
Instead, maybe we should only expect the stringAsXml element to have a single Elem child, and any other children should be empty text nodes or we should error--that gives the user an indication that the expected stringAsXml messed up (or we have have a buggy infoset outputter).
Maybe instead we want something like:
case e @ Elem(..., stringAsXml, ..., children) => {
val (elemChildren, otherChildren) = e.filter { _.isInstanceOf[Elem] }
if (elemChildren.length != 1) ... // throw exception, stringAsXml must contain a single child element
nonElemChildren.foreach { c =>
case Text(data) if data.matches("""\*""") => // no-op, empty text siblings are fine
case _ => ... // throw exception, c is some kind of mixed content not allowed as a stringAsXml child
}
c.copy(child = elemChildren)
}This way the only thing we throw away are sibling whitespace Text nodes, and it also verfies that there is only single child element.
It's probably also worth adding a comment here explaining that this is specifically for the stringAsXml feature and that we avoid making changes to any of its children except removing any surrounding whitespace, requiring that stringAsXml in the infoset match results exactly.
| val noMixedWS = removeMixedWhitespace(combinedText) | ||
| noMixedWS | ||
| n match { | ||
| case x @ Elem( |
There was a problem hiding this comment.
This function is only called on the root element which is rarely going to be the stringAsXml element, so I don't think this case will ever really match in practice. I think instead the functions below should all have cases to be no-ops for the stringAsXml element, similar to what you have for removeAttributes1
| nsbB.getURI(prefixB), | ||
| b.getNamespace(prefixB) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?
There was a problem hiding this comment.
I don't think we validate the expected infoset, it looks like it's still an open ticket
https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288
That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example
| val before = testSuite.loadingExceptions.clone() | ||
|
|
||
| val elem = loader.load(infosetSrc, None) // no schema | ||
| val elem = loader.load(infosetSrc, None, noNormalizations = true) // no schema |
There was a problem hiding this comment.
My only concern with disabling normalizations is that we don't remove things like comments now. That's important for string as XML, but I feel like that has caused problems in the past with pattern matching in the daffodil compiler (for example, expecting a Seq(Elem("foo") but the children are actually Seq(Comment(..), Elem(foo)), but maybe the TDML runner doesn't do any of that kind of pattern matching to parse TDML files and just uses XML paths to access elements, which ignores things like comments?
There was a problem hiding this comment.
I think the TDML runner calls normalize which would remove comments for non-stringAsXML elems, the issue was the load was removing it wholesale, which is not what we wanted
| val dafpr = parseResult.asInstanceOf[DaffodilTDMLParseResult] | ||
| val inputter = dafpr.inputter | ||
| val resNode = dafpr.getResult | ||
| val resNode = dafpr.getScalaResult |
There was a problem hiding this comment.
What happens if DAFFODIL_TDML_API_INFOSETS is "xml", then I think there wont' be a scala result?
| import java.util.Properties | ||
|
|
||
| import org.apache.daffodil.api.validation.ValidatorFactory | ||
|
|
There was a problem hiding this comment.
Suggset we combine this file with the TestStringAsXmlValidator.scala file. It's small enough that I don't think the separate adds a whole lot. And it's nice to have al the custom validation logic in a single file for quick reference.
|
|
||
| <parserTestCase name="stringAsXml_09" root="binMessage" | ||
| model="/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd" | ||
| validation="on"> |
There was a problem hiding this comment.
Should these use the TestStringAsXmlValidator or just inherit from the default if changed above?
| xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" | ||
| xmlns:xs="http://www.w3.org/2001/XMLSchema" | ||
| xmlns:ex="http://example.com" | ||
| defaultValidation="off"> |
There was a problem hiding this comment.
Should we set the default validation so we don't have to repeat it in the tests?
| * as not normalizing CRLFs is non-standard for XML. | ||
| * | ||
| * @param noNormalizations True to not remove comments and processing instructions and to not normalize | ||
| * CRLF/CR to LF. This is used to keep the XML as close to the original as possible |
There was a problem hiding this comment.
I don't love the name noNormalizations--setting noNormalizations=false is kidnof a double negative and a bit tricky to make sense of, and it also kindof makes it so normalizeCRLFtoLF is ignored if true. I would maybe suggest we just add additional flags for the specific behaviors (e.g. removeComments and removeProcInstr). It makes it clear exactly what those flags will do and gives control to users about exactly what they want to keep. I imagine in most cases normalizeCRLF, removeComments, and removeProcInstr will all be set to the same thing (a user either wants everything removed or everything kept), but it at least gives the option.
| TDMLInfosetOutputterAll() | ||
| } else { | ||
| } else if (tdmlApiInfosetsEnv == "scala") { | ||
| TDMLInfosetOutputterScala() |
There was a problem hiding this comment.
Do we need this outputter? Feels like we really just need the core XML one, and then the all one used for CI to makes sure all of our infoset inputters/outputters do the same thing.
8567dbb to
d1ca111
Compare
- it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters - Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests - Drop whitespace between elements to keep expected matching actual, but keep all others like mixed whitespace, attributes, comments unchanged - Introduced tests for `stringAsXML` validation and namespace handling. - Added a `noNormalizations` flag to control whether comments/processing instructions are normalized. - Updated associated XML parsing methods and test cases to support the new option. - Revised whitespace removal to handle specific scenarios for improved XML processing. - Verify prefixes resolve to the same namespaces when checking prefixes - update TDMLException with more information on why getSimpleText isn't matching - NullInfosetInputter should be received UTF-8 bytes for its events Deprecation/Compatibility Instead of ScalaXMLInfosetInputter/Outputter being the default inputter/outputter for TDML Runner, it is now XMLTextInfosetInputter/Outputter which supports stringsAsXml feature DAFFODIL-2909
- Removed unneeded `TDMLInfosetOutputterScala`. - Enhanced `DaffodilXMLLoader` and related classes with `removeComments` and `removeProcInstr` flags, replacing too-broad `noNormalizations`. - Updated `TestStringAsXmlValidator` with a unified validator for namespace and non-namespace cases. - add comments for clarity - remove getScalaResult from all but TDMLInfosetOutputterAll - add clarifying info to TDMLInfosetInputter TDML Exceptions in case of non-matches - remove intermingling of ScalaInfosetOutputter with TDMLInfosetOutterXML - Turn off pretty printing from XMLTextInfosetOutputter in TDMLInfosetOutputterAll - undo type aware changes for ScalaXMLInfosetOutputter (the way it was adding namespace bindings for scalaXML was not exactly correct as it wasn't part of the child element's minimized scope. We found it would be too much trouble to implement correctly for scalaXML, so we decided to remove the functionality) DAFFODIL-2909
d1ca111 to
0ac13c9
Compare
| if (elemChildren.length != 1) | ||
| throw new Exception("stringAsXml must contain a single child element.") |
There was a problem hiding this comment.
Thoughts on changing these and the below generic Exception throws to asserts? Does having it be an exception allow it to be caught or converted into a ParseError later on? Not sure how these sort of failure cases are supposed to be handled with the stringAsXML feature.
There was a problem hiding this comment.
I don't think this wants to be an assert since the input could be an expected infoset in a TDML file. So if user provides invalid XML in the infoset, instead of an assert we probably want this to be some kind of exception.
Maybe we want a custom exception (maybe something like InvalidInfosetException?), then users and the TDML runner can handle this however they want. The TDMLRunner an just convert it to a TDMLException like it does most things.
jadams-tresys
left a comment
There was a problem hiding this comment.
+1 with one question/suggestion
- ensure stringAsXml file line endings are not normalized in windows - change generic exception to InvalidInfosetException
ba7308f to
2c5cadf
Compare
- make XMLUtils normalize CRLFs on expected/actual normalization during comparison if it is not StringAsXML DEPRECATION/COMPATIBILITY This code checks the actual infoset for the presence of XMLTextInfoset.stringAsXML(currently stringAsXML) and won't normalize CRLF to LF in infosets that contain that element
2c5cadf to
6ddad8d
Compare
- reformat code
stevedlawrence
left a comment
There was a problem hiding this comment.
+1, minor suggestoins for refactor/cleanup
| /KEYS export-ignore | ||
| /KEYS export-ignore | ||
| # ensure stringAsXml file line endings are not normalized in windows | ||
| /daffodil-test/src/test/resources/org/apache/daffodil/infoset/stringAsXml/** -text No newline at end of file |
There was a problem hiding this comment.
Did you look into any other alternatives to this? It would be nice if we didn't need to have this one special case in a hidden file. It might be hard to remember that this exists if this ever leads to issues.
If we do keep this, instead of doing a directory glob, can we list the exact files that this applies to to make it clear exactly which files have this issue? I think it's only one or two XML files? Is it also possible to add a comment to those XML files that explains that the file intentionally contains CR's for testing purposes, and that git's auto.crlf feature is disabled for this file via the .gitattributes to prevent that from changing?
| * normalizes CRLF to LF within text nodes in non-stringAsXML elements | ||
| */ | ||
| private def normalizeCRLFtoLF(ns: Node): Node = { | ||
| if (!ns.isInstanceOf[Elem]) return ns |
There was a problem hiding this comment.
Instead of this early return, can we change the default case to case e: Elem =>, and then have the default case be case _ => ns? It's maybe a bit slower since we'll have to do some matching, but it's more scala-y and I imagine it doesn't make that much of a difference in performance.
| // NOTE: this is specifically for the stringAsXml feature as we avoid | ||
| // making changes to any of its children requiring that stringAsXml in | ||
| // the infoset match results exactly. | ||
| case e @ Elem( |
There was a problem hiding this comment.
We have this same kindof-complex match/case in a number of places to check for the stringAsElem. Thoughts on doing something like this:
private def isStringAsXmlElem(e: Elem) => e match {
case Elem(null, XMLTextInfoset.stringAsXml, ...) => true
case _ => false
}And then these become something like
case e: Elem if isStringAsXmlElem(e) => ...
It's also makes it easier to update that one function if we ever change what it stringAsXml elements look like.
| } | ||
| case c => c | ||
| } | ||
| .map(normalizeCRLFtoLF) |
There was a problem hiding this comment.
Might be a bit cleaner to do something like:
case e: Elem ... => e // stringAsXml case
case e: Elem => {
val normalized = e.child.map(normalizeCRLFtoLF)
val res =
if (normalized eq children) e
else e.copy(child = normalized)
}
}
case Text(data) if data.contains("\r") => {
val replaced = ...
Text(replaced)
}
case _ => nThis makes it a bit more clear that it just recurses down non-stringAsXmlElements and the only thing that actually changes is Text nodes that contain CR's.
|
|
||
| /** | ||
| * normalizes CRLF to LF within text nodes in non-stringAsXML elements | ||
| */ |
There was a problem hiding this comment.
Can you add something that explain why this is important? Something about how fields in infosets could contain LF, but could be changed to CRLF due to git's autocrlf feature. And since infoset outputters always output LF we need to undo with git might do and normalize those CRLF's to LF.
| cB.toString, | ||
| maybeType, | ||
| maybeFloatEpsilon, | ||
| maybeDoubleEpsilon |
There was a problem hiding this comment.
Should we pass in None for type, and epsilons for comments and PCData? I'm not even really sure how maybeType coul dbe defined here. Maybe if stringAsXml contained something like this:
<stringAsXml xmlns="">
<foo xsi:type="xs:int">
<!-- inline comment -->
5
</foo>
</stringAsXml>I think in that case we might e usuing xs:int for type aware comparisons? And we'll try to compare the comment as if it were an int, which might break? I'm not postive, but I think we can avoid this all if we just pass in None for these to disable type awareness.
There was a problem hiding this comment.
So I tested this and inline comment ends up having its own type who is None. foo is a separate iteration of computeDiff and is what actually carries the type. But for extra precaution, will do!
| } | ||
| case (cA: Comment, cB: Comment) => { | ||
| val thisDiff = computeTextDiff( | ||
| zPath, |
There was a problem hiding this comment.
It might be worth adding something to zpath to make it clear this is a comment that is a child of whatever zpath is. Maybe somethin like zPath + "/@comment". The path step make sit clear it's a child, and @ is what we've been using to indicate something is like an attribute or somethig other than an element. Same idea for pcadata below.
- Introduced `binMessageA` element and relevant schema changes, including `stringAsXmlGroupA` and updated XML payload references to test inline comment in stringAsXml. - Enhanced `XMLUtils` to streamline `CRLF` normalization and `stringAsXml` element identification. - Adjusted `.gitattributes` to prevent line ending normalization for specific test files.
stringAsXMLvalidation and namespace handling.noNormalizationsflag to control whether comments, processing instructions, and line endings are normalized.DAFFODIL-2909