Skip to content

XML Entities by example#5634

Open
alfsb wants to merge 3 commits into
php:masterfrom
alfsb:xml-entity-by-example
Open

XML Entities by example#5634
alfsb wants to merge 3 commits into
php:masterfrom
alfsb:xml-entity-by-example

Conversation

@alfsb

@alfsb alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member

This is a tour de force for XML Entities project, showing how it works, by examples. The final objective is to remove all DTD entities in favor of "XML Entities", implemented in tha last year.

Entity deletion

The first example is the remotion of an unused entity on doc-en. In particular, the file doc-en/contributors.ent has two empty DTD entities from a long time. But simply removing these entities will cause break the build of the manual, as these two entities are referenced in a file in another repository (in case, doc-base/manual.xml). These cases of interdependence between repositories are common, mose so between doc-base and translations, and this makes the evolution of doc-en manual harder than need be.

So, to remove these unused entities from doc-en without breaking all translations, these removed entities are "moved" to a doc-en/entities/entities-remove.ent, where they keep being declared and usable, but they are explicit marked as deleted, on all translations. This way, doc-en does not need to either maintain unused entities, nor all translations keep translating these unused entities.

Transformation of DTD Entity into a XML Entity

As an example of the main objective of the XML Entities project, I moved two entities from doc-en/extensions.ent to the new format.

First, the text of extcat.intro entity as transformed from:

<!ENTITY extcat.intro '<title xmlns="http://docbook.org/ns/docbook">Extension List/Categorization</title><simpara xmlns="http://docbook.org/ns/docbook">This
appendix categorizes more than 150 extensions documented in the PHP
Manual by several criteria.</simpara>'>

to this:

<entity name="extcat.intro">
 <title>Extension List/Categorization</title>
 <simpara>This appendix categorizes more than 150 extensions documented
 in the PHP Manual by several criteria.</simpara>
</entity>

Mind the clearer syntax. In particular, mind the removal of the two namespace declarations, that yes, are obligatory on DTD format in some obscure cases, but are optional and unnecessary om XML format. These new files are fully valid XML, down to namespace declarations, as they can hold textual entities, well balanced texts and also multirooted "XML" fragments. So not only are they easier to work on normal XML code, they also are easier to edit, as any XML error will be detected by IDEs, something that does not happen in DTD entities.

The extcat.alphabetical entity demonstrates the capacity for text only entities, and is duplicated in the example to demonstrate another aspect of XML Entities project, that is, more detailed reporting of entity collision and under and over translated entities.

Detailed reporting

After the XML entities project is fully merged (see php/doc-base#307), the merge of this PR will start generating a new line on doc-base/configure.php runs. Something like this:

Running text-entities.php... done: 4 entities, 1 untranslated, 1 other failures.
(run "doc-base/scripts/text-entities.php lang [lang] --debug" for details) 

Running text-entities.php directly, with one language, will then report:

Running text-entities.php...
 Normal entity, redefined 1 times: extcat.alphabetical
done: 4 entities, 1 other failures.

And running text-entities.php directly with two languages, will then report:

Running text-entities.php...
 Not translated:                   extcat.intro
 Multiple redefined/translated:    extcat.alphabetical
done: 4 entities, 1 untranslated, 1 other failures.

So it will be possible to detect, from one place, all duplicated entities and all duplicated translations, and all redefitions of entities that should not occur.

Big entities as individual files

Not only .ent files are expected on the new doc-lang/entities/ path. Any name.xml file placed here will be compiled as an individual entity, where the name of the entity maps to file name (minus the .xml), and the contents of the entity is mapped to the contents of the file.

No more gigantic and hard to edit entities on language-snippets.ent. These can be moved here.

Per extension entities

Finally, another feature of XML Entities is the ability to have per extension .ent files, inside each doc-lang/reference/ dir, so language-snippets.ent can also be splitted. These per extension entity files are processed in one place, and will be reported in consolidated form, as above.


The plan

If there are no hard objections, wait for the merging of doc-base PR 307, then merge this.

Then, start new PRs converting all .ent files of doc-en (minus the language-snippets.ent). While language-snippets.ent is manually edited into per extension entity files, it will shrink until it is deleted, and finally, all manually edited DTD files can be erased from the manual.

And... that's it.

Comments and reviews wellcome. Plan to leave this open at least two weeks before merging.

@alfsb

alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

The build failures is expetect. It will be fixed by doc-base PR 307. I opened this as early as now to allow more time for comments.

@jordikroon

Copy link
Copy Markdown
Member

We should also update docbookcs.xml or these entities are not loaded. Can you add those, or are you more comfortable that I will do them?

@alfsb

alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

I will update docbookcs.xml, as one entire file is being removed.

@jordikroon

Copy link
Copy Markdown
Member

And entities as folder should be added if I am correct.

@alfsb

alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

The entities/ folder is not already included by <directory> or <path> tags? That is, they don't already recurse?

@jordikroon

Copy link
Copy Markdown
Member

<directory> is not referencing anything from the entities in doc-en. Only doc-base.

It's a bit of a whacky setup though because of the doc-base dependency so I understand the confusion.

  • Project: the full scope of the project from the parent folder.
  • Path: all files to scan as entry point. Only loads .xml files.
  • Entities: All entity files and folders. Only loads .ent files.

@alfsb

alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

I'm reading these like this:

 <entities>
  <!-- These are files from doc-en... -->
  <file>contributors.ent</file>   
  <file>extensions.ent</file>
  <!-- ... so anything outside doc-en need a relative path -->
  <directory>../doc-base/entities/</directory>
  <file>../doc-base/temp/file-entities.ent</file>

This is correct? As in, the current directory is the path of the first /project/directory?

Also, if a <directory> inside a <entities> loads all .ent files in recursion, so an:

 <entities>
  <directory>.</directory>
  <directory>../doc-base/</directory>
 </entities>

do not the same as listing all files in each directory?

@jordikroon

Copy link
Copy Markdown
Member

It does do the same thing. It just scans more files. But since we will have .ent files in the references folder (per ext) it's fine to me for doc-base.

Though I prefer that the doc-base paths stays more specific as they were wdyt?

@alfsb

alfsb commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Though I prefer that the doc-base paths stays more specific as they were wdyt?

Specific paths for doc-base are a very nice thing, that was only an example.

It does do the same thing. It just scans more files. But since we will have .ent files in the references folder (per ext) it's fine to me for doc-base.

Now this got me worried. There will be new .ent files on entities and reference folders. But these are neither Docbook XML files, nor concatenated DTD entities. These are normal XML files, yes, but with a lot of Docbook fragments.

These new .ent files can be detected because they have a <?xml header. But many new .xml files on entities folder are at most well balanced texts, without a <?xml header. That is, they are for the most part invalid Docbook XML, and many are in fact not even valid XML.

How will docbookcs react for them?


For now, having only specific paths for DTD entities will suffice, but the new entities folder may need some form of exclusion. Something like a<path>.<exclude>entities</exclude></path> or an docbookcs-skip marker file.

Thinking ahead, if docbookcs is a local project, I have some bold suggestions:

  • <dtd-entities> for concatenated DTD entity definitions, instead of  (.ent);
  • <xml-entities> for the new XML entities files (.ent);
  • <xml-files> for single rooted XML quasi-valid files (see below) (.xml)
  • <wbt-files> for well balanced texts (WBT) .xml files (.xml)

In fact.. I'm very curious about how the sniffer is loading the actual XML files with undefined entities, as I cannot make any bundled PHP XML parser to read those. I will try study how docbookcs does that.

But it docbookcs is not a local project... I think I will need to change XML Entities to have a diferent extension for XML fragments on entities/ folder.

@jordikroon

Copy link
Copy Markdown
Member

I think we would have to do some testing outside of this PR. docbook-cs is a project we have in control.

Perhaps we can do something like:

<entities>
    <directory type="dtd" />
    <directory type="xml" />
</entities>

@alfsb

alfsb commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Perhaps we can do something like:

That would be very nice, and will map directly on what manual sources already do, by for now buried inside the DTD entities, and XML entities will make visible as individual fragment XML files.

About these fragment files on entities/, I could specify they are .xml files but without any XML declarations, so all files on entities/ (.ent and .xml) would be special in some way, if this could make things easier on your side.

Two side things:

  1. The distinction about files and directories at definition level is something that makes everything so much harder inside doc-base/scritpes/file-entities.php. I'm working on a new version that sidesteps this completely. There is only "path"s, and the distinction if this is a file or a directory is one is_dir() away.
  2. That new version will have a "well balanced [XML] region parser", that preserves all undefined entities as is. In the case docbookcs needs something related with the entities itselfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants