322 lines
13 KiB
Plaintext
322 lines
13 KiB
Plaintext
========================
|
|
eZ Publish markup format
|
|
========================
|
|
|
|
Summarization of discussion results on the new internal eZ Publish markup
|
|
format.
|
|
|
|
Scope
|
|
=====
|
|
|
|
The discussed format will be used for the storage of documents in the data
|
|
backend and therefore need to be able to represent a sufficient superset of
|
|
markup used by various input and output formats.
|
|
|
|
Common use cases
|
|
----------------
|
|
|
|
Common use cases, which should be matched by the document format.
|
|
|
|
1) Web content management
|
|
|
|
In web content management the user will most likely edit the contents using
|
|
some rich text editor [#]_ in the browser and the contents will be
|
|
transformed to (X)HTML for output on the website. Depending on the
|
|
customers preferences the output language might be anything from HTML 4, to
|
|
HTML 5, or X/HTML 1, 1.1, 2 or 5.
|
|
|
|
2) Content management
|
|
|
|
Content management normally involves more formats like the already known
|
|
Office document import and export, and also exporting documents using known
|
|
print output formats like PDF and LaTeX. The storage format must be able to
|
|
match the markup offered by those documents as much as possible to lose as
|
|
little document semantics as possible.
|
|
|
|
3) Website styling
|
|
|
|
Some users want to use web content management systems for easy editing and
|
|
styling of their web contents, which includes formatting of contents beside
|
|
pure semantic markup. This markup should also be possible to store in the
|
|
backend, even it should also be easy to filter out for later content
|
|
cleaning.
|
|
|
|
4) Extensibility
|
|
|
|
Content management and publication also means we must offer an easy way to
|
|
integrate with external contents (like images, videos or other external
|
|
data providers). We cannot foresee which applications evolve here, so the
|
|
markup format should stay extensible with custom tags.
|
|
|
|
Document component
|
|
==================
|
|
|
|
In the `eZ Components`__ project we develop the `document component`__ which
|
|
aims to provide document conversions between all relevant markup formats. The
|
|
current state is that we can convert documents in all directions between
|
|
RST__, Docbook__, XHTML 1 and HTML <=4.
|
|
|
|
We will work next on integrating the eZ Publish markup formats in the chain
|
|
and then integrate `wiki markup languages`__, as well as PDF__ and maybe
|
|
common other markup languages like the `Open Document Format`__.
|
|
|
|
The document component currently uses a subset of Docbook as the internal
|
|
conversion format, because an initial evaluation showed that it covers most
|
|
semantic markup structures of the used formats and is easy to process, because
|
|
one of the supported syntax languages is XML. So each format added to the
|
|
document component is required to convert from and to Docbook. This way we
|
|
will be able to convert between all formats using Docbook as an intermediate
|
|
step.
|
|
|
|
The document components will offer a base for the conversion required by some
|
|
of the above mentioned use cases.
|
|
|
|
Format considerations
|
|
=====================
|
|
|
|
With the use cases above and the background of already existing conversion
|
|
tools the following markup languages are up to consideration.
|
|
|
|
RST / Wiki markup
|
|
-----------------
|
|
|
|
So called "lightweight markup formats" which are easily editable by the user
|
|
and offer great flexibility, because they are commonly extensible by custom
|
|
plugins. They will be available as input and output formats using the document
|
|
component, but are not valid for an internal storage format, because:
|
|
|
|
- There are no common tools to parse such languages, so the parser is required
|
|
to be implemented in PHP, which is slower then established markup parser
|
|
frameworks like libxml2, available through the XML extensions in PHP.
|
|
|
|
- RST even is a context free language, so no common parser approaches work
|
|
here.
|
|
|
|
- A common base for wiki syntaxes is evolving__ but not really defined yet,
|
|
and a lot of different dialects of the language yet exist.
|
|
|
|
- The general tool support is quite bad for both language flavors - there are
|
|
only two tools which are really able to parse RST (docutils__ and the
|
|
document component) and most wiki markup parsers are dialect specific.
|
|
|
|
X/HTML 1 / X/HTML 5
|
|
-------------------
|
|
|
|
X/HTML is easy to parse, because it uses XML as syntax and is used widely in
|
|
the web environment as a markup format for textual contents. A dialect similar
|
|
to XHMLT 1.1 is already used in some versions of eZ Publish as a markup
|
|
language in the database.
|
|
|
|
X/HTML semantics
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
X/HTML improves its semantic markup from version to version, and in version 5
|
|
of X/HTML there are several new elements introduced like <video>, <audio> and
|
|
<section>.
|
|
|
|
Generally the X/HTML markup is document representation centric without markup
|
|
elements for structures often used in text semantics, like:
|
|
|
|
- Footnotes
|
|
|
|
Footnotes are available in all other markup formats, like in RST__ and
|
|
Docbook__, but cannot really be represented in in X/HTML.
|
|
|
|
- Names, addresses, mail addresses, etc.
|
|
|
|
Docbook defines lots of already available markup for elements commonly used
|
|
in various documents, which are only available in X/HTML through external not
|
|
solidified extensions like microformats__.
|
|
|
|
X/HTML still includes a lot of markup which is used only or partly for
|
|
representation. The most common example here are tables used to layout
|
|
websites. But also elements like <div> and <span>, or the attributes style="",
|
|
on(load|click|...)="" are used solely for representational purposes. X/HTML is
|
|
not designed for document centric markup, but still designed as a mix of
|
|
representational and semantical markup [CIT_IAN_2008]_.
|
|
|
|
However, it lacks elements to express the semantics of many of the
|
|
non-document types of content often seen on the Web. For instance, forum
|
|
sites, auction sites, search engines, online shops, and the like, do not
|
|
fit the document metaphor well, and are not covered by XHTML2
|
|
|
|
-- Ian Hickson, HTML 5, W3C Working Draft 22 January 2008
|
|
|
|
X/HTML conversion benefits
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
One might think, that X/HTML offers the benefit of less conversions in the
|
|
most traditional use case, the web content management. Considering the fourth
|
|
use case X/HTML also always is required to be processed on input and output.
|
|
|
|
The input processing would need to filter representational elements from a
|
|
document to sanitize the contents stored in the data backend.
|
|
|
|
The output processing would need to transform custom extensions, like
|
|
<ezp:object node_id="23"/> or <mymodule:gallery/> into valid X/HTML code, not
|
|
speaking of yet necessary conversions from X/HTML 5 to X/HTML 1 / HTML 4.
|
|
|
|
X/HTML editor integration
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
X/HTML integrates perfectly with yet existing editors, even they often do not
|
|
focus on semantically correct markup, but representation centric WYSIWYG
|
|
editing.
|
|
|
|
The rich text editors will probably be updated to generate X/HTML 5 sooner or
|
|
later, which could spare us the work of convincing the editors of creating a
|
|
custom markup.
|
|
|
|
Custom formatting
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
Custom user defined formatting like colors, as mentioned in use case 3 is
|
|
offered in X/HTML by default. This may make it hard to filter later on,
|
|
because, like mentioned above, in X/HTML semantic and representational markup
|
|
is mixed by design. On the other hand no markup extensions are required.
|
|
|
|
A filter can still remove all elements and attributes not defined in a
|
|
whitelist for valid markup.
|
|
|
|
X/HTML 2
|
|
--------
|
|
|
|
X/HTML 2 is also a strong improvement compared with X/HTML 1, by offering
|
|
similar section definitions as in Docbook and X/HTML 5 and other small
|
|
improvements. It still has many of the same drawbacks like X/HTML 5, as
|
|
mentioned in the sections `X/HTML conversion benefits`_, `X/HTML semantics`_
|
|
and `X/HTML editor integration`_.
|
|
|
|
X/HTML 1
|
|
--------
|
|
|
|
Beside the drawbacks mentioned for X/HTML 2 and 5, X/HTML 1 and 1.1 do have
|
|
additional problems. It lacks several of the markup structures introduced in
|
|
X/HTML 2 and 5, especially the <section> element, which makes it hard to
|
|
decide which block level element belongs to which section, like the following
|
|
example shows::
|
|
|
|
<h1>Header 1</h1>
|
|
<p>First paragraph...</p>
|
|
<h2>Header 2</h2>
|
|
<p>Second paragraph...</p>
|
|
<p>Third paragraph...</p>
|
|
|
|
Where it is not decidable, if the third paragraph belongs to the first or
|
|
second sections, introduced by the respective headers. The same is true for
|
|
the second paragraph. The resulting documents could look like::
|
|
|
|
<section>
|
|
<header>Header 1</header>
|
|
<para>First paragraph...</para>
|
|
<section>
|
|
<header>Header 1</header>
|
|
<para>Second paragraph...</para>
|
|
</section>
|
|
<para>Third paragraph...</para>
|
|
</section>
|
|
|
|
Or::
|
|
|
|
<section>
|
|
<header>Header 1</header>
|
|
<para>First paragraph...</para>
|
|
<section>
|
|
<header>Header 1</header>
|
|
<para>Second paragraph...</para>
|
|
<para>Third paragraph...</para>
|
|
</section>
|
|
</section>
|
|
|
|
This may be problematic when converting documents edited in the web interface
|
|
to output formats, which are aware of those structures and style documents
|
|
accordingly.
|
|
|
|
Docbook
|
|
-------
|
|
|
|
Docbook is one of the most complete XML based markup languages with only
|
|
semantical markup.
|
|
|
|
Docbook semantics
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
Docbook is by far the most complete and established markup language,
|
|
comparable with LaTeX, but XML based. The only problems experienced so far
|
|
converting other markup languages to Docbook are documented in the
|
|
`documentation of the document component`__. The described problems are all
|
|
not really relevant from a semantical point of view, but only small possible
|
|
conversion losses.
|
|
|
|
Docbook editor integration
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The used rich text editor is required to create non X/HTML elements, to offer
|
|
the user WYSIWYG experience with a Docbook markup format. The elements created
|
|
by the editor can be styled as usual using CSS, like `documented here`__.
|
|
|
|
Another possibility would be to keep the editor creating X/HTML and converting
|
|
it to Docbook before storing the document in the database like already
|
|
supported by the document component. This would, of course, reduce the
|
|
features, which can be used from the markup language.
|
|
|
|
Custom formatting
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
Since Docbook is also XML, custom formatting and modules can be integrated
|
|
with the XML source using different XML namespaces, and be converted on output
|
|
to X/HTML including the required representational markup.
|
|
|
|
Conclusion
|
|
==========
|
|
|
|
All formats require conversions during input and output of contents, because
|
|
of to the above mentioned use cases. Even there is progress in X/HTML 2 and 5,
|
|
the markup offered by those languages is not nearly as complete as the Docbook
|
|
markup and still includes purely representational markup, which would require
|
|
us to define a subset of X/HTML which is valid to store. Also the X/HTML
|
|
standards in the versions 2 and 5 have not settled down yet and may be up for
|
|
future modifications.
|
|
|
|
All formats offer enough capabilities to extend them with custom markup
|
|
directives.
|
|
|
|
The XML based formats should offer faster processing then the text based
|
|
formats, especially because of the integration of libxml2 with PHP 5.
|
|
|
|
Because of the above considerations Docbook seems the best choice for the
|
|
interal markup format in eZ Publish.
|
|
|
|
.. [#] Rich text editors in the web commonly mean editors like TinyMCE__ or
|
|
FCKEditor__, which offer WYSIWYG capabilities in web browsers.
|
|
|
|
.. [CIT_IAN_2008] `"HTML 5, 1.1.2. Relationship to XHTML2"`__. World Wide Web
|
|
Consortium. Retrieved on 2008-07-19. “… XHTML2… defines a new HTML
|
|
vocabulary with better features for hyperlinks, multimedia content,
|
|
annotating document edits, rich metadata, declarative interactive forms,
|
|
and describing the semantics of human literary works such as poems and
|
|
scientific papers… However, it lacks elements to express the semantics of
|
|
many of the non-document types of content often seen on the Web. For
|
|
instance, forum sites, auction sites, search engines, online shops, and the
|
|
like, do not fit the document metaphor well, and are not covered by XHTML2…
|
|
This specification aims to extend HTML so that it is also suitable in these
|
|
contexts…”
|
|
|
|
__ http://ezcomponents.org/
|
|
__ http://ezcomponents.org/docs/tutorials/Document
|
|
__ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html
|
|
__ http://docbook.org/tdg/en/html/docbook.html
|
|
__ http://www.wikicreole.org/wiki/Engines
|
|
__ http://en.wikipedia.org/wiki/Portable_Document_Format
|
|
__ http://de.wikipedia.org/wiki/OpenDocument
|
|
__ http://www.wikicreole.org/wiki/Engines
|
|
__ http://docutils.sourceforge.net/
|
|
__ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#footnotes
|
|
__ http://docbook.org/tdg/en/html/footnote.html
|
|
__ http://en.wikipedia.org/wiki/Microformat
|
|
__ http://ezcomponents.org/docs/api/trunk/Document_conversion.html
|
|
__ http://kore-nordmann.de/blog/the_long_way_to_semantic_web.html#id6
|
|
__ http://tinymce.moxiecode.com/
|
|
__ http://www.fckeditor.net/
|
|
__ http://www.w3.org/TR/2008/WD-html5-20080122/#relationship0
|