113 lines
4.4 KiB
Plaintext
113 lines
4.4 KiB
Plaintext
eZ publish Enterprise Component: Document, Requirements
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
:Author: Kirill Subbotin
|
|
:Revision: $Revision: $
|
|
:Date: $Date: $
|
|
|
|
Introduction
|
|
============
|
|
|
|
Description
|
|
-----------
|
|
The purpose of the Document component is to provide an easy and universal way to
|
|
deal with structured text documents of any format.
|
|
|
|
Current implementation
|
|
----------------------
|
|
In eZ publish 3 there is a datatype 'ezxmltext' that stores structured text in
|
|
it's own internal XML format, but this format has some disadvantages. The main
|
|
weak point is that it's scheme is very different from XHTML, which makes
|
|
it hard to present it in WYSIWYG editors and has a negative impact on the
|
|
performance. The main things are:
|
|
- All content is wrapped in paragraphs, even "block" elements like tables,
|
|
which is not allowed in XHTML.
|
|
- Using sections instead of plain headers like in XHTML 1.*.
|
|
- Using "line" tags with line's content instead of single "br".
|
|
- "Custom" element's placement in the schema depends on it's attribute, which is
|
|
not compatible with XML schema definitions.
|
|
|
|
This datatype can take input in some formats including 'simplified XML' and basic
|
|
HTML (for OE) and produces output in HTML and PDF using eZ publish's templates.
|
|
|
|
Also there is an ODF extension for eZ publish that can convert simple documents
|
|
in Oasis Open Document format to the internal format of 'ezxmltext' datatype and
|
|
back. It uses the custom code plus OpenOffice.org in daemon mode for conversion.
|
|
|
|
|
|
Requirements
|
|
============
|
|
The main task of the component is to take an input in one of the supported
|
|
formats and convert it to another. There should be always a way to convert
|
|
one format to another, it could be a direct conversion, or chain conversion,
|
|
that involves one or more intermediate formats.
|
|
|
|
There should be one special format that is recommended to use as intermediate
|
|
format in complex conversions, unless there are reasons to peek another. This
|
|
format is called "internal format".
|
|
|
|
Additionally the next features should be implemented:
|
|
* Validating the input document according to the given schema
|
|
* Auto-fix (tidy) incorrect input.
|
|
* Generate/edit a document in internal format using PHP API.
|
|
|
|
Main formats the component deal with include:
|
|
* Simple text markup formats (ReST, BBcode, WikiText, 'simplified XML')
|
|
* XML-based formats (XHTML, DocBook, eZ publish 4 XML text)
|
|
* Complex formats that include styles and additional info (doc, odt, pdf)
|
|
|
|
The purpose of the component is to present document's structured content only,
|
|
so all styling information that comes from the input will be ignored. But if
|
|
some semantics is coded with styles in the input document, it should be
|
|
possible to convert it properly.
|
|
|
|
|
|
Design goals
|
|
============
|
|
The component should be able to:
|
|
|
|
* Take an input and present it as a DOM tree internally.
|
|
* Validate it by the schema of it's format and report about errors or tidy
|
|
document automatically.
|
|
* Transform to another formats using one of the available methods.
|
|
* Provide an interface to create/edit a document in the internal format.
|
|
|
|
Internal format should have a schema that is rich enough to present any document.
|
|
There should always be a away to convert from any of the supported formats to
|
|
the internal format. This rule guarantees that there is always a way between
|
|
any formats.
|
|
|
|
XML transformations can be done with XSL templates. Thus XSL extension for
|
|
PHP 5 is required for this component.
|
|
|
|
Design document should contain a list of formats we are willing to support in
|
|
the first release and the ways they can be processed.
|
|
|
|
|
|
Formats
|
|
=======
|
|
|
|
The good candidate for the inner document format is DocBook (http://docbook.org/)
|
|
This is open format that has a lot of features while not being too complex.
|
|
Also there are a lot of already existing tools that support this format, so
|
|
we can use some of them if it is allowed by the license.
|
|
|
|
In the first release we will most likely support a subset of this format, so it
|
|
probably has a sense to use an already existing subset called Simplified DocBook
|
|
(http://docbook.org/schemas/simplified).
|
|
|
|
Other document formats that will defenitely supported in some way:
|
|
|
|
- ReST
|
|
- Wikitext
|
|
- HTML
|
|
- XHTML
|
|
- OpenDocument
|
|
- PDF
|
|
- eZ publish 4 XML text
|
|
- eZ publish 4 simplified XML text
|
|
|
|
The attached diagram (document-formats.svg) shows which formats will be
|
|
supported in the first release of the component and possible directions of
|
|
transforming from one format to anthoer.
|
|
|