Wikitext interchange through XHTML dialect declarations
Wiki content interchange poses a challenge because of both variant wikitext syntax and the diversity of wiki markup constructs. Limiting the set of wiki markup constructs to a standard set is neither possible nor desirable. Instead, a WIF (Wiki Interchange Format) must accommodate wiki innovation through vendor-specific dialects that extend a common base. XHTML has a natural role as the common base, and microformats demonstrates the viability of representing dialects by overloading base elements. By serializing wiki document instances in XHTML and declaring dialect extensions in RDF, wiki adopters can enable general-purpose conversion tools and integrate across system boundaries.
The wiki and wikitext explosions
Invented by Ward Cunningham in 1995 to allow a community of developers to maintain a public software patterns repository, wikis have branched out to almost any community or team endeavor to collect and maintain content. Development teams use wikis for design artifacts. Publishers use wikis for contributions from external authors. Corporations use wikis for internal collaborations controlled by project-specific access lists. With rapid adoption has come a profusion of Open Source and proprietary implementers. The Wiki Matrix website lists 124 different wikis as of this writing. Wiki uptake has also resulted in vendor-specific as well as general conferences, including WikiSym (the International Symposium on Wikis and Open Collaboration, which is now in its fifth year). As a final indicator, according to the Alexa web analytics website, Wikipedia is among the top 10 websites in the world by traffic, visited by 12% of all internet users on average. In short, wikis hold an ever-increasing share of actively maintained content.
Each vendor has its own vision of the wiki way, occupying niches ranging from personal document management to content management, social collaboration, and application infrastructure. Inevitably, these solutions differ, both in the markup constructs supported for wiki documents (such as italic phrases within documents) and in the wikitext syntax for expressing those markup constructs (such as an underscore immediately before the first word and after the last word in an italic phrase).
As a result, a document edited in one wiki can be opened in a different wiki only after conversion from one wikitext markup to another. Conversion to and from any two wikitexts is a new implementation effort. Conversions can share document parsing and document building logic only through abstraction or serendipitous similarity. Mathematically, the increase of such directed transforms is a summation: each new wikitext requires a separate conversion to and from every other pre-existing wikitext.
Aggravating the frustration about such conversion challenges, the distinctions between many wikitexts can appear unnecessary or arbitrary to an outsider. The wikitexts (see the Syntax examples section) of the five most popular wikis according to Wiki Matrix at this writing express the basic constructs of structured text with different syntax. For example, Confluence (an enterprise wiki) indicates an italic phrase with leading and trailing underscores (as in _an italic phrase_) while MediaWiki (the Open Source wiki engine behind Wikipedia) uses two apostrophes (as in ''an italic phrase''). The rationale for such differences are small consolation to anyone attempting to exchange content between wikis.
Standard wikitext and Wiki Creole
From 2003, the Meatball wiki (a discussion wiki for wiki practitioners) hosted an exploratory discussion for a standard wikitext (see MBSTAN). Discussion enumerated some common wiki constructs and variant syntax, and some participants floated a proposal for a Working Group of the Internet Engineering Task Force (IETF), but the discussion never reached agreement. Observers made instructive arguments against a standard wikitext:
- Wikis with an existing wikitext have no motivation to migrate their users.
- In some cases, wikitext differences reflect legitimate differences in goals for the wiki.
- Wikis already have a common standard in HTML.
- GUI interfaces are reducing the importance of wikitext.
During the 2006 WikiSym, the Wiki Markup Standard workshop (with the authoritative participation of Ward Cunningham) redirected the wikitext standard effort to a Wiki Creole (see CREOLE). Similar to the role of a pidgin in allowing speakers of different native languages to interact, a Wiki Creole provides a wikitext for the common wiki constructs so that wiki vendors can offer users an alternative wikitext that is the same across wiki engines. With leadership by researchers from Heilbronn University, the workgroup followed an open process, conducted an careful inventory of existing wikitexts, conformed to thoughtful guidelines for decisions, and produced a stable recommendation. By such criteria, the Wiki Creole effort was a success.
Adoption of Wiki Creole has been slow. (Of the five popular wikis identified previously, only two provide support.) More importantly, Wiki Creole is not intended to solve the wikitext explosion problem. Wiki Creole only provides a standard for the wiki constructs in the intersection of existing wikitexts. The areas of differentiation for wiki vendors — where the wikitext reflects the vendor’s innovation — are by definition out of scope for Wiki Creole.
For example, to serve the structural needs of Wikipedia articles, much of the MediaWiki innovation is in the area of linking, inclusion, categorization, and templates. The Semantic MediaWiki innovations focus on formalizing categories and properties with vocabulary names. By contrast, many of the distinguishing features of Confluence support styling, layout, interactivity, and integration with information sources.
In short, while the divergence of wikitext syntax poses a problem, the diversity of wiki constructs is if anything a greater problem. Mitigating the explosion of wiki constructs is a challenge of vocabulary relations. That is, any solution for interchange across MediaWiki and Confluence must be able to identify and manage the common and distinct constructs in those wiki vocabularies. Before considering the requirements for such an interchange solution, however, it’s important to recognize how traditional XML markup differs from wiki markup.
Differences between traditional XML and wiki perspectives on markup
In traditional XML markup, the document type defines a complete vocabulary for a document from the top down. For instance, a DocBook document typically has a root element of book. The book element defines possible contained elements, such as preface, chapter, and appendix. In turn, preface defines its valid contents, and so on. Both general-purpose elements for structured text (such as bullet lists) and special-purpose elements with precise semantics and structure (such as a function synopsis) have equally visibility in the document markup
The wiki approach simplifies document markup so any reader can act as an immediate, infrequent writer. While a mechanism for editing a document in a web browser is critical to the approach, the markup simplicity is equally important. Writers must be able to correlate changes in the source markup with changes in the rendered output. In particular, writers must be shielded from positional obligations (the sequences and nested groups made possible by markup grammars).1
A discussion page disqualifying HTML as a format for authoring wikis (see NOHTML) argues for wiki principles of markup simplicity, which can be summarized as follows:
- Minimal distraction due to markup obligations while writing.
- Minimal distraction due to formatting considerations while writing.
- Minimal distraction due to obtrusive markup while reading source.
- Ease of learning.
The fundamental constructs of structured text (the common markup identified by Wiki Creole such as bullet list, numbered list, and table at the block level and bold, italic, and link at the phrase level) resemble the conventions a writer might use for layout and emphasis in a text-only document.
The structured text acts as a kind of substrate for more visible, labeled markup (typically introduced through embedded HTML tags or wiki-specific macros) that specify the innovative and differentiating features of the wiki. For instance, MediaWiki annotates preformatted blocks with the HTML opening <pre> and closing </pre> tags. Similarly, Confluence annotates preformatted blocks with initial and terminating {noformat} tags. The labeled markup stands out from the substrate of structured text. In short, such salient markup is the exception, not the rule.
Salient markup breaks down broadly into two categories:
- Special semantics for structured text. Examples include preformatted blocks, caution, tip, or warning note blocks, quoted blocks or phrases, copyright or trademark phrases, and values organized as tables or trees.
- Parameters to transclusion, interaction, or media object generation. Examples include retrieved calendars and tables of contents, Atom / RSS feeds, and recent changes.
As a markup-rich wiki, Confluence provides abundant examples from both categories. Confluence has approximately 50 constructs for semantic structured text (40 using macro tags) and another 50 constructs for parameterized instructions. Between supported HTML tags, supplemental XML tags, and macros, MediaWiki has about 30 constructs for semantic structured text and around 10 constructs for parameterized instructions. Such considerations are not limited to document-oriented wikis. For instance, spreadsheet wikis (as in SocialCalc), map wikis (as in WikiMapia), and bug reporting wikis (as in Trac) have human readable values (that is, semantic structured text) as their primary content. Finally, extensibility is particularly crucial for wikis that allow adopters to add new salient markup through templates or pluggable macros.
As part of the wiki commitment to simplicity, the content of wiki markup exhibits none of the complex patterns supported by XML grammars. In addition, some wikis currently have no formal grammar. For instance, the MediaWiki parser currently uses a series of regular expressions rather than a grammar. Thus, instead of nested groups of choices and sequences, wiki markup typically contains properties (with zero to one occurrence) and textual or mixed content (with zero to unbounded occurrence).
To summarize, wiki interchange must handle three distinct types of content: the structured text substrate, semantic text, and parameterized instructions.
Wiki Interchange Format
The wiki community has considered a WIF (Wiki Interchange Format) as an alternative to wikitext standardization (see WIF and MBWIF). In this approach, each wiki vendor is responsible for export and import conversion between its wikitext and WIF. In effect, the WIF approach distributes the conversion effort with each vendor working in parallel on the one wikitext syntax the vendor already understands.
To account for overlapping wiki constructs, conversion requires more than a single WIF vocabulary and a process with export by one vendor and import by another vendor. Instead, WIF must be extensible by each vendor to support its distinguishing wiki constructs. The extended WIF can be considered a dialect of the base WIF and an alternative serialization of the wiki document. Thus, the vendor must be responsible for a WIF dialect and export and import operations on the WIF dialect.
Conversion between two wikitexts breaks down into the following steps:
- Export from the source wikitext to the source WIF dialect.
- Conversion from the source WIF dialect to the target WIF dialect.
- Import from the target WIF dialect to the target wikitext.
This approach solves the wikitext syntax explosion by using a common format for both input and output of the conversion step. The approach also solves the wiki construct explosion by managing new constructs as extensions on the base vocabulary.
To support this strategy, WIF must have the following characteristics:
- Processability. Export from a wikitext to WIF and import from WIF to a wikitext must be efficient and direct.
- Interpretability. Developers who write imports and exports must be able to understand WIF without extraordinary effort. Otherwise, conversions will be error prone. Also, to handle unexpected exigencies, manual editing of WIF instances must be possible.
- Completeness. WIF fails in its primary goal unless content can be exported from a wikitext and imported to the same wikitext without loss of semantics or structure in the roundtrip. In other words, the WIF must be a complete representation of any source wikitext. (Complete lexical fidelity to the source, however, need not be a requirement for an interchange format.)
- Out-of-the-box support for structured text. Representing the constructs of the structured text substrate of wiki markup must not require any additions to WIF.
- Extensibility for semantic text and parameterized instructions. To be complete, WIF must be able to accommodate new wiki markup constructs that innovate and differentiate. Instances of the WIF must be able to represent such extensions. The complete set of WIF extensions for any wikitext must be declarable (instead of having to guess extensions from instances). Such declarations provide an inventory of challenges for conversion from one wikitext to another.
- Distributed implementation. Each wiki vendor must be able to maintain its declaration of WIF extensions without coordination with other wiki vendors. Otherwise, coordination will hinder innovation, which will prevent adoption of WIF.
- Mapping of extensions between WIF dialects. Where an extension in a source dialect has an equivalent in the target dialect, the equivalence should be declarable. For instance, if source and target wikitext both have a caution note construct, a mapping can equate the two constructs. Maximizing flexibility, the mapping could be maintained by the vendor of either the source or target wikitexts or, indeed, by a third party. Generic processing against the WIF declarations can implement the conversion from the source WIF dialect to the target WIF dialect. Simplified content models of wikitexts enable automation of conversions based on declarations.
- Reversion of semantic text extensions to structured text. Because a semantic text extension such as caution note markup is a special kind of block, a wikitext that does not have a caution note construct must be free to treat the caution note as a block. Even if semantics are lost on conversion to the target wikitext, the retention of structure makes the content readable and thus manually editable in the target wiki.
- Optionality of parameterized instruction extensions. Because the values for a parameterized instruction such as the range for a calendar are meaningless when the instruction is not supported, the parameterized instruction must be ignorable on conversion to a target wikitext that lacks the markup construct. The rest of the wiki document remains editable in the target wiki, and the presence of the instruction in the WIF allows the target wiki vendor to add support for the instruction at a later time without a new export from the source wiki.
While a base representation of extensions that have no meaning in the base document type might seem strange, the alternative (defining a special representation) incurs the effort of reading a special representation just to ignore the parts that are special. To put it another way, the base format must make it easy to skip over unknown constructs while reading recognized constructs.
By definition, conversion between two wikitexts will be lossy if the source wikitext supports constructs with no equivalent in the target wikitext. Because the wiki constructs are declared, however, problematic constructs can be identified. The source wiki can even provide users who have interchange requirements with a mode that issues edit-time warnings for problematic constructs.
XHTML as WIF
Candidates proposed for WIF have included the following:
- An abstract model similar to SAX events or a DOM structure (see Wiki Model). While an abstract model has value, a serialization that can be stored and transported seems necessary.
- A direct serialization of the parsed AST (Abstract Syntax Tree) for each wikitext. Because an open-ended AST does not manage the base or extension structure or semantics and is difficult to understand, such a serialization would not by itself be sufficient for a WIF.
- A new XML vocabulary. A vocabulary for managing the base and extension structure and semantics of wikitext would have to be created.
- A new wikitext format. The syntax and vocabulary for managing the base and extension structure and semantics of wikitext would have to be created.
In the discussions cited above, some participants have suggested XHTML as the natural candidate for WIF. XHTML provides a vocabulary with obvious sufficiency for the structured text substrate of wikitexts. Indeed, HTML may well have provided the model for the structured text constructs of some wikitexts. As an XML vocabulary, XHTML gets the full benefit of the XML ecosystem including parsers, transform processors, and community expertise. For wikis such as MediaWiki that allow embedding of HTML tags, mapping from the wikitext to XHTML is direct for a subset of the constructs. Other wikis still have a close relationship between their wiki constructs and HTML simply because HTML is the primary output for wikis. In particular, the strong association between wiki constructs and HTML elements makes it easier for infrequent writers to correlate changes to the source markup with effects in the document presentation.
For interchange, the existing output HTML is insufficient because of styling. Instead, the wikitext must be mapped to semantic XHTML similar to the POSH (Plain Old Semantic HTML) approach advocated by the microformats community (see POSH). Still, serializing wiki documents in an XHTML WIF has the potential to share some of the code for rendering the wikitext.
In pursuit of this approach, the wiki community has produced two definitions of restricted versions of XHTML for wikitext interchange — the InterWikiMarkupLanguage: A Common Interchange Syntax for Wiki (see IWML) and Structured Text Interchange Format (see STIF). Restricting XHTML for WIF, however, is misguided. The purpose of WIF must be descriptive rather than prescriptive. For instance, most wikitexts cannot construct a table within a table, but if one could, WIF should represent such structures faithfully so the potential interchange problem is visible. In short, the full range of the XHTML document type should be available in WIF.
WIF dialects, microformats, and restrictive substitution
XHTML alone is not sufficient, however, for wikitext extensions. WIF requires extensibility for the semantic text and parameterized instruction constructs that characterize the WIF dialect for the wiki.
HTML modularization (see HTMLMOD) provides a well-established strategy for adding new elements to HTML. For each new set of extensions, a new HTML profile integrates HTML modules with the new elements for the extensions. While the HTML profile approach provides sharing for the common constructs of structured text, documents for different profiles have incompatible XML schemas. For interchange, each new HTML profile would require a custom transform to every other HTML profile. As observed previously, conversions should not have to make a special effort to read constructs only to ignore them.
The microformats initiative offers an alternative strategy with proven practicability for extending XHTML with new vocabularies. In the microformats approach, existing elements are overloaded with new semantics. In particular, the refactoring of the Atom syndication format as the hAtom microformat demonstrates that this approach can scale up from simple to complex vocabularies.
By definition, the content of a microformat can only be the same as or more restrictive than its overloaded HTML element. DITA specialization (see DITASPEC) uses the same fundamental principle for derivation of a wide variety of vocabularies (though from base XML vocabularies different from HTML). This derivation strategy essentially combines the XML Schema mechanisms of restriction and substitution. For clarity, this paper refers to this approach as restrictive substitution. Restrictive substitution has particular benefit for interchange because dialects all share the same base vocabulary. By contrast with HTML profiles, restrictive substitution permits serialization of extensions in the base document type without loss of semantics.
While restrictive substitution of XHTML provides the best candidate for supporting WIF dialects, the process for defining microformats requires central discussion and agreement. Thus, the traditional microformats process conflicts with the distributed requirement for WIF dialects. For example, the microformats process has a direct impact on naming. Because vocabularies result from central agreement, microformats can assign simple names without risk of collision. For distributed definition of WIF dialects, simple names pose a significant risk for collision. Thus, namespaces — the standard mechanism for preventing naming collisions — become essential for WIF dialects. More generally, microformats requires hard-coding an awareness of the extension instead of providing mechanisms for discovery and automated processing.
The challenge of independent derivation and alignment of vocabularies for automated processing resides squarely in the problem space for RDF (see RDF) and its upper layers, SKOS and OWL (see SKOS AND OWL). While RDF lacks facilities for declaration of complex content models — instead having simple properties with cardinality but not sequence — this limited capability has a good fit with the simple content models of wikitexts (due to their rejection of the complex positional constraints typical of grammars). In addition, identifying extensions in RDF has the potential for valuable synergies with Semantic Wiki initiatives. In short, an RDF vocabulary could meet the requirement for declaring the constructs of a WIF dialect including their base HTML terms, restricted content, and mapping to constructs in other WIF dialects.
The WIF serializations of individual wiki documents would have the following characteristics:
- Validity as an XHTML document.
- Namespace declarations for XHTML and the WIF dialect.
- Identification of the RDF declaration of the WIF dialect with the XHTML profile attribute.
- Identification of the WIF extension on overloaded XHTML elements with a modified QName identifier in the class attribute. Such identifiers can conform to the basic principles of CURIE identifiers (see CURIE) while remaining valid tokens in the class attribute by separating the prefix with two hyphens (and prohibiting two hyphens in a prefix) instead of a colon.
- Representation of parameter values that aren’t part of the content — particularly for parameterized instruction constructs — with a special processing instruction inside the overloaded element. Processing instructions can appear freely within any element including empty elements. Thus, a dialect has the freedom to accommodate any parameters without constraint by the XHTML base.
To validate converted documents before import and to test exports during development, automated tooling could check the consistency of WIF dialect instances against the WIF dialect declaration.
WIF examples
Here is an example of salient markup for semantic text in the Confluence wikitext:
{code:title=Bar.java}
public String getFoo()
{
return foo;
}
{code}
The HTML vocabulary offers the following obvious base terms for the Confluence markup constructs:
| Confluence | HTML | Rationale |
| code | pre | Line endings and spaces should be preserved in code. |
| title | strong | Titles should receive strong emphasis. |
The XHTML serialization of the WIF dialect might resemble the following:
<pre class="conf--code"><strong class="conf--title">Bar.java</strong>
public String getFoo()
{
return foo;
}
</pre>
Because MediaWiki has no equivalent to the semantic code block, conversion to the MediaWiki dialect would consider only the base XHTML element, yielding the following instance in the MediaWiki dialect:
<pre><strong>Bar.java</strong>
public String getFoo()
{
return foo;
}
</pre>
If MediaWiki later added an equivalent to the semantic code block, anyone could declare a mapping between the Confluence and MediaWiki constructs. An automated process could then change the identifiers in the class attributes to preserve semantics during conversion.
For an example of a parameterized instruction, consider the following wikitext in Confluence:
{rss:url=http://slashdot.org/index.rss|max=5}
An HTML div or span element can represent the parameterized instruction because an empty div or span has no meaning and can be ignored. The element contains special processing instructions for the parameter values:
<div class="conf--rss"> <?dialect-property url=http://slashdot.org/index.rss?> <?dialect-property max=5?> </div>
Because MediaWiki has no equivalent to the rss instruction, conversion to the MediaWiki dialect would ignore the parameter values and revert to the base element, yielding a no-op instance in the MediaWiki dialect:
<div/>
Because DokuWiki does have the feed instruction, conversion to the DokuWiki dialect would map the element as well as the parameter values:
<div class="doku--rss"> <?dialect-property feed=http://slashdot.org/index.rss?> <?dialect-property number=5?> </div>
The DokuWiki dialect would import to the following DokuWiki wikitext instance:
{{rss>http://slashdot.org/index.rss 5 }}
This serialization meets the objective for easy interpretation. The reversion of semantic text to structured text is visible and enabled by default (as with the code block above). Similarly, by default, the skipped values of a parameterized instruction are visible in the source dialect (though not in the rendered flow). A developer can see exactly what is lost on conversion. A vendor can improve compatibility at any time by enhancing the target dialect. Even in the worst case, maximizing the conservation of base structured text has value for many scenarios.2
The RDF declaration for the Confluence constructs might resemble the following sketch (where the “dx” namespace prefix qualifies the dialect declaration vocabulary and the “conf” namespace prefix qualifies the constructs of the Confluence dialect):
conf:code a dx:Type ;
dx:hasBase html:pre ;
dx:containerFor [ a dx:Position ;
dx:hasContained conf:title ;
dx:hasOccurrence dx:ZeroToOneRange
] ;
dx:containerFor [ a dx:Position ;
dx:hasContained xs:string ;
dx:hasOccurrence dx:ZeroToOneRange
] .
conf:title a dx:Type ;
dx:hasBase html:strong ;
dx:containerFor [ a dx:Position ;
dx:hasContained xs:string ;
dx:hasOccurrence dx:ZeroToOneRange
] .
conf:rss a dx:Type ;
dx:hasBase html:div ;
dx:containerFor [ a dx:Position ;
dx:hasContained conf:url ;
dx:hasOccurrence dx:OneRange
] ;
dx:containerFor [ a dx:Position ;
dx:hasContained conf:max ;
dx:hasOccurrence dx:ZeroToOneRange
] .
conf:url a dx:Type ;
dx:hasBase dx:DialectProperty ;
dx:containerFor [ a dx:Position ;
dx:hasContained xs:anyURI ;
dx:hasOccurrence dx:OneRange
] .
conf:max a dx:Type ;
dx:hasBase dx:DialectProperty ;
dx:containerFor [ a dx:Position ;
dx:hasContained xs:positiveInteger ;
dx:hasOccurrence dx:OneRange
] .
DokuWiki would provide a similar declaration of the RSS construct for its WIF dialect. A declaration of equivalence between dialect constructs would resemble the following:
conf:rss dx:equivalent doku:rss . conf:url dx:equivalent doku:feed . conf:max dx:equivalent doku:number .
The declarations provide enough information for a general-purpose process to discover and perform valid conversions. In particular, where salient markup in the source instance has an equivalent in the target dialect and has equivalent content for all required content in the target dialect, the general-purpose process can rewrite the source instance as a target instance. In addition to equivalence, a declaration of subsumption (where the target construct has a broader semantic and broader content than the source construct) can also enable conversion.
While RDF provides for distributed vocabulary declarations with automated alignment, the precision of the RDF notation reduces its usability in some cases for authoring. A more concise XML format that makes use of CURIEs and has a transform to RDF could provide an authorable convenience for dialect declarations. The following fragment shows one possibility:
<dx:Type about="conf:code">
<hasBase ref="html:pre"/>
<contains ref="conf:title" range="zeroToOne"/>
<contains ref="xs:string" range="zeroToOne"/>
</dx:Type>
<dx:Type about="conf:title">
<hasBase ref="html:strong"/>
<contains ref="xs:string" range="zeroToOne"/>
</dx:Type>
<dx:Type about="conf:rss">
<hasBase ref="html:div"/>
<contains ref="conf:url" range="one"/>
<contains ref="conf:max" range="zeroToOne"/>
</dx:Type>
<dx:Type about="conf:url">
<hasBase ref="dx:DialectProperty"/>
<contains ref="xs:anyURI" range="one"/>
</dx:Type>
<dx:Type about="conf:max">
<hasBase ref="dx:DialectProperty"/>
<contains ref="xs:positiveInteger" range="one"/>
</dx:Type>
Full exploitation of XHTML dialects
Based on the declaration of the WIF dialect, general-purpose tooling could support an XML document type specific to the dialect. The tooling could read the dialect declaration and generate a basic document type schema for the WIF dialect. In addition, tooling could serialize wiki documents as valid XML documents, using namespaced dialect elements instead of XHTML elements overloaded by namespaced dialect tokens in the class attribute. Elements that serialize parameterized instruction constructs would have extension attributes instead of nested processing instructions. The direct XML representation of the dialect would have the benefit of taking full advantage of the XML ecosystem for authoring, storing, and processing wiki documents without losing the option of serialization in the wikitext or WIF. Some wiki vendors may find value in such capabilities, adding incentive to declare the WIF dialect.
XHTML dialects could be used to exchange other content beyond the wiki documents. For instance, some wikis maintain wiki navigation structures outside of the wiki documents. These navigation structures could be serialized as an XHTML dialect, extending nested div elements for structure and a elements for links (similar to the XML navigation formats such as Eclipse TOC and JavaHelp TOC).
More generally, many applications (including RESTful applications, ePUB sources, and others unrelated to wikis) could model their content as XHTML dialects. While outside the scope of this paper, this broader value increases the practicability of WIF because the implementation effort can attract more resources and because the serialization has potential interchange outside of wikitexts. As a particular example, the convenience sketched in the previous section for XML authoring of WIF dialect declarations could, in fact, be an XHTML dialect.
Benefits of the approach
To summarize the interchange proposal, each wiki vendor declares a WIF dialect that represents their differentiating wiki constructs as restrictive substitutions of XHTML. The vendor also writes export and import processing for converting wiki documents between the WIF dialect and the vendor’s wikitext. Third parties can declare mappings between WIF dialects. Based on the declarations, general-purpose tools can check compatibility of WIF dialects from different wiki vendors and convert instances. Because the conversion work is distributed across vendors and mappers, the approach does not raise process barriers to rapid innovation in the wiki space.
This robust method for interchange of wiki content provides many benefits:
- Federation across wikis (in particular, transclusion of document fragments) becomes possible.
- XHTML editor vendors can adapt their tools to support editing wiki documents in the base XHTML serialization.
- Users can avoid vendor lock-in by migrating their content (potentially with manual cleanup for edge cases).
- Wiki vendors can provide broader value (especially in the enterprise) by integrating wiki content through XML export and import.
Formalizing a strong alignment between wikitext and XHTML also increases the value of XHTML as an ecosystem. In particular, re-energized HTML5, the RESTful emphasis on HTML as the preferred format for data exchange, and the uptake of the ePUB format would have strong synergy with XHTML representation of wiki documents.
Pragmatic steps to definition and adoption
This paper has sketched out an approach that would benefit from deeper exploration. For instance, the automated conversion between WIF dialects may be able to flatten containers while preserving basic semantics and structure. In particular, the exploration needs significant participation from the wiki and XML communities. The following steps would be useful to flesh out the proposal and build momentum for the approach:
- In collaboration with at least one Open Source wiki vendor, one enterprise wiki vendor, and one enterprise XML vendor, create a proof of concept for conversion based on WIF dialect declarations.
- Present findings from the demonstrator and solicit feedback at the WikiSym and SemWiki conferences.
- Convene a W3C workshop for followup to the conferences. Because of the important tie-in to XHTML, W3C is the most appropriate standards body for WIF dialects based on XHTML.
- Convene a W3C committee for standardizing the WIF dialect declarations and WIF instance serializations.
Bibliography
CREOLE Sauer, Christoph et al. “WikiCreole: A Common Wiki Markup.” Presented at the International Symposium on Wikis, Montréal, Canada, October 21 – 25, 2007. In Proceedings of the 2007 international symposium on Wikis. http://ws2007.wikisym.org/space/SauerSmithBenzPaper/Sauer_WikiSym2007_WikiCreole.pdf See also http://www.wikicreole.org/
CURIE Birbeck, Mark and McCarron, Shane, Ed. “CURIE Syntax 1.0”. W3C, 16 Jan 2009. http://www.w3.org/TR/curie/
DITASPEC Priestley, Michael and Hackos, JoAnn, Ed. “DITA Version 1.1 Architectural Specification”. OASIS, 31 May 2007. http://docs.oasis-open.org/dita/v1.1/CS01/archspec/ditaspecialization.html
HTMLMOD Austin, Daniel et al. “XHTML Modularization 1.1”. W3C, 8 October 2008. http://www.w3.org/TR/xhtml-modularization/
IWML Altheim, Murray. “InterWikiMarkupLanguage (IWML): A Common Interchange Syntax for Wiki”. 27 Mar 2004. http://www.altheim.com/specs/iwml/
MBWIF Shah, Sunir et al. “Meatball Wiki: WikiInterchangeFormat”. Last modified 4 Apr 2007. http://meatballwiki.org/wiki/WikiInterchangeFormat
MBSTAN Shah, Sunir et al. “Meatball Wiki: WikiMarkupStandard”. Last modified 16 Apr 2010. http://meatballwiki.org/wiki/WikiMarkupStandard
NOHTML Shah, Sunir et al. “Why Doesnt Wiki Do Html”. Last modified 28 Feb 2010. http://c2.com/cgi/wiki?WhyDoesntWikiDoHtml
OWL W3C OWL Working Group, Ed. “OWL 2 Web Ontology Language Document Overview”. 27 October 2009. http://www.w3.org/TR/owl2-overview/
POSH Çelik, Tantek et al. “Plain Old Semantic HTML (POSH)”. 26 Apr 2010. http://microformats.org/wiki/posh
RDF Klyne, Graham and Carroll, Jeremy J., Ed. “Resource Description Framework (RDF): Concepts and Abstract Syntax”. W3C, 10 Feb 2004. http://www.w3.org/TR/rdf-concepts/
SKOS Miles, Alistair and Bechhofer, Sean, Ed. “SKOS Simple Knowledge Organization System Reference”. W3C, 18 Aug 2009. http://www.w3.org/TR/skos-reference/
STIF Völkel, Max et al. “Structured Text Interchange Format”. Last modified 19 Oct 2009. http://semanticweb.org/wiki/Structured_Text_Interchange_Format
WIF Voght, Tim et al. “Wiki Interchange Format”. Last modified 22 Dec 2006. http://c2.com/cgi/wiki?WikiInterchangeFormat
Notes
1 The wikitext critique of XML markup echoes the Dynamic Language critique of Object-Oriented programming languages like Java. According to the critique, the formality that ensures valid sources turns source creation into a complex ceremony that impedes agility. The wikitext critique also echoes the argument for a DSL (Domain Specific Language) as a better fit for a specific niche than a general-purpose language (but also invites the counter-argument that simplifying assumptions can become constraining legacy when a DSL outgrows its niche).
2 A particular challenge for an XML representation of wikitext is that some wikitexts can have overlapping tags as in HTML (for example, <i>italic <b>bold italic</i> bold</b>). One possibility would be for the WIF serialization to use the class attribute to represent these special cases (as in <i>italic </i><span class="html--i html--b">bold italic</span><b> bold</b>). The importance of this case and other possible solutions, however, require more discussion.
Comments