[MarkLogic Dev General] Wikimedia parse

David Lee David.Lee at marklogic.com
Wed Jun 27 10:33:42 PDT 2012


Ah the API does !!! Who hoo.
Maybe I can get XML out of this after all ... I smell an xmlsh extension in the making :)

I actually have a similar problem with xmlsh docs ... they are all currently in WakiWiki ... but thats a black box ... I want to turn them into XML like DocBook ...
Someone (Dave Pawson I think ?) wrote me a python lib to do that but only 90% ... the last 10% as usual is 99% 

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
dlee at marklogic.com
Phone: +1 650-287-2531
Cell:  +1 812-630-7622
www.marklogic.com

This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.


> -----Original Message-----
> From: general-bounces at developer.marklogic.com [mailto:general-
> bounces at developer.marklogic.com] On Behalf Of Michael Blakeley
> Sent: Wednesday, June 27, 2012 1:23 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Wikimedia parse
> 
> Yes, it does. The API gives you access to much of that wiki-structured markup,
> but you have to decide what to do with it. Naturally the online tool and much
> of the sample code doesn't do anything interesting.
> 
> -- Mike
> 
> On 27 Jun 2012, at 10:19 , David Lee wrote:
> 
> > Thanks,  I tried the online tool on a sample I have and it strips out much of
> the meaningful stuff :(
> >
> > -----------  Input
> > {{Infobox settlement
> > <!--See the Table at Infobox Settlement for all fields and descriptions of
> usage-->
> > <!-- Basic info  ---------------->
> > |name          = Teichibe
> > |other_name             =
> > |native_name            =  <!-- for cities whose native name is not in English --
> >
> > |nickname               =
> > |settlement_type        =Village
> > |motto                  =
> > <!-- images and maps  ----------->
> > |image_skyline          =
> > |imagesize              =
> > |image_caption          =
> > |image_flag             =
> > |flag_size              =
> > |image_seal             =
> > |seal_size              =
> > |image_shield           =
> > |shield_size            =
> > |image_map              =
> > |mapsize                =
> > |map_caption            =
> > |pushpin_map            =Mali<!-- the name of a location map as per
> http://en.wikipedia.org/wiki/Template:Location_map -->
> > |pushpin_label_position =bottom
> > |pushpin_mapsize        =300
> > |pushpin_map_caption    =Location in Mali
> > <!-- Location ------------------>
> > |coordinates_display    = inline,title
> > |coordinates_region     = ML
> > |subdivision_type       = Country
> > |subdivision_name       = {{flag|Mali}}
> > |subdivision_type1      = [[Regions of Mali|Region]]
> > |subdivision_name1      = [[Kayes Region]]
> > |subdivision_type2      =[[Cercles of Mali|Cercle]]
> > |subdivision_name2      = [[Kayes Cercle]]
> > |subdivision_type3      =[[Communes of Mali|Commune]]
> > |subdivision_name3      = [[Karakoro]]
> > |<!-- Politics ----------------->
> > |government_footnotes   =
> > |government_type        =
> > |leader_title           =
> > |leader_name            =
> > |leader_title1          =  <!-- for places with, say, both a mayor and a city
> manager -->
> > |leader_name1           =
> > |established_title      =  <!-- Settled -->
> > |established_date       =
> > <!-- Area    --------------------->
> > |area_magnitude         =
> > |unit_pref                =Imperial <!--Enter: Imperial, if Imperial (metric) is
> desired-->
> > |area_footnotes           =
> > |area_total_km2           =  <!-- ALL fields dealing with a measurements are
> subject to automatic unit conversion-->
> > |area_land_km2            = <!--See table @ Template:Infobox Settlement for
> details on automatic unit conversion-->
> > <!-- Population   ----------------------->
> > |population_as_of               =
> > |population_footnotes           =
> > |population_note                =
> > |population_total               =
> > |population_density_km2         =
> > |population_density_sq_mi       =
> > |population_metro               =
> > |population_density_metro_km2   =
> > |population_density_metro_sq_mi =
> > |population_blank1_title        =Ethnicities
> > |population_blank1              =
> > |population_density_blank1_km2 =
> > |population_density_blank1_sq_mi =
> > <!-- General information  --------------->
> > |timezone               =[[GMT]]
> > |utc_offset             = +0
> > |timezone_DST           =
> > |utc_offset_DST         =
> > |latd=15|latm=16|lats=30 |latNS=N
> > |longd=11|longm=42|longs=25|longEW=W
> > |elevation_footnotes    =  <!--for references: use <ref> </ref>
> tags-->
> > |elevation_m            =
> > |elevation_ft           =
> > <!-- Area/postal codes & others -------->
> > |postal_code_type       =  <!-- enter ZIP code, Postcode, Post code, Postal
> code... -->
> > |postal_code            =
> > |area_code              =
> > |blank_name             =
> > |blank_info             =
> > |blank1_name            =
> > |blank1_info            =
> > |website                =
> > |footnotes              =
> > }}
> >
> > '''Teichibe''' is a village and principal settlement (''[[chef-lieu]]'') of the
> [[Karakoro|commune of Karakoro]] in the [[Kayes Cercle|Cercle of Kayes]] in
> the [[Kayes Region]] of south-western [[Mali]].<ref>{{citation |
> title=Communes de la Région de Kayes | publisher= Ministère de
> l'administration territoriale et des collectivités locales, République du Mali |
> url=http://www.matcl.gov.ml/pdf/ComRegKayes.pdf | language=French
> }}.</ref>
> >
> >
> > ==References==
> > {{reflist}}
> >
> > [[Category:Populated places in the Kayes Region]]
> >
> >
> > {{Kayes-geo-stub}}
> >
> > --------------  Output
> >
> > <p>{{Infobox settlement
> > &#60;!--See the Table at Infobox Settlement for all fields and descriptions of
> usage--&#62;
> > &#60;!-- Basic info  ----------------&#62;}} </p>
> > <p><b>Teichibe</b> is a village and principal settlement (<i><a href="Chef-
> lieu" title="chef-lieu">chef-lieu</a></i>) of the <a href="Karakoro"
> title="Karakoro">commune of Karakoro</a> in the <a href="Kayes_Cercle"
> title="Kayes Cercle">Cercle of Kayes</a> in the <a href="Kayes_Region"
> title="Kayes Region">Kayes Region</a> of south-western <a href="Mali"
> title="Mali">Mali</a>.&#60;ref&#62;{{citation}}.&#60;/ref&#62;</p>
> >
> > <h2><span class="mw-headline" id="References">References</span></h2>
> > <p>{{reflist}}</p>
> >
> > <p>{{Kayes-geo-stub}}</p>
> >
> >
> > -----------------------------------------------------------------------------
> > David Lee
> > Lead Engineer
> > MarkLogic Corporation
> > dlee at marklogic.com
> > Phone: +1 650-287-2531
> > Cell:  +1 812-630-7622
> > www.marklogic.com
> >
> > This e-mail and any accompanying attachments are confidential. The
> information is intended solely for the use of the individual to whom it is
> addressed. Any review, disclosure, copying, distribution, or use of this e-mail
> communication by others is strictly prohibited. If you are not the intended
> recipient, please notify us immediately by returning this message to the sender
> and delete all copies. Thank you for your cooperation.
> >
> >
> >> -----Original Message-----
> >> From: general-bounces at developer.marklogic.com [mailto:general-
> >> bounces at developer.marklogic.com] On Behalf Of Michael Blakeley
> >> Sent: Wednesday, June 27, 2012 1:14 PM
> >> To: MarkLogic Developer Discussion
> >> Subject: Re: [MarkLogic Dev General] Wikimedia parse
> >>
> >> Not in XQuery: it would be much too ugly for my taste. I've used
> >> http://code.google.com/p/gwtwiki/ and contributed a couple of patches.
> >> Hsiao could show you some sample code with xhtml-like output.
> >>
> >> If you need to use it from XQuery, I suppose you could wrap it in a web
> service.
> >>
> >> -- Mike
> >>
> >> On 27 Jun 2012, at 09:59 , David Lee wrote:
> >>
> >>> Has anyone seen a XQuery or XSLT parser for WikiMedia (markup for
> >> Wikipedia)
> >>>
> >>> I found this list
> >>>
> >>> http://www.mediawiki.org/wiki/Alternative_parsers
> >>>
> >>>
> >>> What I'm looking for is a way to take the XML dump of Wikipedia and
> enrich
> >> it to something more useful.  Right now all the body of an article is in
> >> Wikimedia format and largely opaque to ML except as one long string.
> >>>
> >>>
> >>> -----------------------------------------------------------------------------
> >>> David Lee
> >>> Lead Engineer
> >>> MarkLogic Corporation
> >>> dlee at marklogic.com
> >>> Phone: +1 650-287-2531
> >>> Cell:  +1 812-630-7622
> >>> www.marklogic.com
> >>>
> >>> This e-mail and any accompanying attachments are confidential. The
> >> information is intended solely for the use of the individual to whom it is
> >> addressed. Any review, disclosure, copying, distribution, or use of this e-
> mail
> >> communication by others is strictly prohibited. If you are not the intended
> >> recipient, please notify us immediately by returning this message to the
> sender
> >> and delete all copies. Thank you for your cooperation.
> >>>
> >>> _______________________________________________
> >>> General mailing list
> >>> General at developer.marklogic.com
> >>> http://community.marklogic.com/mailman/listinfo/general
> >>
> >> _______________________________________________
> >> General mailing list
> >> General at developer.marklogic.com
> >> http://community.marklogic.com/mailman/listinfo/general
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://community.marklogic.com/mailman/listinfo/general
> >
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://community.marklogic.com/mailman/listinfo/general


More information about the General mailing list