How to learn to parse huge XML documents by doing it wrong for 5 years

*

Excerpt

Tyler Riddle will cover his learning experiences creating the Parse::MediaWikiDump, XML::TreePuller, and MediaWiki::DumpFile modules which are made to handle the 24 gigabyte English Wikipedia dump files in a reasonable time frame.

Description

When XML documents can’t fit into memory the vast majority of solutions available on CPAN are no longer available to you; when the XML documents are so large they take up to 16 hours to process with the standard tools for handling large documents your hands are tied even more. Tyler Riddle will cover his learning experiences creating the Parse::MediaWikiDump, XML::TreePuller, and MediaWiki::DumpFile modules which are made to handle the 24 gigabyte English Wikipedia dump files in a reasonable time frame.

Major topics:

1) Real world benchmarks of C and perl libraries used to process huge XML documents.

2) The dirty little secret about XS and what it means for you in this context.

3) The evolution of the implementation of a nice interface around event oriented (SAX style) XML parsing.

4) Why XML::TreePuller is what you need for huge documents

Speaking experience

Speaker

  • 18446 107380552610261 100000149518165 208103 6583881 n

    Biography

    Tyler has been using Linux and writing Perl since 1994. As a professional systems administrator he frequently uses Perl to solve real world solutions elegantly and quickly. Tyler is an accomplished open source author with six modules on CPAN and a patch in Apache 2.2.