Narendra Dhami

My Site

Importing huge XML files using PHP5

Posted by Narendra Dhami on August 25, 2008


At work I had the task to implement the synchronization between an online
shop and a commodity management system. Data exchange format was XML –
one big XML file for all of the products (some thousands with dozens of
attributes). Big question: How do I import the file in a way that is most
convenient for me as a programmer – and without exceeding the machine’s
RAM when loading a 1 GiB file?

I personally prefer SimpleXML for everything XML related in PHP – even to
generate XML; although it was never meant to do that primarily.
The big problem is that SimpleXML uses DOM in the background which builds
the whole XML tree in memory. That’s a no-go for large files.

So what’s left? Yes, our old and rusty Sax parser. It’s not really
convenient – you have to catch all this actions for open tags, close tag,
data section etc. – but it reads the xml file iteratively. Parsing huge
files is no problem if you use Sax. PHP5’s slightly enhanced Sax
implementation/wrapper is XmlReader which I chose to make use
of.

On the other side – my program that synched the data with the database –
I wanted to have something dead simple, like a foreach loop. Now
the task was to combine XmlReader and
SPL‘s Iterator

interface.

Sample XML



 
  Kate
  LGPL
  
   Editor
   Nice KDE text editor
  
  
   3.5.9
   4.0.5
  
 

 
  gedit
  LGPL
  
   Editor
   Standard gnome text editor
  
  
   2.22.3
   2.22.4-rc1
  
 

Preferred PHP import code

The following code is as easy and beautiful as reading an XML file can
get:


Iterator code

Here is the iteration code – without comments – in case you (or /me) need
to do the same thing again.

strFile = $strFile;
    }

    public function current() {
        return $this->program;
    }

    public function key() {
        return $this->nKey;
    }

    public function next() {
        $this->program = null;
    }

    public function rewind() {
        $this->reader = new XMLReader();
        $this->reader->open($this->strFile);
        $this->program = null;
        $this->nKey    = null;
    }

    public function valid() {
        if ($this->program === null) {
            $this->loadNext();
        }

        return $this->program !== null;
    }

    /**
     * Loads the next program
     *
     * @return void
     */
    protected function loadNext()
    {
        $nCount = 0;
        $strElementName = null;
        $bCaptureValues = false;
        $arValues       = array();
        $arNesting      = array();

        while ($this->reader->read()) {
            switch ($this->reader->nodeType) {
                case XMLReader::ELEMENT:
                    $strElementName = $this->reader->name;
                    if ($bCaptureValues) {
                        $arNesting[] = $strElementName;
                        $arValues[implode('-', $arNesting)] = null;
                    }
                    if ($strElementName == $this->strObjectTagname) {
                        $bCaptureValues = true;
                    }
                    break;

                case XMLReader::TEXT:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = $this->reader->value;
                    }
                    break;

                case XMLReader::END_ELEMENT:
                    if ($this->reader->name == $this->strObjectTagname) {
                        $this->program = $arValues;
                        ++$this->nKey;
                        break 2;
                    }
                    if ($bCaptureValues) {
                        array_pop($arNesting);
                    }
                    break;
            }
        }
    }//protected function loadNext()

}
]]>

There are some things missing, like: namespace and attribute support,
handling of tags with the same name in different hierarchy levels,
especially the main tag and generally tags that may show up several
times. I didn’t need it, so do it yourself if it’s necessary.

Advertisements

2 Responses to “Importing huge XML files using PHP5”

  1. jzhang said

    You may also want to look at vtd-xml, the latest and most advanced XML processing
    API

    vtd-xml

  2. anon said

    you may also want to check out vtd-xml, the latest and most advanced xml processing model

    vtd-xml

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: