Wednesday, April 11, 2012

Parsing BIG XML in PHP

I need to parse an XML that is big. f.ex 100mb (it can be even more).



For Example:
Xml looks like this:



<notes>
<note>
<id>cdsds32da435-wufdhah</id>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>


x 1000000 different notes(or even more)

</notes>


Each note has un unique ID. When I Parse an XML, I need to first find if note by specific ID exists in DB if no than INSERT it.



The problem is in Performance(it takes 2 hours). I try to take all ids from the DB (but is also big) with one SELECT, so I dont ask DB each time and I have them in PHP Array (Memory).



$sql = "SELECT id FROM 'notes'";
...
$ids = Array with all ids


I 've also parsed an XML with xml_parser in a loop:



while($data = fread($Xml, '512')) {
xml_parse($xmlParser, $data);
}


I think that parse an XML with simple_xml_parser may generate a too big variable for PHP to handle it.



And than when I have a note ID I check if it exists in $ids:



if (!array_search($note->id, $ids)) {
//than insert it
}


But it takes too long. So I found that PHP comes with special Arrays called Juddy Arrays http://php.net/manual/en/book.judy.php but I don't know exactly if they are for this - I mean for quick parse BIG Arrays.



I think also with Memcached, to store the ids from DB in many variables, but I want to find a proper solution.



In DB table there are also indexes, to speed up the process. The XML grows every week :) and it conatins every time all notes from the last XML plus new notes.



QUESTION?
How to fast parse BIG ARRAYS in PHP? Are Judy Arrays for this?





No comments:

Post a Comment