Tuesday, October 5, 2010

Use DOMDocument to parse non utf-8 encoding web page in PHP

Recently I was digging around in PHP + curl + DOMDocument, there are quite lot of impressive facilities such as DOMxPath, curl post, cookies, it is very effortless to simulate any action on an website without JavaScript depend. Here is some problem & tricks I found when I handle any non utf-8 encoding with CURL & DOMDocument.

Case 1:
Parsing a non utf-8 encoding page to DomDocument, Some web page put tag in following sequence

<html><head><title>NON UTF-8 TITLE</title><meta http-equiv="Content-Type" content="text/html; charset=ENCODING"/>

Assuming you have just received the html content from curl_exec

//....$htmlContent = curl_exec($ch);$doc=new DocDocument('1.0', 'ENCODING'); //create a new DOMDocument object$doc->loadHtml($htmlContent); //you probably obtain warning here$doc->save('test.html');

Open your test.html with any text editor, you may find the your html body is gone & the header is incomplete.

To resolve this problem, you will have to put the title after the

Here is a simple trick to do

$htmlContent = curl_exec($ch);$pattern="/(<title>.*<\/title>)[.\s]*(<meta\s*http-equiv=\"Content-Type\"\s*content=\"text\/html; charset=gb2312\"\s*\/>)/i";$htmlContent=preg_replace($pattern,"$2\r\n$1",$htmlContent);$doc=new DocDocument('1.0', 'ENCODING');$doc->loadHtml($htmlContent);

Now you should obtain the proper document content without lose anything.

No comments:

Post a Comment