Parsing a non utf-8 encoding page to DomDocument, Some web page put tag in following sequence
<html><head><title>NON UTF-8 TITLE</title><meta http-equiv="Content-Type" content="text/html; charset=ENCODING"/>
Assuming you have just received the html content from curl_exec
//....$htmlContent = curl_exec($ch);$doc=new DocDocument('1.0', 'ENCODING'); //create a new DOMDocument object$doc->loadHtml($htmlContent); //you probably obtain warning here$doc->save('test.html');
Open your test.html with any text editor, you may find the your html body is gone & the header is incomplete.
To resolve this problem, you will have to put the title after the
Here is a simple trick to do
$htmlContent = curl_exec($ch);$pattern="/(<title>.*<\/title>)[.\s]*(<meta\s*http-equiv=\"Content-Type\"\s*content=\"text\/html; charset=gb2312\"\s*\/>)/i";$htmlContent=preg_replace($pattern,"$2\r\n$1",$htmlContent);$doc=new DocDocument('1.0', 'ENCODING');$doc->loadHtml($htmlContent);
Now you should obtain the proper document content without lose anything.