从DOM中获取从特定id名称开始的div数据
我想要获取html div数据,其中id从特定名称或字符串开始。从DOM中获取从特定id名称开始的div数据
例如,假设我有这个网站的数据: -
<html>
<div id="post_message_1">
somecontent1
</div>
<div id="post_message_2">
somecontent2
</div>
<div id="post_message_3">
somecontent3
</div>
</html>
为此,我试过卷曲。
<?php
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("myUrl");
$fh = fopen("test.html", 'w'); // we create the file, notice the 'w'. This is to be able to write to the file once.
//writing response in newly created file
fwrite($fh, $html); // here we write the data to the file.
fclose($fh);
?>
如果我使用
$select= $doc->getElementById("post_message_");
,则它不会返回数据,因为其搜索这个ID在DOM,但在HTML DIV ID只能从该字符串开始。它可能是post_message_1或post_message_2。
我会转的file_get_contents_curl
输出到 SimpleXmlElement对象,我会使用的xpath
的功能之一例如,你可以这样做:
$html = <<<HTML
<html>
<div id="post_message_1">
somecontent1
</div>
<div id="post_message_2">
somecontent2
</div>
<div id="post_message_3">
somecontent3
</div>
</html>
HTML;
$dom = new SimpleXMLElement($html);
var_dump($dom->xpath('//div[starts-with(@id, "post_message_")]'));
UPDATE
在你的情况下,你应该这样做:
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents_curl($url));
$sxml = simplexml_import_dom($doc);
var_dump($sxml->xpath('//div[starts-with(@id, "post_message_")]'));
我可能会迭代所有div,并使用他们的id的正则表达式来获得我需要的。
我不认为有一个更干净的方式来做到这一点,除非你可以编辑html页面代码并向包含消息的div添加类。
我正在提取另一个网站数据,所以我不能在HTML中添加一个类。 – neo
这里盐渍大锤,但我从来没有能够得到这样的模式在PHP中的工作 - 但正则表达式将工作 - >
$subject = $html;
$pattern = '/id\=\"post_message_\d+\"\>(?<match>.*)<\/div\>/isUg';
preg_match($pattern, $subject, $matches);
var_dump(trim($matches['match']));
正则表达式的解释:
id\=\"post_message_\d+\"\>(?<matches>.*)<\/div\>/isU
id matches the characters id literally (case insensitive)
\= matches the character = literally
\" matches the character " literally
post_message_ matches the characters post_message_ literally (case insensitive)
\d+ match a digit [0-9]
Quantifier: + Between one and unlimited times, as few times as possible, expanding as needed [lazy]
\" matches the character " literally
\> matches the character > literally
(?<matches>.*) Named capturing group matches
.* matches any character
Quantifier: * Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
< matches the characters < literally
\/ matches the character/literally
div matches the characters div literally (case insensitive)
\> matches the character > literally
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
s modifier: single line. Dot matches newline characters
U modifier: Ungreedy. The match becomes lazy by default. Now a ? following a quantifier makes it greedy
它只给我第一个div数据。可能需要一些修改。 – neo
我找到解决方案,它的工作正常。可能是这个代码会帮助别人。感谢@smarber,他的模式帮助我解决了这个问题。
<?php
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("myUrl");
$dom = new DOMDocument();
$result = $dom->loadHTML($html);
$finder = new DomXPath($dom);
$nodes = $finder->query('//div[starts-with(@id, "post_message_")]');
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = trim($tmp_dom->saveHTML());
$fh = fopen("test.html", 'w'); // we create the file, notice the 'w'. This is to be able to write to the file once.
//writing response in newly created file
fwrite($fh, $innerHTML); // here we write the data to the file.
fclose($fh);
?>
嗨,感谢您的建议,但是当我使用xpath,然后我得到这个错误“PHP致命错误:未捕获异常'异常'消息'字符串不能被解析为XML'”,您可以了解HTML给,只是为了演示的目的。 – neo
您的意思是说您无法将'file_get_contents_curl'的html输出转换为php对象simplexmlElement?我的意思是,如果你可以利用'SimpleXMLElement',你可以非常容易地做任何你需要的工作 – smarber
是的,很多警告也显示像“PHP Warning:SimpleXMLElement :: __ construct(): English in /var/www/index.php on line 1“ – neo