preg_match模式来扫描这些价格
我试图从THIS页面扫描价格,我想使用此preg_match从此div提取价格:<span class="price"><b>519,00 €</b></span>
。什么是正确的preg_match?preg_match模式来扫描这些价格
这是我的提取脚本:
<?php
echo "funziona!";
if(!$fp = fopen("https://www.google.it/webhp?sourceid=chrome-instant&ion=1&espv=2&es_th=1&ie=UTF-8#tbs=vw:l,mr:1&tbm=shop&q=samsung+galaxy+note+4&tbas=0" ,"r")) {
return false;
} //our fopen is right, so let's go
$content = "";
while(!feof($fp)) { //while it is not the last line, we will add the current line to our $content
$content .= fgets($fp, 1024);
}
fclose($fp); //we are done here, don't need the main source anymore
?>
<?php
//our fopen, fgets here
//our magic regex here
preg_match_all('/<span class=\"price">(.*?)<\/span>/s',$content, $prices); //THIS IS PREG_MATCH
echo $prices[0][0]."<br />";
?>
我从来没有使用过的preg_match,我努力适应这个脚本。
谢谢。
看一看这样的:
<?php
function getUrl($Url,$Options = array(),&$optOut = array())
{
$CURL_DEFAULT_SETTINGS = array
(
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 10,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8'
);
if (!($ch = curl_init($Url)))
throw new Exception("Couldn't initialize cURL library",100);
if (is_array($CURL_DEFAULT_SETTINGS) && count($CURL_DEFAULT_SETTINGS) > 0)
curl_setopt_array($ch,$CURL_DEFAULT_SETTINGS);
if (is_array($Options) && count($Options) > 0)
{
foreach ($Options as $k => $v)
{
curl_setopt($ch,$k,$v);
}
}
$Data = curl_exec($ch);
$Error = curl_error($ch);
$optOut['CURLINFO_HEADER_OUT'] = curl_getinfo($ch, CURLINFO_HEADER_OUT);
curl_close($ch);
if (!$Data)
{
if ($Error)
throw new Exception($Error);
return false;
}
return $Data;
}
function getPriceFor($query) {
$data = getUrl('https://www.google.it/search?tbs=vw:l,mr:1&tbm=shop&q='.rawurlencode($query).'&tbas=0&bav=on.2,or.&cad=b&fp=6a24b60e09fe0b18&biw=1196&bih=703&dpr=2&ion=1&espv=2&tch=1&ech=1&psi=byWgVee9A4TNeIXRgLAK.1436558704099.3');
$data = '['.preg_replace('/\/\*""\*\//msi',',',preg_replace('/\/\*""\*\/[\s]*$/msi','',$data)).']';
$data = json_decode($data,true);
preg_match_all('/<div[\s]+class="_OA"><div><b>([^<]+)[\s]*<\/b><\/div><div>([^<]+)<\/div><\/div>/msi',$data[3]['d'],$res);
$re = array();
foreach ($res[1] as $k=>$r)
$re[] = array('price'=>$r,'from'=>$res[2][$k]);
return $re;
}
print_r(getPriceFor('samsung galaxy note 4'));
那一定显示是这样的:
Array
(
[0] => Array
(
[price] => 515,00 €
[from] => phoneshopping.it
)
[1] => Array
(
[price] => 519,00 €
[from] => Smartyrama
)
[2] => Array
(
[price] => 519,00 €
[from] => Smartyrama
)
[3] => Array
(
[price] => 519,00 €
[from] => Smartyrama
)
[4] => Array
(
[price] => 690,45 €
[from] => Amazon.it - Seller
)
[5] => Array
(
[price] => 673,99 €
[from] => da 2 negozi
)
[6] => Array
(
[price] => 345,00 €
[from] => da 2 negozi
)
[7] => Array
(
[price] => 342,00 €
[from] => Amazon.it - Seller
)
[8] => Array
(
[price] => 699,99 €
[from] => ePRICE.it
)
[9] => Array
(
[price] => 730,00 €
[from] => in oltre 5 negozi
)
[10] => Array
(
[price] => 20,00 €
[from] => Amazon.it - Seller
)
[11] => Array
(
[price] => 208,99 €
[from] => eGlobal Central Italia
)
[12] => Array
(
[price] => 711,00 €
[from] => in oltre 5 negozi
)
[13] => Array
(
[price] => 322,99 €
[from] => eGlobal Central Italia
)
[14] => Array
(
[price] => 40,09 €
[from] => da 4 negozi
)
[15] => Array
(
[price] => 15,99 €
[from] => acadattatore.com
)
[16] => Array
(
[price] => 339,99 €
[from] => ePRICE.it
)
[17] => Array
(
[price] => 412,90 €
[from] => da 3 negozi
)
[18] => Array
(
[price] => 343,33 €
[from] => Amazon.it - Seller
)
[19] => Array
(
[price] => 629,00 €
[from] => BestPriceStore
)
)
谢谢你和克里斯,我非常感谢你的支持。锡,当我尝试你的代码,我得到这个错误:致命错误:未知的异常'异常'与消息'SSL证书问题:无法获得本地发行人证书'在C:\ xampp \ htdocs \ index.php:41堆栈跟踪:#0 C:\ xampp \ htdocs \ index.php(50):getUrl('https://www.goo ...')#1 C:\ xampp \ htdocs \ index.php(63):getPriceFor 'samsung galaxy ...')#2 {main}抛出C:\ xampp \ htdocs \ index.php第41行' – leofabri
我看到你正在使用windows。你必须为curl设置一个ssl证书,或者使用file_get_contents而不是我调用的getUrl函数。我可以在几个小时内给你进一步的指示。 – tin
噢,你的程序在ubuntu中工作得很好。是的,最初我在XAMPP机器上使用过你的代码,但是因为我想在基于Ubuntu的机器上使用它,所以我不需要修改windows的代码。我非常感谢你的支持,你已经清楚直接。 – leofabri
您应该使用解析器,而不是正则表达式来完成此任务。下面是使用simple html dom parser
可以如何完成的一个示例。
include_once 'simple_html_dom.php';
$html = file_get_html('http://www.example.com');
foreach($html->find('span') as $element) {
if(strpos($element->class, 'price')){
echo $element->innertext . "\n";
}
}
这也是一个相当宽松的检查,你可能会得到比你想要的更多的结果。它只是检查跨度的类包含单词price
。
http://simplehtmldom.sourceforge.net/manual.htm#section_quickstart
与当前的代码会发生什么?你不需要避免双引号''''你也想要第一个索引,而不是价格的零索引 – chris85
这应该打印来自网页的价格,但是有错误。完整的代码在本指南中http://www.1stwebdesigner.com/php-crawler-tutorial/ – leofabri
没有“正确的”preg。regexes + html =坏主意。使用DOM解析器。 –