网络爬虫 - 2000多个网页中获取数据(TED网站为例)
问题描述:
我写每周一次将运行一个PHP老太婆作业脚本网络爬虫 - 2000多个网页中获取数据(TED网站为例)
这个脚本的主要目的是从所有的TED得到细节会谈上可用的TED 我们的网站(例如,为了使这个问题更容易理解)
该脚本将花费大约70分钟来运行,并越过2000网页
我的问题是:
1)是有没有更好/更快捷的方式使用该函数来获取网页中的每个时间,即时通讯:
file_get_contents_curl($网址)
2)它是一个很好的做法,以保持在所有会谈数组(可以变得相当大)
3)有没有更好的方法来获得例如网站上的所有特德演讲细节?在TED网站上“抓取”以获得所有会谈的最佳方式是什么?
**我已选中使用RSS源的选项,但缺少一些我需要的细节。
感谢
<?php
define("START_ID", 1);
define("STOP_TED_QUERY",20);
define ("VALID_PAGE","TED | Talks");
/**
* this script will run as a cron job and will go over all pages
* on TED http://www.ted.com/talks/view/id/
* from id 1 till there are no more pages
*/
/**
* function get a file using curl (fast)
* @param $url - url which we want to get its content
* @return the data of the file
* @author XXXXX
*/
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
//will hold all talks in array
$tedTalks = array();
//id to start the query from
$id=START_ID;
//will indicate when needed to stop the query beacuse reached the end id's on TED website
$endOFQuery=0;
//get the time
$time_start = microtime(true);
//start the query on TED website
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more
while ($endOFQuery < STOP_TED_QUERY){
//get the page of the talk
$html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id");
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
//check if this a valid page
if (! strcmp ($title , VALID_PAGE))
//this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop)
$endOFQuery++;
else {
//this is a valid TED talk get its details
//reset the flag for end of query
$endOFQuery = 0;
//get meta tags
$metas = $doc->getElementsByTagName('meta');
//get the tag we need (keywords)
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
//create new talk object and populate it
$talk = new Talk();
//set its ted id from ted web site
$talk->setID($id);
//parse the name (name has un-needed char's in the end)
$talk->setName(substr($title, 0, strpos($title, '|')));
//parse the String of tags to array
$keywords = explode(",", $keywords);
//remove un-needed items from it
$keywords=array_diff($keywords, array("TED","Talks"));
//add the filters tags to the talk
$talk->setTags($keywords);
//add to the total talks array
$tedTalks[]=$talk;
}
//move to the next ted talk ID to query
$id++;
} //end of the while
$time_end = microtime(true);
$execution_time = ($time_end - $time_start);
echo "this took (sec) : ".$execution_time;
?>
答
上了车github.com
如果一些1正在寻找它是如何工作
https://github.com/Nimrod007/TED-talks-details-from-TED.com-and-youtube
香港专业教育学院公布的免费增值API网络爬虫PHP示例Mashape实现这个脚本https://market.mashape.com/bestapi/ted
享受!
您可以使用卷曲多模式并行地抓取页面。您也可以使用Yahoo Pipes进行调查,Yahoo Pipes会为您在页面中需要的特定数据进行抓取和解析。 – 2013-02-18 03:42:10
Henley Chiu - 你能展示一个卷曲多模式的代码片段吗? – Nimrod007 2013-02-24 07:51:17
我想这里有很好的例子http://php.net/manual/en/function.curl-multi-exec.php – 2013-03-01 13:57:39