数据爬取工具---puppeteer(爬取豆瓣数据示例)
github:https://github.com/GoogleChrome/puppeteer#readme
安装npm i puppeteer -S
安装时候遇到问题Error: EBUSY: resource busy or locked,
解决:将当前的node确保是最新的稳定版本,删除node_modules,重新安装包管理,
但是还是一直安装不了puppeteer,然后尝试用cnpm进行安装,安装成功
在实际爬取过程中遇到的问题
如果一段文字中只想提取中文的一步一步的处理,如下某一个爬取数据,在title属性中混有空格,换行符号,那么如何一步一步提取出文字?
{ doubanId: 26302614,
title:
'\n \n\n 请回答1988\n\n \n 9.7\n \n
',
rate: 9.7,
poster:
'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2272563445.jpg' },
首先替换到换行,然后替换空格,然后处理文字
let title = it.find('p').text().replace(/[\r\n]/g,"").replace(/\ +/g,"")
{ doubanId: 26415300,
title: '天下足球9.7',
rate: 9.7,
poster:
'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2264941730.jpg' },
{ title: '', rate: 0 }
提取出汉字
var reg = /[\u4e00-\u9fa5]/g;
title = title.match(reg)
title =title.join('')
因为在当前的node环境中,join()方法在报错,所以只得处理为数据进行保存
运行爬取数据的脚本文件
node server/crawler/trailer-list.js
整个爬取代码
const puppeteer = require('puppeteer')
const url = `https://movie.douban.com/tv/#!type=tv&tag=%E7%83%AD%E9%97%A8&
sort=rank&page_limit=20&page_start=0`
const sleep = time => new Promise(resolve=>{
setTimeout(resolve,time)
})
;(async()=>{
console.log('开始数据的爬取')
const browser = await puppeteer.launch({
args:['--no-sandbox'],
dumpio:false,
})
const page = await browser.newPage()
await page.goto(url,{
waitUntil:'networkidle2'
})
await sleep(3000)
await page.waitForSelector('.more')
for (let index = 0; index < 1; index++) {
await sleep(3000)
await page.click('.more')
}
const result = await page.evaluate(()=>{
var $ = window.$
var items = $('.list-wp a')
var links = [];
if(items.length >=1){
items.each((index,item)=>{
let it = $(item)
console.log(it)
let doubanId = it.find('div').data('id')
// let title = it.find('p').text()
let title = it.find('p').text().replace(/[\r\n]/g,"").replace(/\ +/g,"")
var reg = /[\u4e00-\u9fa5]/g;
title = title.match(reg)
let rate = Number(it.find('strong').text())
let poster = it.find('img').attr('src')
links.push({
doubanId,
title,
rate,
poster
})
})
}
return links
})
browser.close()
console.log(result)
})()