爬虫的数据存储(TXT、JSON、CSV)
TXT文本存储
将知乎的发现板块的内容存入txt文本
import requests
from pyquery import PyQuery as pq
url="https://www.zhihu.com/explore"
myheader={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome"
}
html=requests.get(url,headers=myheader).text
doc=pq(html)
items=doc('.explore-tab .feed-item').items()
for item in items:
question=item.find('h2').text()
author=item.find(".author-link-line").text()
answer=pq(item.find(".content").html()).text()
file=open("explore.txt","a",encoding="utf-8")
file.write("\n".join([author,answer]))
file.write("\n"+"="*50+"\n")
file.close()
打开方式:
JSON文件存储
读取JSON
可以调用JSON库的load()方法将JSON文本字符串转换为JSON对象,可以调用dumps()方法将JSON对象转换为文本字符串。
import json
str='''
[{
"name":"Bob",
"gender":"male",
"birthday":"1992-10-18"
},{
"name":"Selina",
"gender":"female",
"birthday":"1995-10-18"
}]
'''
print(type(str))
word=json.loads(str);
print(word)
print(type(word))
输出:
<class 'str'>
[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]
<class 'list'>
获取键值对的两种方式:一种中括号加键名,另一种通过get()方法传入键名(get方法还可以传入第二个参数默认值)
word=json.loads(str)
print(word[0]["name"])
print(word[0].get("name"))
输出JSON
dumps()方法将JSON对象转化为字符串
import json
str=[{
"name":"Bob",
"gender":"male",
"birthday":"1992-10-18"
},{
"name":"Selina",
"gender":"female",
"birthday":"1995-10-18"
}]
with open("datas.txt","w",encoding="utf-8") as file:
file.write(json.dumps(str))
dumps()方法还可以添加一个参数indent,代表缩进字符个数
为了输出中文,还需要指定参数ensure_ascii为False,另外还要规定文件输出的编码:
with open("datas.txt","w",encoding="utf-8") as file:
file.write(json.dumps(str,ensure_ascii=False))
CSV文件存储
CSV文件的写入
import csv
with open("datas.csv","w") as csvfile:
writer=csv.writer(csvfile)
writer.writerow(["id","name","age"])
writer.writerow(["001","wuyou","21"])
writer.writerow(["002","chenwei","20"])
如果要修改列与列之间的分隔符,可以传入delimiter参数
也可以调用writerows()方法同时写入多行,此时参数就需要为二维列表。
读取CSV文件
调用csv库
import csv
with open("datas.csv","r",encoding="utf-8") as csvfile:
reader=csv.reader(csvfile)
for row in reader:
print(row)
调用pandas库的read_csv方法