爬虫的数据存储(TXT、JSON、CSV)

TXT文本存储

将知乎的发现板块的内容存入txt文本

import requests
from pyquery import PyQuery as pq
url="https://www.zhihu.com/explore"
myheader={
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome"
}
html=requests.get(url,headers=myheader).text
doc=pq(html)
items=doc('.explore-tab .feed-item').items()
for item in items:
    question=item.find('h2').text()
    author=item.find(".author-link-line").text()
    answer=pq(item.find(".content").html()).text()
    file=open("explore.txt","a",encoding="utf-8")
    file.write("\n".join([author,answer]))
    file.write("\n"+"="*50+"\n")
file.close()

打开方式:
爬虫的数据存储(TXT、JSON、CSV)
爬虫的数据存储(TXT、JSON、CSV)

JSON文件存储

读取JSON
可以调用JSON库的load()方法将JSON文本字符串转换为JSON对象,可以调用dumps()方法将JSON对象转换为文本字符串。

import json
str='''
[{
   "name":"Bob",
   "gender":"male",
   "birthday":"1992-10-18"
   },{
   "name":"Selina",
   "gender":"female",
   "birthday":"1995-10-18"
}]
'''
print(type(str))
word=json.loads(str);
print(word)
print(type(word))

输出:

<class 'str'>
[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]
<class 'list'>

获取键值对的两种方式:一种中括号加键名,另一种通过get()方法传入键名(get方法还可以传入第二个参数默认值)

word=json.loads(str)
print(word[0]["name"])
print(word[0].get("name"))

输出JSON
dumps()方法将JSON对象转化为字符串

import json
str=[{
   "name":"Bob",
   "gender":"male",
   "birthday":"1992-10-18"
   },{
   "name":"Selina",
   "gender":"female",
   "birthday":"1995-10-18"
}]
with open("datas.txt","w",encoding="utf-8") as file:
    file.write(json.dumps(str))

爬虫的数据存储(TXT、JSON、CSV)
dumps()方法还可以添加一个参数indent,代表缩进字符个数
为了输出中文,还需要指定参数ensure_ascii为False,另外还要规定文件输出的编码:

with open("datas.txt","w",encoding="utf-8") as file:
    file.write(json.dumps(str,ensure_ascii=False))

CSV文件存储

CSV文件的写入

import csv
with open("datas.csv","w") as csvfile:
    writer=csv.writer(csvfile)
    writer.writerow(["id","name","age"])
    writer.writerow(["001","wuyou","21"])
    writer.writerow(["002","chenwei","20"])

如果要修改列与列之间的分隔符,可以传入delimiter参数
也可以调用writerows()方法同时写入多行,此时参数就需要为二维列表。
读取CSV文件
调用csv库

import csv
with open("datas.csv","r",encoding="utf-8") as csvfile:
    reader=csv.reader(csvfile)
    for row in reader:
        print(row)

调用pandas库的read_csv方法