如何按类别刮取文本并生成json文件?
问题描述:
我们刮网站www.theft-alerts.com。现在我们得到所有的文字。如何按类别刮取文本并生成json文件?
connection = urllib2.urlopen('http://www.theft-alerts.com')
soup = BeautifulSoup(connection.read().replace("<br>","\n"), "html.parser")
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
for wd in sp.select("div.itemindentmodified"):
text = wd.text
if not text.startswith("Images :"):
print(text)
with open("theft-alerts.json", 'w') as outFile:
json.dump(theftalerts, outFile, indent=2)
输出:
STOLEN : A LARGE TAYLORS OF LOUGHBOROUGH BELL
Stolen from Bromyard on 7 August 2014
Item : The bell has a diameter of 37 1/2" is approx 3' tall weighs just shy of half a ton and was made by Taylor's of Loughborough in 1902. It is stamped with the numbers 232 and 11.
The bell had come from Co-operative Wholesale Society's Crumpsall Biscuit Works in Manchester.
Any info to : PC 2361. Tel 0300 333 3000
Messages : Send a message
Crime Ref : 22EJ/50213D-14
No of items stolen : 1
Location : UK > Hereford & Worcs
Category : Shop, Pub, Church, Telephone Boxes & Bygones
ID : 84377
User : 1 ; Antique/Reclamation/Salvage Trade ; (Administrator)
Date Created : 11 Aug 2014 15:27:57
Date Modified : 11 Aug 2014 15:37:21;
怎能类别文本的JSON文件。 JSON文件现在是空的。
输出JSON:
[]
答
您可以定义列表,并追加您创建的列表中的所有字典对象。 e.g:
import json
theftalerts = [];
atheftobject = {};
atheftobject['location'] = 'UK > Hereford & Worcs';
atheftobject['category'] = 'Shop, Pub, Church, Telephone Boxes & Bygones';
theftalerts.append(atheftobject);
atheftobject['location'] = 'UK';
atheftobject['category'] = 'Shop';
theftalerts.append(atheftobject);
with open("theft-alerts.json", 'w') as outFile:
print(json.dump(theftalerts, outFile, indent=2))
在此之后运行theft-alerts.json
将包含此JSON对象:
[
{
"category": "Shop",
"location": "UK"
},
{
"category": "Shop",
"location": "UK"
}
]
你可以玩这个生成自己的JSON对象。 结帐json模块
答
您的JSON输出保持空白,因为您的循环不附加到列表。
这是我会怎么提取类别名称: