Python:通过URL的.csv循环并保存为另一列
Python新手,阅读了一堆,并观看了很多视频。我无法让它工作,我感到沮丧。Python:通过URL的.csv循环并保存为另一列
我有一个链接列表如下图所示:
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
我试图让蟒蛇去“URL”,并将其保存在名为“定位”为文件名的文件夹中“API.las” 。
EX)...... “位置”/组/ “API” .las C://.../T32S R29W/Sec.27/15-119-00164.las
该文件有数百行和链接供下载。我想实现一个休眠功能也不会轰炸服务器。
有什么不同的方式来做到这一点?我试过熊猫和其他一些方法......有什么想法?
你将不得不做这样的
for link, file_name in zip(links, file_names):
u = urllib.urlopen(link)
udata = u.read()
f = open(file_name+".las", w)
f.write(udata)
f.close()
u.close()
的东西,如果你的文件的内容是你想要什么没有,你可能想看看在刮图书馆像BeautifulSoup
解析。
方法1: -
您的文件已经假设1000行。通过这个masterlist>
[ROW1,ROW2,ROW3等]
一旦这样做,环 -
创建masterlist其具有存储在该形式的数据。每次迭代你都会得到一个字符串格式的行。 拆分它制作一个列表并拼接最后一列url,即行[-1]
并将其追加到名为result_url的空列表中。一旦它运行的所有行,它保存在一个文件,你可以方便地使用os模块创建一个目录,然后将您的文件那边
方法2: -
如果文件过于庞大,阅读在try块中一行一行地处理你的数据(使用csv模块,你可以将每一行作为一个列表,拼接url并将它每次写入文件API.las)。
一旦你的程序移动了第1001行,它将移动到除了可以“通过”或写入打印以获得通知的块之外。
在方法2中,您不是将所有数据保存在任何数据结构中,只是在执行时存储单行,因此速度更快。
import csv, os
directory_creater = os.mkdir('Locations')
fme = open('./Locations/API.las','w+')
with open('data.csv','r') as csvfile:
spamreader = csv.reader(csvfile, delimiter = ',')
print spamreader.next()
while True:
try:
row= spamreader.next()
get_url = row[-1]
to_write = get_url+'\n'
fme.write(to_write)
except:
print "Program has run. Check output."
exit(1)
此代码可以在更短的时间内完成所有您提到的效率。
该文件有240,163行,需要保存唯一的url地址 – gdink1020
请使用方法2(我已经重新编辑过)。不要在列表中累积所有数据。 –
@ gdink1020我的代码不工作? –
这可能有点肮脏,但这是解决问题的第一步。这一切都取决于CSV中的每个值都包含在双引号中。如果这不是真的,这个解决方案将需要大量调整。
代码:
import os
csv = """
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
""".strip() # trim excess space at top and bottom
root_dir = '/tmp/so_test'
lines = csv.split('\n') # break CSV on newlines
header = lines[0].strip('"').split('","') # grab first line and consider it the header
lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable
for l in lines[1:]: # we want all lines except the top line, which is a header
line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote
line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key
line_dict = dict(line_assoc) # turn this into a dict
lines_d.append(line_dict)
section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need
file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested
# stuff to show what's actually put in the files
print file_out, ':'
print ' ', '"%s"'%('","'.join(header),)
print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))
输出:
~/so_test $ python so_test.py
/tmp/so_test/T32S R29W/Sec. 27/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
/tmp/so_test/T34S R26W/Sec. 2/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
~/so_test $
你尝试过这么远吗? – Mortz
'进口熊猫作为PD 数据= pd.read_csv( 'MeadeLAS.csv') 链接= data.URL file_names = data.API 用于链路,FILE_NAME在拉链(链接,file_names): 文件= pd.read_csv (链接).to_csv(文件名+'。las',索引= False)' – gdink1020
@mortz忘记标记 – gdink1020