将两个csv文件与python熊猫比较
问题描述:
我有两个csv文件都由两列组成。将两个csv文件与python熊猫比较
第一个有产品ID,第二个有序列号。
我需要从第一个csv查找所有序列号,并在第二个csv上查找匹配。结果报告将具有匹配的序列号以及来自每个csv的相应产品ID,在单独的列 中修改以下代码,没有运气。
你会如何处理这个问题?
import pandas as pd
A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
print(A-B) #set A - set B gives back everything thats only in A.
print(B-A) # same here, other way around.
答
我想你需要merge
:
A = pd.DataFrame({'product id': [1455,5452,3775],
'serial number':[44,55,66]})
print (A)
B = pd.DataFrame({'product id': [7000,2000,1000],
'serial number':[44,55,77]})
print (B)
print (pd.merge(A, B, on='serial number'))
product id_x serial number product id_y
0 1455 44 7000
1 5452 55 2000
+0
只需要一个小小的修改,在上面的代码片段中,怎么能给出两个文件名作为输入,而不是硬编码值呢? – poyim
答
first_one=pd.read_csv(file_path)
//same way for second_one
// if product_id is the first column then its location would be at '0'
len_=len(first_one)
i=0
while(len_!=0)
{
if(first_one[i]==second_one[i])
{
//it is a match do whatever you want with this matched data
i=i-1;
}
len_=len_-1;
}
答
试试这个:
A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")
答
可以DF转换成集,而比较数据,将忽略指数,然后用set symmetric_difference
ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])
ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'
print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'
DF1
id Name score isEnrolled Comment
0 111 Jack 2.17 True He was late to class
1 112 Nick 1.11 False Graduated
2 113 Zoe 4.12 True NaN
DF2
id Name score isEnrolled Comment
0 111 Jack 2.17 True He was late to class
1 112 Nick 1.21 False Graduated
2 113 Zoe 4.12 False On vacation
输出
0 1 2 3 4
0 113 Zoe 4.12 True NaN
1 112 Nick 1.11 False Graduated
0 1 2 3 4
0 113 Zoe 4.12 False On vacation
1 112 Nick 1.21 False Graduated
你可以添加一些样本数据和期望的输出?因为它有点不清楚究竟需要什么。 – jezrael