将两个csv文件与python熊猫比较

问题描述：

第一个有产品ID，第二个有序列号。

我需要从第一个csv查找所有序列号，并在第二个csv上查找匹配。结果报告将具有匹配的序列号以及来自每个csv的相应产品ID，在单独的列中修改以下代码，没有运气。

你会如何处理这个问题？

import pandas as pd 
    A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it. 
    B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here 
    print(A-B) #set A - set B gives back everything thats only in A. 
    print(B-A) # same here, other way around.

你可以添加一些样本数据和期望的输出？因为它有点不清楚究竟需要什么。 – jezrael

答

我想你需要merge：

A = pd.DataFrame({'product id': [1455,5452,3775], 
        'serial number':[44,55,66]}) 

print (A) 

B = pd.DataFrame({'product id': [7000,2000,1000], 
        'serial number':[44,55,77]}) 

print (B) 

print (pd.merge(A, B, on='serial number')) 
    product id_x serial number product id_y 
0   1455    44   7000 
1   5452    55   2000

只需要一个小小的修改，在上面的代码片段中，怎么能给出两个文件名作为输入，而不是硬编码值呢？ – poyim

答

first_one=pd.read_csv(file_path) 
//same way for second_one 
// if product_id is the first column then its location would be at '0' 
len_=len(first_one) 
i=0 
while(len_!=0) 
{ 
if(first_one[i]==second_one[i]) 
{ 
//it is a match do whatever you want with this matched data 
i=i-1; 
} 
len_=len_-1; 
}

答

试试这个：

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates() 
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates() 
# A - B 
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'") 
# B - A 
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")

答

可以DF转换成集，而比较数据，将忽略指数，然后用set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()]) 
ds2 = set([ tuple(values) for values in df2.values.tolist()]) 

ds1.symmetric_difference(ds2) 
print df1 ,'\n\n' 
print df2,'\n\n' 

print pd.DataFrame(list(ds1.difference(ds2))),'\n\n' 
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'

DF1

id Name score isEnrolled    Comment 
0 111 Jack 2.17  True He was late to class 
1 112 Nick 1.11  False    Graduated 
2 113 Zoe 4.12  True     NaN

DF2

id Name score isEnrolled    Comment 
0 111 Jack 2.17  True He was late to class 
1 112 Nick 1.21  False    Graduated 
2 113 Zoe 4.12  False   On vacation

输出

 0  1  2  3   4 
0 113 Zoe 4.12 True  NaN 
1 112 Nick 1.11 False Graduated 


    0  1  2  3   4 
0 113 Zoe 4.12 False On vacation 
1 112 Nick 1.21 False Graduated

将两个csv文件与python熊猫比较

相关推荐