加入2 CSV文件(1-N的关系)
该CSV文件是制表符分隔加入2 CSV文件(1-N的关系)
file1.csv:
id_album name date
001 Nevermind 24/09/1991
...
file2.csv:
id_song id_album name
001 001 Smells Like Teen Spirit
002 001 In Bloom
...
我想获得此output.csv:
id_album name date songs
001 Nevermind 24/09/1991 001,Smells Like Teen Spirit,002,In Bloom,...
您是否看到了一种方法它在Bash(最好)或Python?
我在我的csv文件(数百万行)中有很多记录。
编辑
我试图加入/ SED/AWK但未能在1到管理到N的关系
发现AWK语言:
awk -F'[[:space:]][[:space:]]+' 'NR==FNR{ if(NR>1) a[$2]=($2 in a? a[$2]",":"")$1","$3; next}
FNR==1{ print $0,"songs" }
$1 in a{ print $0,a[$1] }' file2.csv OFS='\t' file1.csv > output.csv
的output.csv
内容:
id_album name date songs
001 Nevermind 24/09/1991 001,Smells Like Teen Spirit,002,In Bloom
超级awk'ward但超级酷! – alvas
可能是'{print $ 0'\ tnew“}'也对齐标题? – Inian
@Inian,不需要。这是通过'OFS ='\ t''完成的。代码在SO框架中查看的视图可能与控制台中的视图不同。有时,SO格式看起来很烦人...... – RomanPerekhrest
TL; DR
from io import StringIO
file1 = """id_album,name,date
001,Nevermind,24/09/1991"""
file2 = """id_song,id_album,name
001,001,Smells Like Teen Spirit
002,001,In Bloom"""
df1 = pd.read_csv(StringIO(file1))
df1 = df1.rename(columns={'name':'album_name'})
df2 = pd.read_csv(StringIO(file2))
df2 = df2.rename(columns={'name':'song_name'})
df3 = df1.merge(df2, on='id_album')
df4 = pd.DataFrame(list({album['id_album'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}.items()), columns=['id_album', 'song_id_name'])
df_want = df1.merge(df4)
[OUT]:
>>> df_want
id_album album_name date song_id_name
0 1 Nevermind 24/09/1991 1,Smells Like Teen Spirit,2,In Bloom
在龙
鉴于:
>>> from io import StringIO
>>> file1 = """id_album,name,date
... 001,Nevermind,24/09/1991"""
>>> file2 = """id_song,id_album,name
... 001,001,Smells Like Teen Spirit
... 002,001,In Bloom"""
>>> df1 = pd.read_csv(StringIO(file1))
>>> df1 = df1.rename(columns={'name':'album_name'})
>>> df2 = pd.read_csv(StringIO(file2))
>>> df2 = df2.rename(columns={'name':'song_name'})
>>> df1
id_album album_name date
0 1 Nevermind 24/09/1991
>>> df2
id_song id_album name
0 1 1 Smells Like Teen Spirit
1 2 1 In Bloom
网络连接首先合并2个DataFrames上id_album
柱:
>>> df3 = df1.merge(df2, on='id_album')
>>> df3
id_album album_name date id_song song_name
0 1 Nevermind 24/09/1991 1 Smells Like Teen Spirit
1 1 Nevermind 24/09/1991 2 In Bloom
现在对于一些pandas
招:
1. First group the rows by the `id_album` column:
2. In each group, get the `id_song` and `song_name` columns and stack them
>> [','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])]
['1,Smells Like Teen Spirit,2,In Bloom']
以类似的方式,由.groupby()
得到ALBUM_NAME:
>>> [album['album_name'].unique()[0] for idx, album in df3.groupby(['id_album'])]
['Nevermind']
让我们结合两个groupby
操作:
>>> {album['album_name'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}
{'Nevermind': '1,Smells Like Teen Spirit,2,In Bloom'}
>>> album2songs = {album['album_name'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}
把那album2songs
成数据帧:
>>> df4 = pd.DataFrame(list(album2songs.items()), columns=['album_name', 'song_id_name'])
>>> df4
album_name song_id_name
0 Nevermind 1,Smells Like Teen Spirit,2,In Bloom
现在加入df1
和df4
:
>>> df1.merge(df4)
id_album album_name date song_id_name
0 1 Nevermind 24/09/1991 1,Smells Like Teen Spirit,2,In Bloom
BTW,@RomanPerekhrest awk
解决方案的方式旁人!
您尝试了哪些努力?将它与问题一起发布 – Inian
大熊猫的完美用例就在这里。 –
我试过使用join/sed/awk。结果太差,不提...对不起。 – eli0tt