熊猫:移调在多列中的一个列DF
这是我有数据:熊猫:移调在多列中的一个列DF
cmte_id trans entity st amount fec_id
date
2007-08-15 C00112250 24K ORG DC 2000 C00431569
2007-09-26 C00119040 24K CCM FL 1000 C00367680
2007-09-26 C00119040 24K CCM MD 1000 C00140715
2007-07-20 C00346296 24K CCM CA 1000 C00434571
2007-09-24 C00346296 24K CCM MA 1000 C00433136
有迹象表明,我已经离开了为简洁起见其他描述的列。 我想对其进行转换,以使[cmte_id]中的值成为列标题,[amount]中的值成为新列中的相应值。我知道这可能是一个简单的枢轴操作。我曾尝试以下:
dfy.pivot('cmte_id', 'amount')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-e5d2cb89e880> in <module>()
----> 1 dfy.pivot('cmte_id', 'amount')
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
3761 """
3762 from pandas.core.reshape import pivot
-> 3763 return pivot(self, index=index, columns=columns, values=values)
3764
3765 def stack(self, level=-1, dropna=True):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in pivot(self, index, columns, values)
323 append = index is None
324 indexed = self.set_index(cols, append=append)
--> 325 return indexed.unstack(columns)
326 else:
327 if index is None:
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level)
3857 """
3858 from pandas.core.reshape import unstack
-> 3859 return unstack(self, level)
3860
3861 #----------------------------------------------------------------------
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level)
402 if isinstance(obj, DataFrame):
403 if isinstance(obj.index, MultiIndex):
--> 404 return _unstack_frame(obj, level)
405 else:
406 return obj.T.stack(dropna=False)
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _unstack_frame(obj, level)
442 else:
443 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 444 value_columns=obj.columns)
445 return unstacker.get_result()
446
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns)
96
97 self._make_sorted_values_labels()
---> 98 self._make_selectors()
99
100 def _make_sorted_values_labels(self):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _make_selectors(self)
134
135 if mask.sum() < len(self.index):
--> 136 raise ValueError('Index contains duplicate entries, '
137 'cannot reshape')
138
ValueError: Index contains duplicate entries, cannot reshape
期望的最终结果(除了附加列,如 '反式',fec_id, 'ST' 等)将是这个样子:
date C00112250 C00119040 C00119040 C00346296 C00346296
2007-ago-15 2000
2007-set-26 1000
2007-set-26 1000
2007-lug-20 1000
2007-set-24 1000
不任何人都知道我如何能够接近最终产品?
试试这个:
pvt = pd.pivot_table(df, index=df.index, columns='cmte_id',
values='amount', aggfunc='sum', fill_value=0)
保留其他列:
In [213]: pvt = pd.pivot_table(df.reset_index(), index=['index','trans','entity','st', 'fec_id'],
.....: columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.....: .reset_index()
In [214]: pvt
Out[214]:
cmte_id index trans entity st fec_id C00112250 C00119040 \
0 2007-07-20 24K CCM CA C00434571 0 0
1 2007-08-15 24K ORG DC C00431569 2000 0
2 2007-09-24 24K CCM MA C00433136 0 0
3 2007-09-26 24K CCM FL C00367680 0 1000
4 2007-09-26 24K CCM MD C00140715 0 1000
cmte_id C00346296
0 1000
1 0
2 1000
3 0
4 0
In [215]: pvt.head()['st']
Out[215]:
0 CA
1 DC
2 MA
3 FL
4 MD
Name: st, dtype: object
UPDATE:
import pandas as pd
import glob
# if you don't use ['cand_id'] column - remove it from `usecols` parameter
dfy = pd.concat([pd.read_csv(f, sep='|', low_memory=False, header=None,
names=['cmte_id', '2', '3', '4','5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'],
usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount', 'fec_id', 'cand_id'],
dtype={'date': str})
for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')
],
ignore_index=True)
dfy['date'] = pd.to_datetime(dfy['date'], format='%m%d%Y')
# remove not needed column ASAP in order to save memory
del dfy['cand_id']
dfy = dfy[(dfy['date'].notnull()) & (dfy['date'] > '2007-01-01') & (dfy['date'] < '2014-12-31') ]
#df = dfy.set_index(['date'])
pvt = pd.pivot_table(dfy, index=['date','trans_typ','entity_typ','state','fec_id'],
columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.reset_index()
print(pvt.info())
pvt.to_excel('out.xlsx', index=False)
它的工作原理。再次感谢!我经常通过我在这个网站上的经历感到谦卑和放松! –
总是乐于帮助! :) – MaxU
我认为可能有一个小问题。我运行'pvt.head()[['state']]'来查看其他列仍然存在,并且出现错误。 'KeyError:'['state'] not in index“'这是它应该如何?如果是这样,那么我怎样才能保留我的其他专栏,或者我可以保留它们(同时仍然执行此操作) –
请检查您输入采样数据和预期的结果集 - 这肯定是不对的。如果要将'cmte_id'转换为列,那么您应该在包含'cmte_id'输入数据框中的值的预期输出列名称中进行输入 - 事实并非如此。除此之外,在你的输入中没有'id.thomas'列 - 所以它在输出中是如何出现的? – MaxU
谢谢MaxU,我刚刚编辑。我在自己前进着。 –