熊猫：移调在多列中的一个列DF

问题描述：

  cmte_id trans entity st amount fec_id 
date       
2007-08-15 C00112250 24K  ORG  DC 2000 C00431569 
2007-09-26 C00119040 24K  CCM  FL 1000 C00367680 
2007-09-26 C00119040 24K  CCM  MD 1000 C00140715 
2007-07-20 C00346296 24K  CCM  CA 1000 C00434571 
2007-09-24 C00346296 24K  CCM  MA 1000 C00433136

有迹象表明，我已经离开了为简洁起见其他描述的列。我想对其进行转换，以使[cmte_id]中的值成为列标题，[amount]中的值成为新列中的相应值。我知道这可能是一个简单的枢轴操作。我曾尝试以下：

dfy.pivot('cmte_id', 'amount') 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-203-e5d2cb89e880> in <module>() 
----> 1 dfy.pivot('cmte_id', 'amount') 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in pivot(self, index, columns, values) 
    3761   """ 
    3762   from pandas.core.reshape import pivot 
-> 3763   return pivot(self, index=index, columns=columns, values=values) 
    3764 
    3765  def stack(self, level=-1, dropna=True): 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in pivot(self, index, columns, values) 
    323   append = index is None 
    324   indexed = self.set_index(cols, append=append) 
--> 325   return indexed.unstack(columns) 
    326  else: 
    327   if index is None: 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level) 
    3857   """ 
    3858   from pandas.core.reshape import unstack 
-> 3859   return unstack(self, level) 
    3860 
    3861  #---------------------------------------------------------------------- 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level) 
    402  if isinstance(obj, DataFrame): 
    403   if isinstance(obj.index, MultiIndex): 
--> 404    return _unstack_frame(obj, level) 
    405   else: 
    406    return obj.T.stack(dropna=False) 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _unstack_frame(obj, level) 
    442  else: 
    443   unstacker = _Unstacker(obj.values, obj.index, level=level, 
--> 444        value_columns=obj.columns) 
    445   return unstacker.get_result() 
    446 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns) 
    96 
    97   self._make_sorted_values_labels() 
---> 98   self._make_selectors() 
    99 
    100  def _make_sorted_values_labels(self): 

/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _make_selectors(self) 
    134 
    135   if mask.sum() < len(self.index): 
--> 136    raise ValueError('Index contains duplicate entries, ' 
    137        'cannot reshape') 
    138 

ValueError: Index contains duplicate entries, cannot reshape

期望的最终结果（除了附加列，如 '反式'，fec_id， 'ST' 等）将是这个样子：

date C00112250 C00119040 C00119040 C00346296 C00346296 
2007-ago-15 2000     
2007-set-26    1000    
2007-set-26       1000  
2007-lug-20          1000 
2007-set-24             1000

不任何人都知道我如何能够接近最终产品？

请检查您输入采样数据和预期的结果集 - 这肯定是不对的。如果要将'cmte_id'转换为列，那么您应该在包含'cmte_id'输入数据框中的值的预期输出列名称中进行输入 - 事实并非如此。除此之外，在你的输入中没有'id.thomas'列 - 所以它在输出中是如何出现的？ – MaxU

谢谢MaxU，我刚刚编辑。我在自己前进着。 –

答

试试这个：

pvt = pd.pivot_table(df, index=df.index, columns='cmte_id', 
        values='amount', aggfunc='sum', fill_value=0)

保留其他列：

In [213]: pvt = pd.pivot_table(df.reset_index(), index=['index','trans','entity','st', 'fec_id'], 
    .....:      columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \ 
    .....:   .reset_index() 

In [214]: pvt 
Out[214]: 
cmte_id  index trans entity st  fec_id C00112250 C00119040 \ 
0  2007-07-20 24K CCM CA C00434571   0   0 
1  2007-08-15 24K ORG DC C00431569  2000   0 
2  2007-09-24 24K CCM MA C00433136   0   0 
3  2007-09-26 24K CCM FL C00367680   0  1000 
4  2007-09-26 24K CCM MD C00140715   0  1000 

cmte_id C00346296 
0    1000 
1    0 
2    1000 
3    0 
4    0 

In [215]: pvt.head()['st'] 
Out[215]: 
0 CA 
1 DC 
2 MA 
3 FL 
4 MD 
Name: st, dtype: object

UPDATE：

import pandas as pd 
import glob 


# if you don't use ['cand_id'] column - remove it from `usecols` parameter 
dfy = pd.concat([pd.read_csv(f, sep='|', low_memory=False, header=None, 
          names=['cmte_id', '2', '3', '4','5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
          usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount', 'fec_id', 'cand_id'], 
          dtype={'date': str}) 
       for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt') 
       ], 
       ignore_index=True) 

dfy['date'] = pd.to_datetime(dfy['date'], format='%m%d%Y') 

# remove not needed column ASAP in order to save memory 
del dfy['cand_id'] 

dfy = dfy[(dfy['date'].notnull()) & (dfy['date'] > '2007-01-01') & (dfy['date'] < '2014-12-31') ] 

#df = dfy.set_index(['date']) 

pvt = pd.pivot_table(dfy, index=['date','trans_typ','entity_typ','state','fec_id'], 
        columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \ 
     .reset_index() 


print(pvt.info()) 

pvt.to_excel('out.xlsx', index=False)

它的工作原理。再次感谢！我经常通过我在这个网站上的经历感到谦卑和放松！ –

总是乐于帮助！ :) – MaxU

我认为可能有一个小问题。我运行'pvt.head（）[['state']]'来查看其他列仍然存在，并且出现错误。 'KeyError：'['state'] not in index“'这是它应该如何？如果是这样，那么我怎样才能保留我的其他专栏，或者我可以保留它们（同时仍然执行此操作） –

熊猫：移调在多列中的一个列DF

相关推荐