如何解析特殊的日期格式在不同的列与大熊猫

问题描述:

与特殊的日期格式XLS数据,如:如何解析特殊的日期格式在不同的列与大熊猫

start day(utc) start time(utc) 
    20160401   100 
    20160401   200 
    20160401   300 
    20160401   400 
    20160401   500 

我想分析它作为格式2016年4月1日凌晨1点, 我用熊猫读了表;

parse = lambda x: datetime.strptime(str(x), '%Y%m%d %H') 
    content=pd.read_excel(filepath,skiprows=1, 
          na_values=['nan',-9999.0,9999.0, 
          '-9999.0 -',-99,'-99.000 -',-999], 
          parse_cols=[1,2,3,4,5,6,7,8,9,10,11,12,14], 
          header=None, parse_dates = [0,1], 
          index_col = 0, 
          date_parser=parse) 

但发生错误。它显示:

File "D:\Anaconda2\lib\_strptime.py", line 332, in _strptime 
    (data_string, format)) 

    ValueError: time data '100' does not match format '%Y%m%d' 

我该如何处理它?

您可以使用to_timedelta,因为100必要的鸿沟:

content=pd.read_excel(filepath,skiprows=1, 
         na_values=['nan',-9999.0,9999.0, 
         '-9999.0 -',-99,'-99.000 -',-999], 
         parse_cols=[1,2,3,4,5,6,7,8,9,10,11,12,14], 
         header=None, parse_dates = [0], 
         index_col = 0) 

df.index = df.index + pd.to_timedelta(df['start time(utc)']/100., unit='h') 
df = df.drop('start time(utc)', axis=1) 

如果没有必要的(时间是0,1,2..23)变化parse_dates = [0,1]parse_dates = [[0,1]]

样品

import pandas as pd 
from pandas.compat import StringIO 

temp=u"""start day(utc);start time(utc);a 
20160401;1;1 
20160401;2;7 
20160401;3;7 
20160401;4;5 
20160401;5;3""" 
#after testing replace 'StringIO(temp)' to 'filename.csv' 
parse = lambda x: datetime.strptime(x, '%Y%m%d %H') 
df = pd.read_csv(StringIO(temp), sep=";", 
          parse_dates = [[0,1]], 
          index_col = 0, 
          date_parser=parse) 

print (df) 
           a 
start day(utc)_start time(utc) 
2016-04-01 01:00:00    1 
2016-04-01 02:00:00    7 
2016-04-01 03:00:00    7 
2016-04-01 04:00:00    5 
2016-04-01 05:00:00    3 

print (df.index) 
DatetimeIndex(['2016-04-01 01:00:00', '2016-04-01 02:00:00', 
       '2016-04-01 03:00:00', '2016-04-01 04:00:00', 
       '2016-04-01 05:00:00'], 
       dtype='datetime64[ns]', name='start day(utc)_start time(utc)', freq=None) 
+0

我用上面的方法解决了它。谢谢:) –

+0

很高兴能帮助,祝你好运! – jezrael