在熊猫中将持续时间的变化文本字段转换为秒
问题描述:
我有一个数据帧,其中包含行程的持续时间作为文本值,如下面在driving_duration_text列中所示。在熊猫中将持续时间的变化文本字段转换为秒
print df
yelp_id driving_duration_text \
0 alexander-rubin-photography-napa 1 hour 43 mins
1 jumas-automotive-napa-2 1 hour 32 mins
2 larson-brothers-painting-napa 1 hour 30 mins
3 preferred-limousine-napa 1 hour 32 mins
4 cardon-y-el-tirano-miami 1 day 16 hours
5 sweet-dogs-miami 1 day 3 hours
正如你可以看到一些是在几个小时和其他人在几天内写。我怎样才能将这种格式转换为秒?
答
UPDATE:
In [150]: df['seconds'] = (pd.to_timedelta(df['driving_duration_text']
.....: .str.replace(' ', '')
.....: .str.replace('mins', 'min'))
.....: .dt.total_seconds())
In [151]: df
Out[151]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180.0
1 jumas-automotive-napa-2 1 hour 32 mins 5520.0
2 larson-brothers-painting-napa 1 hour 30 mins 5400.0
3 preferred-limousine-napa 1 hour 32 mins 5520.0
4 cardon-y-el-tirano-miami 1 day 16 hours 144000.0
5 sweet-dogs-miami 1 day 3 hours 97200.0
OLD答案:
你能做到这样:
from collections import defaultdict
import re
def humantime2seconds(s):
d = {
'w': 7*24*60*60,
'week': 7*24*60*60,
'weeks': 7*24*60*60,
'd': 24*60*60,
'day': 24*60*60,
'days': 24*60*60,
'h': 60*60,
'hr': 60*60,
'hour': 60*60,
'hours': 60*60,
'm': 60,
'min': 60,
'mins': 60,
'minute': 60,
'minutes':60
}
mult_items = defaultdict(lambda: 1).copy()
mult_items.update(d)
parts = re.search(r'^(\d+)([^\d]*)', s.lower().replace(' ', ''))
if parts:
return int(parts.group(1)) * mult_items[parts.group(2)] + humantime2seconds(re.sub(r'^(\d+)([^\d]*)', '', s.lower()))
else:
return 0
df['seconds'] = df.driving_duration_text.map(humantime2seconds)
输出:
In [64]: df
Out[64]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180
1 jumas-automotive-napa-2 1 hour 32 mins 5520
2 larson-brothers-painting-napa 1 hour 30 mins 5400
3 preferred-limousine-napa 1 hour 32 mins 5520
4 cardon-y-el-tirano-miami 1 day 16 hours 144000
5 sweet-dogs-miami 1 day 3 hours 97200
答
鉴于文本似乎遵循标准格式,这是相对简单的。我们需要拆开字符串,将它组合成相关的部分,然后处理它们。
def parse_duration(duration):
items = duration.split()
words = items[1::2]
counts = items[::2]
seconds = 0
for i, each in enumerate(words):
seconds += get_seconds(each, counts[i])
return seconds
def get_seconds(word, count):
counts = {
'second': 1,
'minute': 60,
'hour': 3600,
'day': 86400
# and so on
}
# Bit complicated here to handle plurals
base = counts.get(word[:-1], counts.get(word, 0))
return base * count
+0
很好的答案。我喜欢这种方法。 – DJGrandpaJ
@unutbu,谢谢你的纠正!我已经更新了我的回答 – MaxU
我得到了您的旧答案,但在您的更新答案中得到'ValueError:无法创建timedelta字符串转换器',但是,谢谢! –
@JasonMeloHall:你[可能需要升级](http://stackoverflow.com/questions/28294803/calculate-total-of-string-column#comment44989331_28302352)你的版本的熊猫得到更新的答案工作。 – unutbu