如何从字符串中找到子字符串列表的位置？

问题描述：

如何从字符串中找到子字符串列表的位置？如何从字符串中找到子字符串列表的位置？

给定一个字符串：

“飞机，开往圣彼得堡，坠毁在埃及西奈沙漠仅23分钟后起飞，从沙姆沙伊赫星期六”。

与子列表：

[ '的'， '飞机'， ' ' '束缚'， '对'， '圣'， '圣彼得堡'，'，' ，'坠毁'，'in'，'埃及'，''s'，'西奈'，'沙漠'，'just'，'23'，'分钟'，'后'，'起飞'，'从' '沙姆'， '沙姆沙伊赫'， '上'， '星期六'，'']

希望的输出：

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday." 
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.'] 
>>> find_offsets(tokens, s) 
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34), 
     (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67), 
     (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109), 
     (110, 119), (120, 122), (123, 131), (131, 132)]

输出的说明，第一个子字符串“The”可以通过使用字符串s使用(start, end)索引找到。所以从期望的输出。

因此，如果我们遍历从期望的输出我们得到的子串的名单，也就是整数的所有元组

>>> [s[start:end] for start, end in out] 
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

我已经试过：

def find_offset(tokens, s): 
    index = 0 
    offsets = [] 
    for token in tokens: 
     start = s[index:].index(token) + index 
     index = start + len(token) 
     offsets.append((start, index)) 
    return offsets

有另一种方法来查找字符串中的子串列表的位置？

答

如果我们没有关于子的想法，还有除了没办法重新扫描整个文本为他们每个人。

如果从数据看来，我们知道这些是文本的连续片段，按照文本顺序给出，每次比赛后仅扫描文本的其余部分很容易。但是，每次都删除文本是没有意义的。

def spans(text, fragments): 
    result = [] 
    point = 0 # Where we're in the text. 
    for fragment in fragments: 
     found_start = text.index(fragment, point) 
     found_end = found_start + len(fragment) 
     result.append((found_start, found_end)) 
     point = found_end 
    return result

测试：

>>> spans('foo in bar', ['foo', 'in', 'bar']) 
[(0, 3), (4, 6), (7, 10)]

这是假定每个片段存在于在正确的地方的文本。您的输出格式不提供不匹配报告的示例。使用.find而不是.index可以帮助，虽然只是部分。

答

解决方案一：

#use list comprehension and list.index function. 
[tuple((s.index(e),s.index(e)+len(e))) for e in t]

二的解决方案来纠正第一个解决方案的问题：

def find_offsets(tokens, s): 
    tid = [list(e) for e in tokens] 
    i = 0 
    for id_token,token in enumerate(tid): 
     while (token[0]!=s[i]):    
      i+=1 
     tid[id_token] = tuple((i,i+len(token))) 
     i+=len(token) 

    return tid 


find_offsets(tokens, s) 
Out[201]: 
[(0, 3), 
(4, 9), 
(9, 10), 
(11, 16), 
(17, 20), 
(21, 23), 
(24, 34), 
(34, 35), 
(36, 43), 
(44, 46), 
(47, 52), 
(52, 54), 
(55, 60), 
(61, 67), 
(68, 72), 
(73, 75), 
(76, 83), 
(84, 89), 
(90, 98), 
(99, 103), 
(104, 109), 
(110, 119), 
(120, 122), 
(123, 131), 
(131, 132)] 

#another test 
s = 'The plane, plane' 
t = ['The', 'plane', ',', 'plane'] 
find_offsets(t,s) 
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]

奈斯利短而且兴高采烈低效的，调用'的.index（）'两次。 – 9000

此外，如果有重复的单词，这将无法正常工作。 '.index（）'总是只提取第一个实例=（ – alvas

尝试'='飞机，飞机'; t = ['The'，'plane'，'，'，'plane']' – alvas

答

import re 

s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday." 
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.'] 


for token in tokens: 
    pattern = re.compile(re.escape(token)) 
    print(pattern.search(s).span())

RESULT

(0, 3) 
(4, 9) 
(9, 10) 
(11, 16) 
(17, 20) 
(21, 23) 
(24, 34) 
(9, 10) 
(36, 43) 
(44, 46) 
(47, 52) 
(52, 54) 
(55, 60) 
(61, 67) 
(68, 72) 
(73, 75) 
(76, 83) 
(84, 89) 
(90, 98) 
(99, 103) 
(104, 109) 
(110, 119) 
(120, 122) 
(123, 131) 
(131, 132)

如何从字符串中找到子字符串列表的位置？

相关推荐