Python - 从文本中提取主题标签;在标点符号

问题描述：

对于我的编程班结束，我必须根据以下描述来创建一个功能：Python - 从文本中提取主题标签;在标点符号

的参数是一个鸣叫。该函数应该按照它们在推文中出现的顺序返回一个包含推文中所有标签的列表。返回列表中的每个hashtag应该删除初始散列符号，并且hashtags应该是唯一的。（如果鸣叫使用相同的主题标签的两倍，它被包含在列表中只有一次。该井号标签的顺序应该与鸣叫每个标签中第一次出现的顺序。）

我不确定如何当遇到标点符号时，哈希标签就会结束（参见第二个doctest示例）。我目前的代码是不输出任何东西：

def extract(start, tweet): 
    """ (str, str) -> list of str 

    Return a list of strings containing all words that start with a specified character. 

    >>> extract('@', "Make America Great Again, vote @RealDonaldTrump") 
    ['RealDonaldTrump'] 
    >>> extract('#', "Vote Hillary! #ImWithHer #TrumpsNotMyPresident") 
    ['ImWithHer', 'TrumpsNotMyPresident'] 
    """ 

    words = tweet.split() 
    return [word[1:] for word in words if word[0] == start] 

def strip_punctuation(s): 
    """ (str) -> str 

    Return a string, stripped of its punctuation. 

    >>> strip_punctuation("Trump's in the lead... damn!") 
    'Trumps in the lead damn' 
    """ 
    return ''.join(c for c in s if c not in '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~') 

def extract_hashtags(tweet): 
    """ (str) -> list of str 

    Return a list of strings containing all unique hashtags in a tweet. 
    Outputted in order of appearance. 

    >>> extract_hashtags("I stand with Trump! #MakeAmericaGreatAgain #MAGA #TrumpTrain") 
    ['MakeAmericaGreatAgain', 'MAGA', 'TrumpTrain'] 
    >>> extract_hashtags('NEVER TRUMP. I'm with HER. Does #this! work?') 
    ['this'] 
    """ 

    hashtags = extract('#', tweet) 

    no_duplicates = [] 

    for item in hashtags: 
     if item not in no_duplicates and item.isalnum(): 
      no_duplicates.append(item) 

    result = [] 
    for hash in no_duplicates: 
     for char in hash: 
      if char.isalnum() == False and char != '#': 
       hash == hash[:char.index()] 
       result.append() 
    return result

我很迷茫在这一点上;任何帮助，将不胜感激。先谢谢你。

注意：我们是而不是允许使用正则表达式或导入任何模块。

那么..如果你需要结束标点符号，并且没有*那许多点符号，为什么不检查下一个字符是否是标点符号？ – Pythonista

答

你看起来有点失落。解决这些类型问题的关键是将问题分成更小的部分，解决这些问题，然后结合结果。你得每一件你需要..：

def extract_hashtags(tweet): 
    # strip the punctuation on the tags you've extracted (directly) 
    hashtags = [strip_punctuation(tag) for tag in extract('#', tweet)] 
    # hashtags is now a list of hash-tags without any punctuation, but possibly with duplicates 

    result = [] 
    for tag in hashtags: 
     if tag not in result: # check that we haven't seen the tag already (we know it doesn't contain punctuation at this point) 
      result.append(tag) 
    return result

PS：这是一个非常适合于正则表达式解决的问题，但如果你想快速strip_punctuation你可以使用：

def strip_punctuation(s): 
    return s.translate(None, '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~')

Python - 从文本中提取主题标签;在标点符号

相关推荐