如何显示包含这些字符的所有单词?

问题描述:

我有一个文本文件,我想显示包含z和x字符的所有单词。如何显示包含这些字符的所有单词?

我该怎么做?

+2

确切的问题在哪里?你试过什么了? – 2010-10-18 19:56:32

+0

我不知道如何解析文本文件:) – xRobot 2010-10-18 19:58:30

+0

正则表达式在解析文本时很重要。看看Ishpeck的解决方案。 – Squirrelsama 2010-10-18 20:54:58

如果你不想有2个问题:

for word in file('myfile.txt').read().split(): 
    if 'x' in word and 'z' in word: 
     print word 
+1

谢天谢地,你提供了一个答案,*不*使用正则表达式。 – gotgenes 2010-10-18 20:14:13

+0

+1:我非常喜欢这个。我能看到的唯一问题是,你会得到围绕你的单词的任何标点符号,而不仅仅是单词本身。 – 2010-10-18 20:16:34

+0

的确,我正在使用python的“words”定义,这在这里可能不合理。 – geoffspear 2010-10-18 20:18:52

听起来像是Regular Expressions的工作。阅读并尝试一下。如果遇到问题,请更新您的问题,我们可以帮助您解决具体问题。

假设你拥有整个文件在内存中一个大字符串,这一个词的定义是“字母的连续序列”,那么你可以做这样的事情:

import re 
for word in re.findall(r"\w+", mystring): 
    if 'x' in word and 'z' in word: 
     print word 
+0

我喜欢这个答案。这是最干净的解决方案。如果表现成为问题,请对照我的解决方案并挑选胜者。 – 2010-10-18 20:14:08

>>> import re 
>>> pattern = re.compile('\b(\w*z\w*x\w*|\w*x\w*z\w*)\b') 
>>> document = '''Here is some data that needs 
... to be searched for words that contain both z 
... and x. Blah xz zx blah jal akle asdke asdxskz 
... zlkxlk blah bleh foo bar''' 
>>> print pattern.findall(document) 
['xz', 'zx', 'asdxskz', 'zlkxlk'] 
+0

我可以证实这个作品,比我的回复好。我将删除我的这个。 – Ishpeck 2010-10-18 21:03:58

>>> import re 
>>> print re.findall('(\w*x\w*z\w*|\w*z\w*x\w*)', 'axbzc azb axb abc axzb') 
['axbzc', 'axzb'] 

我不知道该发电机的性能,但对我来说ŧ他是这样的:

from __future__ import print_function 
import string 

bookfile = '11.txt' # Alice in Wonderland 
hunted = 'az' # in your case xz but there is none of those in this book 

with open(bookfile) as thebook: 
    # read text of book and split from white space 
    print('\n'.join(set(word.lower().strip(string.punctuation) 
        for word in thebook.read().split() 
        if all(c in word.lower() for c in hunted)))) 
""" Output: 
zealand 
crazy 
grazed 
lizard's 
organized 
lazy 
zigzag 
lizard 
lazily 
gazing 
"" 

我只是想指出如何笨拙一些正则表达式可以在比较简单的string methods-based solution provided by Wooble

让我们来做一些时间安排吧?

#!/usr/bin/env python 
# -*- coding: UTF-8 -*- 

import timeit 
import re 
import sys 

WORD_RE_COMPILED = re.compile(r'\w+') 
Z_RE_COMPILED = re.compile(r'(\b\w*z\w*\b)') 
XZ_RE_COMPILED = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b') 

########################## 
# Tim Pietzcker's solution 
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962876#3962876 
# 
def xz_re_word_find(text): 
    for word in re.findall(r'\w+', text): 
     if 'x' in word and 'z' in word: 
      print word 


# Tim's solution, compiled 
def xz_re_word_compiled_find(text): 
    pattern = re.compile(r'\w+') 
    for word in pattern.findall(text): 
     if 'x' in word and 'z' in word: 
      print word 


# Tim's solution, with the RE pre-compiled so compilation doesn't get 
# included in the search time 
def xz_re_word_precompiled_find(text): 
    for word in WORD_RE_COMPILED.findall(text): 
     if 'x' in word and 'z' in word: 
      print word 


################################ 
# Steven Rumbalski's solution #1 
# (provided in the comment) 
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3963285#3963285 
def xz_re_z_find(text): 
    for word in re.findall(r'(\b\w*z\w*\b)', text): 
     if 'x' in word: 
      print word 


# Steven's solution #1 compiled 
def xz_re_z_compiled_find(text): 
    pattern = re.compile(r'(\b\w*z\w*\b)') 
    for word in pattern.findall(text): 
     if 'x' in word: 
      print word 


# Steven's solution #1 with the RE pre-compiled 
def xz_re_z_precompiled_find(text): 
    for word in Z_RE_COMPILED.findall(text): 
     if 'x' in word: 
      print word 


################################ 
# Steven Rumbalski's solution #2 
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962934#3962934 
def xz_re_xz_find(text): 
    for word in re.findall(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b', text): 
     print word 


# Steven's solution #2 compiled 
def xz_re_xz_compiled_find(text): 
    pattern = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b') 
    for word in pattern.findall(text): 
     print word 


# Steven's solution #2 pre-compiled 
def xz_re_xz_precompiled_find(text): 
    for word in XZ_RE_COMPILED.findall(text): 
     print word 


################################# 
# Wooble's simple string solution 
def xz_str_find(text): 
    for word in text.split(): 
     if 'x' in word and 'z' in word: 
      print word 


functions = [ 
     'xz_re_word_find', 
     'xz_re_word_compiled_find', 
     'xz_re_word_precompiled_find', 
     'xz_re_z_find', 
     'xz_re_z_compiled_find', 
     'xz_re_z_precompiled_find', 
     'xz_re_xz_find', 
     'xz_re_xz_compiled_find', 
     'xz_re_xz_precompiled_find', 
     'xz_str_find' 
] 

import_stuff = functions + [ 
     'text', 
     'WORD_RE_COMPILED', 
     'Z_RE_COMPILED', 
     'XZ_RE_COMPILED' 
] 


if __name__ == '__main__': 

    text = open(sys.argv[1]).read() 
    timings = {} 
    setup = 'from __main__ import ' + ','.join(import_stuff) 
    for func in functions: 
     statement = func + '(text)' 
     timer = timeit.Timer(statement, setup) 
     min_time = min(timer.repeat(3, 10)) 
     timings[func] = min_time 


    for func in functions: 
     print func + ":", timings[func], "seconds" 

运行在plaintext copy of Moby Dick这个脚本Project Gutenberg获得的,在Python 2.6中,我得到以下计时:

xz_re_word_find: 1.21829485893 seconds 
xz_re_word_compiled_find: 1.42398715019 seconds 
xz_re_word_precompiled_find: 1.40110301971 seconds 
xz_re_z_find: 0.680151939392 seconds 
xz_re_z_compiled_find: 0.673038005829 seconds 
xz_re_z_precompiled_find: 0.673489093781 seconds 
xz_re_xz_find: 1.11700701714 seconds 
xz_re_xz_compiled_find: 1.12773990631 seconds 
xz_re_xz_precompiled_find: 1.13285303116 seconds 
xz_str_find: 0.590088844299 seconds 

在Python 3.1(使用2to3修复打印报表后),我得到以下时序:

xz_re_word_find: 2.36110496521 seconds 
xz_re_word_compiled_find: 2.34727501869 seconds 
xz_re_word_precompiled_find: 2.32607793808 seconds 
xz_re_z_find: 1.32204890251 seconds 
xz_re_z_compiled_find: 1.34104800224 seconds 
xz_re_z_precompiled_find: 1.34424304962 seconds 
xz_re_xz_find: 2.33851099014 seconds 
xz_re_xz_compiled_find: 2.29653286934 seconds 
xz_re_xz_precompiled_find: 2.32416701317 seconds 
xz_str_find: 0.656699895859 seconds 

我们可以看到,基于正则表达式的功能,往往需要两倍的时间来的〜应变运行g是基于方法的函数,在Python 3中是超过3倍。对于一次性解析(没有人会错过这些毫秒),时间差异是微不足道的,但对于必须多次调用该函数的情况,基于字符串方法的方法既简单又快捷。

+0

我也喜欢字符串方法。但是,这是一个挑剔。我更改了zx_re_find(text)的定义,它比纯字符串方法快4倍: def zx_re_find(text): pat = re.compile('(\ b \ w * z \ w * \ b)') word在pat.findall(文本): 如果文字中有'x': 打印文字 – 2010-10-18 21:25:43

+0

@Steven我已经更新了我的答案,包括在评论中包含您的建议解决方案以及您提供的答案解答与字符串方法相比,任何正则表达式都不会获得4倍的性能。对我来说,可再生能源解决方案仍然落后。你用什么文字来测试你的表现? – gotgenes 2010-10-18 22:33:00

+0

@gotgenes我使用了与Moby Dick相同的明文副本。我在Windows XP上使用了python 2.7(嗯,在我工作的笔记本电脑上忘了芯片)。我记得字符串0.311的前三位数字和正则表达式的0.088(不是真正的4倍,但接近)。我坚持认为,如果要求更加复杂,正则表达式将获得简单性和性能。 – 2010-10-18 23:47:29