如何轻松删除文本字符串中的某些单词?

问题描述:

我希望大家都身体健康。如何轻松删除文本字符串中的某些单词?

我试过str_replace,但没有得到任何成功..!

我想要的是从这整个文本中删除所有这些"因为我不想混淆脚本,并且这些HTML特殊字符真的把它弄乱了,它不起作用现在这些都太多所以如何与""什么取代所有这些:

#!/usr/bin/perl 
# This script is NOT written or modified by me, I only copy pasted it from the internet. 
# It was First originally Written by [email protected] 
# & Have been modified by various persons over the net to fix/add various functions. 
# Like For Example modified by member of comstuff.net to satisfy common and dynamic content. 
# th30nly @comstuff.net a.k.a invisible_theater , Syaifudin JW , Ucok Karnadi and possibly other people too. 
# For more info, http://wiki.squid-cache.org/ConfigExamples/DynamicContent/YouTube 
# Syed Jahanzaib/[email protected] 
# http://aacable.wordpress.com/2012/01/19/youtube-caching-with-squid-2-7-using-storeurl-pl/ 
####################### 
# Special thanks to some indonesian friends who provided some updates, 
## UPDATED on 20 January, 2014/Syed Jahanzaib 

##################### 
#### REFERENCES ##### http://www2.fh-lausitz.de/launic/comp/misc/squid/projekt_youtube/ 
##################### 
##################### 

use IO::File; 
$|=1; 
STDOUT->autoflush(1); 
$debug=1; ## recommended:0 
$bypassallrules=0; ## recommended:0 
$sucks=""; ## unused 
$sucks="sucks" if ($debug>=1); 
$timenow=""; 
$printtimenow=1; ## print timenow: 0|1 
my $logfile = '/tmp/storeurl.log'; 

open my $logfh, '>>', $logfile 
or die "Couldn't open $logfile for appending: $!\n" if $debug; 
$logfh->autoflush(1) if $debug; 

#### main 
while (<>) { 
$timenow=time()." " if ($printtimenow); 
print $logfh "$timenow"."in : $_" if ($debug>=1); 
chop; ## strip eol 
@X = split; 
$x = $X[0]; ## 0 
$u = $X[1]; ## url 
$_ = $u; ## url 

if ($bypassallrules){ 
$out="$u"; ## map 1:1 

#youtube with range (YOUTUBE has split its videos into segments) 
}elsif (m/(youtube|google).*videoplayback\?.*range/){ 
@itag = m/[&?](itag=[0-9]*)/; 
@id = m/[&?](id=[^\&]*)/; 
@range = m/[&?](range=[^\&\s]*)/; 
@begin = m/[&?](begin=[^\&\s]*)/; 
@redirect = m/[&?](redirect_counter=[^\&]*)/; 
$out="http://video-srv.youtube.com.SQUIDINTERNAL/@itag&@id&@range&@redirect"; 
#sleep(1); ## delay loop 

#youtube without range 
}elsif (m/(youtube|google).*videoplayback\?/){ 
@itag = m/[&?](itag=[0-9]*)/; 
@id = m/[&?](id=[^\&]*)/; 
@redirect = m/[&?](redirect_counter=[^\&]*)/; 
$out="http://video-srv.youtube.com.SQUIDINTERNAL/@itag&@id&@redirect"; 
#sleep(1); ## delay loop 

#speedtest 
}elsif (m/^http:\/\/(.*)\/speedtest\/(.*\.(jpg|txt))\?(.*)/) { 
$out="http://www.speedtest.net.SQUIDINTERNAL/speedtest/" . $2 . ""; 

#mediafire 
}elsif (m/^http:\/\/199\.91\.15\d\.\d*\/\w{12}\/(\w*)\/(.*)/) { 
$out="http://www.mediafire.com.SQUIDINTERNAL/" . $1 ."/" . $2 . ""; 

#fileserve 
}elsif (m/^http:\/\/fs\w*\.fileserve\.com\/file\/(\w*)\/[\w-]*\.\/(.*)/) { 
$out="http://www.fileserve.com.SQUIDINTERNAL/" . $1 . "./" . $2 . ""; 

#filesonic 
}elsif (m/^http:\/\/s[0-9]*\.filesonic\.com\/download\/([0-9]*)\/(.*)/) { 
$out="http://www.filesonic.com.SQUIDINTERNAL/" . $1 . ""; 

#4shared 
}elsif (m/^http:\/\/[a-zA-Z]{2}\d*\.4shared\.com(:8080|)\/download\/(.*)\/(.*\..*)\?.*/) { 
$out="http://www.4shared.com.SQUIDINTERNAL/download/$2\/$3"; 

#4shared preview 
}elsif (m/^http:\/\/[a-zA-Z]{2}\d*\.4shared\.com(:8080|)\/img\/(\d*)\/\w*\/dlink__2Fdownload_2F(\w*)_3Ftsid_3D[\w-]*\/preview\.mp3\?sId=\w*/) { 
$out="http://www.4shared.com.SQUIDINTERNAL/$2"; 

#photos-X.ak.fbcdn.net where X a-z 
}elsif (m/^http:\/\/photos-[a-z](\.ak\.fbcdn\.net)(\/.*\/)(.*\.jpg)/) { 
$out="http://photos" . $1 . "/" . $2 . $3 . ""; 

#YX.sphotos.ak.fbcdn.net where X 1-9, Y a-z 
} elsif (m/^http:\/\/[a-z][0-9]\.sphotos\.ak\.fbcdn\.net\/(.*)\/(.*)/) { 
$out="http://photos.ak.fbcdn.net/" . $1 ."/". $2 . ""; 

#maps.google.com 
} elsif (m/^http:\/\/(cbk|mt|khm|mlt|tbn)[0-9]?(.google\.co(m|\.uk|\.id).*)/) { 
$out="http://" . $1 . $2 . ""; 

# compatibility for old cached get_video?video_id 
} elsif (m/^http:\/\/([0-9.]{4}|.*\.youtube\.com|.*\.googlevideo\.com|.*\.video\.google\.com).*?(videoplayback\?id=.*?|video_id=.*?)\&(.*?)/) { 
$z = $2; $z =~ s/video_id=/get_video?video_id=/; 
$out="http://video-srv.youtube.com.SQUIDINTERNAL/" . $z . ""; 
#sleep(1); ## delay loop 

} elsif (m/^http:\/\/www\.google-analytics\.com\/__utm\.gif\?.*/) { 
$out="http://www.google-analytics.com/__utm.gif"; 

#Cache High Latency Ads 
} elsif (m/^http:\/\/([a-z0-9.]*)(\.doubleclick\.net|\.quantserve\.com|\.googlesyndication\.com|yieldmanager|cpxinteractive)(.*)/) { 
$y = $3;$z = $2; 
for ($y) { 
s/pixel;.*/pixel/; 
s/activity;.*/activity/; 
s/(imgad[^&]*).*/\1/; 
s/;ord=[?0-9]*//; 
s/;&timestamp=[0-9]*//; 
s/[&?]correlator=[0-9]*//; 
s/&cookie=[^&]*//; 
s/&ga_hid=[^&]*//; 
s/&ga_vid=[^&]*//; 
s/&ga_sid=[^&]*//; 
# s/&prev_slotnames=[^&]*// 
# s/&u_his=[^&]*//; 
s/&dt=[^&]*//; 
s/&dtd=[^&]*//; 
s/&lmt=[^&]*//; 
s/(&alternate_ad_url=http%3A%2F%2F[^(%2F)]*)[^&]*/\1/; 
s/(&url=http%3A%2F%2F[^(%2F)]*)[^&]*/\1/; 
s/(&ref=http%3A%2F%2F[^(%2F)]*)[^&]*/\1/; 
s/(&cookie=http%3A%2F%2F[^(%2F)]*)[^&]*/\1/; 
s/[;&?]ord=[?0-9]*//; 
s/[;&]mpvid=[^&;]*//; 
s/&xpc=[^&]*//; 
# yieldmanager 
s/\?clickTag=[^&]*//; 
s/&u=[^&]*//; 
s/&slotname=[^&]*//; 
s/&page_slots=[^&]*//; 
} 
$out="http://" . $1 . $2 . $y . ""; 

#cache high latency ads 
} elsif (m/^http:\/\/(.*?)\/(ads)\?(.*?)/) { 
$out="http://" . $1 . "/" . $2 . ""; 

# spicific servers starts here.... 
} elsif (m/^http:\/\/(www\.ziddu\.com.*\.[^\/]{3,4})\/(.*?)/) { 
$out="http://" . $1 . ""; 

#cdn, varialble 1st path 
} elsif (($u =~ /filehippo/) && (m/^http:\/\/(.*?)\.(.*?)\/(.*?)\/(.*)\.([a-z0-9]{3,4})(\?.*)?/)) { 
@y = ($1,$2,$4,$5); 
$y[0] =~ s/[a-z0-9]{2,5}/cdn./; 
$out="http://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . ""; 

#rapidshare 
} elsif (($u =~ /rapidshare/) && (m/^http:\/\/(([A-Za-z]+[0-9-.]+)*?)([a-z]*\.[^\/]{3}\/[a-z]*\/[0-9]*)\/(.*?)\/([^\/\?\&]{4,})$/)) { 
$out="http://cdn." . $3 . "/SQUIDINTERNAL/" . $5 . ""; 

} elsif (($u =~ /maxporn/) && (m/^http:\/\/([^\/]*?)\/(.*?)\/([^\/]*?)(\?.*)?$/)) { 
$out="http://" . $1 . "/SQUIDINTERNAL/" . $3 . ""; 

#like porn hub variables url and center part of the path, filename etention 3 or 4 with or without ? at the end 
} elsif (($u =~ /tube8|pornhub|xvideos/) && (m/^http:\/\/(([A-Za-z]+[0-9-.]+)*?(\.[a-z]*)?)\.([a-z]*[0-9]?\.[^\/]{3}\/[a-z]*)(.*?)((\/[a-z]*)?(\/[^\/]*){4}\.[^\/\?]{3,4})(\?.*)?$/)) { 
$out="http://cdn." . $4 . $6 . ""; 

#...spicific servers end here. 

#photos-X.ak.fbcdn.net where X a-z 
} elsif (m/^http:\/\/photos-[a-z].ak.fbcdn.net\/(.*)/) { 
$out="http://photos.ak.fbcdn.net/" . $1 . ""; 

#for yimg.com video 
} elsif (m/^http:\/\/(.*yimg.com)\/\/(.*)\/([^\/\?\&]*\/[^\/\?\&]*\.[^\/\?\&]{3,4})(\?.*)?$/) { 
$out="http://cdn.yimg.com//" . $3 . ""; 

#for yimg.com doubled 
} elsif (m/^http:\/\/(.*?)\.yimg\.com\/(.*?)\.yimg\.com\/(.*?)\?(.*)/) { 
$out="http://cdn.yimg.com/" . $3 . ""; 

#for yimg.com with &sig= 
} elsif (m/^http:\/\/(.*?)\.yimg\.com\/(.*)/) { 
@y = ($1,$2); 
$y[0] =~ s/[a-z]+[0-9]+/cdn/; 
$y[1] =~ s/&sig=.*//; 
$out="http://" . $y[0] . ".yimg.com/" . $y[1] . ""; 

#youjizz. We use only domain and filename 
} elsif (($u =~ /media[0-9]{2,5}\.youjizz/) && (m/^http:\/\/(.*)(\.[^\.\-]*?\..*?)\/(.*)\/([^\/\?\&]*)\.([^\/\?\&]{3,4})((\?|\%).*)?$/)) { 
@y = ($1,$2,$4,$5); 
$y[0] =~ s/(([a-zA-A]+[0-9]+(-[a-zA-Z])?$)|(.*cdn.*)|(.*cache.*))/cdn/; 
$out="http://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . ""; 

#general purpose for cdn servers. add above your specific servers. 
} elsif (m/^http:\/\/([0-9.]*?)\/\/(.*?)\.(.*)\?(.*?)/) { 
$out="http://squid-cdn-url//" . $2 . "." . $3 . ""; 

#generic http://variable.domain.com/path/filename."ex" "ext" or "exte" with or withour "? or %" 
} elsif (m/^http:\/\/(.*)(\.[^\.\-]*?\..*?)\/(.*)\.([^\/\?\&]{2,4})((\?|\%).*)?$/) { 
@y = ($1,$2,$3,$4); 
$y[0] =~ s/(([a-zA-Z]+[0-9]+(-[a-zA-Z])?$)|(.*cdn.*)|(.*cache.*))/cdn/; 
$out="http://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . ""; 

} else { 
$out="$u"; ##$X[2]="$sucks"; 
} 
print $logfh "$timenow"."out: $x $out $X[2] $X[3] $X[4] $X[5] $X[6] $X[7]\n" if ($debug>=1); 
print "$x $out $X[2] $X[3] $X[4] $X[5] $X[6] $X[7]\n"; 
} 
close $logfh if ($debug); 
+0

Oohh所以文很多文字很难阅读。 – nicael

+2

因此,没有'str_replace(array('&','amp;','quot;'),'',$ s)'工作吗? http://ideone.com/8qEvXx –

+0

我不相信这是PHP。请删除这个问题上的PHP标签。 –

你总是可以做这样的事情

查找:&(?:(?:amp|gt|lt|quot);)+
替换: ''

编辑
对于多出的部分,一个PHP样本

$strNewSrc = preg_replace_callback('~http://([^;\r\n]+);~', 
    function($matches) 
    { 
     $tmp = $matches[1]; 
     $tmp = preg_replace('~\s+~', '', $tmp); 
     $repl = "http://" + $tmp + ";"; 
     return $repl; 
    }, 
    $strSrc 
); 
+0

Great..Works Sublime Text:D除了我还想删除链接中的多余空格或空格,如'http://。 。 。 /。 $ y。 。任何正则表达式对于那个也是好的。 –

+0

是的,在你运行上面的代码之后,使用这个 - 使用一个普通的正则表达式替换回调函数。查找:'http://([^; \ r \ n] +);'替换为“http://”+ RemoveWsp($ 1)+“;” – sln

+0

这里是Pastebin的更清晰的例子:http://pastebin.com/icBRa9t4 –

这里的评论步骤可以帮助你做到这一点:

<?php 

    // GET & DUMP THE CONTENTS OF THE FILE INTO A VARIABLE: $fileContents 
    $fileContents = file_get_contents(__DIR__ . "/_data_1.txt"); 

    // CLEAN UP THE TEXTS (REMOVING: quot; AND OR amp; AND OR &amp;) 
    $cleanText  = preg_replace("#((quot\;)*?|(\&amp\;)|(amp\;)*?)#si", "", $fileContents); 

    // RE-SAVE THE FILE 
    file_put_contents(__DIR__ . "/_data_2.txt", $cleanText); 
    var_dump($cleanText);