的sed从文件中删除网址

问题描述：

我想写一个sed表达式，可以从文件的sed从文件中删除网址

例如

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:) 

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)

删除网址，但我不明白：

sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile

固定!!!!!

处理几乎所有的情况下，甚至畸形网址

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more

当与工作网址，文件路径等，我更喜欢使用“|”作为sed分隔符，所以我不必逃避/。例如：sed的|/path/to/some/file/|/newpath/to/new/file/| g' – 2010-11-26 09:55:43

@ JP19，喜欢它，会试试这个 – daydreamer 2010-11-26 22:38:50

答

下删除http://或https://和一切，直到下一个空格：

sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile 
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:) 

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor @kdpartak :)

编辑：

我应该用：

sed -e 's!http[s]\?://\S*!!g' posFile

“[s]\?”是写作的“可选s”相比，“\(s\)\{0,1\}”

“\S*”更具可读性的“任何非版本更具可读性的方式空格字符“比”[^[:space:]]*“

我一定在使用我的Mac上安装的sed在我写这个答案（brew install gnu-sedFTW）的时候。

有更好的URL正则表达式在那里（那些考虑到除HTTP（S）其他账户的方案，例如），但是这会为你工作，给你给的例子。为什么复杂的事情？

答

接受的答案提供了我用于从我的文件中删除URL等的方法。但是它留下了“空白”行。这是一个解决方案。

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file 

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file

的GNU sed的标志，使用的表达式是：

-i Edit in-place 
-e [-e script] --expression=script : basically, add the commands in script 
     (expression) to the set of commands to be run while processing the input 
^ Match start of line 
$ Match end of line 


? Match one or more of preceding regular expression 
{2,} Match 2 or more of preceding regular expression 
\S* Any non-space character; alternative to: [^[:space:]]*

然而，

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'

离开打印字符的（一个或多个），据推测\n（换行）。标准sed为基础的方法来删除“空白”行，标签和空格，例如

sed -i 's/^[ \t]*//; s/[ \t]*$//'

不工作，在这里：如果你不使用“分支标签”来处理新行，你不能使用SED替代它们（这一次读取输入的一行）。

的解决方案是使用下面的Perl表达式：

perl -i -pe 's/^'`echo "\012"`'${2,}//g'

它使用一个壳取代，

'`echo "\012"`'

替换一个八进制值

\012

（即，换行，\n）中，发生2次或更多次，

{2，}

（否则我们将解开的所有行）中，用别的东西;在这里：

//

即，什么都没有。

[第二参考下面提供了这些值的精彩表！]

使用该perl标志是：

-p Places a printing loop around your command, 
    so that it acts on each line of standard input 

-i Edit in-place 

-e Allows you to provide the program as an argument, 
    rather than in a file

参考文献：

perl的国旗：Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII控制码：https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
删除网址：sed to remove URLs from a file
分支标签：How can I replace a newline (\n) using sed?
GNU sed的手册：https://www.gnu.org/software/sed/manual/sed.html
快速正则表达式指南：https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

实施例：

$ cat url_test_input.txt 

Some text ... 
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file 
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html 
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html 
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/ 
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw 
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw 
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw 
ftp://ftp.ncbi.nlm.nih.gov/ 
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index 
Some more text. 

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a 

$ cat a 

Some text ... 










Some more text. 

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a 

Some text ... 
Some more text. 

$

的sed从文件中删除网址

相关推荐