如何通过linux命令递归地从网站中提取电子邮件
问题描述:
我想递归地提取电子邮件表单网站,以便如果网站有其他电子邮件链接,它也会遍历这些网页并提取email-id:s。如何通过linux命令递归地从网站中提取电子邮件
我做了的2级深度以下:
wget -r -l 2 -O - some site name |grep -E -o "[a-z0-9.][email protected][a-z0-9.-]+.[a-z]{2,4}">>some.txt
但是,当我执行此命令它只是创建一个文件“some.txt
”,不会提取任何电子邮件编号:s。
答
为什么不做递归wget并在本地存储站点,然后在本地文件系统的站点上执行递归grep(grep -r)?只要将rm -Rf添加到脚本的末尾,即可在完成后删除该站点。
答
按照使用PHP: 点1)。开发人员以HTML实体格式(rish)HTML实体添加电子邮件ID:
点2)。电子邮件写在href =“mailto:[email protected]”上。所以我们可以把它作为定期表达。
<?php
$str = '<div class="call-to-action ">
<a title="Email" class="contact contact-main contact-email "
href="mailto:[email protected]?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au&
body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch"
rel="nofollow" data-email="[email protected]">
<span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span>
<a href="mailto:rishabhdubey20@gmail.com">
</a>
</div>';
// $str = file_get_contents(http://example.com) ; (to get emails from URL in place of file_get_contents i use to prefer CURL) .
$str = html_entity_decode($str);
$regex = "/mailto:([^?]*)/";
if ($rex = preg_match_all($regex, $str,$matches_out)) {
echo "Found a match!";
echo "<pre>";
var_dump($matches_out[0]);
} else {
echo "The regex pattern does not match. :(";
}
?>
转到此链接... http://stackoverflow.com/questions/13858344/extract-email-addresses-from-a-website-using-scripts –
@ user3125133我已经通过了链接它说我们不能使用-o和-r,它不会将递归链接输出到标准uotput,所以没有办法将递归linek输出到标准输出 – Mohit