如何使用Python 2.7遵守robots.txt？

问题描述：

我尝试使用Python2.7抓取整个网站：如何使用Python 2.7遵守robots.txt？

我使用robotparser
我打开每一个环节“a”到网站，并
我将它们添加到分析的robots.txt文件要检索的页面列表关键是：我试图避免Robots.txt文件中的所有路径，但它们仍然在要爬网的页面列表中。

如何从我的爬网列表中删除Robot.txt路径？

我cound't找到任何帮助，通过计算器呢。

我的代码波纹管：

import robotparser 
 
import urlparse 
 
import urllib 
 
import urllib2 
 
from BeautifulSoup import * 
 

 
AGENT_NAME = 'PYMOTW' 
 
URL_BASE = 'website' 
 
urls = [URL_BASE] 
 
visited = [URL_BASE] # Create a copy 
 
parser = robotparser.RobotFileParser() 
 
parser.set_url(urlparse.urljoin(URL_BASE, 'robot.txt')) 
 
parser.read() 
 
PATHS = [ 
 
    '/..../', 
 

 
    ] 
 
for path in PATHS: 
 
    print '%6s : %s' % (parser.can_fetch(AGENT_NAME, path), path) 
 
    url = urlparse.urljoin(URL_BASE, path) 
 
    print '%6s : %s' % (parser.can_fetch(AGENT_NAME, url), url) 
 
    robot = [url] 
 
while (len(urls) > 0 and robot != True): 
 
    html = urllib.urlopen(urls[0]).read() 
 
    soup = BeautifulSoup(html) # Parse All HTML using BeautifulSoup 
 
    urls.pop(0) 
 
# Retrieve all of Tags as a list 
 
    for tags in soup.findAll('a', href = True): 
 
     tags['href'] = urlparse.urljoin(URL_BASE, tags['href']) 
 
     if URL_BASE in tags['href'] and tags['href'] not in visited: 
 
      urls.append(tags['href']) 
 
      visited.append(tags['href']) 
 
     c = len(visited) 
 
print visited 
 
print 'page visited', c

欢迎堆栈溢出！我编辑了您的帖子，以删除仅适用于在Web浏览器中运行的HTML/JavaScript的代码段功能。除了删除Python 3标签之外，我还修复了拼写和添加格式以提高可读性。像这样改进你的问题会增加你阅读你的问题并获得很好答案的机会。 –

谢谢@AnthonyGeoghegan – CDS

Hi @ J.F.Sebastian。返回值是True值的列表。 – CDS

答

你在你的脚本有几个错误。

我重构了很多，并试图解释。

首先，您使用的是循环，当你想要做递归（为每个页，你得到的，你得到的链接和重做的过程）。
然后出于某种原因，我不知道，urlparse.join失败...（您的网址被截断），所以我手动concat。
美丽的汤是沉重的，所以我重构，只解析链接，而不是整个页面。一个页面可以有相对和绝对的链接，所以你需要处理两者。
robotparser似乎很蠢，路径必须准确（/test和/test/对他来说是不一样的）。他同时如果在robots.txt中没有指定他们（测试http://example.com/test比赛*/test但不/test ...）
编辑不明白完整的URL：我通过过滤匹配的URL取得脚本有点多强大。

这给我：

import robotparser 
import urlparse 
import urllib 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

AGENT_NAME = 'PYMOTW' 
URL_BASE = 'http://www.dcs.bbk.ac.uk/~martin/sewn/ls3' 
DOMAIN = urlparse.urlparse(URL_BASE).hostname 
visited = ['/'] # Create a copy 

parser = robotparser.RobotFileParser() 
parser.set_url(URL_BASE + '/robots.txt') 
parser.read() 


def process_url(url): 
    # transform relative paths 
    parsed_path = urlparse.urlparse(url) 
    if not parsed_path.hostname: 
     url = URL_BASE + url 

    # check domain 
    if parsed_path.hostname != DOMAIN: 
     print 'External domain ignored: %s' % parsed_path.hostname 

    # ensure we are allowed to fetch url 
    if not parser.can_fetch(AGENT_NAME, parsed_path.path): 
     print 'Not allowed to fetch %s' % parsed_path.path 
     return 

    # ensure we did not already visit it 
    if url in visited: 
     print 'Ignoring already visited %s' % url 
     return 

    print 'Visiting: %s' % url 
    html = urllib.urlopen(url).read() 
    visited.append(url) 
    links = BeautifulSoup(html, parseOnlyThese=SoupStrainer('a', href=True)) 

    # Retrieve all of Tags as a list 
    for link in links: 
     parsed_link = urlparse.urlparse(link['href']) 
     if len(link['href']) is 0: 
      print 'Ignoring empty link' 
     elif link['href'][0] == '#': 
      print 'Ignoring hash link %s' % link['href'] 
     elif parsed_path.hostname and parsed_link.scheme not in [None, 'http', 'https']: 
      print 'Ignoring non http(s) links %s' % link['href'] 
     else: 
      process_url(link['href']) 

PATHS = [ 
    '/testpage.html', 
    '/files/', 
    '/images/', 
    '/private/' 
] 
for path in PATHS: 
    process_url(path)

我还没有定义函数。我是初学者，所以我尝试用我的基本知识创建一个爬行程序。你的更正对我来说很清楚。我很理解递归，即使它们在函数内部（第一次对我来说），但我有一些问题： - 为什么堆栈中的所有域被访问？我可以只为域/ ls3/...工作吗？ - 在def process_url（url）：不清楚我的变量：url = URL_BASE + url（url？）它是被调用函数中的参数：process_url？ – CDS

如果您不在域之外，您可以过滤为不抓取。在'print'之前放置一个新的if条件访问：％s'％url'。还要注意，这个“bot”不跟踪它在哪个域上，并且alwais预先加上了“URL_BASE”变量。是的，'url'是函数'process_url'的参数，但由于它可以是绝对路径或相对路径，所以我重新定义它始终以绝对url结束。 – Cyrbil

@Cyrbil如果我在打印之前插入一个新条件'访问：％s'％url，程序将继续阅读机器人，而不是我真正需要抓取的标签。要阅读我的链接，我必须创建一个新的功能？，这是不明确的。我如何抓取页面，如果我爬行oposite部分。 “bot”究竟意味着什么？ – CDS

如何使用Python 2.7遵守robots.txt？

相关推荐