刮板抛出无效的网址错误

问题描述：

我在python中创建了一个刮板来从网页获取不同的批号。但是，当我运行我的刮板时，我在控制台中看到“请求的URL无效”。我试图获得响应网址，并发现它是有效的。在处理请求时是否有任何错误？我，试图脚本：刮板抛出无效的网址错误

import requests 
from lxml import html 

payload = {"keyword":"degas"} 

headers={ 
"Content-Type":"text/html; charset=UTF-8", 
"User-Agent":"Mozilla/5.0" 
} 

response = requests.get("http://www.sothebys.com/en/search-results.html?", params=payload, headers=headers, allow_redirects=False) 
# tree = html.fromstring(response.text) 
# for item in tree.cssselect("div.search-results-lot-number"): 
#  print(item.text) 

print(response.url) 
print(response.text) 
print(response.status_code)

这是我在印刷时 “response.url”， “response.text” 控制台获得，而 “response.status_code”：

http://www.sothebys.com/en/search-results.html?keyword=degas 
<HTML><HEAD> 
<TITLE>Invalid URL</TITLE> 
</HEAD><BODY> 
<H1>Invalid URL</H1> 
The requested URL "&#91;no&#32;URL&#93;", is invalid.<p> 
Reference&#32;&#35;9&#46;541d2017&#46;1503578560&#46;40be2bd 
</BODY></HTML> 

400

顺便说一句，如果我手动检查URL，然后我发现它确实将我带到了所需的页面。

您拥有的网址将重定向到“未找到”页面。 http://www.sothebys.com/en/notfound.html – Mekicha

由于您的请求中包含'allow_redirects = False'，它只会引发错误。 – Mekicha

@ Mekicha，“allow_redirects = True”给了我与使用False参数相同的结果。 – SIM

答

我按照下面的方式工作。

import requests 

payload = { 
    'keyword':'degas', 
    'pageSize':'24', 
    'offset':'0' 
    } 

headers={ 
    'Accept':'application/json, text/javascript, */*; q=0.01', 
    'Referer':'http://www.sothebys.com/en/search-results.html?keyword=degas', 
    "User-Agent":"Mozilla/5.0" 
    } 

response = requests.get("http://www.sothebys.com/en/search", params=payload, headers=headers) 

print(response.url) 
print(response.status_code) 
print(response.text)

答

我想你使用的是错误的标题。以下为我工作头：

headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}

输出：

<!DOCTYPE html> 
<!--[if lt IE 7]>  <html xml:lang="en" lang="en" class="no-js pre-ie9"> <![endif]--> 
<!--[if IE 7]>   <html xml:lang="en" lang="en" class="no-js ie7 pre-ie9"> <![endif]--> 
<!--[if IE 8]>   <html xml:lang="en" lang="en" class="no-js ie8 pre-ie9"> <![endif]--> 
<!--[if IE 9]>   <html xml:lang="en" lang="en" class="no-js ie9"> <![endif]--> 
<!--[if gt IE 9]><!--> <html xml:lang="en" lang="en" class="no-js"> <!--<![endif]--> 

<head> 
    <!--GLOBAL META--> 
<!-- requestUrl=/content/sothebys/en/search-results.html?keyword=degas --> 
<title>Search Results | Sotheby's</title> 
<meta name="description" content="View auction details, art exhibitions and online catalogues; bid, buy and collect contemporary, impressionist or modern art, old masters, jewellery, wine, watches, prints, rugs and books at sotheby's auction house"> 
<meta name="keywords" content="auction, art, exhibition, online, catalogue, bid, buy, collect, contemporary, impressionist, modern, old mast...

。。。

你是对的Sam Chats，它对我也是如此。有时候，苗条并不聪明。我试图让User-Agent简洁，给解析器一个很好的外观，但事实证明这是错误的。最后一件事：如果我在解析器中取出未注释的行来获取批号，为什么它不取任何东西。你能给我任何想法吗？再次感谢。 – SIM

@Shahin尝试使用“//div.search-results-lot-number”。 –

刮板抛出无效的网址错误

相关推荐