Rè¯è¨ç¬è«ç³»å3|HTTPåè®®
è¦æ³ä½¿ç¨Rè¯è¨ä»ç½ç»æåæ°æ®ï¼æä»¬å°±å¿ 须对Rè¯è¨è¿è¡è®¾ç½®ä½¿å¾Rå ·å¤ä¸æå¡å¨åWebæå¡è¿è¡éä¿¡çè½åãèäºèç½ä¸è¿è¡ç½ç»éä¿¡çéç¨è¯è¨å°±æ¯HTTP(HypterText Transfer Protocol)ï¼å³æè°è¶ ææ¬ä¼ è¾åè®®ãé£è¿ä¸ªè¶ ææ¬ä¼ è¾åè®®æ¯ä¸ªä»ä¹ä¸è¥¿å¢ï¼
è¶ ææ¬åè®®æ¯ä¸ç§ç¨äºåå¸å¼ãåä½å¼åè¶ åªä½ä¿¡æ¯ç³»ç»çåºç¨å±åè®®ï¼æ¯ä¸ä¸ªå®¢æ·ç«¯ç»ç«¯ï¼ç¨æ·ï¼åæå¡å¨ç»ç«¯ï¼ç½ç«ï¼è¯·æ±ååºççæ åï¼TCPï¼ï¼éè¿ä½¿ç¨ç½é¡µæµè§å¨ãç½ç»ç¬è«æè å ¶ä»å·¥å ·ï¼å®¢æ·ç«¯åèµ·ä¸ä¸ªHTTP请æ±å°æå¡å¨ä¸æå®ç«¯å£ï¼é»è®¤ç«¯å£ä¸º80ï¼æ¥è·åç½ç»èµæºçè¿ç¨ã说人è¯HTTPå°±æ¯æµè§å¨æè ç¬è«å·¥å ·å¦ä½æ¥æ¥æ¶ç½é¡µHTMLçæ¹æ³ã
å®é çæ´»ä¸ï¼å½æä»¬åå¨çµèåï¼ç¨æµè§å¨è®¿é®æ·å®è¿è¡è´ç©ï¼å ¶é´æä»¬åºæ¬ä¸ä¸ä¼ä¸HTTPæäº¤éãå建ååéHTTP请æ±ä»¥åå¤çæå¡å¨ç«¯è¿åçHTTPååºé½æ¯ç±æµè§å¨ä¸ææå®ï¼è¯æ³ä¸ä¸å¦æå¤§å®¶æ¯æ¬¡ç¨æ·å®è´ç©é½éè¦æå¨æå»ºç±»ä¼¼âç¨HTTPåè®®æwww.taobao.comç½é¡µä¸çæä¸ªåå龿¥ä¼ éç»æâè¿æ ·ç请æ±ï¼å²ä¸æ¯é常åç¹ï¼è½ç¶å¾åç¹ï¼ä½æä»¬ç°å¨éè¦ç¨Rè¯è¨æ¥å䏿¬¡ï¼ççRå¨è¿è¡ç¬è«æ¶å¦ä½æ¥æ¨¡ææµè§å¨åç½ç»éä¿¡çä»»å¡ãä¸ºäºæ»¡è¶³æä»¬å¤æ ·çç¬è«éæ±ï¼æä»¬å¿ é¡»æ·±å ¥å¦ä¹ ä¸ä¸ç½ç»ä¸æä»¶ä¼ è¾å议并åç¡®æå»ºè¯·æ±ã
访é®NBA䏿ç½ä¸»é¡µ
louwillå ç»å¤§å®¶çä¸ä¸ªå¨è®¿é®NBA䏿ç½ä¸»é¡µçæ¶åæµè§å¨æ¯å¦ä½éè¿HTTPå议帮æä»¬æå»ºè¯·æ±ä»¥åæå¡å¨æ¯å¦ä½ç¸åºæä»¬ç请æ±çãå¨è¿ä¸ªä¾åä¸ï¼æä»¬é¦å 建ç«äºå°http://china.nba.com/çè¿æ¥ï¼å¹¶è¯·æ±æå¡å¨åéindex.htmlãHTTP客æ·ç«¯é¦å æä¸»æºç¿»è¯ä¸ºä¸ä¸ªIPå°åå¹¶å¨ç¼ºççHTTP端å£ï¼80端å£ï¼å»ºç«å°æå¡å¨çè¿æ¥ãæä¸ªæ¯æ¹è¯´ï¼è¿ä¸ª80端å£å°±å¥½æ¯ç½ç»èµæºæå¡å¨æå¨çå±åçé¨ï¼HTTP客æ·ç«¯å°±æ¯éè¿æ²æ£é¨æ¥å»ºç«èµ·è¿æ¥çï¼ç¸åºç请æ±åååºè¿ç¨æ»ç»å¦ä¸ï¼
客æ·ç«¯ä¼è¯ä¿¡æ¯ï¼
About to connect() to china.nba.com port 80 (#0)
Trying 127.0.0.1:50049... connected
Connected to china.nba.com(127.0.0.1:50049) port 80 (#0)
Connected #0 to host china.nba.com left intact
建ç«è¿æ¥ä¹åï¼æå¡å¨ä¼çå¾ è¯·æ±ï¼æµè§å¨ä¼åæå¡å¨åéå¦ä¸çHTTP请æ±ï¼
GET /index.html HTTP/1.1
Host: china.nba.com
Accept: text/html,application/xhtml+xml,application/xml;
q=0.9,image/webp,*/*;q=0.8
Proxy-Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko)Chrome/49.0.2623.221 Safari/537.36 SE
2.X MetaSr 1.0
ç¶åå°±æ¯æå¡å¨è¯¥å¦ä½ååºæµè§å¨ç请æ±äºï¼
HTTP/1.1 200 OK
Date:Thu, 14 Sep 2017 06:34:35 GMT
Server:squid/3.5.20
Keep-Alive:timeout=38
Vary:Accept-Encoding
...
<!DOCTYPE html>
<html>
<head>
卿¥åäºæææ°æ®ä¹åï¼è¿æ¥ä¼è¢«æµè§å¨åæ¬¡å ³éï¼ä¸æ¬¡è®¿é®å°±ç®ç»æäºã
closing connection #0
URLè¯æ³
æè°URLï¼å°±æ¯æä»¬å¹³å¸¸æè¯´çç½åï¼å®çå ¨ç§°å«åç»ä¸èµæºå®ä½ç¬¦ï¼Uniform Resource Locatorsï¼,è½ç¶URL䏿¯HTTPçä¸é¨åï¼ä½é常æä»¬è½å¤éè¿URLç´è§å°è¿è¡HTTPåå ¶ä»åè®®çéä¿¡ãæ»ä½çURLä¾åå¯ä»¥è¡¨ç¤ºä¸ºï¼
scheme://hostname:port/path?querystring#fragment
对åºå°NBA䏿ç½çå®ä¾ä¸ºï¼
http://nbachina.qq.com/a/20170914/004815.htm
scheme表示URLçæ¨¡å¼ï¼å®å®ä¹äºæµè§å¨åæå¡å¨ä¹é´éä¿¡æéç¨çåè®®ï¼NBA主页çä¾åééç¨ç模å¼å°±æ¯httpãç´§éå ¶åçæ¯ä¸»æºåhostnameå端å£å·portï¼ä¸»æºåæä¾äºåæ¾æä»¬æå ´è¶£èµæºçæå¡å¨çååï¼å®æ¯ä¸ä¸ªæå¡å¨çå¯ä¸è¯å«ç¬¦ã端å£å·ä¸è¬é»è®¤ä¸º80ï¼ä¸»æºåå端å£å·ç»åèµ·æ¥å°±çäºè¯´æ¯åè¯æµè§å¨è¦å»æ²åªä¸æé¨æè½è®¿é®è¯·æ±çèµæºã主æºåå端å£å·ä¹åçè·¯å¾ç¨æ¥ç¡®å®è¢«è¯·æ±çèµæºå¨æå¡å¨ä¸çä½ç½®ï¼è·æä»¶ç³»ç»ç±»ä¼¼ï¼ä¹æ¯ç¨/ç¬¦å·æ¥å段çã
å¦å¤ï¼å¨å¤æ°æ å½¢ä¸ï¼URLçè·¯å¾é伿ä¾å¾å¤è¡¥å ä¿¡æ¯ï¼ç¨æ¥å¸®å©æå¡å¨æ£ç¡®çå¤çä¸äºå¤æç请æ±ï¼æ¯å¦è¯´éè¿ç±»ä¼¼name=valueè¿æ ·çæ¥è¯¢å符串æ¥è·åæ´å¤çä¿¡æ¯ï¼æè ç¨#ç¬¦å·æ¥æåç½é¡µä¸ç¹å®çé¨å乿¯å¸¸è§çè¡¥å æ¹æ³ã
æåéè¦è¯´æçæ¯ï¼URLæ¯éè¿ASCâ ¡åç¬¦éæ¥å®ç°ç¼ç çï¼ææä¸å¨128个å符ééé¢çå符åç¹æ®å符串é½éè¦è½¬ä¹ç¼ç 为æ åç表示æ³ï¼URLç¼ç ä¹è¢«æä¸ºç¾åå·ç¼ç ï¼è¿æ¯å 为æ¯ä¸ªè¿æ ·çç¼ç 齿¯ä»¥%å¼å¤´çãå¨Rè¯è¨ä¸ï¼æä»¬å¯ä»¥éè¿åºç¡å½æ°URLencode()åURLdecode()彿°æ¥å¯¹å符串è¿è¡ç¼ç æè è§£ç ï¼
char <- "Golden states Worriors is the NBA Champion in 2017"
URLencode(char,reserve=TRUE)
[1] "Golden%20states%20Worriors%20is%20the%20NBA%20Champion%20in%202017"
URLdecode(char)
[1] "Golden states Worriors is the NBA Champion in 2017"
HTTPæ¶æ¯
HTTPæ¶æ¯æ 论æ¯è¯·æ±æ¨¡å¼è¿æ¯ååºæ¨¡å¼ï¼é½ç±èµ·å§è¡ï¼start lineï¼ãæ 头ï¼headersï¼ï¼ä¹å«æ¶æ¯æ¥å¤´ï¼åæ£æï¼bodyï¼ä¸é¨åç»æãèµ·å§è¡æ¯æ¯ä¸ªHTTPæ¶æ¯ç第ä¸è¡ï¼å®å®ä¹äºè¯·æ±ä½¿ç¨çæ¹æ³ï¼ä»¥åæè¯·æ±èµæºçè·¯å¾åæµè§å¨è½å¤å¤ççHTTPæé«çæ¬ãèµ·å§è¡ä¹åçæ å¤´ä¸ºæµè§å¨åæå¡å¨æä¾äºå ä¿¡æ¯ï¼ä»¥âåå-åå¼âå½¢å¼è¡¨ç¤ºçä¸å¥æ 头忮µãæ£æé¨åå å«çº¯ææ¬æè äºè¿å¶æ°æ®ï¼è¿ç±æ 头信æ¯ä¸çcontent-type声æå³å®ãç¶åæ¯MIMEï¼å¤ç¨éäºèç½é®ä»¶æ©å±ï¼ç±»å声æï¼è¿ä¸ªå£°æçä½ç¨æ¯åè¯æµè§å¨ææå¡å¨ä¼ è¾è¿æ¥çæ¯åªç§ç±»åçæ°æ®ãèµ·å§è¡ãæ 头忣æåå¼éè¦ç¨å°æ¢è¡ç¬¦ï¼CRLFï¼ã
å¨è¯·æ±æ¨¡å¼ä¸ï¼æå¸¸ç¨çè¯·æ±æ¹æ³æ¯GETåPOSTæ¹æ³ï¼å¨ç¬è«è¿ç¨ä¸è³å ³éè¦ãè¿ä¸¤ä¸ªæ¹æ³é½æ¯ä»æå¡å¨è¯·æ±ä¸ä¸ªèµæºï¼ä½æ¯å¨æ£æç使ç¨ä¸ææä¸åï¼GETæ¹æ³ä¸ä¼å¨è¯·æ±çæ£æä¸åéä»»ä½å 容ï¼ä½POSTä¼ç¨æ£ææ¥åéæ°æ®ã
GET请æ±å¦ä¸ï¼
GET/form.html HTTP/1.1(CRLF)
å¨Rä¸ï¼RCurlå æä¾äºä¸äºé«çº§å½æ°æ¥æ§è¡GET请æ±ä»Webæå¡å¨ä¸è·åæä¸ªèµæºï¼æå¸¸ç¨ç彿°çgetURL()ï¼è¿ä¸ªå½æ°ä¼èªå¨ç¡®å®ä¸»æºã端å£ä»¥å请æ±çèµæºãå®é æä½ä¸ï¼æä»¬åªéè¦æURLä¼ ç»è¿ä¸ªå½æ°ï¼ä¹å¯ä»¥æå¨æå®HTML表ååæ°ï¼
library(RCurl)
getURLï¼http://nbachina.qq.com/a/20170914/004815.htmï¼
[1] "<!DOCTYPE html><html lang=\"zh-CN\"><head>\n<meta content=\"text/html;
charset=gb2312\" http-equiv=\"Content-Type\">\n<meta charset=\"gb2312\">
\n<title>1tμ<c7><b6><d4>±è<bb>e<bc>yó<eb>2012à×<f6>aï¿¡o<b6><bc>·<c7>
,... <truncated>
请æ±å®åç¬è«éæ±åä¾èµäºåç»çæä½ã
POST请æ±å¦ä¸ï¼
POST/greetings.html HTTP/1.1
å¨R䏿§è¡POST请æ±ï¼æ éæå¨æå»ºï¼èæ¯å¯ä»¥ä½¿ç¨postForm()彿°ï¼
url<-âhttp://www.r-datacollection.com/materials/http/POSTexample.phpâ
cat(postForm(url,name=âKobeâ,age=39,style=âpostâ))
Hello Kobe!
You are 39 years old.
å¨å°é¢å 声æçåæ°å¡«å å°è¡¨åä¸å»çæ¶åï¼é注æå©ç¨styleåæ°é¢å æ¾å¼å£°æä¸ä¸å ¶å¯æ¥åçæ¹å¼ã常è§çHTTPè¯·æ±æ¹æ³å¦ä¸ï¼
æ¹æ³ |
æè¿° |
GET |
仿å¡å¨æ£ç´¢èµæº |
POST |
å©ç¨æ¶æ¯åæå¡å¨åéæ°æ®ï¼ç¶å仿å¡å¨æ£ç´¢èµæº |
HEAD |
仿å¡å¨æ£ç´¢èµæºï¼ä½åªååºèµ·å§è¡åæ 头 |
PUT |
å°è¯·æ±çæ£æä¿å卿å¡å¨ä¸ |
DELETE |
仿å¡å¨å é¤ä¸ä¸ªèµæº |
TRACE |
è¿½è¸ªæ¶æ¯å°è¾¾æå¡å¨æ²¿éçè·¯å¾ |
OPTIONS |
è¿åæ¯æçHTTPæ¹æ³æ¸ å |
CONNECT |
建ç«ä¸ä¸ªç½ç»è¿æ¥ |
æµè§å¨åé请æ±åï¼æå¡å¨éè¦å¯¹å ¶è¿è¡ååºï¼ä¼å¨ååºçèµ·å§è¡ååä¸ä¸ªç¶æç ï¼å¯è½å¤§å®¶ä¼ä¸å¤ªæç½ç¶æç æ¯ä»ä¹ç©æå¿ï¼ä½louwill说ä¸ä¸ª404æ³å¿ 大家é½ç¥éäºï¼404å°±æ¯ä¸ä¸ªè¡¨ç¤ºæå¡å¨æ æ³æ¾å°èµæºçååºç¶æç ã
èæ£å¸¸æ å½¢çååºç¶æç 为200ï¼
常è§çHTTPç¶æç å¦ä¸æç¤ºï¼
1xxï¼æç¤ºä¿¡æ¯--表示请æ±å·²æ¥æ¶ï¼ç»§ç»å¤ç
2xxï¼æå--表示请æ±å·²è¢«æåæ¥æ¶ãçè§£ãæ¥å
3xxï¼éå®å--è¦å®æè¯·æ±å¿ é¡»è¿è¡æ´è¿ä¸æ¥çæä½
4xxï¼å®¢æ·ç«¯é误--è¯·æ±æè¯æ³é误æè¯·æ±æ æ³å®ç°
5xxï¼æå¡å¨ç«¯é误--æå¡å¨æªè½å®ç°åæ³ç请æ±
常è§ç200表示æåæ¾å°èµæºï¼404è¡¨ç¤ºæªæ¾å°èµæºï¼500表示æå¡å¨å é¨é误ï¼502表示é误ç½å ³çã
æå ³ç¬è«åºç¡é¨åçHTTPç¥è¯ï¼louwillå°±ç»å¤§å®¶ä»ç»å°è¿éäºï¼è³äºæ´æ·±å ¥çãæ´å®é çä¸äºæ åµï¼æ¯å¦å¦ä½å©ç¨HTTPè¿è¡èº«ä»½è¯å«ã认è¯å代çï¼libcurlåºç详ç»ä»ç»ãRCurlçåºå±å½æ°çå ¶ä»å 容å°å¨Rç¬è«ç³»åçåç»å å®¹ä¸æç»æ¨åºï¼ä¹è¯·åä½æå¾ ~
åèèµæï¼
Automated Data Collection with R
å¾æç²¾å½©ï¼
ä¸ä¸ªæ°æ®ç§å¦ççè çå¦ä¹ åç¨