过滤器链接

问题描述：

我工作的程序来提取网页上的链接，并将其筛选这个到目前为止我的代码使用HtmlAgilityPack过滤器链接

HtmlWeb hw = new HtmlWeb(); 
     HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
     doc = hw.Load(txt_url.Text); 
     foreach (HtmlNode link in doc.DocumentNode.SelectNodes("a//[@href]")) 
     { 
      // Get the value of the HREF attribute 
      string hrefValue = link.GetAttributeValue("href", string.Empty); 
      listbox1.Items.Add(hrefValue); 

     }

这段代码是提取网页等等各个环节我的问题是如何通过扩展名筛选这些Url，如“.html”

答

使用WebClient.DownloadString方法来获取html。

然后在字符串上使用正则表达式模式来捕获所有的URL。

答

首先，您必须执行HTTP GET请求并获取HTML代码的响应正文。

//Request HTTP GET 

ServicePointManager.Expect100Continue = false; 
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url); 
request.Proxy = null; 
request.Method = "GET"; 

WebResponse response; 
string html = ""; 

response = request.GetResponse(); 
StreamReader sr = new StreamReader(response.GetResponseStream()); 
html = sr.ReadToEnd(); 
sr.Close(); 
response.Close();

然后你可以用正则表达式解析HTML代码来提取所需的文件。

相关推荐