使用Nokogiri和Mechanize解析html表格

问题描述:

使用以下代码我试图从我们的电话提供商的Web应用程序中刮取通话记录,以将信息输入到我的Ruby on Rails应用程序中。使用Nokogiri和Mechanize解析html表格

desc "Import incoming calls" 
task :fetch_incomingcalls => :environment do 

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls. 
    require 'rubygems' 
    require 'mechanize' 
    require 'logger' 

    # Create a new mechanize object 
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) } 

    # Load the Phone Provider website 
    page = agent.get("https://manage.phoneprovider.co.uk/login") 

    # Select the first form 
    form = agent.page.forms.first 
    form.username = 'username 
    form.password = 'password 

    # Submit the form 
    page = form.submit form.buttons.first 

    # Click on link called Call Logs 
    page = agent.page.link_with(:text => "Call Logs").click 

    # Click on link called Incoming Calls 
    page = agent.page.link_with(:text => "Incoming Calls").click 

    # Prints out table rows 
    # puts doc.css('table > tr') 

    # Print out the body as a test 
    # puts page.body 

end 

正如您可以从最后五行看到的,我测试了'puts page.body'成功工作并且上面的代码有效。它成功登录,然后导航到通话记录,然后传入Calls.The来电表看起来像这样:

| Timestamp | Source | Destination | Duration | 
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  

这是从下面的代码生成:

<thead> 
<tr> 
<td>Timestamp</td> 
<td>Source</td> 
<td>Destination</td> 
<td>Duration</td> 
<td>Cost</td> 
<td class='centre'>Recording</td> 
</tr> 
</thead> 
<tbody> 
<tr class='o'> 
<tr> 
<td>03 Jan 13:40</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:01:14</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>30 Dec 20:31</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:02:52</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>24 Dec 00:03</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:09</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>23 Dec 14:56</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:07</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>21 Dec 13:26</td> 
<td>07793770851</td> 
<td>12345679</td> 
<td>00:00:26</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 

我想找出如何选择我想要的单元格(时间戳,源,目标和持续时间)并输出它们。然后我可以担心输出到数据库而不是终端。

我试过使用Selector Gadget,但它只是显示'td'或'tr:nth-​​child(6)td,tr:nth-​​child(2)td'如果我选择多个。

任何帮助或指针,将不胜感激!

表中有一种模式可以很容易地使用XPath。具有所需信息的行的<tr>标记缺少class属性。幸运的是,XPath提供了一些简单的逻辑操作,包括not()。这提供了我们需要的功能。

一旦我们减少了处理的行数,我们就可以遍历行并通过使用XPath的element[n]选择器来提取必要列的文本。这里的一个重要注意事项是XPath对从1开始的元素进行计数,所以表格行的第一列应该是td[1]。通过引入nokogiri(和规格)

示例代码:

require "rspec" 
require "nokogiri" 

HTML = <<HTML 
<table> 
    <thead> 
    <tr> 
     <td> 
     Timestamp 
     </td> 
     <td> 
     Source 
     </td> 
     <td> 
     Destination 
     </td> 
     <td> 
     Duration 
     </td> 
     <td> 
     Cost 
     </td> 
     <td class='centre'> 
     Recording 
     </td> 
    </tr> 
    </thead> 
    <tbody> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     03 Jan 13:40 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:01:14 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     30 Dec 20:31 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:02:52 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     24 Dec 00:03 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:09 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     23 Dec 14:56 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:07 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     21 Dec 13:26 
     </td> 
     <td> 
     07793770851 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:26 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    </tbody> 
</table> 
HTML 

class TableExtractor 
    def extract_data html 
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row| 
     timestamp = row.at("td[1]").text.strip 
     source  = row.at("td[2]").text.strip 
     destination = row.at("td[3]").text.strip 
     duration = row.at("td[4]").text.strip 
     {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration} 
    end 
    end 
end 

describe TableExtractor do 
    before(:all) do 
    @html = HTML 
    end 

    it "should extract the timestamp properly" do 
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40" 
    end 

    it "should extract the source properly" do 
    subject.extract_data(@html)[0][:source].should eq "12345678" 
    end 

    it "should extract the destination properly" do 
    subject.extract_data(@html)[0][:destination].should eq "12345679" 
    end 

    it "should extract the duration properly" do 
    subject.extract_data(@html)[0][:duration].should eq "00:01:14" 
    end 

    it "should extract all informational rows" do 
    subject.extract_data(@html).count.should eq 5 
    end 
end 
+0

我不确定如何将这个代码应用到我已有的代码中,如果你看到以下的想法应该是我的想法..https: //gist.github.com/1574942 – dannymcc 2012-01-07 14:53:10

+0

直到现在才注意到您的回复。我已经[分解了你的要点并添加了一些代码](https://gist.github.com/1592493)。我也回答了你关于这个问题的其他问题。 – 2012-01-11 02:03:04

使用XPath选择器,您应该能够从根目录(最差的情况)到达所需的确切节点。与Nokogiri一起使用XPath列出了here

有关如何使用XPath访问所有元素的详细信息,请参阅here

+0

是文档你只与f链接或解析XML数据,还是它也可以使用HTML网页? – dannymcc 2012-01-06 17:59:43

+0

Yes.Check this one too nokogiri.org/tutorials/searching_a_xml_html_document.html – jake 2012-01-06 19:50:18