在Python中使用BeautifulSoup解析HTML

问题描述：

我想用Python使用BeautifulSoup解析HTML，但是我无法设法得到我需要的东西。在Python中使用BeautifulSoup解析HTML

这是我想要做的个人应用程序的一个小模块，它包含一个带有凭据的Web登录部分，一旦脚本登录到Web中，我需要解析一些信息以便管理它并处理它。

越来越登录后的HTML代码是：

<div class="widget_title clearfix"> 

     <h2>Account Balance</h2> 

    </div> 

    <div class="widget_body"> 

     <div class="widget_content"> 

      <table class="simple"> 

       <tr> 

        <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td> 

        <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;"> 

         150       

        </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td> 

        <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;"> 

         500      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td> 

        <td style="text-align: right; color: #119911; font-weight: bold;"> 

         1500      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west4" title="Total expenses">Total expended</a></td> 

        <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;"> 

         430      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west5" title="Total available">Account Balance</a></td> 

        <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;"> 

         840      </td> 

       </tr> 

       <tr> 

        <td></td> 

        <td style="padding: 5px;"> 

         <center> 

          <form id="request_bill" method="POST" action="index.php?page=dashboard"> 

           <input type="hidden" name="secret_token" value="" /> 

           <input type="hidden" name="request_payout" value="1" /> 

           <input type="submit" class="btn blue large" value="Request Payout" /> 

          </form> 

         </center> 

        </td> 

       </tr> 

      </table> 

     </div> 

    </div> 

</div>

正如你所看到的，这不是一个很好的格式化的HTML，但我需要提取的元素和它们的值，我的意思是，对于例如：“每日收入”和“150”| “每周收入”和“500”...

我认为“id”属性可能会有所帮助，但是当我尝试解析它时，它会崩溃。

的Python代码我工作是：

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = BeautifulSoup(html) 
    par = parsed_html.find('td', attrs={'id':'west1'}).string 
    print par

凡archivohtml是在网络

登录当我运行该脚本后保存的HTML文件，我只得到错误。

我也试着这样做：

def parseo(archivohtml): 
    soup = BeautifulSoup() 
    html = archivohtml 
    parsed_html = soup(html) 
    par = soup.parsed_html.find('td', attrs={'id':'west1'}).string 
    print par

但结果还是一样。

哪些错误???? – 2013-03-22 17:44:40

“它崩溃”是什么意思？它是否用回溯打印出异常然后退出？如果是这样，请向我们展示异常和追溯（当然还有追溯所涉及的代码）。 – abarnert 2013-03-22 18:01:09

文件“C：\ py \ projectparse \ logparse.py”，第53行，在parseo par = parsed_html.find（'td'，attrs = {'id'：'west1'}）字符串 AttributeError：'NoneType 'object has no attribute'string' – dexafree 2013-03-22 18:51:05

答

带有id="west1"的标签是<a>标签。您正在寻找此<a>标签后到来的<td>标签：

import BeautifulSoup as bs 

content = '''<div class="widget_title clearfix"> 
     <h2>Account Balance</h2> 
    </div> 
    <div class="widget_body"> 
     <div class="widget_content"> 
      <table class="simple"> 
       <tr> 
        <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td> 
        <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;"> 
         150       
        </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td> 
        <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;"> 
         500      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td> 
        <td style="text-align: right; color: #119911; font-weight: bold;"> 
         1500      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west4" title="Total expenses">Total expended</a></td> 
        <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;"> 
         430      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west5" title="Total available">Account Balance</a></td> 
        <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;"> 
         840      </td> 
       </tr> 
       <tr> 
        <td></td> 
        <td style="padding: 5px;"> 
         <center> 
          <form id="request_bill" method="POST" action="index.php?page=dashboard"> 
           <input type="hidden" name="secret_token" value="" /> 
           <input type="hidden" name="request_payout" value="1" /> 
           <input type="submit" class="btn blue large" value="Request Payout" /> 
          </form> 
         </center> 
        </td> 
       </tr> 
      </table> 
     </div> 
    </div> 
</div>''' 

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = bs.BeautifulSoup(html) 
    par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')   
    print par.string.strip() 

parseo(content)

产生

非常感谢您的快速回答！我试过你的代码，但是我有一个bs.BeautifulSoup（html）表达式的问题... 我该在哪里声明bs？我的进口是从BeautifulSoup导入BeautifulSoup 我必须在开始时添加bs = BeautifulSoup（）吗？我也看到BeautifulSoup可以导入为“进口BeautifulSoup为BS”，但它仍然无法正常工作我得到“AttributeError的：‘NoneType’对象有没有属性‘FindNext中’” 我不不知道我做错了什么！ – dexafree 2013-03-22 18:54:24

我已经添加了可运行的代码。希望有所帮助。 – unutbu 2013-03-22 20:33:39

非常感谢！现在代码运行了，它确实显示了预期显示的内容，现在我已经设法查看错误是什么了！问题是我还在保存一个包含所有内容的.html文件，以便监视所有进程是否顺利进行，但我没有将BS应用到html代码本身。我是这样做的.html代码现在它完美的工作，我可以继续工作！非常感谢你:) – dexafree 2013-03-23 00:29:28

答

我无法从你的问题告诉我们，如果这将是适用于你，但这里的另一种方法：

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = BeautifulSoup(html) 
    for line in parsed_html.stripped_strings:   
     print line.strip()

其产生：

Account Balance 
Daily Earnings 
150 
Weekly Earnings 
500 
Monthly Earnings 
1500 
Total expended 
430 
Account Balance 
840

如果你想在一个列表中的数据：

data = [line.strip() for line in parsed_html.stripped_strings]

[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']

非常感谢你！现在代码正在工作，这种方式来存储信息和样式比我使用的方式好很多！ – dexafree 2013-03-23 11:21:40

在Python中使用BeautifulSoup解析HTML

相关推荐