如何使用Jsoup从html文件中获取特定数据？

问题描述：

我有一个本地语言的新闻纸的HTML文件，我想收集在本地语言只在新闻纸上的所有单词如何使用Jsoup从html文件中获取特定数据？

我已经在html文件中观察到，在本地的所有单词语言是类字段内容div元素下，所以我已选择其元件获得的数据，但在div元件也含有的元素，如在其内部的本地语言的单词存在

<div class = "field-content"></div>

所以如何获得只来自html文件的当地语言文字

网址的网站：http://www.andhrabhoomi.net/

我的代码：

public static void main(String a[]) 
     { 
      Document doc; 
      try { 
       doc = Jsoup.connect("http://www.andhrabhoomi.net/").userAgent("Mozilla").get(); 
       String title = doc.title(); 

       System.out.println("title : " + title); 

        // get all links 
        //Elements links = doc.select("a[href]"); 

        Elements body = doc.select("div.field-content"); 

        for (Element link : body) { 

         System.out.println(link); 


    // get the value from href attribute 
         //System.out.println("\nlink : " + link.attr("href")); 
         //System.out.println("text : " + link.text()); 
        } 

      }catch(IOException e){ 
       System.out.println("error\n"); 

      } 
     }

答

不知道你是什么后，在这里，但如果我的猜测是正确的这应该帮助。如果没有，就说出来，我们会从那里开始。

你会希望通过获取只是有field-content然后摆脱所有其他HTML内容的类来改变你的选择，你要添加text()到你的System.out.println(link.text());请参见下面的结束。

Elements body = doc.getElementsByClass("field-content"); 

for(Element link : body) 
{ 
    System.out.println(link.text()); 
}

由于它的工作 – Labeo

这里的.text（）直接通过跳过元件右打印数据？ – Labeo

'.text（）'获取元素的组合文本;因此，在这种情况下，我们选择了'div'中的所有子元素的'div'和所有文本。所以是的，它几乎拿出所有的标签。但是，如果你只是在'div'文本后面，那么你可以使用'ownText（）'虽然你会得到很多可能需要清除的空白。 –

答

的解决方案是：

 String title = doc.title(); 

     System.out.println("title : " + title); 

     //get all links 
     //Elements links = doc.select("a[href]"); 
     //Elements body = doc.select("div.field-content"); 
     Elements body = doc.select("div[class=\"field-content\"] > a"); 

     for (Element link : body) { 

      System.out.println("---------------------------------------------------------------------------------------------------------------"); 
      System.out.println(link); 

      Elements img = link.select("img"); 
      // get the value from href attribute 
      System.out.print("\nSrc Img : " + img.attr("src")); 

      Elements tag_a = link.select("a"); 
      System.out.println("\nHref : " + tag_a.attr("href")); 
      //System.out.println("text : " + tag_a.text()); 
     } 

    } catch (Exception e) { 
     System.out.println("error\n"); 

    } 
}

如何使用Jsoup从html文件中获取特定数据？

相关推荐