Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫

这几天写了一个Android上面简单的爬虫Demo
数据爬取完后通过RecyclerView展示,这篇文章先写爬取数据部分

这里我爬虫测试网站是:什么值得买

想要爬取的数据是首页的一些精选文章,主要爬取文章标题、图片、简介

Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫

这个是我爬到的数据
Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫
这里需要引用到Jsoup和OkHttp的jar包,我是下载下来,添加到项目工程当中
Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫
也可以直接在gradle文件当中配置

implementation 'org.jsoup:jsoup:1.11.3'
implementation 'com.squareup.okhttp3:okhttp:3.4.1'

然后就可以开始写代码爬虫啦

实体类Article.java

/*
 *@Author:Swallow
 *@Date:2019/3/21
 * 抓取到的文章数据封装
 */
public class Article {
    private String title;
    private String author;
    private String imgUrl;
    private String context;
    private String articleUrl;
    private String date;
    private String from;

//有几个属性还没用到,所以构造方法先用上这四个有爬取到数据的
    public Article(String title, String author, String imgUrl, String context) {
        this.title = title;
        this.author = author;
        this.imgUrl = imgUrl;
        this.context = context;
    }


    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String getImgUrl() {
        return imgUrl;
    }

    public void setImgUrl(String imgUrl) {
        this.imgUrl = imgUrl;
    }

    public String getContext() {
        return context;
    }

    public void setContext(String context) {
        this.context = context;
    }

    public String getArticleUrl() {
        return articleUrl;
    }

    public void setArticleUrl(String articleUrl) {
        this.articleUrl = articleUrl;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public String getFrom() {
        return from;
    }

    public void setFrom(String from) {
        this.from = from;
    }

    @Override
    public String toString() {
        return "Article{" +
                "title='" + title + '\'' +
                ", author='" + author + '\'' +
                ", imgUrl='" + imgUrl + '\'' +
                ", context='" + context + '\'' +
                ", articleUrl='" + articleUrl + '\'' +
                ", date='" + date + '\'' +
                ", from='" + from + '\'' +
                '}';
    }
}

OkHttp请求网络


/*
 *@Author:Swallow
 *@Date:2019/3/7
 */
public class OkHttpUtils {
    public static String OkGetArt(String url) {
        String html = null;
        OkHttpClient client = new OkHttpClient();
        Request request = new Request.Builder()
                .url(url)
                .build();
        try (Response response = client.newCall(request).execute()) {
            //return
            html = response.body().string();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return html;
    }
   }

抓取数据的类
这里用到Jsoup,它主要是解析获取到的网页资源的HTML标签来抓取里面的数据
这里我们可以到原本的网址上去查看网页源码,就可以看到网页的结构,还有要获取的数据所对应的标签

Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫

/*
 *@Author:Swallow
 *@Date:2019/3/21
 */
public class GetData {
/**
     * 抓取什么值得买首页的精选文章
     * @param html
     * @return  ArrayList<Article> articles
     */
    public static ArrayList<Article> spiderArticle(String html){
        ArrayList<Article> articles = new ArrayList<>();

            Document document = Jsoup.parse(html);
            Elements elements = document
                    .select("ul[class=feed-list-hits]")
                    .select("li[class=feed-row-wide J_feed_za ]");
            for (Element element : elements) {
                String title = element
                        .select("h5[class=feed-block-title]")
                        .text();
                String author = element
                        .select("div[class=feed-block-info]")
                        .select("span")
                        .text();

                String imgurl = element
                        .select("div[class=z-feed-img]")
                        .select("a")
                        .select("img")
                        .attr("src");
                String context = element
                        .select("div[class=feed-block-descripe]")
                        .text();
                String url = element
                        .select("div[class=feed-block z-hor-feed ]")
                        .select("a")
                        .attr("href");

                Article article = new Article(title,author,imgurl,context);
                articles.add(article);
                Log.e("DATA>>",article.toString());
            }
        return articles;
    }
}

后面直接调用方法就可以
这里要注意一点就是,Android上面发送网络请求要放到子线程当中,所以调用的时候需要开启一个新的子线程

final String url = "https://www.smzdm.com/";
        new Thread(){
            public void run(){
                String html = OkHttpUtils.OkGetArt(url);
                ArrayList<Article> articles = GetData.spiderArticle(html);
                //发送信息给handler用于更新UI界面
                Message message = handler.obtainMessage();
                message.what = 1;
                message.obj = articles;
                handler.sendMessage(message);
            }
        }.start();