java 实现百度熊掌号历史资源记录提交
最近在做一个需求,需要将大量的历史记录url提交给百度熊掌号资源搜索平台,虽然熊账号给提供了手动提交的工具,但是这种方式的提交费时费力,尤其是在有很多的url需要提交时使用这个方式提交很明显效率低下,所以可以采用提供api提交的方式,
一 百度熊掌号账号获取(这个可以自己百度申请账号)
二 看上图,这是官方提供的api说明(这个需要登录自己的账号才可以看到),实际上说到这里基本上已经知道怎么批量提交数据,但是这里有几点需要说明一下:
1)批量提交时url中的type需要设置为batch,进行批量提交
2)单次提交时上限是2000个,否则会返回超出提交上限
三 代码实现
import org.apache.commons.io.FileUtils; import org.apache.commons.lang3.StringUtils; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.StatusLine; import org.apache.http.client.HttpResponseException; import org.apache.http.client.ResponseHandler; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpPost; import org.apache.http.entity.StringEntity; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils;
private static CloseableHttpClient client;
client = HttpClientBuilder.create().disableAutomaticRetries().build(); //创建客户端
private void urlsPush(List<String> urlList, String type) { if (CollectionUtils.isEmpty(urlList)) { return; }
//进行分组提交 int times = urlList.size() % 2000 == 0 ? (urlList.size() / 2000) : (urlList.size() / 2000 + 1); for (int i = 0; i < times; i++) { int end = (i + 1) * 2000; if (end >= urlList.size()) { end = urlList.size(); } List<String> subList = urlList.subList(i * 2000, end); StringBuilder sb = new StringBuilder(); subList.stream().forEach(url -> { sb.append(url); sb.append("\r\n"); }); String params = sb.toString(); // saveAsFile(params, type); // 判断是提交还是保存到文件中 if (maxPushSize <= 2000) { saveAsFile(params, type); continue; } HttpPost request = new HttpPost(URL); request.setHeader("content-type", "text/plain"); HttpEntity entity = new StringEntity(params, Charset.defaultCharset()); request.setEntity(entity); CloseableHttpResponse response = null; try { logger.info("正在推送数据,本次推送{}条,推送内容:{}", subList.size(), type); long singlePushStart = System.currentTimeMillis(); response = client.execute(request); logger.info("单次推送完成,本次共计用时{}ms", System.currentTimeMillis() - singlePushStart); } catch (IOException e) { e.printStackTrace(); logger.info("推送数据异常"); saveAsFile(params, type); continue; } StatusLine statusLine = response.getStatusLine(); HttpEntity responseEntity = response.getEntity(); if (statusLine.getStatusCode() != 200 || responseEntity == null) { logger.info("数据获取异常"); saveAsFile(params, type); continue; } String respStr = ""; try { respStr = EntityUtils.toString(responseEntity); } catch (Exception ex) { ex.printStackTrace(); } if (StringUtils.isNotBlank(respStr)) { try { PushUrlsResponse result = null; result = JSONObject.parseObject(respStr, PushUrlsResponse.class); this.maxPushSize = result.getRemain_batch(); this.successSize += result.getSuccess_batch(); } catch (Exception e) { logger.info("解析返回内容出现问题,返回内容{}", respStr); saveAsFile(params, type); } } } }
/** * url提交响应结果 */ private static class PushUrlsResponse { /** * 成功提交条数 */ private int success_batch; /** * 剩余可提交数 */ private int remain_batch; public int getSuccess_batch() { return success_batch; } public void setSuccess_batch(int success_batch) { this.success_batch = success_batch; } public int getRemain_batch() { return remain_batch; } public void setRemain_batch(int remain_batch) { this.remain_batch = remain_batch; } }
我这里对未成功提交的数据,写到了文件中进行保存,所以会有保存的方法
saveAsFile
参数param是提交的url内容,type为文件中内容的类型(我这里url内容分类比较多,所以需要一个type来标记文件中内容的类型)