文章目录

项目效果

springboot开源分销商城(SpringBootESJsoup实现JD)(1)

1、功能概述

利用Jsoup爬虫爬取JD商城的商品信息,并将商品信息存储在ElasticSearch中,同时利用请求进行全文检索,同时完成高亮显示等功能。

2、工具简介

Jsoup:jsoup 是一款Java 的Html解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

HttpClient:HttpClient 是Apache Jakarta Common 下一个子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。

3、操作步骤

3.1 创建SpringBoot项目

springboot开源分销商城(SpringBootESJsoup实现JD)(2)

3.2 勾选对应的集成包

springboot开源分销商城(SpringBootESJsoup实现JD)(3)

3.3 导入项目中需要的jar包依赖(这里需要注意Springboot版本与ES版本的冲突问题)

​ 版本对应 :

Spring Data Release Train

Spring Data Elasticsearch

Elasticsearch

Spring Framework

Spring Boot

2021.2 (Raj)

4.4.x

7.17.4

5.3.x

2.7.x

2021.1 (Q)

4.3.x

7.15.2

5.3.x

2.6.x

2021.0 (Pascal)

4.2.x[ 1 ]

7.12.0

5.3.x

2.5.x

2020.0 (Ockham)[ 1 ]

4.1.x[ 1 ]

7.9.3

5.3.2

2.4.x

Neumann[ 1 ]

4.0.x[ 1 ]

7.6.2

5.2.12

2.3.x

Moore[ 1 ]

3.2.x[ 1 ]

6.8.12

5.2.12

2.2.x

Lovelace[ 1 ]

3.1.x[ 1 ]

6.2.2

5.1.19

2.1.x

Kay[ 1 ]

3.0.x[ 1 ]

5.5.0

5.0.13

2.0.x

Ingalls[ 1 ]

2.1.x[ 1 ]

2.4.0

4.3.25

1.5.x

​ 需要导入maven依赖:

<dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.75</version> </dependency> <!--解析网页 jsoup 解析视频 tika--> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency> <dependency> <groupId>cn.hutool</groupId> <artifactId>hutool-all</artifactId> <version>5.4.6</version> </dependency> <!-- HttpClient --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </dependency>

3.4 编写ES客户端配置类 ElasticSearchClientConfig (用于spring整体管理)

@Configuration public class ElasticSearchClientConfig { @Bean public RestHighLevelClient restHighLevelClient(){ RestHighLevelClient restHighLevelClient = new RestHighLevelClient( RestClient.builder( new HttpHost("127.0.0.1", 9200))); return restHighLevelClient; } }

3.5 编写爬虫工具类 HtmlParseUtil

//html解析工具类 public class HtmlParseUtil { public static void main(String[] args) throws IOException { List<Content> list = HtmlParseUtil.parseJDSearchKeyByPage("洗衣机", 2); System.out.println(list.size()); } public static List<Content> parseJDSearchKeyByPage(String key,int page) throws IOException { List<Content> list = new ArrayList<>(); for (int i = 1; i <=page ; i ) { List<Content> itemList = HtmlParseUtil.parseJDSearchKey(key, i); list.addAll(itemList); } return list; } public static List<Content> parseJDSearchKey(String key,int page) throws IOException { //拼接URL路径和请求参数 String url = UrlBuilder.create() .setScheme("https") .setHost("search.jd.com") .addPath("Search") .addQuery("keyword", key) .addQuery("enc","utf-8") .addQuery("page",String.valueOf(2*page-1)) //默认爬取前两页数据 .build(); URL url1 = new URL(url); HttpURLConnection httpConn = (HttpURLConnection) url1.openConnection(); httpConn.setRequestMethod("GET"); /** 利用http模仿浏览器行为,防止被京东反爬虫程序 **/ httpConn.setRequestProperty("authority", "search.jd.com"); httpConn.setRequestProperty("accept", "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"); httpConn.setRequestProperty("accept-language", "zh-CN,zh;q=0.9"); httpConn.setRequestProperty("cache-control", "max-age=0"); httpConn.setRequestProperty("cookie", "__jdv=122270672|direct|-|none|-|1657610731752; __jdu=1657610731752947367087; pinId=zrLGvhk9izSm009P6x9LOw; pin=apple_ggUEIRS; unick=apple_ggUEIRS; ceshi3.com=000; _tp=70MDtYz0RbaKAAA4iyM/QQ==; _pst=apple_ggUEIRS; shshshfpb=daS4RVr0Yk9w65Hio31lN-g; shshshfpa=03fd05de-1795-e1be-7faa-dbe1342ebbcd-1657504705; rkv=1.0; areaId=12; ipLoc-djd=12-988-0-0; TrackID=1xjK9942JTH1cA13hCy9lpjoF4VUsywFztnHXMZa8fMqdod6dnvsJBqV2ZD7UVJXPOj_9eOcIbRSs8MdtE1dIc4M7Ie1oRPm-h1ZW-hdOnb9Gtb_DRX3_JGb_ZkJexJcQ; qrsc=3; PCSYCityID=CN_320000_320500_0; user-key=93bcac49-c4f4-4018-8b25-0766e0c16eda; cn=0; shshshfp=fc6aabe0109953d6062026a77f8bb1e5; __jda=122270672.1657610731752947367087.1657610732.1657610732.1657610732.1; __jdb=122270672.12.1657610731752947367087|1.1657610732; __jdc=122270672; shshshsID=fcfca37eb1dce4e7ebabf041ed253e70_6_1657612610164; thor=D83906BED82DBCAAD56166802034A7EB66575CF409BC09A49AFAF3487B79FEB995355C1A9063238C46E44EDF6CFED6A8324081B64A2FC4E00045BBAB6836FB7D4A6F24F6FBF97FE1F6A3014B93F3032242CB6FE9BF9D997B81005B34FA33DC1505BFB42E7DA2FE2D5991823CAEC187EE28A13F59C3698528BFD659FBAB4CFF16650B12DA4813475B5BF6F26CFCF2C198; 3AB9D23F7A4B3C9B=4YK7NHSJLWRZZ3CXJ4A22DRHHX7TAZBRBGGHDONJODT3TACJJJ65IS72HOSU4LFNHG6ZV3WAFDYORHCEBRJYYI6ZL4"); httpConn.setRequestProperty("sec-ch-ua", "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\""); httpConn.setRequestProperty("sec-ch-ua-mobile", "?0"); httpConn.setRequestProperty("sec-ch-ua-platform", "\"macOS\""); httpConn.setRequestProperty("sec-fetch-dest", "document"); httpConn.setRequestProperty("sec-fetch-mode", "navigate"); httpConn.setRequestProperty("sec-fetch-site", "none"); httpConn.setRequestProperty("sec-fetch-user", "?1"); httpConn.setRequestProperty("upgrade-insecure-requests", "1"); httpConn.setRequestProperty("user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"); InputStream responseStream = httpConn.getResponseCode() / 100 == 2 ? httpConn.getInputStream() : httpConn.getErrorStream(); Scanner s = new Scanner(responseStream).useDelimiter("\\A"); String response = s.hasNext() ? s.next() : ""; Document document = Jsoup.parse(response); // Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36").cookie("wlfstk_smdl","4jxg7p5cy2jz7afp41rull7hc3y9mkjr").timeout(30000).get(); Element j_goodsList = document.getElementById("J_goodsList"); if(j_goodsList==null) return new ArrayList<>(); ; Element gl_warp= j_goodsList.getElementsByClass("gl-warp").get(0); ArrayList<Content> contents = new ArrayList<>(); for (Element child : gl_warp.children()) { //img图片路径是存放在懒加载路径里面。 String img =child.getElementsByTag("img").eq(0).attr("data-lazy-img"); String price = child.getElementsByClass("p-price").eq(0).text(); String name = child.getElementsByClass("p-name").eq(0).text(); Content content = new Content(); content.setImg(img); content.setTitle(name); content.setPrice(price); contents.add(content); } return contents; } }

3.6 编写前端页面 index.html

<!DOCTYPE html> <html xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="utf-8"/> <title>ES仿京东实战</title> <link rel="stylesheet" th:href="@{/css/style.css}"/> </head> <body class="pg"> <div class="page" id="app"> <div id="mallPage" class=" mallist tmall- page-not-market "> <!-- 头部搜索 --> <div id="header" class=" header-list-app"> <div class="headerLayout"> <div class="headerCon "> <!-- Logo--> <h1 id="mallLogo"> <img th:src="@{/images/jdlogo.png}" alt=""> </h1> <div class="header-extra"> <!--搜索--> <div id="mallSearch" class="mall-search"> <form name="searchTop" class="mallSearch-form clearfix"> <fieldset> <legend>天猫搜索</legend> <div class="mallSearch-input clearfix"> <div class="s-combobox" id="s-combobox-685"> <div class="s-combobox-input-wrap"> <input v-model="keyword" type="text" autocomplete="off" value="dd" id="mq" class="s-combobox-input" aria-haspopup="true" > </div> </div> <button type="submit" id="searchbtn" @click.prevent="searchKey">搜索</button> </div> </fieldset> </form> <ul class="relKeyTop"> <li><a>Java</a></li> <li><a>前端</a></li> <li><a>Linux</a></li> <li><a>大数据</a></li> <li><a>理财</a></li> </ul> </div> </div> </div> </div> </div> <!-- 商品详情页面 --> <div id="content"> <div class="main"> <!-- 品牌分类 --> <form class="navAttrsForm"> <div class="attrs j_NavAttrs" style="display:block"> <div class="brandAttr j_nav_brand"> <div class="j_Brand attr"> <div class="attrKey"> 品牌 </div> <div class="attrValues"> <ul class="av-collapse row-2"> <li><a href="#"> </a></li> <li><a href="#"> Java </a></li> </ul> </div> </div> </div> </div> </form> <!-- 排序规则 --> <div class="filter clearfix"> <a class="fSort fSort-cur">综合<i class="f-ico-arrow-d"></i></a> <a class="fSort">人气<i class="f-ico-arrow-d"></i></a> <a class="fSort">新品<i class="f-ico-arrow-d"></i></a> <a class="fSort">销量<i class="f-ico-arrow-d"></i></a> <a class="fSort">价格<i class="f-ico-triangle-mt"></i><i class="f-ico-triangle-mb"></i></a> </div> <!-- 商品详情 --> <div class="view grid-nosku"> <!-- <div class="product">--> <!-- <div class="product-iWrap">--> <!-- <!–商品封面–>--> <!-- <div class="productImg-wrap">--> <!-- <a class="productImg">--> <!-- <img src="https://img.alicdn.com/bao/uploaded/i1/3899981502/O1CN01q1uVx21MxxSZs8TVn_!!0-item_pic.jpg">--> <!-- </a>--> <!-- </div>--> <!-- <!–价格–>--> <!-- <p class="productPrice">--> <!-- <em><b>¥</b>2590.00</em>--> <!-- </p>--> <!-- <!–标题–>--> <!-- <p class="productTitle">--> <!-- <a> dkny秋季纯色a字蕾丝dd商场同款连衣裙 </a>--> <!-- </p>--> <!-- <!– 店铺名 –>--> <!-- <div class="productShop">--> <!-- <span>店铺: Java </span>--> <!-- </div>--> <!-- <!– 成交信息 –>--> <!-- <p class="productStatus">--> <!-- <span>月成交<em>999笔</em></span>--> <!-- <span>评价 <a>3</a></span>--> <!-- </p>--> <!-- </div>--> <!-- </div>--> <div class="product" v-for="(item,index) in result" :key="index item"> <div class="product-iWrap"> <!--商品封面--> <div class="productImg-wrap"> <a class="productImg"> <img :src="'http:' item.img"> </a> </div> <!--价格--> <p class="productPrice"> <!-- <em><b>¥</b>2590.00</em>--> <em>{ {item.price}}</em> </p> <!--标题--> <p class="productTitle"> <a v-html="item.title"> </a> <!-- <a> { {item.title}}} </a>--> </p> <!-- 店铺名 --> <div class="productShop"> <span>店铺: Java </span> </div> <!-- 成交信息 --> <p class="productStatus"> <span>月成交<em>999笔</em></span> <span>评价 <a>3</a></span> </p> </div> </div> </div> </div> </div> </div> </div> <script th:src="@{/js/jquery.min.js}"></script> <script th:src="@{/js/axios.min.js}"></script> <script th:src="@{/js/vue.min.js}"></script> <script> new Vue({ el:"#app", data:{ keyword:"", result:[] }, methods:{ async searchKey(){ let keyword = this.keyword; console.log(keyword); let res = await axios.post("ES/Search",{ keyword, pageSize:20, pageNo:1 }) console.log(res); if(res!=null&& res!=undefined){ // alert("查询成功") this.result = res.data; } } } }) </script> </body> </html>

3.7 创建商品pojo类 Content

@Data public class Content { private String img; private String title; private String price; }

3.8 编写爬虫同步逻辑代码

/** Controller层代码 **/ @Slf4j @RestController @RequestMapping("/ES") public class ESController { @Resource EsDataSearchService esDataSearchService; /** * 导入数据进入es * @param keyword * @return * @throws Exception */ @GetMapping("/data/{keyword}") public boolean SynchronizeData(@PathVariable("keyword") String keyword) throws Exception { return esDataSearchService.SynchronizeData(keyword); } } /** Service层代码 **/ @Service public class EsDataSearchServiceImpl implements EsDataSearchService { @Resource RestHighLevelClient restHighLevelClient; @Override public boolean SynchronizeData(String keyword)throws Exception { List<Content> contents = HtmlParseUtil.parseJDSearchKeyByPage(keyword,2) ; //创建批量操作请求 BulkRequest jd_goods = new BulkRequest(); jd_goods.timeout("2m"); //将爬取出来的数组同步进入es for (Content content : contents) { //新增添加请求 jd_goods.add( new IndexRequest("jd_goods") .source(JSON.toJSONString(content), XContentType.JSON) ); } //批量请求 BulkResponse response = restHighLevelClient.bulk(jd_goods, RequestOptions.DEFAULT); return !response.hasFailures(); } }

注意:通过将爬取的数据转成数组,再通过es批量处理,将数据同步进入es

3.9 编写查询接口

/** Controller层代码 **/ @PostMapping("/Search") public List<Content> SearchData(@RequestBody SearchObject searchObject) { return esDataSearchService.SearchData(searchObject,true); } /** Service层代码 **/ @SneakyThrows @Override public List<Content> SearchData(SearchObject searchObject,boolean flag) { SearchRequest request = new SearchRequest(); request.indices("jd_goods"); SearchSourceBuilder builder = new SearchSourceBuilder(); //分页 builder.from((searchObject.getPageNo()-1)*searchObject.getPageSize()); builder.size(searchObject.getPageSize()); HighlightBuilder highlightBuilder = new HighlightBuilder(); //多个高亮显示 highlightBuilder.requireFieldMatch(false); highlightBuilder.preTags("<span style='color:red;'>"); highlightBuilder.postTags("</span>"); highlightBuilder.field("title"); builder.highlighter(highlightBuilder); //精准匹配 必须完全相同 否则无法展示 TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", searchObject.getKeyword()); MatchPhraseQueryBuilder queryBuilders = QueryBuilders.matchPhraseQuery("title", searchObject.getKeyword()); builder.query(queryBuilders); //带中文的匹配 BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); //boolQueryBuilder.must(QueryBuilders.matchPhraseQuery("title",searchObject.getKeyword())); builder.query(boolQueryBuilder); builder.timeout(new TimeValue(60, TimeUnit.SECONDS)); request.source(builder); //执行搜索 SearchResponse response = restHighLevelClient.search(request, RequestOptions.DEFAULT); //获取结果 List<Content> res = new ArrayList<>(); SearchHits hits = response.getHits(); for (SearchHit hit : hits.getHits()) { Content content = JSON.parseObject(hit.getSourceAsString(), Content.class); Map<String, HighlightField> highlightFields = hit.getHighlightFields(); HighlightField title = highlightFields.get("title"); if(title!=null){ Text[] fragments = title.fragments(); StringBuffer str = new StringBuffer("");//利用StringBuffer拼接效率更高 for (Text fragment : fragments) { str.append(fragment); } content.setTitle(str.toString()); } res.add( content); } //没有就现插入 if(res.size()==0&&flag){ //第一次没有查找到数据,则进行一次数据爬取再执行查询。 this.SynchronizeData(searchObject.getKeyword()); Thread.sleep(1000);//线程睡眠1s 因为同步es数据是异步操作,等待同步完成。 res = this.SearchData(searchObject,false); } return res; }

3.10 启动项目,通过 启动端口进行访问(记得打开ES服务)

springboot开源分销商城(SpringBootESJsoup实现JD)(4)

Elasticsearch 是一个分布式、高扩展、高实时的搜索与 数据分析 引擎。它能很方便地使大量数据具有搜索、分析和探索的能力。

它可以做实时数据存储,es检索数据本身扩展性很好,可以扩展到上百台服务器,处理PB级别(大数据时代)的数据。

,