当前位置:脚本大全 > > 正文

Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】(Python HTML解析器BeautifulSoup用法实例详解爬虫解析器)

时间:2021-10-23 10:13:27类别:脚本大全

Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】

Python HTML解析器BeautifulSoup用法实例详解爬虫解析器

本文实例讲述了python html解析器beautifulsoup用法。分享给大家供大家参考,具体如下:

beautifulsoup简介

我们知道,python拥有出色的内置html解析器模块——htmlparser,然而还有一个功能更为强大的html或xml解析工具——beautifulsoup(美味的汤),它是一个第三方库。简单来说,beautifulsoup最主要的功能是从网页抓取数据。本文我们来感受一下beautifulsoup的优雅而强大的功能吧!

beautifulsoup安装

beautifulsoup3 目前已经停止开发,推荐在现在的项目中使用beautifulsoup4,不过它已经被移植到bs4了,也就是说导入时我们需要 import bs4 。可以利用 pip 或者 easy_install 两种方法来安装。下面采用pip安装。

  • ?
  • 1
  • 2
  • pip install beautifulsoup4
  • pip install lxml
  • 建议同时安装"lxml"模块,beautifulsoup支持python标准库中的html解析器(htmlparser),还支持一些第三方的解析器,如果我们不安装它,则 python 会使用 python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

    创建对象

    安装后,创建对象:

  • ?
  • 1
  • soup = beautifulsoup(markup='html文件', 'lxml')
  • 格式化输出:

  • ?
  • 1
  • soup.prettify()
  • beautifulsoup四大对象类型

    beautifulsoup将复杂html文档转换成一个复杂的树形结构,每个节点都是python对象,所有对象可以归纳为4种:

    1.tag类型

    即html的整个标签,如获取<title>标签:

  • ?
  • 1
  • 2
  • print soup.title
  • #<title>the dormouse's story</title>
  • tag有两个重要属性:name,attrs。

    name

    即html的标签名称:

  • ?
  • 1
  • 2
  • 3
  • 4
  • print soup.name
  • #[document]
  • print soup.head.name
  • #head
  • attrs

    即html的标签属性字典:

  • ?
  • 1
  • 2
  • print soup.p.attrs
  • #{'class': ['title'], 'name': 'dromouse'}
  • 如果想要单独获取某个属性:

  • ?
  • 1
  • 2
  • print soup.p['class']
  • #['title']
  • 2.navigablestring类型

    既然我们已经得到了整个标签,那么问题来了,我们要想获取标签内部的文字内容怎么办呢?很简单,用 string 即可:

  • ?
  • 1
  • 2
  • print soup.p.string
  • #the dormouse's story
  • 3.beautifulsoup类型

    beautifulsoup 对象表示的是一个文档的全部内容.:

  • ?
  • 1
  • 2
  • print soup.name
  • # [document]
  • 4.comment类型

    html的注释内容,注意的是,不包含注释符号。我们首先判断它的类型,是否为 comment 类型,然后再进行其他操作,如打印输出:

  • ?
  • 1
  • 2
  • 3
  • if type(soup.a.string)==bs4.element.comment:
  •   print soup.a.string
  • #<!-- elsie -->
  • 遍历文档树

    1.子节点

    contents

    获取所有子节点,返回列表:

  • ?
  • 1
  • 2
  • print soup.head.contents
  • #[<title>the dormouse's story</title>]
  • children

    获取所有子节点,返回列表生成器:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • print soup.head.children
  • #<listiterator object at 0x7f71457f5710>
  • ## 需要遍历
  • for child in soup.body.children:
  •   print child
  • ## 结果
  • <p class="title" name="dromouse"><b>the dormouse's story</b></p>
  • <p class="story">once upon a time there were three little sisters; and their names were
  • <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- elsie --></a>,
  • <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">lacie</a> and
  • <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">tillie</a>;
  • and they lived at the bottom of a well.</p>
  • <p class="story">...</p>
  • 2.节点内容

    string

    返回单个文本内容。如果一个标签里面没有标签了,那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string 方法应该调用哪个子节点的内容,string 的输出结果是 none。例如:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • print soup.head.string
  • print soup.title.string
  • #the dormouse's story
  • #the dormouse's story
  • print soup.html.string
  • # none
  • strings

    返回多个文本内容,且包含空行和空格。

    stripped_strings

    返回多个文本内容,且不包含空行和空格:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • for string in soup.stripped_strings:
  •   print(repr(string))
  •   # u"the dormouse's story"
  •   # u"the dormouse's story"
  •   # u'once upon a time there were three little sisters; and their names were'
  •   # u'elsie'
  •   # u','
  •   # u'lacie'
  •   # u'and'
  •   # u'tillie'
  •   # u';\nand they lived at the bottom of a well.'
  •   # u'...'
  • get_text()方法

    返回当前节点和子节点的文本内容。

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • from bs4 import beautifulsoup
  • html_doc = """
  • <html><head><title>the dormouse's story</title></head>
  • <body>
  •   <p class="title"><b>the dormouse's story</b></p>
  •   <p class="story">once upon a time there were three little sisters; and their names were
  •     <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister1" id="link1">elsie</a>,
  •     <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister2" id="link2">lacie</a> and
  •     <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister3" id="link3">tillie</a>;
  •     and they lived at the bottom of a well.
  •   </p>
  •   <p class="story">...</p>
  • </body>
  • </html>
  • """
  • soup = beautifulsoup(markup=html_doc,features='lxml')
  • node_p_text=soup.find('p',class_='story').get_text()    # 注意class_带下划线
  • print(node_p_text)
  • # 结果
  • once upon a time there were three little sisters; and their names were
  •     elsie,
  •     lacie and
  •     tillie;
  •     and they lived at the bottom of a well.
  • 3.父节点

    parent

    返回某节点的直接父节点:

  • ?
  • 1
  • 2
  • 3
  • p = soup.p
  • print p.parent.name
  • #body
  • parents

    返回某节点的所有父辈及以上辈的节点:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • content = soup.head.title.string
  • for parent in content.parents:
  •   print parent.name
  • ## 结果
  • title
  • head
  • html
  • [document]
  • 4.兄弟节点

    next_sibling

    next_sibling 属性获取该节点的下一个兄弟节点,结果通常是字符串或空白,因为空白或者换行也可以被视作一个节点。

    previous_sibling

    previous_sibling 属性获取该节点的上一个兄弟节点。

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • print soup.p.next_sibling
  • #    实际该处为空白
  • print soup.p.prev_sibling
  • #none  没有前一个兄弟节点,返回 none
  • print soup.p.next_sibling.next_sibling
  • #<p class="story">once upon a time there were three little sisters; and their names were
  • #<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- elsie --></a>,
  • #<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">lacie</a> and
  • #<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">tillie</a>;
  • #and they lived at the bottom of a well.</p>
  • #下一个节点的下一个兄弟节点是我们可以看到的节点
  • next_siblingsprevious_siblings

    迭代获取全部兄弟节点。

    5.前后节点

    next_elementprevious_element

    不是针对于兄弟节点,而是在于所有节点,不分层次的前一个和后一个节点。

    next_elementsprevious_elements

    迭代获取所有前和后节点。

    搜索文档树

    1.find_all(name=none, attrs={}, recursive=true, text=none, limit=none, **kwargs)

    find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

    参数说明

    name参数

    name参数很强大,可以传多种方式的参数,查找所有名字为 name 的tag,字符串对象会被自动忽略掉。

    (a)传标签名

    最简单的过滤器是标签名。在搜索方法中传入一个标签名参数,beautifulsoup会查找与标签名完整匹配的内容,下面的例子用于查找文档中所有的<a>标签:

  • ?
  • 1
  • 2
  • print soup.find_all('a')
  • #[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">tillie</a>]
  • 返回结果列表中的元素仍然是beautifulsoup对象。

    (b)传正则表达式

    如果传入正则表达式作为参数,beautifulsoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • import re
  • for tag in soup.find_all(re.compile("^b")):
  •   print(tag.name)
  • # body
  • # b
  • (c)传列表

    如果传入列表参数,beautifulsoup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和<b>标签:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • soup.find_all(["a", "b"])
  • # [<b>the dormouse's story</b>,
  • # <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">elsie</a>,
  • # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">lacie</a>,
  • # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">tillie</a>]
  • (d)传true

    true 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • for tag in soup.find_all(true):
  •   print(tag.name)
  • # html
  • # head
  • # title
  • # body
  • # p
  • # b
  • # p
  • # a
  • # a
  • (e)传函数

    如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。如果这个方法返回 true 表示当前元素匹配并且被找到,如果不是则反回 false:

  • ?
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • def has_class_but_no_id(tag):
  •   return tag.has_attr('class') and not tag.has_attr('id')
  • soup.find_all(has_class_but_no_id)
  • # [<p class="title"><b>the dormouse's story</b></p>,
  • # <p class="story">once upon a time there were...</p>,
  • # <p class="story">...</p>]
  • keyword参数

    注意的是,如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,beautifulsoup会搜索每个tag的”id”属性:

  • ?
  • 1
  • 2
  • soup.find_all(id='link2')
  • # [<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">lacie</a>]
  • 如果传入 href 参数,beautiful soup会搜索每个tag的"href"属性:

  • ?
  • 1
  • 2
  • soup.find_all(href=re.compile("elsie"))
  • # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">elsie</a>]
  • 使用多个指定名字的参数可以同时过滤tag的多个属性:

  • ?
  • 1
  • 2
  • soup.find_all(href=re.compile("elsie"), id='link1')
  • # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow"

    猜您喜欢