单元4:Beautiful Soup库入门

安装

pip install Beautiful Soup

Beautiful Soup库的基本元素

Beautiful Soup库的引用

#**Beautiful Soup库也叫beautifulsoup4或bs4,有两种引用方式**
from bs4 import **BeautifulSoup
#   or
import bs4**

Beautiful Soup类

html/xml文档的全部内容←→标签树←→Beautiful Soup类

Beautiful Soup库解析器

Beautiful Soup类的基本元素

#**初写代码**
from bs4 import BeautifulSoup
demo = '<https://python123.io/ws/demo.html>'
soup = BeautifulSoup(urlopen(demo),"html.parser")#soup代表解析后的demo页面
soup.title#返回字符串类型

#**PS:直接使用以上代码会报warning,需导入urlopen**
**#更改后**

from bs4 import BeautifulSoup
from urllib.request import urlopen
demo = '<https://python123.io/ws/demo.html>'
soup = BeautifulSoup(urlopen(demo),"html.parser")#soup代表解析后的demo页面
soup.title#返回字符串类型
#Out[1]:<title>This is a python demo page</title>
soup.a.name#a标签的名字
#Out[2]:'a'
soup.a.parent.name#a标签的父亲的名字
#Out[3]:'p'
tag = soup.a#以标签a为例
tag.attrs#标签属性
#Out[4]:{'href': '<http://www.icourse163.org/course/BIT-268001','class>': ['py1'],'id': 'link1'}
tag.attrs['class']#class标签 的属性值
#Out[5]:['py1']
type(tag.attrs)#标签属性类型是字典型
#Out[6]:dict
type(tag)#标签是bs4中元素的tag类型
#Out[7]:bs4.element.Tag
**#获取尖括号中间的内容**

soup.a# a标签的信息
#Out[1]:<a class="py1" href="<http://www.icourse163.org/course/BIT-268001>" id="link1">Basic Python</a>
soup.a.string
#Out[2]:'Basic Python'
type(soup.a.string)
#Out[3]:bs4.element.NavigableString
soup.p#p标签的信息
#Out[4]:<p class="title"><b>The demo python introduces several python courses.</b></p>
soup.p.string
#Out[5]:'The demo python introduces several python courses.'
type(soup.p.string)
#Out[6]:bs4.element.NavigableString

# **PS:p标签中包含b标签,但是打印的string并不包含b标签,说明NavigableString可以跨越多个标签层次**