Python数据采集--Beautifulsoup的使用


本文摘自php中文网,作者巴扎黑,侵删。

Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》,例子是照搬的,觉得跟着敲一遍还是有作用的,所以记录下来。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(res.text, 'lxml')print(soup.h1)

1

<h1>An Interesting Title</h1>

使用urllib访问页面是这样的,read返回的是字节,需要解码为utf-8的文本。像这样a.read().decode('utf-8'),不过在使用bs4解析时候,可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(a, 'lxml')print(soup.h1)

1

<h1>An Interesting Title</h1>

抓取所有CSS class属性为green的span标签,这些是人名。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/warandpeace.html')

soup = BeautifulSoup(res.text, 'lxml')
green_names = soup.find_all('span', class_='green')for name in green_names:print(name.string)


1

2

3

4

5

6

7

8

9

10

11

Anna

Pavlovna Scherer

Empress Marya

Fedorovna

Prince Vasili Kuragin

Anna Pavlovna

St. Petersburg

the prince

Anna Pavlovna

Anna Pavlovna

...

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代,而后代标签则包括了父标签下面所有的子子孙孙。通俗来说,descendant包括了child。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').childrenfor name in gifts:print(name)


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

<tr><th>

Item Title

</th><th>

Description

</th><th>

Cost

</th><th>

Image

</th></tr>

 

 

<tr class="gift" id="gift1"><td>

Vegetable Basket

</td><td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>

</td><td>

$15.00

</td><td>

<img src="../img/gifts/img1.jpg"/>

</td></tr>

 

 

<tr class="gift" id="gift2"><td>

Russian Nesting Dolls

</td><td>

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>

</td><td>

$10,000.52

</td><td>

<img src="../img/gifts/img2.jpg"/>

</td></tr>

找到表格后,选取当前结点为tr,并找到这个tr之后的兄弟节点,由于第一个tr为表格标题,这样的写法能提取出所有除开表格标题的正文数据。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').tr.next_siblingsfor name in gifts:print(name)


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

<tr class="gift" id="gift1"><td>

Vegetable Basket

</td><td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>

</td><td>

$15.00

</td><td>

<img src="../img/gifts/img1.jpg"/>

</td></tr>

 

 

<tr class="gift" id="gift2"><td>

Russian Nesting Dolls

</td><td>

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>

</td><td>

$10,000.52

</td><td>

<img src="../img/gifts/img2.jpg"/>

</td></tr>

查找商品的价格,可以根据商品的图片找到其父标签<td>,其上一个兄弟标签就是价格。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
price = soup.find('img', src='../img/gifts/img1.jpg').parent.previous_sibling.stringprint(price)


1

$15.00

采集所有商品图片,为了避免其他图片乱入。使用正则表达式精确搜索。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
imgs= soup.find_all('img', src=re.compile(r'../img/gifts/img.*.jpg'))for img in imgs:print(img['src'])


1

2

3

4

5

../img/gifts/img1.jpg

../img/gifts/img2.jpg

../img/gifts/img3.jpg

../img/gifts/img4.jpg

../img/gifts/img6.jpg

find_all()还可以传入函数,对这个函数有个要求:就是其返回值必须是布尔类型,若是True则保留,若是False则剔除。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')# lambda tag: tag.name=='img'tags = soup.find_all(lambda tag: tag.has_attr('src'))for tag in tags:print(tag)


1

2

3

4

5

6

<img src="../img/gifts/logo.jpg" style="float:left;"/>

<img src="../img/gifts/img1.jpg"/>

<img src="../img/gifts/img2.jpg"/>

<img src="../img/gifts/img3.jpg"/>

<img src="../img/gifts/img4.jpg"/>

<img src="../img/gifts/img6.jpg"/>

tag是一个Element对象,has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中,下面的写法返回的结果一样。
lambda tag: tag.has_attr('src')lambda tag: tag.name=='img'

阅读剩余部分

相关阅读 >>

Python中subprocess库的用法介绍

Python函数基础入门

Python用什么软件好?Python开发工具推荐

认识什么是PythonPython的优点和缺点

Python里的π怎么输入

Python threading模块中的join()方法

如何在Python中使用while语句[适合初学者]

Python数据竖着怎么变横的?

Python创建文件夹的基本步骤

Python 网络爬虫--关于简单的模拟登录

更多相关阅读请进入《Python》频道 >>




打赏

取消

感谢您的支持,我会继续努力的!

扫码支持
扫码打赏,您说多少就多少

打开支付宝扫一扫,即可进行扫码打赏哦

分享从这里开始,精彩与您同在

评论

管理员已关闭评论功能...