Python内建的爬虫库urllib的全方位使用指南

2024 年 01 月 14 日

112 次浏览

2792字数

Python的 urllib库是一个用于处理URL的模块集合，它提供了一系列用于操作URL和执行网络请求的功能。urllib主要包括以下几个子模块：urllib.request、urllib.error、urllib.parse和 urllib.robotparser。

1. urllib.request

这个模块用于打开和读取URL。它提供了从URL检索数据的接口。

基本用法

打开和读取URL：

import urllib.request

with urllib.request.urlopen('http://example.com/') as response:
    html = response.read()

进阶用法

发送数据和设置请求头：

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'key1': 'value1', 'key2': 'value2'}), encoding='utf8')
request = urllib.request.Request('http://example.com/', data=data)
request.add_header('User-Agent', 'Mozilla/5.0')
with urllib.request.urlopen(request) as response:
    response_text = response.read()

2. urllib.error

这个模块包含 urllib.request抛出的异常。最常见的异常是 URLError和 HTTPError。

例子

处理异常：

from urllib.request import urlopen
from urllib.error import URLError, HTTPError

try:
    response = urlopen('http://example.com/')
except HTTPError as e:
    print('HTTPError:', e.code)
except URLError as e:
    print('URLError:', e.reason)
else:
    print('请求成功处理')

3. urllib.parse

这个模块用于解析和构造URL。它提供了对URL的基本操作功能。

例子

解析URL：

from urllib.parse import urlparse

parsed_url = urlparse('http://example.com:80/foo/bar?baz=qux#quux')
print(parsed_url)

构造URL查询字符串：

from urllib.parse import urlencode

query_string = urlencode({'key1': 'value1', 'key2': 'value2'})
print(query_string)

4. urllib.robotparser

urllib.robotparser用于解析robots.txt文件。它可以确定哪些页面可以被抓取。

例子

解析robots.txt：

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', 'http://example.com/foo')
print(can_fetch)

小结

urllib是Python标准库的一部分，它为HTTP请求提供了简单的接口。尽管它不像一些外部库（如 requests）那样功能强大，但它是内置的、无需安装额外库，对于简单的任务来说是足够的。在使用 urllib时，需要注意异常处理和网络安全问题，如SSL证书验证、遵循robots.txt等。