首页百科大全正文内容

4.1 使用Python获取网页源代码

百科大全

更新时间：2025-10-05 11:05:0242

admin管理员组
文章数量:1794759

4.1 使用Python获取网页源代码

1）第三方库的安装

a.在线安装

pip install 第三方库名

b.本地安装下载对应版本的.whl文件，然后cd到文件目录下，通过

pip install xxx.whl 2）使用requests获取网页源代码

a. GET方式

import requests html = requests.get('网址')#得到一个Response对象 html_bytes = html.content#属性.content用来显示bytes型网页的源代码 html_str = html_bytes.decode()#属性.decode()用来把bytes型的数据解码为字符串型的数据，默认编码格式UTF-8

常见的编码格式 UTF-8、GBK、GB2312、GB18030。以中文可以正常显示为准。上面的代码可缩减为：

html_str = requests.get('网址').content.decode()

b. POST方式有些网页使用GET和POST方式访问同样的网址，得到的结果不一样。还有些网页只能用POST方式访问，使用GET方式访问返回错误信。 post()方法的格式：

import requests data = {'key1':'value1','key2':'value2'} html_formdata = requests.post('网址',data = data).content.decode() #html_formdata = requests.post('网址',json = data).content.decode()#有些网址提交的内容是json格式

3）结合requests与正则表达式 ①提取标题

title = re.search('title>(.*?)<',html,re.S).group(1)

②提取正文，并将两端正文使用换行符拼接起来

content_list = re.findall('p>(.*?)<', html_str,re.S) content_str = '\\n'.join(content_list)

完整代码如下：

import requests import re html_str = requests.get('exercise.kingname.info/exercise_requests_get.html').content.decode() title = re.search('title>(.*?)<',html,re.S).group(1) content_list = re.findall('p>(.*?)<', html_str,re.S) content_str = '\\n'.join(content_list) print(f'页面标题为：{title}') print(f'页面正文内容为：\\n{content_str}') 总结

建议安装第三方库时使用本地安装，因为有些库在线安装传输速度非常慢。

网页源代码获取格式

#GET方式 html_str = requests.get('网址').content.decode(编码格式，默认UTF-8) #POST方式 data json html_str = requests.post('网址',data = data).content.decode() html_str = requests.post('网址',json = data).content.decode()

本文标签：源代码网页 Python

版权声明：本文标题：4.1 使用Python获取网页源代码内容由林淑君副主任自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.xiehuijuan.com/baike/1686815598a106385.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。