Skip to content

Instantly share code, notes, and snippets.

@lize240810
Created June 20, 2019 02:18
Show Gist options
  • Select an option

  • Save lize240810/66a093d2d36f94490115cb551ca848d7 to your computer and use it in GitHub Desktop.

Select an option

Save lize240810/66a093d2d36f94490115cb551ca848d7 to your computer and use it in GitHub Desktop.
爬虫三部曲
# 初识Python爬虫
## 环境搭建
1. 下载[Python](https://repo.anaconda.com/archive/Anaconda3-2019.03-MacOSX-x86_64.pkg)
2. 下载[chrome浏览器](https://www.google.cn/chrome/browser/desktop/index.html )
- [chrome插件](https://www.zhihu.com/question/20054116)
3. 下载[PyCharm编译器](http://www.jetbrains.com/pycharm/download/#section=windows )
## 创建一个爬虫
1. 安装包
```
pip install requests
```
2. 创建实例
```
import requests #导入requests库
r = requests.get('https://www.baidu.com/')
#使用requests.get方法获取网页信息
r.text #打印结果
r.encoding='utf-8’ #修改编码
r.text #打印结果
```
## 爬虫三步
#### 第一步: 获取源码
1. 导入requests库
2. 伪装浏览器
3. `requests.get`获取源码
4. 编码转换
```
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
resp = requests.get('https://baike.baidu.com/item/中国知网', headers=headers)
resp.encoding = 'utf-8'
```
#### 第二步: 解析源码
1. 导入bs4
2. 解析网页数据
3. 寻找数据
4. 打印数据
```
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text)
text = soup.find("div", attrs={"class": "basic-info cmn-clearfix"}).text
print(text)
```
##### 第二步拓展:清洗数据
> 使用列表推到式 嵌套if
```
text = [''.join(i.strip().split()) for i in lists if i != '']
```
#### 第三步:保存数据
1. 导入pandas
2. 新建list对象
3. 使用to_csv保存
```
import pandas
comments = []
for item in text:
comments.append(item)
df = pandas.DataFrame(comments)
df.to_csv('comments.csv', encoding='utf_8_sig')
```
## 如果看不懂以上语法 请先熟悉该文档
[python 入门教程](https://www.runoob.com/python3/python3-tutorial.html)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment