Created
June 20, 2019 02:18
-
-
Save lize240810/66a093d2d36f94490115cb551ca848d7 to your computer and use it in GitHub Desktop.
爬虫三部曲
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # 初识Python爬虫 | |
| ## 环境搭建 | |
| 1. 下载[Python](https://repo.anaconda.com/archive/Anaconda3-2019.03-MacOSX-x86_64.pkg) | |
| 2. 下载[chrome浏览器](https://www.google.cn/chrome/browser/desktop/index.html ) | |
| - [chrome插件](https://www.zhihu.com/question/20054116) | |
| 3. 下载[PyCharm编译器](http://www.jetbrains.com/pycharm/download/#section=windows ) | |
| ## 创建一个爬虫 | |
| 1. 安装包 | |
| ``` | |
| pip install requests | |
| ``` | |
| 2. 创建实例 | |
| ``` | |
| import requests #导入requests库 | |
| r = requests.get('https://www.baidu.com/') | |
| #使用requests.get方法获取网页信息 | |
| r.text #打印结果 | |
| r.encoding='utf-8’ #修改编码 | |
| r.text #打印结果 | |
| ``` | |
| ## 爬虫三步 | |
| #### 第一步: 获取源码 | |
| 1. 导入requests库 | |
| 2. 伪装浏览器 | |
| 3. `requests.get`获取源码 | |
| 4. 编码转换 | |
| ``` | |
| import requests | |
| headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'} | |
| resp = requests.get('https://baike.baidu.com/item/中国知网', headers=headers) | |
| resp.encoding = 'utf-8' | |
| ``` | |
| #### 第二步: 解析源码 | |
| 1. 导入bs4 | |
| 2. 解析网页数据 | |
| 3. 寻找数据 | |
| 4. 打印数据 | |
| ``` | |
| from bs4 import BeautifulSoup | |
| soup = BeautifulSoup(resp.text) | |
| text = soup.find("div", attrs={"class": "basic-info cmn-clearfix"}).text | |
| print(text) | |
| ``` | |
| ##### 第二步拓展:清洗数据 | |
| > 使用列表推到式 嵌套if | |
| ``` | |
| text = [''.join(i.strip().split()) for i in lists if i != ''] | |
| ``` | |
| #### 第三步:保存数据 | |
| 1. 导入pandas | |
| 2. 新建list对象 | |
| 3. 使用to_csv保存 | |
| ``` | |
| import pandas | |
| comments = [] | |
| for item in text: | |
| comments.append(item) | |
| df = pandas.DataFrame(comments) | |
| df.to_csv('comments.csv', encoding='utf_8_sig') | |
| ``` | |
| ## 如果看不懂以上语法 请先熟悉该文档 | |
| [python 入门教程](https://www.runoob.com/python3/python3-tutorial.html) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment