前言
以我的理解,写一个爬虫分为以下几个步骤
- 分析目标网站
- 访问单个网页地址,获取网页源代码
- 提取数据
- 保存数据
- 抓取剩余网页 以下开始正题
1. 分析目标网站
- 目标网站为简书七日热门文章 。 提取数据为用户,标题,阅读量,评论量,获赞量,打赏数
- 用chrome tools 查看这个网页,是用ajax加载的,分析规律,发现url为 , page=1 至 page=5.
2. 访问单个网页地址,获取网页源代码
- 设置url
url = 'http://www.jianshu.com/trending/weekly?page=1'
- 设置头部信息(用来伪装请求,本案例中可省略)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}request = urllib2.Request(url=url, headers=headers)
- 发送请求和接收响应
html = urllib2.urlopen(request)
3. 从源代码中提取数据
# 先用BeautifulSoup转换一下,以便之后解析bsObj = BeautifulSoup(html.read(), 'lxml')
-
抓出每篇文章的源代码,并提取目标数据(写的很差劲,just work)
items = bsObj.findAll("div", {"class": "content"}) for item in items: author = item.find("a", {"class": "blue-link"}).get_text() title = item.find("a", {"class": "title"}).get_text() other = item.find("div", {"class": "meta"}).get_text() pattern = re.compile('(\d+)') content = re.findall(pattern, other) view = content[0] comment = content[1] like = content[2] money = content[3] if (len(content) == 4) else 0 # 非常不严谨,暂时这么做
4. 保存数据
with open('articlesOfSevenDays.csv', 'a') as resultFile: wr = csv.writer(resultFile, dialect= 'excel') wr.writerow([author,title,view,comment,like,money])
因为遇到编码问题,所以添加以下代码
import sysreload(sys)sys.setdefaultencoding('utf-8')
5. 抓取剩余网页
for i in range(1,6): print "开始抓取第{}页...".format(i) url = 'http://www.jianshu.com/trending/weekly?page={}'. format(i) # 重复之前提取数据和保存数据的代码
完整的代码
#!/usr/bin/env python# coding=utf-8from urllib.request import Request,urlopenfrom bs4 import BeautifulSoupfrom urllib.error import HTTPErrorimport reimport csvimport osdef getHTML(i): url = 'http://www.jianshu.com/trending/weekly?page={}'.format(i) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'} try: request = Request(url=url, headers=headers) html = urlopen(request) bsObj = BeautifulSoup(html.read(), 'lxml') items = bsObj.findAll("div", {"class": "content"}) except HTTPError as e: print(e) exit() return itemsdef getArticleInfo(items): articleInfo= [] for item in items: author = item.find("a", {"class": "blue-link"}).get_text() title = item.find("a", {"class": "title"}).get_text() other = item.find("div", {"class": "meta"}).get_text() pattern = re.compile('(\d+)') content = re.findall(pattern, other) view = content[0] comment = content[1] like = content[2] money = content[3] if (len(content) == 4) else 0 # 不太严谨 articleInfo.append([author, title, view, comment, like, money]) return articleInfodir = "../jianshu/"if not os.path.exists(dir): os.makedirs(dir)csvFile = open("../jianshu/jianshuSevenDaysArticles.csv","wt",encoding='utf-8')writer = csv.writer(csvFile)writer.writerow(("author", "title", "view", "comment", "like", "money"))try: for i in range(1, 6): items = getHTML(i) articleInfo = getArticleInfo(items) for item in articleInfo: writer.writerow(item)finally: csvFile.close()
抓取结果
总结
- 页面解析水平不好,接下来要学习:正则表达式,beautifulSoup,lxml
- 遇到的编码问题待学习