微博爬虫

一年前,自己开通了公众号,将每天微博上最新的 iOS 技术文章,制作成笔记的形式分享出来。持续了一年时间,最近因为创业,已经好些天没有更新了。最近有人向我反馈这个问题,我也拿比较忙没有时间推脱。

随着问的人越来越多,开始思考,怎么用技术的手段自动化生成这些笔记,想到前段时间学习到的 Python,想利用爬虫来解决这个问题。

本篇文章主要有以下内容:

  1. 分析微博网站。找到获取某个用户微博数据的 URL。
  2. 通过 URL 获取到微博内容。微博正文内容、转发数量、时间等。
  3. 清洗数据。这里我们要获取前一天转发量 >= 20 的微博数据
  4. 按照日期生成并保存成 markdown 文件

分析微博网站

我们爬取的是这个网站:

http://weibo.cn/

要获取某人的微博,比如要获取 @逻辑思维 的微博:

http://weibo.cn/u/1853923717?page=1

获取微博内容

下方代码获取的是 @逻辑思维 微博第 1 页网页源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from urllib.request import Request
from urllib.request import urlopen
url = 'http://weibo.cn/u/1853923717?page=1'
header = {
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Connection" : "keep-alive",
"Cache-Control" : "max-age=0",
"Upgrade-Insecure-Requests" : "1",
"Accept" :"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2",
"Cookie": "这里填写你的 Cookie 值"
}
request = Request(url,None,header)
response = urlopen(request,timeout=1000)
print(response.read())

通过上面的网页,我们要找到以下数据:

  • 微博内容(包括 被转发的微博内容)
  • 微博创建时间
  • 微博转发数量
  • 微博 URL 地址

获取单条微博所有内容

这里的单条微博的所有内容,包括微博正文、转发数、转发时间等。

微博爬虫

上代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
from bs4 import BeautifulSoup
...
bsObj = BeautifulSoup(response.read(),"html.parser")
cardListElement = bsObj.find_all("div",{"class":"c"})
for cardElement in cardListElement:
if cardElement.get('id') is None: # 排除不是微博的数据
continue
if cardElement.find("span",{"class":"kt"}) is not None: # 排除置顶
continue
print('cardElement:',cardElement)

获取微博正文

这里的微博正文包括原始微博正文内容和转发微博的正文内容。

如何判断是否包含有转发微博:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
weiboContent = "" # 微博内容
haveRetweetWeibo = False # 是否有转发微博
if len(cardElement.find_all(text='转发理由:')) != 0:# 有转发理由就有转发
haveRetweetWeibo = True
if haveRetweetWeibo: # 有转发微博
repostReasonItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/attitude/*')})[0].previous_siblings;
for index,repostReasonItem in enumerate(repostReasonItems): # 需要反向拼接
weiboContent = repostReasonItem.string + weiboContent;
weiboContent = weiboContent[5:] # 去掉‘转发理由:’ 5 个字
weiboContent = weiboContent + '\n\n' + str(cardElement.find("span",{"class":"ctt"}))
else: # 没有转发微博
weiboContent = str(cardElement.find("span",{"class":"ctt"}))
print('weiboContent:', weiboContent)

微博创建时间

1
2
3
4
5
...
timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
timeArray = timeLongStr.split('\xa0')
timeStr = timeArray[0]

这里获取的数据为 今天 10:23

微博转发数量

微博转发数量包括原始微博的转发数量和转发微博的转发数量。

上代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 原始微博的转发数量
repostCountStr = '0'
repostItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/repost/*')}) # 找到转发标签
for repost in repostItems:
repostCountLongStr = repost.get_text() # 获取 '转发[100]'
repostCountStr = repostCountLongStr[3:len(repostCountLongStr) - 1] # 获取 100
print('转发数量:',repostCountStr,'\n\n')
# 转发微博的转发数量
repostWeiboRepostCountStr = '0'
repostWeiboRepostItems = cardElement.find_all("span",{"class":"cmt"})
for i,cmt in enumerate(repostWeiboRepostItems):
if i == 2: # 找到转发标签
repostWeiboRepostCountLongStr = cmt.get_text() # 获取 '原文转发[14]'
repostWeiboRepostCountStr = repostWeiboRepostCountLongStr[5:len(repostWeiboRepostCountLongStr) - 1] # 获取 '14'
print('转发微博的转发数量:',repostWeiboRepostCountStr,'\n\n')

微博 URL 地址

1
2
3
weiboURL = ""
for ccA in cardElement.find_all("a",{"class":"cc"}):
weiboURL = ccA["href"]

清洗数据

转发量 >= 20

因为我们抓取的技术文章,不是所有微博都是我们抓取的对象,我这里设置一个条件:仅获取转发量 >= 20 的微博数据。

转发量为:该微博转发量 + 转发微博的转发量

因为上面我们已经获取了这两个数据,做个判断就可以了。

时间为前一天

我们知道当前时间,也就知道昨天的起始时间和结束时间,分别转化为时间戳就可以了。上面的代码中我们知道了微博的创建时间,转化为时间戳就可以了。

故判断条件为:前一天开始时间戳 <= 创建时间戳 <= 前一天结束时间戳

获取 昨日起始时间戳 和 昨日结束时间戳:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import time
import datetime
structTime = time.localtime(time.time())
yesterdayCurrentTime = datetime.datetime.now() - datetime.timedelta(days=1) # 获取昨天此刻时间
yesterdayStructTime = yesterdayCurrentTime.timetuple()
yesterdayYear = yesterdayStructTime.tm_year # 昨天 年份
yesterdayMonth = yesterdayStructTime.tm_mon # 昨天 月份
yesterdayDay = yesterdayStructTime.tm_mday # 昨天 多少号
# 昨日起始时间戳
yesterdayStartTime = "%d-%d-%d 00:00:00"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayStartTimeArray = time.strptime(yesterdayStartTime, "%Y-%m-%d %H:%M:%S")
yesterdayStartTimeStamp = int(time.mktime(yesterdayStartTimeArray))
# 昨日结束时间戳
yesterdayEndTime = "%d-%d-%d 23:59:59"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayEndTimeArray = time.strptime(yesterdayEndTime, "%Y-%m-%d %H:%M:%S")
yesterdayEndTimeStamp = int(time.mktime(yesterdayEndTimeArray))

微博创建时间时间戳:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
timeArray = timeLongStr.split('\xa0')
timeStr = timeArray[0]
seperatorArray = timeStr.split('-')
if len(seperatorArray) == 1:
structTime = time.localtime(time.time())
currentTime = datetime.datetime.now()
currenStructTime = yesterdayCurrentTime.timetuple()
currenYear = currenStructTime.tm_year # 今天 年份
timeStr = str(currenYear) + '年' + timeStr
timeStr = timeStr.replace('年','-')
timeStr = timeStr.replace('月','-')
timeStr = timeStr.replace('日','')
colonArray = timeStr.split(':')
if len(colonArray) == 2:
timeStr = timeStr + ':00'
try:
weiboCreateTimeArray = time.strptime(timeStr, "%Y-%m-%d %H:%M:%S")
createdTimestamp = int(time.mktime(weiboCreateTimeArray))
print("timeStr:",timeStr)
print('createdTimestamp:',str(createdTimestamp))
print('yesterdayStartTimeStamp:',str(yesterdayStartTimeStamp),' yesterdayEndTimeStamp:',str(yesterdayEndTimeStamp))
except Exception as err:
print("timeStr:",timeStr)
print(err)

微博 URL 优化

看到打印出来的地址是这个样子的:

http://weibo.cn/comment/EhdDavwxC?uid=1853923717&rl=0#cmtfrm

不是特别好看,但是我们发现移动版 http://m.weibo.cn 下的挺好看:

http://m.weibo.cn/1853923717/EhdDavwxC

用下面代码转换下就可以了:

1
2
3
4
5
6
url1 = weiboURL.replace('weibo.cn','m.weibo.cn')
url1Array = url1.split('=')
uidArray = url1Array[1].split('&')
uid = uidArray[0]
weiboURL = url1.replace('comment',uid)
weiboURL = re.sub(r'\?uid[\s\S]*','',weiboURL)

生成 markdown 文件

也就是创建 md 文件

1
2
3
4
5
6
7
8
mdContent = '....'
fileName = str(yesterdayYear) + "-" + str(yesterdayMonth) + '-' + str(yesterdayDay)
filePath = './notes/' + fileName + '.md'
print('filePath:',filePath,'mdContent:',mdContent)
f = open(filePath,mode='w',encoding="UTF-8")
f.write(mdContent)
f.close()

源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
import re
import time
import datetime
def getWeibo(userWeiboID,pageIndex):
global weiboIndex
global mdContent
url = 'http://weibo.cn/u/%s?page=%d'%(userWeiboID,pageIndex)
header = {
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Connection" : "keep-alive",
"Cache-Control" : "max-age=0",
"Upgrade-Insecure-Requests" : "1",
"Accept" :"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2",
"Cookie": "这里填写你的 Cookie 值"
}
request = Request(url,None,header)
response = urlopen(request,timeout=1000)
bsObj = BeautifulSoup(response.read(),"html.parser")
cardListElement = bsObj.find_all("div",{"class":"c"})
print(url)
for cardElement in cardListElement:
if cardElement.get('id') is None:
continue
if cardElement.find("span",{"class":"kt"}) is not None:
continue
# 创建时间
timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
timeArray = timeLongStr.split('\xa0')
timeStr = timeArray[0]
seperatorArray = timeStr.split('-')
if len(seperatorArray) == 1:
structTime = time.localtime(time.time())
currentTime = datetime.datetime.now()
currenStructTime = yesterdayCurrentTime.timetuple()
currenYear = currenStructTime.tm_year # 今天 年份
timeStr = str(currenYear) + '年' + timeStr
timeStr = timeStr.replace('年','-')
timeStr = timeStr.replace('月','-')
timeStr = timeStr.replace('日','')
colonArray = timeStr.split(':')
if len(colonArray) == 2:
timeStr = timeStr + ':00'
try:
weiboCreateTimeArray = time.strptime(timeStr, "%Y-%m-%d %H:%M:%S")
createdTimestamp = int(time.mktime(weiboCreateTimeArray))
print("timeStr:",timeStr)
print('createdTimestamp:',str(createdTimestamp))
print('yesterdayStartTimeStamp:',str(yesterdayStartTimeStamp),' yesterdayEndTimeStamp:',str(yesterdayEndTimeStamp))
if createdTimestamp > yesterdayEndTimeStamp:
continue
if createdTimestamp < yesterdayStartTimeStamp:
return
except Exception as err:
print("timeStr:",timeStr)
print(err)
continue
# 微博的转发数量
repostCountStr = '0'
repostItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/repost/*')})
for repost in repostItems:
repostCountLongStr = repost.get_text()
repostCountStr = repostCountLongStr[3:len(repostCountLongStr) - 1]
print('转发数量:',repostCountStr,'\n\n')
# 转发微博的转发数量
repostWeiboRepostCountStr = '0'
for i,cmt in enumerate(cardElement.find_all("span",{"class":"cmt"})):
if i == 2:
repostWeiboRepostCountLongStr = cmt.get_text()
repostWeiboRepostCountStr = repostWeiboRepostCountLongStr[5:len(repostWeiboRepostCountLongStr) - 1]
print('转发微博的转发数量:',repostWeiboRepostCountStr,'\n\n')
if int(repostCountStr) + int(repostWeiboRepostCountStr) < 20:
continue
# 微博内容
weiboContent = ""
haveRetweetWeibo = False
if len(cardElement.find_all(text='转发理由:')) != 0:
haveRetweetWeibo = True
if haveRetweetWeibo: # 有转发微博
repostReasonItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/attitude/*')})[0].previous_siblings;
for index,repostReasonItem in enumerate(repostReasonItems): # 需要反向拼接
weiboContent = repostReasonItem.string + weiboContent;
weiboContent = weiboContent[5:]
weiboContent = weiboContent + '\n\n' + str(cardElement.find("span",{"class":"ctt"}))
else: # 没有转发微博
weiboContent = str(cardElement.find("span",{"class":"ctt"}))
# 原文链接
weiboURL = ""
for ccA in cardElement.find_all("a",{"class":"cc"}):
weiboURL = ccA["href"]
url1 = weiboURL.replace('weibo.cn','m.weibo.cn')
url1Array = url1.split('=')
uidArray = url1Array[1].split('&')
uid = uidArray[0]
weiboURL = url1.replace('comment',uid)
weiboURL = re.sub(r'\?uid[\s\S]*','',weiboURL)
weiboIndex += 1
mdContent = mdContent + '## ' +str(weiboIndex) + '.' + weiboContent + '\n\n' + '<' + weiboURL + '>\n\n' + timeStr + '\n\n'
pageIndex += 1
getWeibo(userWeiboID,pageIndex)
for day in range(1):
structTime = time.localtime(time.time())
yesterdayCurrentTime = datetime.datetime.now() - datetime.timedelta(days=day+1)
yesterdayStructTime = yesterdayCurrentTime.timetuple()
yesterdayYear = yesterdayStructTime.tm_year
yesterdayMonth = yesterdayStructTime.tm_mon
yesterdayDay = yesterdayStructTime.tm_mday
yesterdayStartTime = "%d-%d-%d 00:00:00"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayStartTimeArray = time.strptime(yesterdayStartTime, "%Y-%m-%d %H:%M:%S")
yesterdayStartTimeStamp = int(time.mktime(yesterdayStartTimeArray))
yesterdayEndTime = "%d-%d-%d 23:59:59"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayEndTimeArray = time.strptime(yesterdayEndTime, "%Y-%m-%d %H:%M:%S")
yesterdayEndTimeStamp = int(time.mktime(yesterdayEndTimeArray))
weiboIndex = 0;
fileName = str(yesterdayYear) + "-" + str(yesterdayMonth) + '-' + str(yesterdayDay)
mdContent = '# ' + fileName + '\n\n'
userWeiboDic = {
"唐巧_boy" : "1708947107",
"onevcat" : "2210132365",
"iOS程序犭袁" : "1692391497",
"没故事的卓同学" : "1926303682",
"叶孤城___" : "1438670852",
"我就叫Sunny怎么了" : "1364395395",
"nixzhu" : "2076580237",
"ibireme" : "2477831984",
"bang" : "1642409481",
"KITTEN-YANG" : "2854163804",
"StackOverflowError" : "1765732340",
"图拉鼎" : "1846569133",
"lzwjava" : "1695406573",
"董宝君_iOS" : "3026163601",
"移动开发前线" : "5861126740"
}
for name,weiboID in userWeiboDic.items():
print("-------正在爬取 %s 的微博------"%name)
getWeibo(weiboID,1)
filePath = './notes/' + fileName + '.md'
print('filePath:',filePath,'mdContent:',mdContent)
f = open(filePath,mode='w',encoding="UTF-8")
f.write(mdContent)
f.close()
print('爬取完毕')

爬取结果如下:

微博爬虫

未完待续 ……

后续功能:

  • 将数据存入数据库
  • 生成邮件模板
  • 定时任务。定时抓取,定时发送邮件
  • 自动读取 Chrome 里 Cookie
  • 使用更高级的第三方库
  • 使用 Scrap 爬虫