每日爬取技术大牛微博生成 Markdown

微博爬虫

一年前，自己开通了公众号，将每天微博上最新的 iOS 技术文章，制作成笔记的形式分享出来。持续了一年时间，最近因为创业，已经好些天没有更新了。最近有人向我反馈这个问题，我也拿比较忙没有时间推脱。

随着问的人越来越多，开始思考，怎么用技术的手段自动化生成这些笔记，想到前段时间学习到的 Python，想利用爬虫来解决这个问题。

本篇文章主要有以下内容：

分析微博网站。找到获取某个用户微博数据的 URL。
通过 URL 获取到微博内容。微博正文内容、转发数量、时间等。
清洗数据。这里我们要获取前一天转发量 >= 20 的微博数据
按照日期生成并保存成 markdown 文件

分析微博网站

我们爬取的是这个网站：

http://weibo.cn/

要获取某人的微博，比如要获取 @逻辑思维的微博:

http://weibo.cn/u/1853923717?page=1

获取微博内容

下方代码获取的是 @逻辑思维微博第 1 页网页源码：

from urllib.request import Request
from urllib.request import urlopen
url = 'http://weibo.cn/u/1853923717?page=1'
header = {
        "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
        "Connection" : "keep-alive",
        "Cache-Control" : "max-age=0",
        "Upgrade-Insecure-Requests" : "1",
        "Accept" :"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2",
        "Cookie": "这里填写你的 Cookie 值"
}
    
request = Request(url,None,header)
response = urlopen(request,timeout=1000)
    
print(response.read())

通过上面的网页，我们要找到以下数据：

微博内容（包括被转发的微博内容）
微博创建时间
微博转发数量
微博 URL 地址

获取单条微博所有内容

这里的单条微博的所有内容，包括微博正文、转发数、转发时间等。

微博爬虫

上代码：

from bs4 import BeautifulSoup
...
bsObj = BeautifulSoup(response.read(),"html.parser")
cardListElement = bsObj.find_all("div",{"class":"c"})
for cardElement in cardListElement:
    if cardElement.get('id') is None: # 排除不是微博的数据
        continue
    if cardElement.find("span",{"class":"kt"}) is not None: # 排除置顶
        continue
    print('cardElement:',cardElement)

获取微博正文

这里的微博正文包括原始微博正文内容和转发微博的正文内容。

如何判断是否包含有转发微博:

weiboContent = "" # 微博内容
haveRetweetWeibo = False # 是否有转发微博
if len(cardElement.find_all(text='转发理由:')) != 0:# 有转发理由就有转发
    haveRetweetWeibo = True
if haveRetweetWeibo: # 有转发微博
    repostReasonItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/attitude/*')})[0].previous_siblings;
    for index,repostReasonItem in enumerate(repostReasonItems): # 需要反向拼接
        weiboContent = repostReasonItem.string + weiboContent;
        weiboContent = weiboContent[5:] # 去掉‘转发理由:’ 5 个字
    weiboContent = weiboContent + '\n\n' + str(cardElement.find("span",{"class":"ctt"}))
else: # 没有转发微博
    weiboContent = str(cardElement.find("span",{"class":"ctt"}))
print('weiboContent:', weiboContent)

微博创建时间

...
timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
timeArray = timeLongStr.split('\xa0')
timeStr = timeArray[0]

这里获取的数据为 今天 10:23。

微博转发数量

微博转发数量包括原始微博的转发数量和转发微博的转发数量。

上代码：

# 原始微博的转发数量
repostCountStr = '0'
repostItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/repost/*')}) # 找到转发标签
for repost in repostItems:
    repostCountLongStr = repost.get_text() # 获取 '转发[100]'
    repostCountStr = repostCountLongStr[3:len(repostCountLongStr) - 1] # 获取 100
print('转发数量:',repostCountStr,'\n\n')
        
# 转发微博的转发数量
repostWeiboRepostCountStr = '0'
repostWeiboRepostItems = cardElement.find_all("span",{"class":"cmt"})
for i,cmt in enumerate(repostWeiboRepostItems):
     if i == 2: # 找到转发标签
         repostWeiboRepostCountLongStr = cmt.get_text() # 获取 '原文转发[14]'
         repostWeiboRepostCountStr = repostWeiboRepostCountLongStr[5:len(repostWeiboRepostCountLongStr) - 1] # 获取 '14'
print('转发微博的转发数量:',repostWeiboRepostCountStr,'\n\n')

微博 URL 地址

1
2
3

weiboURL = ""
for ccA in cardElement.find_all("a",{"class":"cc"}):
    weiboURL = ccA["href"]

清洗数据

转发量 >= 20

因为我们抓取的技术文章，不是所有微博都是我们抓取的对象，我这里设置一个条件：仅获取转发量 >= 20 的微博数据。

转发量为：该微博转发量 + 转发微博的转发量

因为上面我们已经获取了这两个数据，做个判断就可以了。

时间为前一天

我们知道当前时间，也就知道昨天的起始时间和结束时间，分别转化为时间戳就可以了。上面的代码中我们知道了微博的创建时间，转化为时间戳就可以了。

故判断条件为：前一天开始时间戳 <= 创建时间戳 <= 前一天结束时间戳

获取昨日起始时间戳和昨日结束时间戳：

import time
import datetime
structTime = time.localtime(time.time())
yesterdayCurrentTime = datetime.datetime.now() - datetime.timedelta(days=1) # 获取昨天此刻时间
yesterdayStructTime = yesterdayCurrentTime.timetuple()
yesterdayYear = yesterdayStructTime.tm_year # 昨天 年份
yesterdayMonth = yesterdayStructTime.tm_mon # 昨天 月份
yesterdayDay = yesterdayStructTime.tm_mday # 昨天 多少号
# 昨日起始时间戳
yesterdayStartTime = "%d-%d-%d 00:00:00"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayStartTimeArray = time.strptime(yesterdayStartTime, "%Y-%m-%d %H:%M:%S")
yesterdayStartTimeStamp = int(time.mktime(yesterdayStartTimeArray))
# 昨日结束时间戳
yesterdayEndTime = "%d-%d-%d 23:59:59"%(yesterdayYear, yesterdayMonth, yesterdayDay)
yesterdayEndTimeArray = time.strptime(yesterdayEndTime, "%Y-%m-%d %H:%M:%S")
yesterdayEndTimeStamp = int(time.mktime(yesterdayEndTimeArray))

微博创建时间时间戳：

timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
timeArray = timeLongStr.split('\xa0')
timeStr = timeArray[0]
seperatorArray = timeStr.split('-')
if len(seperatorArray) == 1:
    structTime = time.localtime(time.time())
    currentTime = datetime.datetime.now()
    currenStructTime = yesterdayCurrentTime.timetuple()
    currenYear = currenStructTime.tm_year # 今天 年份
    timeStr = str(currenYear) + '年' + timeStr
timeStr = timeStr.replace('年','-')
timeStr = timeStr.replace('月','-')
timeStr = timeStr.replace('日','')
colonArray = timeStr.split(':')
if len(colonArray) == 2:
    timeStr = timeStr + ':00'
try:
    weiboCreateTimeArray = time.strptime(timeStr, "%Y-%m-%d %H:%M:%S")
        
    createdTimestamp = int(time.mktime(weiboCreateTimeArray))
    print("timeStr:",timeStr)
    print('createdTimestamp:',str(createdTimestamp))
    print('yesterdayStartTimeStamp:',str(yesterdayStartTimeStamp),' yesterdayEndTimeStamp:',str(yesterdayEndTimeStamp))
except Exception as err:
     print("timeStr:",timeStr)
     print(err)

微博 URL 优化

看到打印出来的地址是这个样子的：

http://weibo.cn/comment/EhdDavwxC?uid=1853923717&rl=0#cmtfrm

不是特别好看，但是我们发现移动版 http://m.weibo.cn 下的挺好看：

http://m.weibo.cn/1853923717/EhdDavwxC

用下面代码转换下就可以了：

url1 = weiboURL.replace('weibo.cn','m.weibo.cn')
url1Array = url1.split('=')
uidArray = url1Array[1].split('&')
uid = uidArray[0]
weiboURL = url1.replace('comment',uid)
weiboURL = re.sub(r'\?uid[\s\S]*','',weiboURL)

生成 markdown 文件

也就是创建 md 文件

mdContent = '....'
fileName = str(yesterdayYear) + "-" + str(yesterdayMonth) + '-' + str(yesterdayDay)
filePath = './notes/' + fileName + '.md'
print('filePath:',filePath,'mdContent:',mdContent)
f = open(filePath,mode='w',encoding="UTF-8")
f.write(mdContent)
f.close()

源码

from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
import re
import time
import datetime
def getWeibo(userWeiboID,pageIndex):
    global weiboIndex
    global mdContent
    url = 'http://weibo.cn/u/%s?page=%d'%(userWeiboID,pageIndex)
    header = {
        "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
        "Connection" : "keep-alive",
        "Cache-Control" : "max-age=0",
        "Upgrade-Insecure-Requests" : "1",
        "Accept" :"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2",
        "Cookie": "这里填写你的 Cookie 值"
    }
    request = Request(url,None,header)
    response = urlopen(request,timeout=1000)
    bsObj = BeautifulSoup(response.read(),"html.parser")
    cardListElement = bsObj.find_all("div",{"class":"c"})
    print(url)
    for cardElement in cardListElement:
        if cardElement.get('id') is None:
            continue
        if cardElement.find("span",{"class":"kt"}) is not None:
            continue
        # 创建时间
        timeLongStr = cardElement.find("span",{"class":"ct"}).get_text()
        timeArray = timeLongStr.split('\xa0')
        timeStr = timeArray[0]
        seperatorArray = timeStr.split('-')
        if len(seperatorArray) == 1:
            structTime = time.localtime(time.time())
            currentTime = datetime.datetime.now()
            currenStructTime = yesterdayCurrentTime.timetuple()
                        
            currenYear = currenStructTime.tm_year # 今天 年份
            timeStr = str(currenYear) + '年' + timeStr
        timeStr = timeStr.replace('年','-')
        timeStr = timeStr.replace('月','-')
        timeStr = timeStr.replace('日','')
        colonArray = timeStr.split(':')
        if len(colonArray) == 2:
            timeStr = timeStr + ':00'
        try:
            weiboCreateTimeArray = time.strptime(timeStr, "%Y-%m-%d %H:%M:%S")
        
            createdTimestamp = int(time.mktime(weiboCreateTimeArray))
            print("timeStr:",timeStr)
            print('createdTimestamp:',str(createdTimestamp))
            print('yesterdayStartTimeStamp:',str(yesterdayStartTimeStamp),' yesterdayEndTimeStamp:',str(yesterdayEndTimeStamp))
            if createdTimestamp > yesterdayEndTimeStamp:
                continue
            if createdTimestamp < yesterdayStartTimeStamp:
                return
        except Exception as err:
            print("timeStr:",timeStr)
            print(err)
            continue
    
        # 微博的转发数量
        repostCountStr = '0'
        repostItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/repost/*')})
        for repost in repostItems:
            repostCountLongStr = repost.get_text()
            repostCountStr = repostCountLongStr[3:len(repostCountLongStr) - 1]
            print('转发数量:',repostCountStr,'\n\n')
        
        # 转发微博的转发数量
        repostWeiboRepostCountStr = '0'
        for i,cmt in enumerate(cardElement.find_all("span",{"class":"cmt"})):
            if i == 2:
                repostWeiboRepostCountLongStr = cmt.get_text()
                repostWeiboRepostCountStr = repostWeiboRepostCountLongStr[5:len(repostWeiboRepostCountLongStr) - 1]
                print('转发微博的转发数量:',repostWeiboRepostCountStr,'\n\n')
        if int(repostCountStr) + int(repostWeiboRepostCountStr) < 20:
            continue
        # 微博内容
        weiboContent = ""
        haveRetweetWeibo = False
        if len(cardElement.find_all(text='转发理由:')) != 0:
            haveRetweetWeibo = True
        if haveRetweetWeibo: # 有转发微博
            repostReasonItems = cardElement.find_all('a',{'href': re.compile('http://weibo.cn/attitude/*')})[0].previous_siblings;
            for index,repostReasonItem in enumerate(repostReasonItems): # 需要反向拼接
                weiboContent = repostReasonItem.string + weiboContent;
            weiboContent = weiboContent[5:]
            weiboContent = weiboContent + '\n\n' + str(cardElement.find("span",{"class":"ctt"}))
        else: # 没有转发微博
            weiboContent = str(cardElement.find("span",{"class":"ctt"}))
        # 原文链接
        weiboURL = ""
        for ccA in cardElement.find_all("a",{"class":"cc"}):
            weiboURL = ccA["href"]
        url1 = weiboURL.replace('weibo.cn','m.weibo.cn')
        url1Array = url1.split('=')
        uidArray = url1Array[1].split('&')
        uid = uidArray[0]
        weiboURL = url1.replace('comment',uid)
        weiboURL = re.sub(r'\?uid[\s\S]*','',weiboURL)
        weiboIndex += 1
        mdContent = mdContent + '## ' +str(weiboIndex) + '.' + weiboContent + '\n\n' + '<' + weiboURL + '>\n\n' + timeStr + '\n\n'
    pageIndex += 1
    getWeibo(userWeiboID,pageIndex)
for day in range(1):
    structTime = time.localtime(time.time())
    yesterdayCurrentTime = datetime.datetime.now() - datetime.timedelta(days=day+1)
    yesterdayStructTime = yesterdayCurrentTime.timetuple()
    
    yesterdayYear = yesterdayStructTime.tm_year
    yesterdayMonth = yesterdayStructTime.tm_mon
    yesterdayDay = yesterdayStructTime.tm_mday
    
    yesterdayStartTime = "%d-%d-%d 00:00:00"%(yesterdayYear, yesterdayMonth, yesterdayDay)
    yesterdayStartTimeArray = time.strptime(yesterdayStartTime, "%Y-%m-%d %H:%M:%S")
    
    yesterdayStartTimeStamp = int(time.mktime(yesterdayStartTimeArray))
    
    yesterdayEndTime = "%d-%d-%d 23:59:59"%(yesterdayYear, yesterdayMonth, yesterdayDay)
    yesterdayEndTimeArray = time.strptime(yesterdayEndTime, "%Y-%m-%d %H:%M:%S")
    
    yesterdayEndTimeStamp = int(time.mktime(yesterdayEndTimeArray))
    
    weiboIndex = 0;
    
    fileName = str(yesterdayYear) + "-" + str(yesterdayMonth) + '-' + str(yesterdayDay)
    mdContent = '# ' + fileName + '\n\n'
    
    userWeiboDic = {
        "唐巧_boy" : "1708947107",
        "onevcat" : "2210132365",
        "iOS程序犭袁" : "1692391497",
        "没故事的卓同学" : "1926303682",
        "叶孤城___" : "1438670852",
        "我就叫Sunny怎么了" : "1364395395",
        "nixzhu" : "2076580237",
        "ibireme" : "2477831984",
        "bang" : "1642409481",
        "KITTEN-YANG" : "2854163804",
        "StackOverflowError" : "1765732340",
        "图拉鼎" : "1846569133",
        "lzwjava" : "1695406573",
        "董宝君_iOS" : "3026163601",
        "移动开发前线" : "5861126740"
    }
    
    for name,weiboID in userWeiboDic.items():
        print("-------正在爬取 %s 的微博------"%name)
        getWeibo(weiboID,1)
    filePath = './notes/' + fileName + '.md'
    print('filePath:',filePath,'mdContent:',mdContent)
    f = open(filePath,mode='w',encoding="UTF-8")
    f.write(mdContent)
    f.close()
print('爬取完毕')

爬取结果如下：

微博爬虫

未完待续 ……

后续功能：

将数据存入数据库
生成邮件模板
定时任务。定时抓取，定时发送邮件
自动读取 Chrome 里 Cookie
使用更高级的第三方库
使用 Scrap 爬虫