scrapy实现新浪微博爬虫(2)-木庄网络博客

当前第2页返回上一页

对比两次js加载的页面，我们不难发现，Request_URL差别的地方仅仅是在pagebar和_rnd两个参数上，第一个代表页数，第二个是时间的加密(测试不带上也无妨),因此我们仅仅需要构建页数即可。有些微博量巨多的可能还需要翻页，道理相同。

class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    nickname = scrapy.Field()
    follow = scrapy.Field()
    fan = scrapy.Field()
    weibo_count = scrapy.Field()
    authentication = scrapy.Field()
    address = scrapy.Field()
    graduated = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()
    oid = scrapy.Field()

设置需要爬取的字段nickname昵称，follow关注，fan粉丝，weibo_count微博数量，authentication认证信息，address地址，graduated毕业院校，有些微博不显示的默认设置为空，以及oid和博文内容及发布时间。
这里说一下内容的解析，还是吴彦祖微博，如果我们还是像之前一样直接用scrapy的解析规则去用xpath或者css选择器解析会发现明明结构找的正确却匹配不出数据，这就是微博坑的地方，点开源代码。我们发现：
在这里插入图片描述
微博的主题内容全是用script包裹起来的！！！这个问题当初也是困扰了博主很久，反复换着法子用css和xpath解析始终不出数据。
解决办法：正则匹配(无奈但有效)
至此，就可以愉快的进行采集了，附上运行截图：
在这里插入图片描述
输入导入mongodb：