爬虫之CasperJS

用jsoup(java, scala, groovy)爬过数据，用cheerio(nodejs)爬过数据，每次爬取都要对页面HTML结构，数据来源URL进行研究。还要对网站的反扒做一些HEADER的设置。各种繁琐，主要还有一些数据型的网站验证复杂，很难通过简单的方式来破解它的那套反扒流程。

CasperJS是在phantomjs基础上的一套工具库用来简化phantomjs的操作，降低使用和入门的门槛。而PhantomJS是类似浏览器的一个工具（headless browsers），你可以把它看做浏览器。所以可以通过CasperJS来操作浏览器访问地址，然后加载完页面后再提取数据，这样就不要考虑被反扒的风险，并且获取数据的方式相对容易和简单。

先从官网的案例体验下HelloWorld以及如何调试

下载最新的CasperJS（npm install）即可，PhantomJS下载1.9.8版本，不推荐2+版本，有些功能有问题。

R:\test>set PATH=C:\Users\winse\AppData\Roaming\npm\node_modules\casperjs\bin;E:\phantomjs-1.9.8-windows;%PATH

R:\test>cat hello.js
var casper = require('casper').create();
// debugger

casper.start('http://casperjs.org/', function() {
    this.echo(this.getTitle());
    
    this.echo("Star: " + this.evaluate(function () { 
        return $(".octicon-star").parent().text().trim()
    }) )
});

casper.thenOpen('http://phantomjs.org', function() {
    this.echo(this.getTitle());
    
    this.echo("Intro: " + this.evaluate(function () { 
        return $(".intro h1").innerHTML
        // return document.querySelector(".intro h1").innerHTML
    }) )
});

casper.run();

R:\test>casperjs  hello.js
CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS
Star: 6,337 Stargazers
PhantomJS | PhantomJS
Intro: null

用js的方式来获取页面数据，非常完美，相比直接通过URL请求来获取数据，CasperJS就是慢了点（有点像我们每次都打开浏览器然后再访问，可以通过建立服务，然后在常驻PhantomJS访问页面）。

上面第二次获取的数据不是我们想要的，这里我们通过调试看看到底是什么原因导致的。在start前增加一行 debugger 。然后执行：

casperjs hello.js --verbose --log-level=debug --remote-debugger-port=9000

打开浏览器方式 localhost:9000 点击 about:blank 链接，然后在Console窗口执行 __run() ，等一下下会停在debugger那一行，再然后就是愉快的debug就好了。

在 http://phantomjs.org 那一段的evaluate代码处增加一个断点，运行到该断点后，再次打开 localhost:9000 会多出一个当前访问页面的链接，点击进去就像平时F12看到的调式窗口了。

注意: Chrome浏览器要用V54版本以下的。

调试详情如下：

> $(".intro h1")
null
> $
bound: function () {
        return document.getElementById.apply(document, arguments);
    }
> document.querySelector(".intro h1").innerHTML
"
        Full web stack<br>
        No browser required
      "

那我们把js脚本修改成querySelector来获取数据。再次执行：

R:\test>casperjs  hello.js
CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS
Star: 6,337 Stargazers
PhantomJS | PhantomJS
Intro:
        Full web stack<br>
        No browser required

功能特性

截图

有现成的方法，但是需要自己处理下背景颜色 Tips and Tricks。

> cat capture.js
var casper = require('casper').create({
    waitTimeout: 120000,
    logLevel: "debug",
    verbose: true
});
casper.userAgent('Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')

casper.start('https://xueqiu.com/2054435398/32283614', function () {
    this.waitForSelector("div.status-content a[title*=xueqiu]");
}).then(function () {
    // white background
    this.evaluate(function () {
        var style = document.createElement('style'),
            text = document.createTextNode('body { background: #fff }');
        style.setAttribute('type', 'text/css');
        style.appendChild(text);
        document.head.insertBefore(style, document.head.firstChild);
    });
}).then(function () {
    this.capture('结庐问山.jpg');
});

casper.run()

> casperjs capture.js --load-images=yes --disk-cache=yes --ignore-ssl-errors=true --output-encoding=gbk

用来截全屏的图片相当厉害，Chrome等自带的截图工具如果内容长了后很慢很麻烦，这种方式毫无压力啊。

抓取层次页面

一般抓数据有个列表页，然后根据列表页的详情地址，根据详情地址再获取数据。

> cat xueqiu.js
debugger

var fs = require('fs');
var casper = require('casper').create({
    waitTimeout: 120000,
    logLevel: "debug",
    verbose: true
});
casper.userAgent('Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')

var links = []
var basedir = '.'
casper.start('https://xueqiu.com/2054435398/32283614', function () {
    this.waitForSelector("div.status-content a[title*=xueqiu]");
}).then(function () {
    var items = this.evaluate(function () {
        return $("div.status-content a[title*=xueqiu]").map(function (i, a) {
            return $(a).attr('href')
        })
    })

    for (var i = 0; i < items.length; i++) {
        links.push(items[i]);
    }
    
    fs.write('all.html', this.getHTML(), 'w');
}).then(function () {
    this.eachThen(links, function (link) {
        var pathname = undefined;
        var url = link.data;

        this.thenOpen(url, function () {
            this.waitForSelector("div.status-content .detail");
        }).then(function () {
            pathname = this.evaluate(function () {
                var style = document.createElement('style'),
                    text = document.createTextNode('body { background: #fff }');
                style.setAttribute('type', 'text/css');
                style.appendChild(text);
                document.head.insertBefore(style, document.head.firstChild);

                return window.location.pathname;
            });
        }).then(function () {
            if (url.indexOf(pathname))
                this.capture(basedir + pathname + ".jpg");
            else
                this.echo(url);
        });

    })

});

casper.run()

> casperjs xueqiu.js --load-images=yes --disk-cache=yes --ignore-ssl-errors=true --output-encoding=gbk --remote-debugger-port=9000

然后一堆堆的图片就生成出来了。由于访问的速度有限，有利有弊，慢一点还不要做时间上面的控制了，有点像人在操作的感觉。然后处理下异常的个别再导一次就可以了(错误的那一篇还是404的…哭笑)。

$("div.status-content a[title*=xueqiu]").map(function(i, a){ return $(a).attr('href') }).length
177

$ find . -name '*.jpg' | wc -l
176

注意：Windows的命令窗口，多按几次Enter，有时一不小心就进入编辑模式了。

压缩后100多M啊！CasperJS足够强大，更多的模式等待你的开启。就写到此。

后记

关于爬虫获取数据抓取前端渲染的页面这篇文章讲的挺中肯的，如果可能的话，用作者写的 WebMagic 也是一个不错的选择。

–END

Winse Blog

走走停停都是风景, 熙熙攘攘都向最好, 忙忙碌碌都为明朝, 何畏之.

先从官网的案例体验下HelloWorld以及如何调试

功能特性

后记

Comments