空夜

appium + selenium + 夜神模拟器实现抖音App爬虫
appium + 夜神模拟器实现抖音App爬虫安装与环境配置需要安装appium、Android SDK、夜神模拟...
扫描右侧二维码阅读全文
17
2019/04

appium + selenium + 夜神模拟器实现抖音App爬虫

appium + 夜神模拟器实现抖音App爬虫

安装与环境配置

需要安装appiumAndroid SDK、夜神模拟器,并配置环境变量(安卓和夜神模拟器)。百度个教程即可。此处我提供一个简要的说明,如果想要更加详细的教程,请自行百度啦——我之前找到一堆。

安装appium

从官网http://appium.io下载Appium并安装。


安装Android SDK

下载一个安卓的SDK——自行百度或使用以下地址:http://tools.android-studio.org/index.php/sdk,也可以使用我的版本:Android SDK, 密码cyup

安装完成后需要配置path环境变量,一共是两个:D:\Android\Sdk\toolsD:\Android\Sdk\platform-tools


安装夜神模拟器

正常是使用真机进行测试的,但为了节省成本、方便部署,个人推荐安装夜神模拟器(市面上其他款模拟器应该也能做到类似功能),从官网下载安装。

安装完成后配置环境变量,地址为:D:\software\夜神模拟器\Nox\bin

下面需要保证安卓SDK和夜神模拟器的adb版本保持一致:

查看本地的安卓SDK中的adb版本,如果跟夜神模拟器的nox_adb不一致,就把夜神的nox_adb.exe替换掉(复制,重命名,粘贴--覆盖)。

# 查看安卓adb版本:adb --version
# 查看夜神模拟器adb版本:nox_adb --version

图:


配置fiddler抓包APP

链接:fiddler对安卓APP进行抓包

上面的配置是对fiddler的配置。

如果想要将固定格式url抓取并保存为文件,可以修改Fiddler中的FiddlerScript。一个简单的示例是修改OnBeforeResponse方法:

 static function OnBeforeResponse(oSession: Session) {
        var oRegEx = /aweme\/v1\/search\/item\.*/i;
        var file = "d://douyinvideo/"
        // 时间格式化
        var date = new Date();
        var month = date.getMonth() + 1;
        var strDate = date.getDate();
        var strHours = date.getHours();
        var strMinutes = date.getMinutes();
        var strSeconds = date.getSeconds();
        var strMilliSeconds = date.getMilliseconds();
        var currentdate = date.getFullYear() + month + strDate
            + '_' +strHours + strMinutes + strSeconds + '_'+ strMilliSeconds;
        if ((oSession.responseCode == 200) && oRegEx.test(oSession.fullUrl))
        {
            var oriBody = oSession.GetResponseBodyAsString();
            oSession.utilSetResponseBody(oriBody);
            oSession.SaveResponseBody(file+currentdate+".txt");
            oSession.utilSetResponseBody(oriBody);        
        }
        if (m_Hide304s && oSession.responseCode == 304) {
            oSession["ui-hide"] = "true";
        }
    }

将请求返回保存为文件后,就可以通过程序对返回值进行分析,入库啦。

配置夜神模拟器

为了让程序进行一些系统性的操作,需要对模拟器进行配置。开启夜神模拟器后,启动开发者模式,设置不息屏,并打开USB调试模式。
如果需要抓包,在安装并配置完fiddler之后,模拟器需要设置网络代理,打开WLAN,长按已连接的网络,配置代理,如192.168.11.52,端口号8888(这个值是fiddler默认的监听端口)、

下面需要安装你要进行测试或抓包的APP,最好从官网下载,然后copy到模拟器中,会自动安装。此处以抖音为例进行开发。


开发脚本

获取设备和APP信息

获取当前被打开的APP的包名和Activity,在命令行窗口中依次执行以下命令:

adb shell
dumpsys window windows | grep -E 'mFocusedApp'

下面是appium需要的夜神模拟器设备信息。

夜神模拟器设备信息(以抖音APP为例):

{
  "platformName": "Android",   
  "deviceName": "127.0.0.1:62001",
  "appPackage": "com.ss.android.ugc.aweme",
  "appActivity": ".main.MainActivity",
}

注:夜神模拟器的设备名称默认是:127.0.0.1:62001,如果是真机需要通过命令行去查看。

开启夜神模拟器后,在命令行输入adb devices,必须要看到如下:

C:\\User\zfh>adb devices
List of devices attached
127.0.0.1:62001 device

如果没有这个设备,那么用命令行nox_adb connect 127.0.0.1:62001


抖音爬虫脚本

以抖音APP为例,开发了一个根据固定关键词搜索抖音视频的脚本。

需要的依赖:

        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-api</artifactId>
            <version>3.141.59</version>
        </dependency>

        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-remote-driver</artifactId>
            <version>3.141.59</version>
        </dependency>

        <dependency>
            <groupId>io.appium</groupId>
            <artifactId>java-client</artifactId>
            <version>6.1.0</version>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>

        <dependency>
            <groupId>cn.hutool</groupId>
            <artifactId>hutool-all</artifactId>
            <version>4.3.2</version>
        </dependency>

关于DesiredCapabilities配置项:DesiredCapabilities内容详解(较全)

Code:

import io.appium.java_client.TouchAction;
import io.appium.java_client.android.AndroidDriver;
import io.appium.java_client.remote.MobileCapabilityType;
import io.appium.java_client.touch.WaitOptions;
import io.appium.java_client.touch.offset.PointOption;
import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Random;

/**
 * 抖音爬虫线程----夜神模拟器版
 * 通过启动appium,使用fiddler将抖音视频数据保存在文件中
 * 注意:当前代码,尤其是坐标和部分的xpath,仅适用于当前设备
 * 补充:本采集程序启动前,若抖音未登录,或设置了清空缓存登录,则必须先在同机器上安装今日头条并登录
 * @author zfh
 * @version 1.0
 * @since 2019/4/11 18:27
 */
public class DouYinCrawlerThread extends Thread {

    public static Integer PAGE_LIMIT = 20; // 抖音翻页限制,这里设置翻20页

    private static Logger logger = LoggerFactory.getLogger(DouYinCrawlerThread.class);
    private static List<String> keyList = new ArrayList<>();
    private static Integer keyIndex = 0; // 当前需要进行搜索的关键词
    private static boolean noReset = false;

    static {
        keyList.add("java");
        keyList.add("python");
        keyList.add("php");
        keyList.add("sql");
        keyList.add("程序员");
    }

    public DouYinCrawlerThread(boolean noReset) {
        this.setName("抖音视频采集线程_夜神版");
        DouYinCrawlerThread.noReset = noReset;
    }

    @Override
    public void run() {
        logger.info("启动:" + this.getName());
        while (true) {
            Date startTime = new Date();
            try {
                keyIndex = 0;
                startCrawler();
            } catch (MalformedURLException e) {
                e.printStackTrace();
            }
            long consumeTime = new Date().getTime() - startTime.getTime();
            long sleepTime = 60 * 60 * 1000 - consumeTime;
            if (sleepTime <= 0) {
                sleepTime = (20 + new Random().nextInt(40)) * 60 * 1000;
            }

            if (!noReset) {
                noReset = true; // 默认从第二次开始,不需要清空缓存
            }
            logger.info("本轮查询结束,共用时:" + consumeTime/(60*1000) + "分钟,距下一轮查询开始需睡眠:" + sleepTime/(60*1000) + "分钟");
            sleepTime(sleepTime); // 1h一次完整搜索
        }
    }

    /**
     * 启动爬虫
     */
    private static void startCrawler() throws MalformedURLException {
        DesiredCapabilities desiredCapabilities = new DesiredCapabilities();
        desiredCapabilities.setCapability(MobileCapabilityType.DEVICE_NAME, "127.0.0.1:62001"); //
        desiredCapabilities.setCapability("platformName", "Android");
        desiredCapabilities.setCapability("appPackage","com.ss.android.ugc.aweme");
        desiredCapabilities.setCapability("appActivity",".main.MainActivity");
        desiredCapabilities.setCapability("unicodeKeyboard", true);
        desiredCapabilities.setCapability("resetKeyboard", true); // 是否重置输入法到原状态
        desiredCapabilities.setCapability("noReset", noReset);
        logger.info(noReset ? "本次启动不清空缓存" : "本次启动清空缓存");

        URL url = new URL("http://127.0.0.1:4723/wd/hub");

        boolean fail = false;
        do {
            try {
                logger.info("启动抖音APP....");
                startApp(desiredCapabilities, url);
                fail = false;
            } catch (Exception ex) {
                fail = true;
                logger.info("系统异常,此次中断的关键词为:" + keyList.get(keyIndex)+ ",重启服务");
                ex.printStackTrace();
            }
        } while (fail) ;
    }

    /**
     * 启动抖音APP
     * @param desiredCapabilities
     * @param url
     * @throws Exception
     */
    private static void startApp(DesiredCapabilities desiredCapabilities, URL url) throws Exception {
        AndroidDriver webDriver = new AndroidDriver(url, desiredCapabilities);
        Thread.sleep(30*1000);

        String pageSource = webDriver.getPageSource();
        // 判定是否存在需要关闭的弹框
        while (needToCheckAndClick(pageSource)) {
            exCheckAndReset(webDriver);
            Thread.sleep(5 * 1000);
            pageSource = webDriver.getPageSource();
        }

        /*//点击同意政策
        try {
            webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.FrameLayout/android.widget." +
                    "FrameLayout/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.TextView[3]")).click();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        //点击权限
        try {
            Thread.sleep(3*1000);
            webDriver.findElement(By.id("com.android.packageinstaller:id/permission_allow_button")).click();
            Thread.sleep(3*1000);
            webDriver.findElement(By.id("com.android.packageinstaller:id/permission_allow_button")).click();
        } catch (Exception ex) {
            ex.printStackTrace();
        }*/
        //点击屏幕进入
        Thread.sleep(3*1000);
        logger.info("点击屏幕进入");
        click(340, 720, webDriver);

        if (!noReset) {
            // 如果重置了缓存,则需要重新登录
            logger.info("缓存被重置,需要重新登录");
            try {
                login(webDriver);
            } catch (Exception ex) {
                logger.info("头条授权登录失败");
                getScreenShot(webDriver);
                ex.printStackTrace();
                throw ex; // 登录失败,抛出
            }

            Thread.sleep(5 * 1000);
            pageSource = webDriver.getPageSource();
            // 登录后,再次判定是否存在需要关闭的弹框
            while (needToCheckAndClick(pageSource)) {
                exCheckAndReset(webDriver);
                Thread.sleep(5 * 1000);
            }
        }


        // 点击进入搜索页
        Thread.sleep(3*1000);
        logger.info("点击进入搜索页");
        click(675, 75, webDriver);

        // 开始搜索
        int time = 0;
        int successTime = 0;
        boolean isNew = true;
        for (; keyIndex < keyList.size(); keyIndex++) {
            time++;
            String keywords = keyList.get(keyIndex);
            logger.info((time) + ": 开始新的查询,查询关键词:" + keywords);
            try {
                boolean res = search(webDriver, keywords, isNew); //
                if (res) {
                    successTime++;
                }
                if (isNew) {
                    isNew = false; // 设为false
                }
            } catch (Exception ex) {
                logger.info("查询出错:");
                getScreenShot(webDriver); // 截图
                throw ex; // 出错了,返回
            }
            Thread.sleep(5 * 1000);
        }
        logger.info("关闭APP");
        webDriver.closeApp();
        Thread.sleep(10 * 1000);
        logger.info("执行查询" + (time) + "次,成功" + successTime + "次,失败" + (time - successTime) + "次");
    }

    /**
     * 通过头条进行登录
     * @param webDriver
     * @throws Exception
     */
    public static void login(AndroidDriver webDriver) throws Exception {
        // 登录
        Thread.sleep(5 * 1000);
        logger.info("点击登录");
        click(650, 1240, webDriver);

        // 点击头条登录
        Thread.sleep(5 * 1000);
        if (!webDriver.getPageSource().contains("密码登录")) {
            Thread.sleep(5 * 1000);
        }
        logger.info("选择头条登录");
        click(215, 580, webDriver);

        // 点击头条授权登录
        Thread.sleep(10 * 1000);
        logger.info("点击头条授权登录");
        click(340, 650, webDriver);

        // 选择跳过绑定手机号
        Thread.sleep(10 * 1000);
        logger.info("选择跳过绑定手机号");
        click(670, 75, webDriver);
    }

    /**
     * 搜索
     * @param webDriver
     * @throws InterruptedException
     */
    private static boolean search(AndroidDriver webDriver, String keywords, boolean isNew) throws Exception {

         if (!isNew) {
            Thread.sleep(3*1000);

            // 点击输入框
            click(320, 80, webDriver);

            logger.info("输入关键词");
            WebElement editElement = webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.EditText"));
           
            editElement.clear(); // 清空输入框内容;网上有些资料说必须将焦点移动到最后面,在三星手机上测试发现不需要移动也可以起作用
            editElement.click();
            editElement.sendKeys(keywords);
        } else {
            Thread.sleep(3*1000);

            webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout[1]/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.LinearLayout/android.widget.FrameLayout[2]"))
                    .click();

            Thread.sleep(3 * 1000);
            logger.info("输入关键词");
            webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout[1]/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.LinearLayout/android.widget.FrameLayout[2]/android.widget.EditText"))
                    .sendKeys(keywords); // 输入关键词
        }

        Thread.sleep(5*1000);
        logger.info("点击搜索");
        click(670, 70, webDriver); // 点击搜索

        if (isSearchNull(webDriver)) {
            return false;
        }

        Thread.sleep(3*1000);
        // 点击选择视频
    webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.HorizontalScrollView/android.widget.LinearLayout/android.support.v7.app.ActionBar.Tab[2]"))
                .click();


        int width = webDriver.manage().window().getSize().width;
        int height = webDriver.manage().window().getSize().height;

        int i = 0; // 翻页次数
        int swipeFailNum = 0; // 翻页失败次数
        while (swipeFailNum < 1) {
            if (i >= PAGE_LIMIT) {
                logger.info("翻页达到限制," + PAGE_LIMIT + "页,结束翻页");
                break;
            }

            logger.info("开始翻页:" + (i+1));
            try {
                Thread.sleep( 2 * 1000);
                swipe(webDriver, width/2, (int)(height * 0.75), width/2, (int)(height/3));
                swipe(webDriver, width/2, (int)(height * 0.75), width/2, (int)(height/3)); // 滑动两次
                logger.info("翻页成功");
            } catch (Exception ex) {
                logger.info("翻页失败");
                swipeFailNum++;
                ex.printStackTrace();
            } finally {
                i++;
            }

            // 查看是否为最后一页
            String source = webDriver.getPageSource();
            if (source.contains("没有搜索到相关")) {
                getScreenShot(webDriver); // 可能是乱码了,也可能是机器被暂时封号了
                logger.info(source);
                logger.info("没有搜索到相关的内容");
                break;
            }
            if (source.contains("没有更多")) {
                logger.info(source);
                logger.info("已达到最后一页,没有更多数据了");
                break;
            }

            try {
                viewAndReturn(webDriver); // 随机点击视频进行查看,并返回
            } catch (Exception e) {
                e.printStackTrace();
                logger.info("点击视频查看报错");
            }
        }

        logger.info("本次视频查询成功");
        return true;
    }

    /**
     * 随机查看当前页的视频,随机等待时间后,返回
     */
    private static void viewAndReturn(AndroidDriver driver) throws Exception {
        Thread.sleep(5 * 1000);
        logger.info("点击查看视频");
        int width = driver.manage().window().getSize().width;
        int height = driver.manage().window().getSize().height;

        logger.info("width = " + width + ", height = " + height);
        int randomX = width/2 + (new Random().nextBoolean() ? new Random().nextInt(width/4) : - new Random().nextInt(width/4));
        int randomY = height/2 + (new Random().nextBoolean() ? new Random().nextInt(height/4) : - new Random().nextInt(height/4));
        TouchAction action = new TouchAction(driver);
        action.tap(PointOption.point(randomX, randomY));
        action.perform();

        logger.info("randomX = " + randomX + ", randomY = " + randomY);

        Thread.sleep(3 * 1000);
        String resource = driver.getPageSource();
        if (resource != null && resource.contains("视频") && resource.contains("综合")) {
            logger.info("打开视频失败,返回");
            return;
        }
        int sleepTime = 3 + new Random().nextInt(9);
        logger.info("睡眠" + sleepTime + "秒");
        Thread.sleep(sleepTime * 1000);

        logger.info("查看完毕,返回");
        action.tap(PointOption.point(35, 80));
        action.perform();
        action.release();
    }

    /**
     * 点击坐标
     * @param x
     * @param y
     * @param driver
     */
    private static void click(int x, int y, AndroidDriver driver) {
        TouchAction action = new TouchAction(driver);
        action.tap(PointOption.point(x, y));
        action.perform();
        action.release();
    }

    /**
     * 滑动
     * @param driver
     * @param fromX
     * @param fromY
     * @param toX
     * @param toY
     */
    private static void swipe(AndroidDriver driver, int fromX, int fromY, int toX, int toY) {
        Duration duration = Duration.ofMillis(800);
        TouchAction action = new TouchAction(driver)
                .press(PointOption.point(fromX, fromY))
                .waitAction(WaitOptions.waitOptions(duration))
                .moveTo(PointOption.point(toX, toY)).release();
        action.perform();
        action.release();
    }

    /**
     * 获取屏幕截图
     * @param driver
     */
    private static void getScreenShot(AndroidDriver driver) {
        File img = driver.getScreenshotAs(OutputType.FILE);
        if (img != null && img.exists()) {
            File file = new File(new Date().getTime() + "." + img.getName().split("\\.")[1]);
            try {
                FileUtils.copyFile(img, file);
                img.delete();
                logger.info("屏幕截图:" + file.getAbsolutePath());
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * 检查搜索结果是否为空
     * @param driver
     * @return
     */
    private static boolean isSearchNull(AndroidDriver driver) {
        String resource = driver.getPageSource();
        return resource != null && resource.contains("搜索结果为空");
    }

    /**
     * APP运行时的异常状态检查和修复
     * @param driver
     */
    private static void exCheckAndReset(AndroidDriver driver) {
        String resource = driver.getPageSource();
        if (resource != null) {
            if (resource.contains("隐私政策") && resource.contains("仅浏览") && resource.contains("同意")) {
                logger.info("同意隐私政策");
                click(460, 940, driver);
            } else if (resource.contains("青少年模式")) {
                // 关闭打开青少年模式的通知
                logger.info("关闭打开青少年模式通知");
                click(350, 875, driver);
            } else if ((resource.contains("通知") && resource.contains("去打开")) ||
                    (resource.contains("新版本") && resource.contains("升级")) ||
                    (resource.contains("通讯录好友"))) {
                /*getScreenShot(driver);*/
                logger.info("取消打开通知/新版本升级/通讯录好友");
                click(250, 900, driver);
            }
        }
    }

    /**
     * 判定是否存在弹框通知需要关闭
     * @param pageSource
     * @return
     */
    private static boolean needToCheckAndClick(String pageSource) {
        if (pageSource != null) {
            return pageSource.contains("青少年模式")
                    || pageSource.contains("隐私政策")
                    || pageSource.contains("通知")
                    || pageSource.contains("去打开")
                    || pageSource.contains("通讯录好友");
        }

        return false;
    }

    /**
     * thread.sleep(millis)
     * @param millis
     */
    private static void sleepTime(long millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        DouYinCrawlerThread thread = new DouYinCrawlerThread(false);
        thread.start();
    }
}

配置好环境后,可运行该脚本,即可自动根据关键词顺序搜索抖音视频信息。

Last modification:April 18th, 2019 at 08:51 am
If you think my article is useful to you, please feel free to appreciate

Leave a Comment