抖音采集

开源仓库

https://gitee.com/erma0/douyin

介绍

Python取数据 + Vue写界面 + Aria2下载

根据抖音各种链接或各种id，通过网页接口采集视频作品，并下载作品到本地。

支持用户主页链接或sec_uid/话题挑战和音乐原声链接或ID。

支持下载喜欢列表（需喜欢列表可见）。

使用

0x00 安装依赖

在程序目录打开命令行，输入

pip install -r requirements.txt

0x01 使用UI界面

双击打开启动.bat，或者在程序目录打开命令行，输入

python ui.py

0x02 直接修改`douyin.py`中相关参数使用

完全不懂Python的朋友用命令行或操作界面。

0x03 从命令行使用`exec.py`

直接运行可查看命令列表，或使用-h参数查看帮助

python exec.py

python exec.py -h

python exec.py download -h

python exec.py download_batch -h

使用函数名调用程序

--type  指定下载类型，默认值：--type=user

--limit 指定采集数量，默认值：--limit=0（不限制）

例如采集某用户全部作品：

python exec.py download https://v.douyin.com/xxxx/

python exec.py download 用户的secuid

例如采集某用户喜欢的前10个作品：

python exec.py download MS4wLjABAAAAl7TJWjJJrnu11IlllB6Mi5V9VbAsQo1N987guPjctc8 --type=like --limit=10

python exec.py download 用户的secuid

例如采集某音乐原声前10个作品：

python exec.py download https://v.douyin.com/xxxx/ --type=music --limit=10

python exec.py download 音乐ID --type=music --limit=10

TODO

[zxsq-anti-bbcode-x] 采集用户作品

[zxsq-anti-bbcode-x] 调用Aria2下载

[zxsq-anti-bbcode-x] 话题/原声作品采集

[zxsq-anti-bbcode-x] 喜欢作品采集

[zxsq-anti-bbcode-x] 导入文件批量采集

[zxsq-anti-bbcode-x] 命令行调用

[zxsq-anti-bbcode-x] 用webview写界面

[zxsq-anti-bbcode-x] ~~打包exe~~ 不打包了，直接装个Python环境更简单

知识点

抖音相关

网页接口恢复了，一次请求即可取回数据

UID几乎没用了，拼不成主页链接了，所有接口都是sec_uid

signature可固定了，不用再扣JS了

作品中直接包含无水印视频地址了，不需要移动端UA也可跳转

话题/音乐作品数目

2021.04.02 喜欢列表也有数据了

Aria2相关

aria2p库使用体验还不错

大部分Aria2下载都是通过rpc接口实现的，这个也一样

需要自己下载Aria2c.exe来开启服务，所以要用代码实现自动启动服务

若文件已存在则跳过下载的方法：
1. --auto-file-renaming=false 可行，但控制台使用会报错，虽然报错不影响
2. -c 可行，且控制台不报错

添加下载任务时通过指定options = {'out': filename}指定文件名，即-d

Aria2会根据指定路径及指定文件名自动创建下载目录

Aria2指定路径及文件名中不能传入非法字符串（*|等），所以写了Download.title2path静态方法

监听事件要手动停止，不停止会阻塞进程，导致程序无法关闭

未发现实时获取任务进度及下载速度的函数，自己写了循环监听回调方法

Python相关

通过os.popen或subprocess.Popen实现子进程打开程序，无界面，不阻塞

继承父类后重写init时，通过super().init()调用父类构造方法

继承父类后重写方法时，不能重写私有方法，不能读取私有成员

参数指定类型提示挺好用，方便调用参数的函数时自动补全

可通过if 'PROGRAMFILES(X86)' in os.environ简单判断系统是否为64位

Pylance的自动导入依赖功能很好用，就是感觉时灵时不灵，重新开关后又可以用

vscode默认启动路径是当前项目路径，在launch.json中加一句"cwd": "${fileDirname}",即可，不过自动补全pylance就无法识别相对目录了

用pipreqs一键生成当前项目依赖：cmd切换到项目路径，pipreqs ./ --encoding=utf-8 --force

命令行模块fire相关

最简单的方法就是直接一个fire.Fire()，暴露全部函数

如果用类或对象暴露，类参数需要单独指定

组合命令需要用不同的类，暴露的类中引入需用组合命令的类，但是在这个批量下载的场景下感觉比较繁琐，所以直接加了个参数，分两个函数来调用

UI模块pywebview相关

可以把一个类的实例暴露给页面js_api，通过pywebview.api.func().then(() => {})调用Python函数

也可以把Flask等服务实例暴露给页面js_api（无需url参数），在内部实现index.html

Python中通过window.evaluate_js('JS代码')调用JS方法

在UI中，类的初始化无法传参，所以需要重新定义init

在UI中，需要公开的类实例方法不能以下划线_开头

创建UI时设置的窗口宽高，好像和网页中大小不一样，值需要比网页中大一些

抖音采集部分源码

# -*- encoding: utf-8 -*-

'''

@File    :   douyin.py

home.php?mod=space&uid=116177    :   2021年03月12日 18:16:57 星期五

@Author  :   erma0

home.php?mod=space&uid=59980 :   1.0

home.php?mod=space&uid=95579    :   https://erma0.cn

@Desc    :   抖音用户作品采集

'''

import json

import os

import time

from urllib.parse import parse_qs, urlparse



import requests



from download import Download



class Douyin(object):

    """

    抖音用户类

    采集作品列表

    """

    def __init__(self, param: str, limit: int = 0):

        """

        初始化用户信息

        参数自动判断：ID/URL

        """

        self.limit = limit

        self.http = requests.Session()

        self.url = ''

        self.type = 'unknow'

        self.download_path = '暂未定义目录'

        # ↑ 预定义属性，避免调用时未定义 ↑

        self.param = param.strip()

        self.sign = 'TG2uvBAbGAHzG19a.rniF0xtrq'  # sign可以固定

        self.__get_type()  # 判断当前任务类型：链接/ID

        self.aria2 = Download()  # 初始化Aria2下载服务，先不指定目录了，在设置文件名的时候再加入目录

        self.has_more = True

        self.finish = False

        # 字典格式方便入库用id做key/取值/修改对应数据，但是表格都接收数组

        self.videosL = []  #列表格式

        # self.videos = {}  #字典格式

        self.gids = {}  # gid和作品序号映射



    def __get_type(self):

        """

        判断当前任务类型

        链接/ID

        """

        if '://' in self.param:  # 链接

            self.__url2redirect()

        else:  # ID

            self.id = self.param



    def __url2redirect(self):

        """

        取302跳转地址

        短连接转长链接

        """

        headers = {  # 以前作品需要解析去水印，要用到移动端UA，现在不用了

            'User-Agent':

            'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1 Edg/89.0.4389.82'

        }

        try:

            r = self.http.head(self.param, headers=headers, allow_redirects=False)

            self.url = r.headers['Location']

        except:

            self.url = self.param



    def __url2id(self):

        try:

            self.id = urlparse(self.url).path.split('/')[zxsq-anti-bbcode-3]

        except:

            self.id = ''



    def __url2uid(self):

        try:

            query = urlparse(self.url).query

            self.id = parse_qs(query)['sec_uid'][zxsq-anti-bbcode-0]

        except:

            self.id = ''



    def get_sign(self):

        """

        网页sign算法，现在不需要了，直接固定

        """

        self.sign = 'TG2uvBAbGAHzG19a.rniF0xtrq'

        return self.sign



    def get_user_info(self):

        """

        取用户信息

        查询结果在 self.user_info

        """

        if self.url:

            self.__url2uid()

        url = 'https://www.iesdouyin.com/web/api/v2/user/info/?sec_uid=' + self.id

        try:

            res = self.http.get(url).json()

            info = res.get('user_info', dict())

        except:

            info = dict()

        self.user_info = info

        # 下载路径

        username = '{}_{}_{}'.format(self.user_info.get('short_id', '0'),

                                     self.user_info.get('nickname', '无昵称'), self.type)

        self.download_path = Download.title2path(username)  # 需提前处理非法字符串



    def get_challenge_info(self):

        """

        取话题挑战信息

        查询结果在 self.challenge_info

        """

        if self.url:

            self.__url2id()

        url = 'https://www.iesdouyin.com/web/api/v2/challenge/info/?ch_id=' + self.id

        try:

            res = self.http.get(url).json()

            info = res.get('ch_info', dict())

        except:

            info = dict()

        self.challenge_info = info

        # 话题挑战下载路径

        username = '{}_{}_{}'.format(self.challenge_info.get('cid', '0'),

                                     self.challenge_info.get('cha_name', '无标题'), self.type)

        self.download_path = Download.title2path(username)  # 需提前处理非法字符串



    def get_music_info(self):

        """

        取音乐原声信息

        查询结果在 self.music_info

        """

        if self.url:

            self.__url2id()

        url = 'https://www.iesdouyin.com/web/api/v2/music/info/?music_id=' + self.id

        try:

            res = self.http.get(url).json()

            info = res.get('music_info', dict())

        except:

            info = dict()

        self.music_info = info

        # 音乐原声下载路径

        username = '{}_{}_{}'.format(self.music_info.get('mid', '0'), self.music_info.get('title', '无标题'),

                                     self.type)

        self.download_path = Download.title2path(username)  # 需提前处理非法字符串



    def crawling_users_post(self):

        """

        采集用户作品

        """

        self.type = 'post'

        self.__crawling_user()



    def crawling_users_like(self):

        """

        采集用户喜欢

        """

        self.type = 'like'

        self.__crawling_user()



    def crawling_challenge(self):

        """

        采集话题挑战

        """

        self.type = 'challenge'

        self.get_challenge_info()  # 取当前信息，用做下载目录



        # https://www.iesdouyin.com/web/api/v2/challenge/aweme/?ch_id=1570693184929793&count=9&cursor=9&aid=1128&screen_limit=3&download_click_limit=0&_signature=AXN-GQAAYUTpqVxkCT6GHQFzfg

        url = 'https://www.iesdouyin.com/web/api/v2/challenge/aweme/'



        cursor = '0'

        while self.has_more:

            params = {

                "ch_id": self.id,

                "count": "21",  # 可调大 初始值：9

                "cursor": cursor,

                "aid": "1128",

                "screen_limit": "3",

                "download_click_limit": "0",

                "_signature": self.sign

            }

            try:

                res = self.http.get(url, params=params).json()

                cursor = res['cursor']

                self.has_more = res['has_more']

                self.__append_videos(res)

            except:

                print('话题挑战采集出错')

        print('话题挑战采集完成')



    def crawling_music(self):

        """

        采集音乐原声

        """

        self.type = 'music'

        self.get_music_info()  # 取当前信息，用做下载目录



        # https://www.iesdouyin.com/web/api/v2/music/list/aweme/?music_id=6928362875564067592&count=9&cursor=18&aid=1128&screen_limit=3&download_click_limit=0&_signature=5ULmIQAAhRYNmMRcpDm2COVC5j

        url = 'https://www.iesdouyin.com/web/api/v2/music/list/aweme/'



        cursor = '0'

        while self.has_more:

            params = {

                "music_id": self.id,

                "count": "21",  # 可调大 初始值：9

                "cursor": cursor,

                "aid": "1128",

                "screen_limit": "3",

                "download_click_limit": "0",

                "_signature": self.sign

            }

            try:

                res = self.http.get(url, params=params).json()

                cursor = res['cursor']

                self.has_more = res['has_more']

                self.__append_videos(res)

            except:

                print('音乐原声采集出错')

        print('音乐原声采集完成')



    def __crawling_user(self):

        """

        采集用户作品/喜欢

        """

        self.get_user_info()  # 取当前用户信息，昵称用做下载目录



        max_cursor = 0

        # https://www.iesdouyin.com/web/api/v2/aweme/like/?sec_uid=MS4wLjABAAAAaJO9L9M0scJ_njvXncvoFQj3ilCKW1qQkNGyDc2_5CQ&count=21&max_cursor=0&aid=1128&_signature=2QoRnQAAuXcx0DPg2DVICdkKEY&dytk=

        # https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAaJO9L9M0scJ_njvXncvoFQj3ilCKW1qQkNGyDc2_5CQ&count=21&max_cursor=0&aid=1128&_signature=DrXeeAAAbwPmb.wFM3e63w613m&dytk=

        url = 'https://www.iesdouyin.com/web/api/v2/aweme/{}/'.format(self.type)



        while self.has_more:

            params = {

                "sec_uid": self.id,

                "count": "21",

                "max_cursor": max_cursor,

                "aid": "1128",

                "_signature": self.sign,

                "dytk": ""

            }

            try:

                res = self.http.get(url, params=params).json()

                max_cursor = res['max_cursor']

                self.has_more = res['has_more']

                self.__append_videos(res)

            except:

                print('作品采集出错')

        print('作品采集完成')



    def __append_videos(self, res):

        """

        数据入库

        """

        if res.get('aweme_list'):

            for item in res['aweme_list']:

                info = item['statistics']

                info.pop('forward_count')

                info.pop('play_count')

                info['desc'] = Download.title2path(item['desc'])  # 需提前处理非法字符串

                info['uri'] = item['video']['play_addr']['uri']

                info['play_addr'] = item['video']['play_addr']['url_list'][zxsq-anti-bbcode-0]

                info['dynamic_cover'] = item['video']['dynamic_cover']['url_list'][zxsq-anti-bbcode-0]

                info['status'] = 0  # 下载进度状态；等待下载：0，下载中：0.xx；下载完成：1



                # 列表格式

                self.videosL.append(info)

                # 字典格式

                # self.videos[info['aweme_id']] = info



                # 此处可以直接添加下载任务，不过考虑到下载占用网速,影响采集过程，所以采集完再下载

            if self.limit:

                more = len(self.videos) - self.limit

                if more >= 0:

                    # 如果给出了限制采集数目，超出的删除后直接返回

                    self.has_more = False

                    # 列表格式

                    self.videosL = self.videosL[:self.limit]

                    # 字典格式

                    # for i in range(more):

                    #     self.videos.popitem()

                    # return



        else:  # 还有作品的情况下没返回数据则进入这里

            print('未采集完成，但返回作品列表为空')



    def download_all(self):

        """

        作品抓取完成后，统一添加下载任务

        可选择在外部注册回调函数，监听下载任务状态

        """

        for id, video in enumerate(self.videosL):

            # for id, video in self.videos.items():

            gid = self.aria2.download(url=video['play_addr'],

                                      filename='{}/{}_{}.mp4'.format(self.download_path, video['aweme_id'],

                                                                     video['desc'])

                                      # ,options={'gid': id}  # 指定gid

                                      )

            self.gids[zxsq-anti-bbcode-gid] = id  # 因为传入gid必须16位，所以就不指定gid了，另存一个字典映射

        print('下载任务投递完成')

精易论坛