热点爬虫

2022-09-29

Python 3

1.获取爬虫的通常有三种方式

(1)使用requests模块进行请求

url = "https://www.zhihu.com/hot"
headers = {
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Mobile Safari/537.36'
    }
resp = requests.get(url, headers=headers)
print(resp.text)

(2)使用selenium + PhantomJS

Selenium是一个Web的自动化测试工具，最初是为网站自动化测试而开发的，类型像我们玩游戏用的按键精灵，可以按指定的命令自动操作，不同是Selenium 可以直接运行在浏览器上，它支持所有主流的浏览器（包括PhantomJS，Chrome这些无界面的浏览器）

安装selenium

pip3 install selenium==2.48.0

安装PhantomJS

win下

#下载对应版本
https://phantomjs.org/download.html
#添加phantomjs.exe到环境变量
#查看版本
phantomjs -v

2.Linux

#使用wget命令下载
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
#解压并且创建软链接
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 
sudo cp -R phantomjs-2.1.1-linux-x86_64 /usr/local/share/ 
sudo ln -sf /usr/local/share/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/

https://blog.csdn.net/xudailong_blog/article/details/79674107

使用实例

driver = webdriver.PhantomJS()
driver.get('https://www.kuaishou.com')
r = driver.page_source

(3)使用selenium + ChromeDriver

selenium已经放弃PhantomJS，所以使用火狐或者谷歌无界面浏览，但是需要下载浏览器和对应驱动

安装驱动并添加到环境变量

https://chromedriver.storage.googleapis.com/index.html?path=105.0.5195.52/

from selenium import webdriver

option = webdriver.ChromeOptions()
#无头浏览器
option.add_argument('--headless')

driver = webdriver.Chrome(options=option)

url = 'https://tophub.today/n/K7GdaMgdQy'
driver.get(url)

page_text = driver.page_source

2.解析页面的三种方式

方法	方法的性能
正则表达式	快
Beautiful Soup	慢
Lxml	快

https://www.jianshu.com/p/f5d9cf7eeb2d

https://blog.csdn.net/piflf/article/details/125059881

3.使用schedule模块执行定时任务

# 循环执行模块 定时执行程序startProgram
# 时间经过10秒才会执行
schedule.every(10).seconds.do(startProgram)
while True:
  schedule.run_pending()
  time.sleep(1)

https://www.cnblogs.com/longsongpong/p/10998619.html

4.连接Mysql数据库

(1)安装pymysql库

pip install pymysql

(2)创建数据库和数据表

# 创建news数据库
CREATE TABLE news (
ID int not null primary key auto_increment comment '主表id',
HotRank varchar(255) not null comment '热点排名',
Title varchar(255) not null comment  '热点名称',
HotNumber varchar(255) not null comment  '热点人数',
Platform varchar(255) not null comment  '热点平台',
Url varchar(2048) not null comment  '热点链接',
Ctime timestamp not null default current_timestamp comment '创建时间'
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

# 查看数据库结构
show full columns from news;
describe news;

(3)连接数据库

import pymysql

db = pymysql.connect(host='xxx.xxx.xxx.xxx', port=3306, user='xxx', passwd='xxxxxx', db='yolo', charset='utf8')
cursor = db.cursor()

如果数据库是本地的话，host 可以填 127.0.0.1 或 localhost

(4)对数据库操作

def queryDB(cursor):
    # 查询数据库
    try:
        sql = "select Platform from news"
        cursor.execute(sql)
        data = cursor.fetchall()
        print(data)
    except Exception as e:
        print(e)

def insertDB(hot_rank, hot_title, hot_number,hot_Platform,hot_url,db,cursor):
    #插入数据
    try:
        sql = "insert into news(HotRank,Title,HotNumber,Platform,Url) values (%s,%s,%s,%s,%s)"
        params = (hot_rank, hot_title, hot_number, hot_Platform,hot_url)
        cursor.execute(sql,params)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)

(5)关闭连接

# 关闭光标对象
cursor.close()
# 关闭数据库连接
db.close()

https://blog.csdn.net/wenxuhonghe/article/details/121945942