查看: 10435|回复: 2

[网站源码] 利用爬虫爬取套图网站美女的所有写真(带多线程)

[复制链接]
累计签到:253 天
连续签到:1 天

1862

主题

-208

回帖

1万

积分

域主

名望
0
星币
3561
星辰
6
好评
79

鼎力支持奖欢乐天使奖灌水天才奖幸运猪我是土豪在线大神

发表于 2022-8-17 11:33:33 | 显示全部楼层 |阅读模式

注册登录后全站资源免费查看下载

您需要 登录 才可以下载或查看,没有账号?立即注册

×
本帖最后由 开车司机 于 2022-8-17 11:34 编辑

====================================
==两种方式==原作者@zrq648022547
==指定某美女,爬取全部所有的图册                          ==
==指定图集首页地址,爬取单图册图片                      ==
====================================
代码一:爬取指定美女图册
缺点:
1、虽然写入了多线程爬取,但是测试貌似还是比较慢;
2、爬取的图片名称不能按照序号命名
3、未处理反扒机制

  1. # 多进程异步并发
  2. import random
  3. import requests
  4. from bs4 import BeautifulSoup
  5. import os
  6. import time
  7. import threading
  8. from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
  9. import concurrent.futures

  10. USER_AGENTS = [
  11.     "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  12.     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  13.     "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
  14.     "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
  15.     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
  16.     "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
  17.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
  18.     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
  19.     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  20.     "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  21.     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
  22.     "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
  23.     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  24.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
  25.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
  26.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  27.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
  28.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  29.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  30.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
  31.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  32.     "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  33.     "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  34.     "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  35.     "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  36.     "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  37.     "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
  38.     "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
  39.     "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
  40.     "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
  41.     "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
  42.     "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
  43.     "UCWEB7.0.2.37/28/999",
  44.     "NOKIA5700/ UCWEB7.0.2.37/28/999",
  45.     "Openwave/ UCWEB7.0.2.37/28/999",
  46.     "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
  47.     # iPhone 6:
  48.     "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
  49. ]

  50. headers = {
  51.     'User-Agent': random.choice(USER_AGENTS),
  52.     "referer": "https://www.xiurenji.vip/"
  53. }


  54. def get_item(item, main_title):
  55.     title = item['title']
  56.     threads = []
  57.     item_links = [start_url + item['href']]
  58.     # print(f'{title}>>>{item_links}')
  59.     for item_link in item_links:
  60.         threads.append(threading.Thread(target=get_images,args=(title, item_link, main_title)))
  61.     for thread in threads:
  62.         thread.start()
  63.     for thread in threads:
  64.         thread.join()


  65. def get_images(title, item_link, main_title):
  66.     try:
  67.         item_res = requests.get(url=item_link, headers=headers,timeout=30)
  68.         item_res.encoding = 'gzip'
  69.         item_soup = BeautifulSoup(item_res.text, 'lxml')
  70.         img_list = item_soup.select('.content_left p img')
  71.         # print(img_list)
  72.         folder = main_folder + '/' + main_title + '/' + title + '/'
  73.         if not os.path.exists(folder):
  74.             os.makedirs(folder)
  75.         else:
  76.             pass
  77.         try:
  78.             for img in img_list:
  79.                 img_link = start_url + img['src']
  80.                 # print(img_link)
  81.                 with open(folder + img_link.split('/')[-1], 'wb') as f:
  82.                     starttime = time.time()
  83.                     image = requests.get(url=img_link, headers=headers,timeout=30).content
  84.                     f.write(image)
  85.                     time.sleep(1)
  86.                     endtime = time.time()
  87.                     print(f'正在保存>>>{title}' + '>>>' + img_link.split('/')[-1] + '>>>用时%.3f'%(endtime - starttime), 'seconds')
  88.         except IndexError:
  89.             pass
  90.         next = item_soup.select_one('.page a:last-of-type')
  91.         if 'class="current"' in str(next):
  92.             pass
  93.         else:
  94.             item_link = start_url + next['href']
  95.             with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
  96.                 pool.submit(get_images,title, item_link, main_title)
  97.     except IndexError:
  98.         pass


  99. if __name__ == '__main__':
  100.     start = time.time()
  101.     start_url = 'https://www.xrmn5.cc'  # 网站根地址
  102.     main_title = '秀人网'
  103.     main_folder = "单项目2\"  # 主文件夹路径(请填入你自己的文件夹路径)
  104.     main_url = 'https://www.xrmn5.cc/younisi.html'
  105.     itemlist_res = requests.get(url=main_url, headers=headers,timeout=30)
  106.     itemlist_res.encoding = 'gzip'
  107.     itemlist_soup = BeautifulSoup(itemlist_res.text, 'lxml')
  108.     itemlist = itemlist_soup.select('.list_n2 a')
  109.     # print(itemlist)
  110.     for item in itemlist:
  111.         get_item(item, main_title)
  112.     end = time.time()
  113.     print('总共用时:',end - start,'seconds',end='')
复制代码
代码二:爬取美女单图集
缺点:
1、文件夹名称需要手动指定
2、爬取的图册首页地址需要手动指定
3、爬取的图片和页面显示的图片数量不符合,有的页面图片漏掉了,不知道为啥
  1. import requests
  2. import parsel
  3. import random
  4. import os
  5. import datetime
  6. import time


  7. starttime = datetime.datetime.now()
  8. USER_AGENTS = [
  9.     "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  10.     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  11.     "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
  12.     "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
  13.     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
  14.     "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
  15.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
  16.     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
  17.     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  18.     "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  19.     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
  20.     "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
  21.     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  22.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
  23.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
  24.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  25.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
  26.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  27.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  28.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
  29.     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  30.     "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  31.     "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  32.     "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
  33.     "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  34.     "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  35.     "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
  36.     "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
  37.     "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
  38.     "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
  39.     "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
  40.     "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
  41.     "UCWEB7.0.2.37/28/999",
  42.     "NOKIA5700/ UCWEB7.0.2.37/28/999",
  43.     "Openwave/ UCWEB7.0.2.37/28/999",
  44.     "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
  45.     # iPhone 6:
  46.     "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
  47. ]

  48. headers = {
  49.     'User-Agent': random.choice(USER_AGENTS),
  50.     # 'Connection': 'close'
  51.     "referer":"https://www.xrmn5.cc"
  52. }

  53. # 0.创建文件夹
  54. directory = '[YouMi尤蜜荟]Vol.809_女神尤妮丝Egg红色轻透上衣配红短裙半脱露红色内衣诱惑写真60P\\'
  55. if os.path.exists(directory):
  56.     pass
  57. else:
  58.     os.mkdir(directory)
  59. # 1.确定爬取的网站

  60. for page in range(0, 30):
  61.     try:
  62.         if page >= 1:
  63.             base_url = 'https://www.xrmn5.com/YouMi/2022/202211014_{}.html'.format(page)
  64.         else:
  65.             base_url = 'https://www.xrmn5.com/YouMi/2022/202211014.html'
  66.         # print('==========正在爬取第{}页数据============='.format(page))
  67.         # 2.发送请求
  68.         response = requests.get(url=base_url, headers=headers)
  69.         response.encoding = 'UTF-8'  # 自动识别响应体的编码
  70.         # print(response)
  71.         html_data = response.text
  72.         # print(html_data)
  73.         # 解析详情页图片地址
  74.         response_1 = requests.get(base_url, headers=headers).text
  75.         html_1 = parsel.Selector(response_1)
  76.         # print(html_1)
  77.         # 解析图册中图片地址
  78.         for i in range(1,4):
  79.             img_list_1 = html_1.xpath('//*[@class="content_left"]/p/img[{}]/@src'.format(i)).extract_first()
  80.             img_list = img_list_1.replace('uploadfile','Uploadfile')
  81.             img_url = 'https://p.xrmn5.com/' + str(img_list)
  82.             # print(img_list_1)
  83.             # 请求图片地址
  84.             img_data = requests.get(img_url, headers=headers).content
  85.             img_name_1 = str(int(page) + 1)  # 图片文件名称
  86.             # print(img_name_1)
  87.             # 4.数据保存
  88.             with open(directory + img_name_1 + '-' + str(i) + '.jpg', 'wb') as f:
  89.                 f.write(img_data)
  90.                 time.sleep(10)
  91.             print('#####################正在爬取第', page + 1, '页,第', int(i),'张图片#####################')
  92.     except IndexError:
  93.         continue
  94. endtime = datetime.datetime.now()
  95. print('####################下载完成,共用时', (endtime - starttime).seconds, '秒###################')
复制代码


我发的破/解游戏的解压密码都是XDGAME
有任何问题私信版主可爱喵不要私信我,我比较少上线,可爱喵每天都会回复私信
回复

使用道具 举报

累计签到:224 天
连续签到:3 天

0

主题

120

回帖

1089

积分

星体

名望
0
星币
903
星辰
0
好评
0
发表于 2022-10-30 11:34:07 | 显示全部楼层
666
默认签名:偏爱是我家,发展靠大家! 社区反馈邮箱Mail To:service@pai.al或paijishu@outlook.com
回复 支持 反对

使用道具 举报

累计签到:29 天
连续签到:1 天

0

主题

14

回帖

217

积分

星碎

名望
30
星币
51
星辰
0
好评
0
发表于 2024-6-2 19:03:38 | 显示全部楼层
a
默认签名:偏爱是我家,发展靠大家! 社区反馈邮箱Mail To:service@pai.al或paijishu@outlook.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

Archiver|手机版|偏爱技术社区-偏爱技术吧-源码-科学刀-我爱辅助-娱乐网--教开服-游戏源码

偏爱技术社区-偏爱技术吧-源码-科学刀-我爱辅助-娱乐网-游戏源码

Powered by Discuz! X3.5

GMT+8, 2024-11-22 00:56 , Processed in 0.089578 second(s), 39 queries .

快速回复 返回顶部 返回列表