[問題]網頁疑似沒有更新爬蟲重複寫入同一則貼文

看板Python作者 (The_rabbit)時間1年前 (2022/12/15 12:39), 編輯推噓0(0010)
留言10則, 3人參與, 1年前最新討論串1/1
請問各位大大 我最近在學習如何使用爬蟲程式所以我拿ptt網頁板作為練習目標 但我碰到在10則後會反覆抓取同一則貼文的title和連結的問題 https://imgur.com/a/Bnqo2B1 我猜想是網頁沒有載入新的網頁資料 但是下拉式載入的動態網頁不是只要下拉就會更新嗎 而且我看chrom driver的selenium的下拉是有在執行的,請問是什麼原因導致? 以下我的程式碼 import urllib.request as req import requests import selenium import schedule import time import json from time import sleep import json import openpyxl import random from selenium.webdriver.common.by import By from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support import expected_conditions as EC import bs4 pttWeb = openpyxl.load_workbook('pttweb.xlsx') ws = pttWeb.active i = 1 scroll_time = int(input("scroll_Times")) options = Options() options.chrome_executable_path = "C:\chromedriver_win32\chromedriver.exe" driver = webdriver.Chrome(options = options) sleep(3) driver.get('https://www.pttweb.cc/hot/all/today') sleep(5) prev_ele = None for now_time in range(1, scroll_time+1): sleep(2) eles = driver.find_elements(by=By.CLASS_NAME,value='e7-right.ml-2') # 若串列中存在上一次的最後一個元素,則擷取上一次的最後一個元素到當前最後一 個元素進行爬取 try: # print(eles) # print(prev_ele) eles = eles[eles.index(prev_ele):] except: pass for ele in eles: try: titleInfo = ele.find_element(by=By.CLASS_NAME, value = "e7-article-default") title = titleInfo.text href = titleInfo.get_attribute('href') ws.cell(i,1,i) ws.cell(i,2,title) ws.cell(i,3,href) sleep(3) inner =req.Request(href, headers ={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" }) with req.urlopen(inner) as innerRespomse: articleData = innerRespomse.read().decode("utf-8") articleRoot = bs4.BeautifulSoup(articleData, "html.parser") main_content = articleRoot.find("div", itemprop="articleBody") boardInfo= articleRoot.find("span", class_="e7-board-name-standalone") authorInfo = articleRoot.find("span", itemprop="name") timeInfo = articleRoot.find("time", itemprop="datePublished") countInfo = articleRoot.find_all("span", class_="e7-head-content") board = boardInfo.text author = authorInfo.text Time = timeInfo.text count = countInfo[4].text allContent = main_content.text pre_text = allContent.split('--')[0] ws.cell(i,4,board) ws.cell(i,5,author) ws.cell(i,6,Time) ws.cell(i,7,count) ws.cell(i,8,pre_text) pttWeb.save('pttweb.xlsx') sleep(random.uniform(5,20)) i = i+1 except: pass prev_ele = eles[-1] print(f"now scroll {now_time}/{scroll_time}") js = "window.scrollTo(0, document.body.scrollHeight);" driver.execute_script(js) sleep(40) driver.quit() _____________________ 先謝過各位大大了 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 49.158.79.67 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1671079197.A.34F.html

12/15 13:09, 1年前 , 1F
建議先改掉try-except:pass,把code貼pastebin較容易看
12/15 13:09, 1F

12/15 16:34, 1年前 , 2F
更:https://pastebin.com/cyUdWYLZ code的Pastebin
12/15 16:34, 2F

12/15 16:37, 1年前 , 3F
更:https://pastebin.com/cyUdWYLZ code的Pastebin
12/15 16:37, 3F

12/16 01:28, 1年前 , 4F
忙猜 你class抓錯 標題不只 e7-article-default
12/16 01:28, 4F

12/16 01:29, 1年前 , 5F
還有 e7-article-viewed 跟 e7-article-most-recently-v
12/16 01:29, 5F

12/16 01:30, 1年前 , 6F
iewed
12/16 01:30, 6F

12/16 01:31, 1年前 , 7F
然後 try expect 不要 pass
12/16 01:31, 7F

12/16 01:32, 1年前 , 8F
一定有跳出找不到class pass幹嘛
12/16 01:32, 8F

12/16 01:33, 1年前 , 9F
不用除錯乾脆把try expect全刪好了
12/16 01:33, 9F

12/16 01:33, 1年前 , 10F
寫了又pass 脫褲子放屁
12/16 01:33, 10F
文章代碼(AID): #1ZcgKTDF (Python)
文章代碼(AID): #1ZcgKTDF (Python)