Re: [問題] 爬蟲錯誤
※ 引述《darklimit ()》之銘言:
應用隨機休息再繼續,還是會出現這樣的錯誤
error: [Errno 10054] 遠端主機已強制關閉一個現存的連線。
進行except例外處理,continue繼續的話
後面nameTag對應到的genre,rating 全部都會打亂
這樣應該要怎麼處理?
謝謝
for i in idlist:
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT
6.1;en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
req =
urllib2.Request("http://www.imdb.com/title/tt"+i+"/",headers=headers)
try:
html =
urllib2.urlopen("http://www.imdb.com/title/tt"+i+"/",timeout = 30)
htmls = html.read()
html.close()
# And Sleep Here for every connection.
except HTTPError, e:
#Hnalde the error,
#Break,
#最好在此把你處理過的資料記下來,安心上路,下次再來
soup = BeautifulSoup(htmls)
nameTag = [a.get_text() for a in soup.find_all("title")]
genreTag = [a.get_text() for a in
soup.find_all("span",{"itemprop":"genre"})]
ratingTag = soup.find_all("span",{"itemprop":"ratingValue"})
for tag in nameTag:
titlelist.append(nameTag)
for tag in genreTag:
genrelist.append(genreTag)
break
for tag in ratingTag:
val = ''.join(tag.find(text=True))
valuelist.append(val)
except HTTPError, e:
print e.code
print e.read()
#continue
except URLError, e:
print 'Reason: ', e.reason
#continue
rsleep = random.randint(10, 40)
time.sleep(rsleep)
return zip(titlelist, genrelist, valuelist)
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 118.160.190.62
→
05/19 22:30,
05/19 22:30
→
05/20 01:09,
05/20 01:09
→
05/20 12:58,
05/20 12:58
→
05/20 15:38,
05/20 15:38
→
05/20 15:39,
05/20 15:39
→
05/20 15:44,
05/20 15:44
→
05/20 15:51,
05/20 15:51
→
05/20 15:52,
05/20 15:52
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 114.42.51.172
→
05/20 18:38, , 1F
05/20 18:38, 1F
→
05/22 17:57, , 2F
05/22 17:57, 2F
討論串 (同標題文章)
Python 近期熱門文章
PTT數位生活區 即時熱門文章