[問題] 抓取PTT網頁,請問此程式碼的錯誤在哪?

看板R_Language作者 (M)時間9年前 (2016/10/18 15:23), 9年前編輯推噓1(106)
留言7則, 2人參與, 最新討論串1/2 (看更多)
[問題類型]: 程式諮詢(我想用R 做某件事情,但是我不知道要怎麼用R 寫出來) [軟體熟悉度]: 使用者(已經有用R 做過不少作品) [問題敘述]: 我照著書本輸入以下程式碼, 想嘗試抓取笨版中之文章文字內容, 但程式碼執行完後卻出現: Error in regexpr("www", line) : argument "line" is missing, with no default 了解regexpr語法的用法後, 發現此程式的"line"字詞與該語法之用法不同, 然而若是在此範例中, 要怎麼修改才能抓取到笨版當中的文章呢? 謝謝大家解惑 [程式範例]: install.packages("XML") install.packages("RCurl") library(XML) library(RCurl) data <- list() for(i in 1058:1118){ tmp <- paste(i, '.html', sep = '') url <- paste('https://www.ptt.cc/bbs/StupidClown/index',tmp,sep = '') get_url <- getURL(url,ssl.verifypeer = FALSE) html <- htmlParse(get_url) url.list <- xpathSApply(html,"//div[@class='title']/a[@href]",xmlAttrs) data <- rbind(data, paste('https://www.ptt.cc',url.list,sep = '')) } data <- unlist(data) getdoc <- function(line){ start <- regexpr('www', line)[1] end <- regexpr('html', line)[1] if(start != -1 & end != -1){ url <- substr(line, start, end+3) html <- htmlParse(getURL(url,ssl.verifypeer = FALSE),encoding = 'UTF-8') doc <- xpathSApply(html, "//div[@id='main-container']",xmlValue) name <- strsplit(url,'/')[[1]][4] write(doc,gsub('html','txt',name)) } } getdoc() sapply(data, getdoc) setwd("C://Documents and Settings//12345//桌面//R_textmining") write.table(getdoc,file = "getdoc.txt",row.names = F,quote = F) [環境敘述]: R version 3.3.1 (2016-06-21) Platform: i386-w64-mingw32/i386 (32-bit) Running under: Windows XP (build 2600) Service Pack 3 [關鍵字]: regexpr、xpathSApply、PTT爬蟲 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 118.163.143.251 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1476775409.A.321.html ※ 編輯: mikemlbb (118.163.143.251), 10/18/2016 15:29:00

10/19 09:41, , 1F
Hi,I am trying to solve your problem.
10/19 09:41, 1F

10/19 09:41, , 2F
Would you tell me what your expected output is
10/19 09:41, 2F

10/19 09:42, , 3F
The "data" dataframe contains 1220 URL characters
10/19 09:42, 3F

10/21 02:23, , 4F
I'm trying to crawl the content of StupidClown site
10/21 02:23, 4F

10/21 02:25, , 5F
XIncluding article title and content by no.1058 to11
10/21 02:25, 5F

10/21 02:27, , 6F
But the code seem to be wrong.When I run "getdoc()"
10/21 02:27, 6F

10/21 02:29, , 7F
The error will emerge then say "line" is not defined
10/21 02:29, 7F
文章代碼(AID): #1O1StnCX (R_Language)
文章代碼(AID): #1O1StnCX (R_Language)