[問題] 爬批踢踢PTT文章內容

看板R_Language作者miao2361 (Miao)時間10年前 (2015/04/01 17:44)推噓5(5推 0噓 5→)

留言10則, 3人參與討論串1/1

[問題類型]: 程式諮詢(我想用R 做某件事情，但是我不知道要怎麼用R 寫出來) [軟體熟悉度]: 使用者(已經有用R 做過不少作品) [問題敘述]: 請簡略描述你所要做的事情，或是這個程式的目的問題一用httr、XML套件想要把批踢踢的文章們存成.txt檔以利後續text mining 但是八卦板因為有「確認已滿18歲」的網頁而無法存出文章問題二關於RCurl的問題（詳見以下）問題一 library(XML) library(httr) start <- regexpr('www', line)[1] end <- regexpr('html', line)[1] if(start != -1 & end != -1){ url <- substr(line, start, end+3) html <- content(GET(url), encoding="UTF8") doc <- xpathSApply(html, "//div[@id='main-content']", xmlValue) name <- strsplit(url, '/')[[1]][4] write(doc, gsub('html', 'txt', name)) } # 當讀入八卦板以外的批踢踢文章網址 line = "https://www.ptt.cc/bbs/StupidClown/M.1427811176.A.552.html" 工作路徑中會出現一個新的txt檔，其中存著這篇笨版文章的內容 # 當讀入八卦板網址 line = "https://www.ptt.cc/bbs/Gossiping/M.1427816656.A.450.html" 存下來的txt檔裡面卻是空的。研判應該是八卦板的十八歲限制網頁造成 https://www.ptt.cc/bbs/Gossiping/M.1427816656.A.450.html 想請問版上高手如何跳過這個網頁呢？問題二原程式碼 url <- substr(line, start, end+3) html <- content(GET(url), encoding="UTF8") doc <- xpathSApply(html, "//div[@id='main-content']", xmlValue) name <- strsplit(url, '/')[[1]][4] write(doc, gsub('html', 'txt', name)) 原本想用RCurl套件來做第二行 html <- htmlParse(getURL(url), encoding='UTF-8') 存出來的html卻失敗 > html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "" rel="nofollow">http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head><title>301 Moved Permanently</title></head> <body bgcolor="white"> <center><h1>301 Moved Permanently</h1></center> <hr> <center>nginx</center> </body> </html> 但後來改成用httr套件的content()和GET()就可以了，卻不明白為什麼XD --- 自己研究了一下，以上問題可能會觸碰到 HTTP request header的範疇，但我過去並沒有學過這方面的知識不知道版上各位大大有沒有推薦的文章可以介紹自學呢？非常感恩～～～～！！ [環境敘述]: > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 111.80.167.159 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1427881473.A.71F.html ※ 編輯: miao2361 (111.80.167.159), 04/01/2015 17:59:17