PTT數位生活區 / Python

scrapy xpath extraction 以及其編碼的問題

看板Python作者stevec (steve)時間11年前 (2014/11/29 19:20)推噓0(0推 0噓 5→)

留言5則, 2人參與討論串1/2 (看更多)

有點不曉得為什麼,想請各位大大看一下下面的程式碼只要是想利用scrapy 裡面的xpath extract一些我想要的info raw_html_article_content_ 是儲存我想extract的部分資訊 raw 是儲存範圍比較大的部分所以理論上raw會包含raw_html_article_content_ 的資訊可是raw包含的部分會有點跟raw_html_article_content_裡面的不一樣例如: raw: 結婚並無Z>B (這跟chrom瀏覽器打開source code的看到的是一樣的) raw_html_article_content_ : 結婚並無Z>B 我要怎麼讓raw裡面儲存的跟raw_html_article_content_的一樣啊？ ps:環境win 7, python 2.7,scrappy 1.4 from scrapy.http import HtmlResponse from scrapy.selector import Selector import urllib import urllib2 address = "http://www.ptt.cc/bbs/Boy-Girl/M.1416362560.A.881.html" response = urllib2.urlopen(address) html = response.read() html_response = HtmlResponse( address, body=html) sel = Selector(html_response) recog_assist_word = u"※ 文章網址: " xpath = """/html/body/div[@id="main-container"]/div[@id="main-content"]/ span[@class="f2" and text()="%s"][last()]/preceding-sibling::node()""" % recog_assist_word raw_html_article_content_ = sel.xpath(xpath).extract() raw_html_article_content_ = "".join([_ for _ in raw_html_article_content_]) raw=sel.xpath(u"""/html""").extract()[0] print raw_html_article_content_ print raw -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.112.218.124 ※ 文章網址: http://www.ptt.cc/bbs/Python/M.1417260034.A.579.html ※ 編輯: stevec (140.112.218.124), 11/29/2014 19:39:32

→

11/30 01:27, , 1^F

11/30 01:27, 1^F

→

11/30 11:03, , 2^F

11/30 11:03, 2^F

→

11/30 11:04, , 3^F

11/30 11:04, 3^F

→

11/30 11:05, , 4^F

11/30 11:05, 4^F

→

11/30 11:05, , 5^F

11/30 11:05, 5^F

‣ 返回看板[ Python ] 程設

‣ 更多 stevec 的文章

文章代碼(AID): #1KUQm2Lv (Python)

討論串 (同標題文章)

完整討論串 (本文為第 1 之 2 篇)：

排序：最新先 | 最舊先 | 留言數

Re: scrapy xpath extraction 以及其編碼的問題

11年前, 12/03

0

5

scrapy xpath extraction 以及其編碼的問題

11年前, 11/29

在新視窗開啟完整討論串 (共2篇)

Python 近期熱門文章

3

13

[問題] vscode的debug模式不用考慮中文路徑

2周前, 02/01

1

1

[問題] python 3.14 free thread build

3月前, 10/29

1

13

[問題] 關於正規表示法的r'\1'?

3月前, 10/22

6

9

[問題] 請問有人用過OMIA PLUS影音平台自學嗎?

4月前, 10/09

4

21

[閒聊] Python 3.13 版本是不是很爛啊！？

7月前, 07/19

15

23

[閒聊] 各位現在用os.path 還是用pathlib.Path

7月前, 07/17

6

11

[閒聊] 2024年的自我python學習

7月前, 07/17

1

2

[問題] 用Whisper AI幫我下載字幕（有酬）

10月前, 04/01

更多近期熱門文章 >>

PTT數位生活區即時熱門文章

5

8

[新聞]貴重如金恆久如鑽？土耳其新人喜獲親友贈

[ PC_Shopping ]

6小時前, 02/17

60

82

[開箱] MSI MPG X870E CARBON MAX WIFI PTT EDI置底

[ PC_Shopping ]

7小時前, 02/17

3

6

[問題] XLR線挑選

8小時前, 02/17

-2

17

[心得] 4G分享器vs手機網路速度測試

8小時前, 02/17

4

9

[問題] 台哥大廣告簡訊

9小時前, 02/17

4

7

Re: [閒聊] 除夕夜換換病發作，求退燒

11小時前, 02/17

12

27

[請益] 換耳擴後大編制很爽，但中頻後退縮水

14小時前, 02/17

6

15

[問題] 換前級後大編制很爽，但中頻後退縮水

14小時前, 02/17

更多即時熱門文章 >>

‣ 返回看板[ Python ] 程設

‣ 更多 stevec 的文章

文章代碼(AID): #1KUQm2Lv (Python)