[問題] lxml遇到<br /> 該如何處理?
大家好 最近想試著撰寫網頁爬蟲
想抓取網頁的這部分資訊
http://imgur.com/rNdE4hh

嘗試的結果為
# -*- coding: utf-8 -*-
from urllib2 import urlopen
import xml.etree.ElementTree as ET
from lxml import etree
import mechanize
import sys
url = "http://www.tham.com.tw/recipe6.php"
path = "//*[@id=\"left-inner\"]/div[2]/div[3]"
html = urlopen(url).read()
tree = etree.HTML(html)
startindex = 4
data = tree.xpath(path)
print data[0].text
Output:
>>> ================================ RESTART ================================
>>>
材料 2人份
>>>
看網頁的原始碼猜測是因為<br />阻擋了判斷的緣故
請問這個有解嗎??
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.195.222.114
※ 文章網址: https://www.ptt.cc/bbs/Python/M.1457968017.A.79E.html
推
03/15 00:37, , 1F
03/15 00:37, 1F
→
03/15 19:43, , 2F
03/15 19:43, 2F
請在請教一下 xpath這部分要怎麼debug? 有什麼秘訣嗎?
下面output也怪怪的
# -*- coding: utf-8 -*-
from urllib2 import urlopen
import xml.etree.ElementTree as ET
from lxml import etree
import mechanize
import sys
url = "https://icook.tw/recipes/133425"
html = urlopen(url).read()
tree = etree.HTML(html)
path = "//*[@id=\"recipes_show\"]/div[3]"
title = tree.xpath(path)
print title
Output:
>>>
[]
※ 編輯: girl5566 (123.195.222.114), 03/15/2016 20:24:59
推
03/16 20:18, , 3F
03/16 20:18, 3F
→
03/16 20:19, , 4F
03/16 20:19, 4F
→
03/16 20:19, , 5F
03/16 20:19, 5F
Python 近期熱門文章
PTT數位生活區 即時熱門文章