Re: [討論] HTML/XML processing tools

看板Python作者yoco315 (眠月)時間19年前 (2007/09/13 17:35)推噓1(1推 0噓 0→)

留言1則, 1人參與討論串2/2 (看更多)

※ 引述《seLain (建築的永恆之道)》之銘言： : 板上文章好少 ^^b : 想看看有沒人想討論一下 Python 的 HTML/XML processing tools, : 因為不久之後可能會用到, 想看看最近一兩年是否有比較好用的 tools 出現. : 我自己兩年前剛開始接觸 Python 時是用 BeautifulSoup [1], 用來寫程式 : 抓取 SourceForge.net 上的資料作分析 (小抱怨 SF.net 的網頁程式碼不太標準), : 後來寫的關於處理 XML 的都是小程式, 則有改用過 ElementTree [2], : 不知道是否有板友可以推薦其他好用的 tools, 使用經驗等等, thanks. : (如果有特殊的 data model 最好, 想看看除了 tree model 之外還可以怎樣處理) : References : [1] BeautifulSoup, URL : http://www.crummy.com/software/BeautifulSoup/ : [2] ElementTree, URL : http://effbot.org/zone/element-index.htm 我都是用 BeautifulSoup，覺得非常好用了 O_O 有一定程度的容錯，雖然還沒有到瀏覽器強度，但是配合一下 re 前處理或是自訂規則幾乎都可以處理了。語法用起來也非常簡單。像這樣的一個 html <html> <head>Title Text</head> <body> <a href="" rel="nofollow">http://www.google.com"> <b> Google </b> </a> <table> <tr> <td> TD1 </td> <td> TD2 </td> <td> <a href="" rel="nofollow">http://www.yahoo.com"> Yahoo! </a> <td> </table> </body> <html> 用 soup.body.a['href'] 可以取得 'http://www.google.com' soup.body('a') 可以取得所有的 a tag 的 list soup.table.tr('td')[2].string 就可以拿到 'Yahoo!' 字串不太會用的話，可能會用 soup.find('table').find('td')[2].find(text) 這樣就太累了 @@" -- To iterate is human, to recurse is divine. 遞迴只應天上有, 凡人該當用迴圈. 　 L. Peter Deutsch -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 140.114.78.40