Re: [問題] 如何清除email裡的html tag??

看板Perl作者時間16年前 (2009/03/04 02:11), 編輯推噓0(000)
留言0則, 0人參與, 最新討論串3/4 (看更多)
※ 引述《flylinux (ㄚ琪)》之銘言: : s/<(?:[^>'"]*|(['"]).*?\1)*>//gs : 試試! : ※ 引述《deh3215 ()》之銘言: : : 只用s/<.*?>//gsi;似乎效果不好,有人有用模組清除html tag的經驗嗎? <FONT face=Verdana size=1><br>&nbsp;&nbsp;&nbsp;Dear PayPalCustomer,<BR></div>&nbsp;&nbsp;<DIV>&nbsp; <b>&nbsp;CONGRATULATIONS!</b><DIV>&nbsp;&nbsp;<DIV>&nbsp;&nbsp; You have been chosen by our online department <br>&nbsp;&nbsp;&nbsp;to take part in our quick and easy online departament.<br>&nbsp;&nbsp;&nbsp;In return we will credit $20 to your account - Just for your time! <====> 上面為222.html/txt內容 --------------------------------------------------- 用HTML::FormatText清除後結果: Dear PayPal Customer, CONGRATULATIONS! You have been chosen by our online department to take part in our quick and easy online departament. In return we will credit $20 to your account - Just for your time! --------------------------------------------------- require HTML::TreeBuilder; $tree = HTML::TreeBuilder->new->parse_file("c:/222.html"); require HTML::FormatText; $formatter = HTML::FormatText->new; print $formatter->format($tree); --------------------------------------------------- 上面略為修改結果(程式碼在下方): 看起來tag都被清掉了,但有些字也被清空了 Dear ______ == > PayPal,time!<== Customer, 似乎是空白的關係,這是模組的bug嗎 CONGRATULATIONS! You have been chosen by our online department to take part in our quick and easy online departament. In return we will credit $20 to your account - Just for your ______ ----------------------------------------------- open(INPUT, "c:/222.txt") or die; @temp = <INPUT>; chomp @temp; use HTML::TreeBuilder; use HTML::FormatText; foreach $string(@temp) { $tree = HTML::TreeBuilder->new; $tree->parse($string); $formatter = HTML::FormatText->new; print $formatter->format($tree); } -------------------------------------------------- 用HTML::Strip清除後結果 ??Dear PayPalCustomer, ??CONGRATULATIONS!?? You have been chosen by our online department ??to take part in our quick and easy online departament. ??In return we will credit $20 to your account - Just for your time! --------------------------------------------------- use HTML::Strip; open(INPUT, "c:/222.txt"); @temp = <INPUT>; chomp @temp; foreach $t (@temp) { my $hs = HTML::Strip->new(); my $clean_text = $hs->parse($t); $hs->eof; print "$clean_text"; } --------------------------------------------------- 用HTML::Strip清除,不知為何會有問號? ====>把txt檔中"&nbsp;"刪除即沒有問號 --------------------------------------------------- 還是HTML::TreeBuilder搭配HTML::FormatText模組會清的比較乾淨,但是某些亂插 入的16進位0x123544這類的可能要手動清除@@ -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 59.116.2.192 ※ 編輯: deh3215 來自: 140.117.168.75 (03/04 20:21)
文章代碼(AID): #19hNBOs6 (Perl)
文章代碼(AID): #19hNBOs6 (Perl)