Re: [問題] 如何清除email裡的html tag？？

看板Perl作者deh3215時間16年前 (2009/03/04 02:11)推噓0(0推 0噓 0→)

留言0則, 0人參與討論串3/4 (看更多)

※ 引述《flylinux (ㄚ琪)》之銘言： : s/<(?:[^>'"]*|(['"]).*?\1)*>//gs : 試試！ : ※ 引述《deh3215 ()》之銘言： : : 只用s/<.*?>//gsi;似乎效果不好,有人有用模組清除html tag的經驗嗎? <FONT face=Verdana size=1><br>   Dear PayPalCustomer,<BR></div>  <DIV>  <b> CONGRATULATIONS!</b><DIV>  <DIV>   You have been chosen by our online department <br>   to take part in our quick and easy online departament.<br>   In return we will credit $20 to your account - Just for your time! <====> 上面為222.html/txt內容 --------------------------------------------------- 用HTML::FormatText清除後結果: Dear PayPal Customer, CONGRATULATIONS! You have been chosen by our online department to take part in our quick and easy online departament. In return we will credit $20 to your account - Just for your time! --------------------------------------------------- require HTML::TreeBuilder; $tree = HTML::TreeBuilder->new->parse_file("c:/222.html"); require HTML::FormatText; $formatter = HTML::FormatText->new; print $formatter->format($tree); --------------------------------------------------- 上面略為修改結果(程式碼在下方): 看起來tag都被清掉了,但有些字也被清空了 Dear ______ == > PayPal,time!<== Customer, 似乎是空白的關係,這是模組的bug嗎 CONGRATULATIONS! You have been chosen by our online department to take part in our quick and easy online departament. In return we will credit $20 to your account - Just for your ______ ----------------------------------------------- open(INPUT, "c:/222.txt") or die; @temp = <INPUT>; chomp @temp; use HTML::TreeBuilder; use HTML::FormatText; foreach $string(@temp) { $tree = HTML::TreeBuilder->new; $tree->parse($string); $formatter = HTML::FormatText->new; print $formatter->format($tree); } -------------------------------------------------- 用HTML::Strip清除後結果 ??Dear PayPalCustomer, ??CONGRATULATIONS!?? You have been chosen by our online department ??to take part in our quick and easy online departament. ??In return we will credit $20 to your account - Just for your time! --------------------------------------------------- use HTML::Strip; open(INPUT, "c:/222.txt"); @temp = <INPUT>; chomp @temp; foreach $t (@temp) { my $hs = HTML::Strip->new(); my $clean_text = $hs->parse($t); $hs->eof; print "$clean_text"; } --------------------------------------------------- 用HTML::Strip清除,不知為何會有問號? ====>把txt檔中" "刪除即沒有問號 --------------------------------------------------- 還是HTML::TreeBuilder搭配HTML::FormatText模組會清的比較乾淨,但是某些亂插入的16進位0x123544這類的可能要手動清除@@ -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 59.116.2.192 ※ 編輯: deh3215 來自: 140.117.168.75 (03/04 20:21)