Re: [問題] 如何清除email裡的html tag??
※ 引述《flylinux (ㄚ琪)》之銘言:
: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
: 試試!
: ※ 引述《deh3215 ()》之銘言:
: : 只用s/<.*?>//gsi;似乎效果不好,有人有用模組清除html tag的經驗嗎?
<FONT face=Verdana size=1><br> Dear
PayPalCustomer,<BR></div> <DIV>
<b> CONGRATULATIONS!</b><DIV> <DIV> You have
been chosen by our online department <br> to take part in our quick and
easy online departament.<br> In return we will credit $20
to your account - Just for your time! <====> 上面為222.html/txt內容
---------------------------------------------------
用HTML::FormatText清除後結果:
Dear PayPal Customer,
CONGRATULATIONS! You have been chosen by our online department
to take part in our quick and easy online departament.
In return we will credit $20 to your account - Just for your time!
---------------------------------------------------
require HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new->parse_file("c:/222.html");
require HTML::FormatText;
$formatter = HTML::FormatText->new;
print $formatter->format($tree);
---------------------------------------------------
上面略為修改結果(程式碼在下方): 看起來tag都被清掉了,但有些字也被清空了
Dear ______ == > PayPal,time!<==
Customer, 似乎是空白的關係,這是模組的bug嗎
CONGRATULATIONS!
You have been chosen by our online
department
to take part in our quick and easy
online
departament.
In return we will credit $20 to
your account - Just for your ______
-----------------------------------------------
open(INPUT, "c:/222.txt") or die;
@temp = <INPUT>;
chomp @temp;
use HTML::TreeBuilder;
use HTML::FormatText;
foreach $string(@temp) {
$tree = HTML::TreeBuilder->new;
$tree->parse($string);
$formatter = HTML::FormatText->new;
print $formatter->format($tree);
}
--------------------------------------------------
用HTML::Strip清除後結果
??Dear PayPalCustomer, ??CONGRATULATIONS!?? You have been chosen by our
online department ??to take part in our quick and easy online departament.
??In return we will credit $20 to your account - Just for your time!
---------------------------------------------------
use HTML::Strip;
open(INPUT, "c:/222.txt");
@temp = <INPUT>;
chomp @temp;
foreach $t (@temp) {
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse($t);
$hs->eof;
print "$clean_text";
}
---------------------------------------------------
用HTML::Strip清除,不知為何會有問號? ====>把txt檔中" "刪除即沒有問號
---------------------------------------------------
還是HTML::TreeBuilder搭配HTML::FormatText模組會清的比較乾淨,但是某些亂插
入的16進位0x123544這類的可能要手動清除@@
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 59.116.2.192
※ 編輯: deh3215 來自: 140.117.168.75 (03/04 20:21)
討論串 (同標題文章)
本文引述了以下文章的的內容:
完整討論串 (本文為第 3 之 4 篇):
Perl 近期熱門文章
PTT數位生活區 即時熱門文章