[問題] java抓網頁內容時，想連超連結那頁的內容一起抓

看板java作者sevenheart (啾啾)時間15年前 (2011/08/08 02:35)推噓0(0推 0噓 0→)

留言0則, 0人參與討論串1/1

在下超弱希望有人願意幫我這個忙orz [願景] 可以抓首頁的原始碼 |------------------------------ | | | |------"www.help...."---------- //抓到一半遇到超連結 |=============== //把www.help....的原始碼抓起來縮排直接放在下面 |=============== |=============== |------------------------------ |------------------------------ |------------------------------ |------------------------------ [背景] //這邊看不懂應該也沒關係對一個servlet程式用java的一系列指令最後用jhat把一個heap的內容呈現在 http://localhost:7000 (直接這樣點是看不到東西的，必須開著使用指令的那個command line) (就我的理解是 : 此時有一個JVM在執行，而我們看的就是他的內容) [開頭] 而這個網頁編排相當單純所有的超連結都是<a href=" 開頭... 於是我的程式試著抓了原本的這頁，並放進一個檔案中到這邊都沒有問題然後我加上一些條件讓他一遇到某一個指定的超連結就從下一行開始直接開始抓那個超連結我天真的想法是用一個URL的array，第url[0]存首頁，直接開 URL[] url = new URL[ MAX_URL ]; BufferedReader[] reader = new BufferedReader[ MAX_URL ]; url[0] = new URL("http://tw.yahoo.com/"); reader[0] = new BufferedReader( new InputStreamReader( url[0].openStream() ) ); 開始拼命讀url[0] while ( ( line = reader[0].readLine() ) != null ) 接下來要是遇到指定超連結就用url[1]去存目前還在測試階段，所以整個程式只抓一個指定的超連結遇到就不管原本的首頁(url[0])，直接進去url[1]抓到把超連結抓完再跳出來 [可是] 不清楚為什麼再抓url[1]的內容時我永遠都只抓的到url[1]那頁的"前三列" 抓完他就會回傳null了相當怪異他不會在eclipse上顯示任何有問題的訊息但是他竟然會在原本我開來跑jhat的那個command line中顯示 Exception in thread "Thread-3" java.lang.NumberFormatException: " is not a valid hex digit at com.sun.tools.hat.internal.util.Misc.parseHex(Misc.java:62) at com.sun.tools.hat.internal.model.Snapshot.findThing(Snapshot.java:359 ) at com.sun.tools.hat.internal.model.Snapshot.findClass(Snapshot.java:364 ) at com.sun.tools.hat.internal.server.ClassQuery.run(ClassQuery.java:42) at com.sun.tools.hat.internal.server.HttpReader.run(HttpReader.java:181) at java.lang.Thread.run(Thread.java:662) 百思不得其解 [徒勞] 用wireshark似乎聽不到任何我用這個程式抓localhost時的封包但是我抓google時就聽得到 [我的程式碼] import java.net.MalformedURLException; import java.net.URL; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWriter; import java.io.InputStreamReader; import java.io.IOException; public class for_asking { public static void main( String args[] ) { int url_count = 0; int i , j; int MAX_URL = 5000; int scope = 0; int pos; boolean START_TO_PARSE; boolean IS_TEST = false; String line; //read a line from the current page String tempStr; String tag_href = "<a href="; String tag_end_href = "</a>"; String package_test = "Package test"; String TAB = " "; String outputFileName = "output.html"; //輸出檔案名稱 //////// try { URL[] url = new URL[ MAX_URL ]; BufferedReader[] reader = new BufferedReader[ MAX_URL ]; FileWriter fw = new FileWriter(outputFileName); BufferedWriter bw = new BufferedWriter(fw); url_count = 0; url[0] = new URL("http://localhost:7000/"); reader[0] = new BufferedReader( new InputStreamReader( url[0].openStream() ) ); while ( ( line = reader[0].readLine() ) != null ) { START_TO_PARSE = false; //** (1)先output原本這列 bw.write( line ); //是不是test的package //我指定要找的超連結是出現在package test的下一列的超連結 pos = line.indexOf( package_test ); //沒有超連結會回傳-1 if( pos == -1 ) ; else IS_TEST = true; if( IS_TEST == true ) { //**判斷這列的html有沒有超連結 pos = line.indexOf( tag_href ); //沒有超連結會回傳-1 //**如果這列沒有超聯結的話 if( pos == -1 ) continue; //**如果找到超連結的話 /** (1)output (2)抓出超連結的字串，用來生成下一層的URL (3)往下一直找直到把原本Output的這列完全output完 (4)然後進該超連結(DFS)，url_count++，scope++ * 讀一列，判斷有沒有超連結 * **如果沒有超連結的話 * ***(I)照原本的output * * **如果有超連結的話 * ***(I)output * ***(II)抓出超連結 * ***(III)往下一直找直到把原本這列完全output完 * ***(IV)然後進該超連結(DFS)，url_count++，scope++ * DFS * ***(V)做完跳出來(進入外層的while) (5)做完DFS跳出來(進入外層的while) **/ else { j = line.indexOf( tag_end_href ); if( j == -1 ) //表示這列超連結沒有結束 ;//再說= = else { //只找雙引號中間 tempStr = line.substring( pos+9 , j-1 ); //** (2)取得新連結的URL(下一層的) url[ scope+1 ] = new URL( url[scope++] , tempStr ); } //** (3)接下來要先把這行輸出完，剛有可能只是html的原始檔亂換行，我們要找到真正換行的地方(目前也只能處理網址沒換列的情況) for( i = j+4 ; i < line.length() ; ++i ) { if( line.charAt(i) == '<' ) { START_TO_PARSE = true; break; } }//end for //判斷一下這列到底結束沒，如果還沒的話就繼續下一列(else) if( START_TO_PARSE == true ) ; else //原始碼中結束一列，畫面上該列未完 { //繼續讀原始檔中的下一列，直到讀到下一列的開頭為止(上面scope會先往上加，然後等到正式開始parse的時候才會更新url_count) while ( ( line = reader[ url_count ] .readLine() ) != null ) { bw.write(line+"\n"); //System.out.println(line); for( i = 0 ; i < line.length() ; ++i ) { if( line.charAt(i) == '<' ) { START_TO_PARSE = true; break; } } if( START_TO_PARSE == true ) break; }//end while(只要跳出來就代表可以往下了) }//end else }//end else(有超連結) //如果有超連結的話，就要開始parse(經過上面的檢查) if( START_TO_PARSE == true ) { ++url_count; START_TO_PARSE = false; BufferedReader tmp = new BufferedReader( new InputStreamReader( url[scope].openStream() ) ); while ( ( line = tmp.readLine() ) != null ) { //System.out.print("shit"); if( line.indexOf("</body>") == -1 ) //沒讀到最後一列才output System.out.println(line); } --url_count; --scope; IS_TEST = false; } }//end if(IS_TEST) //else沒有超連結的話就繼續抓原始檔 }//end while(還沒讀完整個原始檔) reader[0].close(); bw.close(); fw.close(); } catch (MalformedURLException e) { System.err.println(e); //ex. no protocol之類的 ... } catch (IOException e) { System.err.println(e); // ... } }//end main } 救命阿>"< -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 218.173.162.19