Older Posts
· Ipswitch not uploading your files when you're uplo...
· Warning: copy(yourfile): failed to open stream: Pe...
· Dumping old computer blog posts here
· Problems publishing on Blogger.com?
· What is PR2, PR3, PR4, etc.?
· Site update
· Various useful images
· And so the site transfer goes on...
· Anyone else having problems with MTV Overdrive buf...
· Site overhaul

Archives
April 2005
May 2005
June 2005
July 2005
August 2005
November 2005
December 2005
January 2006
February 2006
March 2006
May 2006
June 2006
July 2006
August 2006
October 2006
March 2007
May 2007
June 2007
September 2007
November 2007
January 2008
June 2008
August 2008

Links
· Startup Applications List
· Welcome to the Hex Hub (Named Hexadecimal Color Codes for HTML)



More blogs on this site:



Powered by Blogger
 

Computer Blog - thebroadroom.net: Java spider function

Disclaimer: all of the following is purely from personal experience. TheBroadroom.Net urges you to use your own instincts, common sense, and willingness to take risks when applying any of the information below.

Geeks rule.
Profile · Current Blog



Google Custom Search


Java spider function
posted by TheBroadroom.Net, Monday, May 28, 2007 at 11:09 AM (Pacific)

Thursday, October 21, 2004

Here is some source code for a Java spider function. I'll guess it could have been written as a recursive function but I couldn't remember the name "recursive," so I slapped together two functions. Quick and dirty, but it works.

public void spider(String directory) {

  File dir;
  Vector directories;
  String child;
  String mydir;

  // got this snippet off the Net
  dir = new File(directory);
  String[] children = dir.list();
  directories = new Vector();

  if (children == null) {
    // Either dir does not exist or is not a directory
  }
  else {

    for (int i = 0; i < children.length; i++) {

      child = children[i].toLowerCase();

      // I don't have the FileFilter class, sorry
      if(child.endsWith(".html") || child.endsWith(".htm")) {
        m_file_list.addElement(directory + "/" + child);
      }

      // test to see if it has no extension
      else if(child.indexOf(".") == -1) {
        mydir = directory + "/" + child;
        directories.addElement(mydir);
        m_directories.addElement(mydir + "/");
      }
    }
  }
  testVector(directories);
}

public void testVector(Vector v) {

  // if there's anything in the directories vector, call spider
  for(int i = 0; i < v.size(); i++) {
    spider((String)v.elementAt(i));
  }
}
...

Of course what you do with the files and subdirectories is your business. Here I have a vector to add the filenames to, and a vector to store the directories and subdirectories. You don't need to store the latter if all you want to do is access all of your files and do something with them; I stored them purely for the report that gets written after all the directories have been crawled.

If you would like the option of eliminating certain subdirectories from getting crawled, it's easy enough to list them and then compare each directory against your list.

Here's the report for my particular program. It's simple; all it looks for are two HTML tags. If it doesn't find them, it slaps the file on the "tags not found" list; if it does, it replaces whatever is between the two tags with a new string (in this case it would be advertising).

This is from our "fun" and "fashion" directories:

Completed files: 4
fun/index.html
fun/book_reviews/index.html
fun/book_reviews/2003_q4/da_vinci_code.html
fun/article_index/index.html

Tags not found: 47
fun/wit_wisdom.html
fun/weblogs/index.html
fun/weblogs/women_bloggers.html
fun/book_reviews/2003_q4/five_people.html
fun/book_reviews/2004_q1/girls_guide.html
fun/book_reviews/2004_q1/five_people.html
fun/book_reviews/2004_q1/emperor.html
fun/book_reviews/2004_q1/songbird.html
fun/book_reviews/2004_q1/why_some_men.html
fun/book_reviews/2004_q2/how_do_you_compare.html
fun/book_reviews/2004_q3/buddha.html
fun/book_reviews/2004_q3/corpses.html
fun/book_reviews/2004_q3/midlife.html
fun/book_reviews/2004_q3/evenings.html
fun/book_reviews/2004_q3/pregnancy.html
fun/book_reviews/2004_q3/sororities.html
fun/book_reviews/2004_q3/who_cares.html
fun/book_reviews/2004_q3/drifting.html
fun/book_reviews/2004_q4/chasing.html
fun/book_reviews/2004_q4/winning_habits.html
fun/book_reviews/2004_q4/angels_demons.html
fun/book_reviews/2004_q4/a_royal_duty.html
fun/book_reviews/2004_q4/sammys_hill.html
fun/link_exchange/index.html
fun/feature/index.html
fun/feature/marlo_thomas.html
fun/feature/old_features.html
fun/feature/who_cares.html
fashion/index.html
fashion/what_do_you_wear/index.html
fashion/what_do_you_wear/reach_for.html
fashion/what_do_you_wear/accessories.html
fashion/what_do_you_wear/gap.html
fashion/what_do_you_wear/banana_republic.html
fashion/what_do_you_wear/shoes.html
fashion/what_do_you_wear/target.html
fashion/what_do_you_wear/old_navy.html
fashion/what_do_you_wear/bras.html
fashion/what_do_you_wear/cafepress.html
fashion/what_do_you_wear/2003_q4/socks.html
fashion/what_do_you_wear/2003_q4/whimsy.html
fashion/what_do_you_wear/2003_q4/fall.html
fashion/what_do_you_wear/2003_q4/old_clothes.html
fashion/what_do_you_wear/2003_q4/weather.html
fashion/what_do_you_wear/2004_q3/bracelets.html
fashion/anti_fashion/index.html
fashion/anti_fashion/2003_q4/accessories.html

Total files read: 51

Total directories read: 17
fun/
fun/weblogs/
fun/book_reviews/
fun/link_exchange/
fun/article_index/
fun/feature/
fun/book_reviews/2003_q4/
fun/book_reviews/2004_q1/
fun/book_reviews/2004_q2/
fun/book_reviews/2004_q3/
fun/book_reviews/2004_q4/
fashion/
fashion/what_do_you_wear/
fashion/anti_fashion/
fashion/what_do_you_wear/2003_q4/
fashion/what_do_you_wear/2004_q3/
fashion/anti_fashion/2003_q4/

Labels:


0 Comment(s)