MyFunnyDev

web, coding and beyond

anemone with hpricot

with one comment

Anemone is a pretty cool DSL used for web crawling. I used it with Hpricot to get a feeling for what’s possible. Below is a simple example which crawls and scrappes data from a popular polish real estate website otodom:

require 'rubygems'
require 'sanitize'
require 'anemone'
require 'open-uri'
require 'hpricot'
 
#otodom.pl
Anemone.crawl("http://otodom.pl/index.php?mod=search&act=searchResults&qid=46911208", 
{:storage => Anemone::Storage.PStore("crawl1.pstore")}) do | anemone |
 
  # filter out useless pages
  anemone.focus_crawl do |page|
   page.links.delete_if do |x|
    (x.to_s =~ /mod=search&act=searchResults&qid=/).nil? and
    (x.to_s =~ /[a-zA-Z]+-id[0-9]*\.html$/).nil?
   end
  end
 
  # process details pages
  anemone.on_pages_like(/[a-zA-Z]+-id[0-9]*\.html$/) do | page |
     doc = Hpricot(page.doc)
     price =  doc.at("//strong[@id='offerPrice']")
     location = doc.at("//dl[@class='stripeMe'] > dd")
     desc = doc.at("//div[@id='offerDesc'] > p")
     offer_no = doc.at("//div[@id='offerFoot'] p[@class='toLeft']/span/strong")
     created_at = doc.at("//div[@id='offerFoot'] p[@class='toRight']/span/strong")
     photos = doc.search("//div[@id='imageList']/p/a")
  end
end

Written by MichaƂ Kuklis

January 11th, 2010 at 12:06 am

Posted in Ruby

One Response to 'anemone with hpricot'

Subscribe to comments with RSS or TrackBack to 'anemone with hpricot'.

  1. You could also use nokogiri. Since anemone requires it anyway, you’d save yourself a gem :)
    (although: hpricot was faster in my limited tests so far)

    Marc

    4 May 10 at 3:27 pm

Leave a Reply