Of Spiders and Scrapers: Decomposing Web Pages 101
Not all sites proffer slick RESTful interfaces and XML feeds. In those cases, collecting data requires some good, old-fashioned scraping. This week, let’s look at some of the tools and techniques required to scrape a site.
Wednesday, July 29th, 2009
With so many different platforms connecting to the Internet these days, the traditional, HTML Web page is just one of many outlets of information. RSS syndicates content to aggregators and specialized readers; messaging services such as Twitter and Facebook keep audiences engaged with frequent, even real-time alerts; and programmatic interfaces, or APIs, provide automated access and further blur the distinction between client and server. If you’re authoring a specialized client or a “mashup” application for a new site, there’s likely no shortage of methods to collect and repurpose content.
Of course, not all sites proffer slick RESTful interfaces and XML feeds. Indeed, most don’t. In those cases, collecting data requires some good, old-fashioned scraping: identify the pages you want, download the content, and sift through the text or HTML of each page to extract the pertinent data. Depending on the complexity of the source, scraping can be simple…
Please log in to view this content.
Read More
- BlazeDS for PHP Developers
- Keep a Paper Trail with Paper Trail
- Typekit: Banishing Blight from the Browser
- Ten Things You Didn't Know Apache (2.2) Could Do
- Hijack: Living on the Edge of (Ruby and) Rails, Part 4
Comments
|