x
Loading
 Loading
Join 10,000+ Fans Join 5,000+ Followers Join 1,000+ Members Join 10,000+ Subscribers Subscribe to Daily Updates
Follow linuxdlsazine
Hello, Guest | Login | Register

Of Spiders and Scrapers: Decomposing Web Pages 101

Not all sites proffer slick RESTful interfaces and XML feeds. In those cases, collecting data requires some good, old-fashioned scraping. This week, let’s look at some of the tools and techniques required to scrape a site.

With so many different platforms connecting to the Internet these days, the traditional, HTML Web page is just one of many outlets of information. RSS syndicates content to aggregators and specialized readers; messaging services such as Twitter and Facebook keep audiences engaged with frequent, even real-time alerts; and programmatic interfaces, or APIs, provide automated access and further blur the distinction between client and server. If you’re authoring a specialized client or a “mashup” application for a new site, there’s likely no shortage of methods to collect and repurpose content.

Of course, not all sites proffer slick RESTful interfaces and XML feeds. Indeed, most don’t. In those cases, collecting data requires some good, old-fashioned scraping: identify the pages you want, download the content, and sift through the text or HTML of each page to extract the pertinent data. Depending on the complexity of the source, scraping can be simple…

Please log in to view this content.

Not Yet a Member?

Register with LinuxMagazine.com and get free access to the entire archive, including:

  • Hands-on Content
  • White Papers
  • Community Features
  • And more.
Already a Member?
Log in!
Username

Password

Remember me

Forgotten your password?
Forgotten your username?
Read More
  1. BlazeDS for PHP Developers
  2. Keep a Paper Trail with Paper Trail
  3. Typekit: Banishing Blight from the Browser
  4. Ten Things You Didn't Know Apache (2.2) Could Do
  5. Hijack: Living on the Edge of (Ruby and) Rails, Part 4
Comments
Downloads
BlackBerry
The CIO's Guide to Mobile Security
M86 Security
Real Time Code Analysis: Proactive Protection Against Malware Threats
Raritan
Measuring Power in Your Data Center
Astaro
Astaro Outperforms Cisco as an Integrated Security Solution at Devine Millimet
Columns
Ken Hess on
Systems
Joe Brockmeier on
Software
Frank Ableson on
Mobile
Jeffrey Layton on
Storage
Douglas Eadline on
HPC
Chris Smart on
Distros