DOM manipulation with PHP, the ultimate page scraper?
By Andres, January 12, 2010 2 responsesThese days we hear a good deal about DOM manipulation with JavaScript but some little known technologies ( for now – they’re quickly gaining ground ) are xPath, xQuery & XSLT.
Fellow developers will know that historically we’ve had to rely on a number of regular expressions to scrape a page and while this can most often be fast, it’s sometimes horrendous to read and edit as a TINY mis-write can effectively render the Regular Expression useless. That’s not to say it’s not useful when the hierarchy is small and simple but in today’s world of web 2.0 designs they’re often not.
What is page scraping?
Page scraping is a method that allows you to pull information from a web page, so that the data can be manipulated inside your own script. In your script, you can connect to another URL and request a page, just like a browser would do it. Once you make the request, the web server will send back the page you asked for and your script can manipulate the data and extract specific information.
What exactly does the DomDocument Object do?
If you’re not familiar with the DOM model than the following explanation probably isn’t going to make much sense as there really isn’t too much to say besides, DomDocument transform’s an HTML page into a tree-model of elements. JavaScript does this upon load and is the entire point to the language: Dom Manipulation. I find a visual representation often helps to understand the tree model so here is a simplified version of the aforementioned model present in DOM:

The above is a representation of the following HTML:
<div>
<ul>
<li>
<a href="#">URL</a>
</li>
<li>
<a href="#"><img src="#" /></a>
</li>
<li>
<a href="#">URL</a>
</li>
<ul>
</div>Sidenote: If your development background is based on the more traditional languages this will look familiar to you as resembles the Binary Tree, how ever, DOM supports a limitless number of children per node.
Through the DomDocument object we are given to the ability to traverse the nodes, create them, and remove them as we see fit. However sometimes traversing the levels of DOM solely through the methods provided to us by the object is cumbersome and altogether impractical given the depth of the information that we sometimes need. Enter xPath; it is to XML compliant mark up languages what SQL is to databases, a query language. The entire breadth of xPath is outside the scope of this particular post but is covered in depth here. If you’re familiar with jQuery or any other JavaScript framework which supports CSS style selectors this’ll be easy for you.
How to use xPath with DomDocument
We’ll start off with something basic as an introduction, we’ll scrape the “Why us” section of TECKpert”s homepage.
<?php
// Define our URL & Start Dom Document
$url = 'http://www.teckpert.com';
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTMLFile($url);
// Alternatively this works too
$html = file_get_contents($url);
$doc->loadHTML($html);
// Now that we've created our dom object proper
// call the xPath object
$xPath = new DOMXPath( $doc );
// Query TECKpert's dom for the 'why us' section
$results = $xPath->query('//div[@class="why_us"]');
echo $results->item(0)->textContent;Simple to use right? There will be more to come on the onset of technologies associates with XML traversing and it’s related query languages.
Note on this article
There exists the possibility of violating copyright laws using techniques outlined in this article if you misuse data you scrape. Please scrape responsibly.
PIPA and SOPA are two bills going before Congress. The Protect IP Act (PIPA) is being presented before the Senate and the Stop Online Piracy Act (SOPA) before Congress.

another way to do this is with native Javascript. instead of having to write
$results = $xPath->query(‘//div[@class="why_us"]‘);
you could simply write
$results = querySelectorAll(‘.div’);
i launched my first php screen scraper in 2003. for even better scraping, get yourself on a Ruby on Rails environment and check out scRUBY. happy coding!
@brian the article is about page scraping. Having just helped out on a project that manipulated the scrape with JavaSrcipt as you suggest, I found it extremely problematic. If the browser has JavaScript disabled, your script fails. Secondly, you have to wait for the page to load with all of its assets (images, stylesheets, flash ect.) before you can start manipulating, which means there is a delay before your scraped and repackaged page is served. If it is all handled with on the server side with PHP, the end user never has an issue then with a delay making it faster, or the page manipulation failing because of no JavaScript, or some other competing script on the page causing yours to fail.