Posts Tagged ‘DOM’

Using the PHP Document Object Model (DOM) to get all page links

Wednesday, January 27th, 2010

Further to the article I wrote about parsing links from a html page, here is a more elegant and accurate solution to getting every link using the Document Object Model (DOM)

/**
 * @author Jay Gilford
 */

/**
 * get_links()
 * 
 * @param string $url
 * @return array
 */
function get_links($url) {
    
    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();
    
    // Load the url's contents into the DOM (the @ supresses any errors from invalid XML)
    @$xml->loadHTMLFile($url);
    
    // Empty array to hold all links to return
    $links = array();
    
    //Loop through each  and  tag in the dom and add it to the link array
    foreach($xml->getElementsByTagName('a') as $link) {
        $links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
    }
    
    //Return the links
    return $links;
}

The code above is clearly documented as to how it all works. To call the function simply use
$links = get_links('http://www.example.com');
changing the website link to the page you require the links off. You could also expand this code to give you further details for the links such as the no follow attributes and so forth

If you have any questions about this feel free to contact me as always

Also please note that this requires PHP 5 in order for you to be able to use the DOMDocument