how to get all links from a web page

A question that gets asked all the time on forums is “How do I get all links on a web page” inside of <a> tags, so here’s some code with full commenting for each line

/**
 * @author Jay Gilford
 */

// regular expression pattern to match all links on a page
$pattern = '%]+href="(?P[^"]+)"[^>*]*>(?P[^< ]+)%si';

// Webpage URL to get links from
$url = 'http://www.jaygilford.com/';

// Fetch contents of whole page
$page_content = file_get_contents($url);

// Get all matches of links and put them into the $matches variable
preg_match_all($pattern, $page_content, $matches);

// Variable to hold all of our urls and their text
$urls = array();

// Loop through each array item
foreach($matches['url'] as $k=>$v) {
    // combine the url and text into it's own key for ease of access
    $urls[$k] = array('url' => $v,'text' => $matches['text'][$k]);
}

// For display purposes only to show the contents of $urls
echo print_r($urls, true);

If you have any questions regarding this feel free to contact me. Details can be found on the about page

Tags: , , , ,

2 Responses to “how to get all links from a web page”

  1. Rodger Says:

    I would like to know, how can we fetch Nofollow links with the same script. is it possible to do this. i need to get the whole information of a URL.

    Rodger

  2. Jay Says:

    Hi Rodger

    You would be better off using the dom to do this, take a look at this article
    http://www.jaygilford.com/php/php-dom-get-all-pagelinks/
    It should be clear how to get the nofollow links from that

    Jay

Leave a Reply

You must be logged in to post a comment.