PHP: Extract Outgoing URLs from a Web Page

Posted by Joys of Programming on in PHP, Web

In PHP, you can download a web page using file_get_contents or curl. Once you have downloaded a web page, you can process it.

We know that the tag structure of hyperlink is as follows

<a href="http://www.joysofprogramming.com">Joys of Programming</a>

Keeping this in mind, we write the following program

<?php

function extractElementsFromWebPage($webPage, $tagName) {
  //Creating a DOMDocument Object.
  $dom = new DOMDocument;

  //Parsing the HTML from the web page
  if ($dom->loadHTML($webPage)) {
    // Extracting the specified elements from the web page
    @$elements = $dom->getElementsByTagName($tagName);
    return $elements;
  }
  return FALSE;
}

function downloadURL($URL) {
  $webPage = file_get_contents ($URL);
  return $webPage;
}

$webPage = downloadURL("http://www.mozilla.org/");
if ($webPage ) {
  $URLs = extractElementsFromWebPage($webPage, 'a');
  if ($URLs) {
    foreach ($URLs as $URL){
      // Extracting the URLs
      echo $URL->getAttribute('href'), "\n";
    }
  }
  else {
    echo "Error in parsing the webPage\n";
  }
}
else {
  echo "Error in downloading the webPage\n";
}
?>

There are certain things that need to be understood:

Firstly we are using file_get_contents to download a web page. Then we use the DOMDocument class in PHP to parse the HTML page. Check the two functions

  1. downloadURL
  2. extractElementsFromWebPage

downloadURL uses file_get_contents to download the web page and extractElementsFromWebPage uses the DOMDocument class. The function loadHTML is used to parse the HTML page and getElementsByTagName to extract the specified elements. In our case, we want to extract the HTML tag element a.

On executing the program

$ php extractURLs.php
#main
/
/about/
/community/
/projects/
/contribute/
/about/mission.html

http://www.mozilla.com/firefox/

http://www.mozilla.com/mobile/download/

...



Tags: , , ,

Comments:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Copyright © 2009-2012 Joys of Programming All rights reserved.
Desk Mess Mirrored v1.8.1 theme from BuyNowShop.com.