PHP: Extract Outgoing URLs from a Web Page
In PHP, you can download a web page using file_get_contents or curl. Once you have downloaded a web page, you can process it.
We know that the tag structure of hyperlink is as follows
<a href="http://www.joysofprogramming.com">Joys of Programming</a>
Keeping this in mind, we write the following program
<?php
function extractElementsFromWebPage($webPage, $tagName) {
//Creating a DOMDocument Object.
$dom = new DOMDocument;
//Parsing the HTML from the web page
if ($dom->loadHTML($webPage)) {
// Extracting the specified elements from the web page
@$elements = $dom->getElementsByTagName($tagName);
return $elements;
}
return FALSE;
}
function downloadURL($URL) {
$webPage = file_get_contents ($URL);
return $webPage;
}
$webPage = downloadURL("http://www.mozilla.org/");
if ($webPage ) {
$URLs = extractElementsFromWebPage($webPage, 'a');
if ($URLs) {
foreach ($URLs as $URL){
// Extracting the URLs
echo $URL->getAttribute('href'), "\n";
}
}
else {
echo "Error in parsing the webPage\n";
}
}
else {
echo "Error in downloading the webPage\n";
}
?>
There are certain things that need to be understood:
Firstly we are using file_get_contents to download a web page. Then we use the DOMDocument class in PHP to parse the HTML page. Check the two functions
- downloadURL
- extractElementsFromWebPage
downloadURL uses file_get_contents to download the web page and extractElementsFromWebPage uses the DOMDocument class. The function loadHTML is used to parse the HTML page and getElementsByTagName to extract the specified elements. In our case, we want to extract the HTML tag element a.
On executing the program
$ php extractURLs.php #main / /about/ /community/ /projects/ /contribute/ /about/mission.html http://www.mozilla.com/firefox/ http://www.mozilla.com/mobile/download/ ...
Comments: