وب اسکرپینگ در PHP

تقریبا هر توسعه دهنده PHP تا حالا یک سری داده ها را از وب اسکرپ کرده است. Web Scraping، روشی برای استخراج و پاکسازی داده ها در سرتاسر اینترنت است.

با استفاده از آموزش وب اسکرپینگ در PHP می توانید میلیون ها رکورد از اطلاعات مورد نیاز خود را از وب سایت های مختلف در کوتاه ترین زمان ممکن جمع آوری کنید.

وب اسکرپینگ (استخراج داده از صفحات وب) قبل از بوجود آمدن API ها استفاده می شد که الان اهمیت آن دوچندان شده است چراکه سایت های زیادی APIی برای دسترسی به یک سری داده ها را ارایه نمی دهند و ما ناچاریم که با وب اسکرپینگ (Web Scraping) تمام اطلاعات صفحه را بصورت HTML دریافت و با تجزیه (parse) کردن آن به داده مورد نظر برسیم.

مثلا برای دریافت اطلاعات موزیک از یک وبسایت مرجع دانلود آهنگ, دریافت اطلاعات فیلم و سریال از وب سایت مرجعی مثل imdb , دریافت قیمت های ارز, دریافت اخبار روزانه و خیلی مثال های دیگر که امروزه بی نهایت استفاده میشه می توانید از این آمورش وب اسکرپینگ در PHP بهره ببرید.

از وب اسکرپینگ (web scraping) در ویکی پدیا اطلاعات کامل و جامعی وجود دارد و همچنین روش های زیادی مثل pattern matching که با regex انجام می شود و یا DOM parsing که اینجا استفاده می کنیم, را به همراه ابزارهای آماده را معرفی کرده است.

قبلا نیز آموزش هایی در مورد وب اسکریپنگ به منظور دریافت لیست ایمیل ها و استخراج لینک های صفحه را دیدیم.

فهرست مطالب

- آنچه در آموزش وب اسکرپینگ در PHP یاد می گیرید

ساخت دیتابیس

آنچه در آموزش وب اسکرپینگ در PHP یاد می گیرید

استخراج داده از وب سایت های مختلف
ذخیره این اطلاعات در دیتابیس mysql
کلاس php برای استخراج داده از Domdocument
خودکار کردن عملیات وب اسکرپینگ در PHP

ساخت دیتابیس

CREATE TABLE IF NOT EXISTS `site_data` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(50) NOT NULL, `author` varchar(50) NOT NULL, `tags` varchar(50) NOT NULL, `recent_posts` varchar(250) NOT NULL, `entry_date` date NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `entry_date` (`entry_date`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

1 2 3 4 5 6 7 8 9 10

CREATE TABLE IF NOT EXISTS `site_data` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(50) NOT NULL, `author` varchar(50) NOT NULL, `tags` varchar(50) NOT NULL, `recent_posts` varchar(250) NOT NULL, `entry_date` date NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `entry_date` (`entry_date`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

فایل Db.php

برای اتصال به دیتابیس و ذخیره اطلاعات استخراج شده با وب اسکرپینگ در PHP یک کلاس Db برای این منظور ایجاد کردیم که از PDO برای اتصال به دیتابیس mysql استفاده می کند. <?php class Db{ private $conn; public $username = “root”; public $dbname = “webscraping”; public $password = “krd123”; public $host = “localhost”; public function getDbconnection(){ $this->conn = null; try{ $this->conn = new PDO(“mysql:host=”.$this->host. “;dbname=”.$this->dbname, $this->username, $this->password); }catch(PDOException $e){ echo “Error Occured while connecting to db”. $e->getMessage(); } return $this->conn; } }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

?php class Db{ private $conn; public $username = “root”; public $dbname = “webscraping”; public $password = “krd123”; public $host = “localhost”; public function getDbconnection(){ $this->conn = null; try{ $this->conn = new PDO(“mysql:host=”.$this->host. “;dbname=”.$this->dbname, $this->username, $this->password); }catch(PDOException $e){ echo “Error Occured while connecting to db”. $e->getMessage(); } return $this->conn; } }

ساخت کلاس WebScraping

WebScarping.class.php <?php class WebScraping { // Declaring class variables and arrays public $url; public $source; // Construct method called on instantiation of object function __construct($url) { // Setting URL attribute $this->url = $url; //passing the url to our function $this->source = $this->getCurl($this->url); // passing the return value from getCurl function $this->pathObj= $this->getXPathObj($this->source); } // Method for making a GET request using cURL public function getCurl($url) { // Initialising cURL session $ch = curl_init(); // Setting cURL options // Returning transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting URL curl_setopt($ch, CURLOPT_URL, $url); // Executing cURL session $results = curl_exec($ch); // Closing cURL session curl_close($ch); // Return the results return $results; } // Method to get XPath object public function getXPathObj($item) { // Instantiating a new DomDocument object $xmlPageDom = new DomDocument(); // Loading the HTML from downloaded page @$xmlPageDom->loadHTML($item); // Instantiating new XPath DOM object $xmlPageXPath = new DOMXPath($xmlPageDom); return $xmlPageXPath; //get xpath } }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

?php class WebScraping { // Declaring class variables and arrays public $url; public $source; // Construct method called on instantiation of object function __construct($url) { // Setting URL attribute $this->url = $url; //passing the url to our function $this->source = $this->getCurl($this->url); // passing the return value from getCurl function $this->pathObj= $this->getXPathObj($this->source); } // Method for making a GET request using cURL public function getCurl($url) { // Initialising cURL session $ch = curl_init(); // Setting cURL options // Returning transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting URL curl_setopt($ch, CURLOPT_URL, $url); // Executing cURL session $results = curl_exec($ch); // Closing cURL session curl_close($ch); // Return the results return $results; } // Method to get XPath object public function getXPathObj($item) { // Instantiating a new DomDocument object $xmlPageDom = new DomDocument(); // Loading the HTML from downloaded page @$xmlPageDom->loadHTML($item); // Instantiating new XPath DOM object $xmlPageXPath = new DOMXPath($xmlPageDom); return $xmlPageXPath; //get xpath } }

در اینجا دو متد به نام های getCurl() و getXPathObj() تعریف کردیم.

تابع getCurl() برای ارسال درخواست GET با استفاده از curl و برگشت دادن کد HTML داکیومنت موردنظر استفاده خواهد شد.

در تابع getXPathObj() نیز یک شی از DomDocument را ایجاد کردیم که مقدار $item را به عنوان یک داکیومنت HTML توسط تابع loadHTML() برگشت می دهد.

$xmlPageDom نیز به شی DOMXPATH پاس داده شده است که ما کمک می کند تمام داده های انتخابی از صفحه HTML را دریافت کنیم.

اگر درحال اسکرپ کردن صفحه XML را دارید, به جای تابع loadHTML از تابع loadXML استفاده کنید.

دریافت داده ها از داکیومنت HTML و ذخیره در دیتابیس

ما اطلاعاتی مثل نویسنده مطلب, عنوان, زمان انتشار, متاتگ ها و آخرین مطالب را از این آدرس دریافت می کنیم.

Scrapper.php <?php require_once(‘WebScraping.class.php’); require_once(“Db.php”); //initializing the database object $database = new Db; $db = $database->getDbconnection(); //intializing the WeScraping object and retrieving the data from the html elements $PostsData = new WebScraping(‘https://beingjaydesaicom.wordpress.com/2016/09/15/getting-started-with-github/’); //query will evaluate the expression of the node inside the html document // item(0)->nodeValue will give one value from that node // Getting the posts title $posts_title = $PostsData->pathObj->query(‘//h1[@class=”entry-title”]’)->item(0)->nodeValue; // echo “string”; // exit; // Getting the posts author $posts_author = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”author vcard”]/a’)->item(0)->nodeValue; // Getting the posts release date $posts_Releasedate = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”posted-on”]/time[@class=”entry-date published”]’)->item(0)->nodeValue; // getting names of all the recent posts //note we havent used item(0)->nodeValue as we need all the data within <a> tag $recent_posts = $PostsData->pathObj->query(‘//div[@class=”widget-area”]/aside[@class=”widget widget_recent_entries”]/ul/li/a’); if (!is_null($recent_posts)) { $all_recent_posts = array(); foreach ($recent_posts as $post) { $all_recent_posts[] = $post->nodeValue; } } //getting all the tags of the post //note we havent used item(0)->nodeValue as we need all the data within </a><a> tag $tags = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/div[@class=”meta-wrapper”]/ul[@class=”post-tags”]/li/a’); //getting all the tags on the posts if (!is_null($tags)) { $posts_tags = array(); foreach ($tags as $tag) { $posts_tags[] = $tag->nodeValue; } } // using implode to function to get comma separated values of recent posts and tags before inserting it to db row,columns. $inn_tags = implode(“,”, $posts_tags); $ins_recent_posts = implode(“,”, $all_recent_posts); //convert into date format Y-m-d to save format into mysql database $entry_date = date(“Y-m-d” , strtotime($posts_Releasedate)); //insert query $insert_db_query = “INSERT INTO site_data SET title=:title,author=:author, recent_posts= :recent_posts, tags= :tags,entry_date=:release”; //prepare the query $exec = $db->prepare($insert_db_query); //set the inputs and sanitize it properly $title = htmlspecialchars(strip_tags($posts_title)); $release = htmlspecialchars(strip_tags($entry_date)); $author = htmlspecialchars(strip_tags($posts_author)); $al_recent_posts = htmlspecialchars(strip_tags($ins_recent_posts)); $al_tags = htmlspecialchars(strip_tags($inn_tags)); //bind parameters $exec->bindParam(“:title”, $title); $exec->bindParam(“:release”, $release); $exec->bindParam(“:author”, $author); $exec->bindParam(“:recent_posts”, $al_recent_posts); $exec->bindParam(“:tags”, $al_tags); if($exec->execute()){ echo “Data Inserted into db”; } else{ echo “ <pre>”; print_r($exec->errorInfo()); echo “</pre> ”; } ?></a>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104

require_once(‘WebScraping.class.php’); require_once(“Db.php”); //initializing the database object $database = new Db; $db = $database->getDbconnection(); //intializing the WeScraping object and retrieving the data from the html elements $PostsData = new WebScraping(‘https://beingjaydesaicom.wordpress.com/2016/09/15/getting-started-with-github/’); //query will evaluate the expression of the node inside the html document // item(0)->nodeValue will give one value from that node // Getting the posts title $posts_title = $PostsData->pathObj->query(‘//h1[@class=”entry-title”]’)->item(0)->nodeValue; // echo “string”; // exit; // Getting the posts author $posts_author = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”author vcard”]/a’)->item(0)->nodeValue; // Getting the posts release date $posts_Releasedate = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”posted-on”]/time[@class=”entry-date published”]’)->item(0)->nodeValue; // getting names of all the recent posts //note we havent used item(0)->nodeValue as we need all the data within tag $recent_posts = $PostsData->pathObj->query(‘//div[@class=”widget-area”]/aside[@class=”widget widget_recent_entries”]/ul/li/a’); if (!is_null($recent_posts)) { $all_recent_posts = array(); foreach ($recent_posts as $post) { $all_recent_posts[] = $post->nodeValue; } } //getting all the tags of the post //note we havent used item(0)->nodeValue as we need all the data within tag $tags = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/div[@class=”meta-wrapper”]/ul[@class=”post-tags”]/li/a’); //getting all the tags on the posts if (!is_null($tags)) { $posts_tags = array(); foreach ($tags as $tag) { $posts_tags[] = $tag->nodeValue; } } // using implode to function to get comma separated values of recent posts and tags before inserting it to db row,columns. $inn_tags = implode(“,”, $posts_tags); $ins_recent_posts = implode(“,”, $all_recent_posts); //convert into date format Y-m-d to save format into mysql database $entry_date = date(“Y-m-d” , strtotime($posts_Releasedate)); //insert query $insert_db_query = “INSERT INTO site_data SET title=:title,author=:author, recent_posts= :recent_posts, tags= :tags,entry_date=:release”; //prepare the query $exec = $db->prepare($insert_db_query); //set the inputs and sanitize it properly $title = htmlspecialchars(strip_tags($posts_title)); $release = htmlspecialchars(strip_tags($entry_date)); $author = htmlspecialchars(strip_tags($posts_author)); $al_recent_posts = htmlspecialchars(strip_tags($ins_recent_posts)); $al_tags = htmlspecialchars(strip_tags($inn_tags)); //bind parameters $exec->bindParam(“:title”, $title); $exec->bindParam(“:release”, $release); $exec->bindParam(“:author”, $author); $exec->bindParam(“:recent_posts”, $al_recent_posts); $exec->bindParam(“:tags”, $al_tags); if($exec->execute()){ echo “Data Inserted into db”; } else{ echo “pre>”; print_r($exec->errorInfo()); echo “/pre>”; } ?>

توضیحات هر خط به طور واضح و کوتاه با یک کامنت در کدها آمده است تا درک کنید که هر کدام به چه منظور استفاده شده است.

در اینجا ما ابتدا یک شی از دیتابیس و کلاس WebScraping خودمان ایجاد کردیم. بعد از آن کدهای صفحه URL هدف را بررسی کردیم تا بتوانیم سلکتور مناسب برای دریافت داده های هر بخش مثل نام نویسنده, زمان انتشار را بنوسیم. سپس داده ها را با query دریافت و داده ها را دسته بندی کردیم.

در نهایت کوئری SQL را برای قرار دادن (INSERT) داده ها در دیتابیس mysql استفاده کردیم.. که خب مطمینا قبلا از آن برای پاکسازی ورودی های مخرب از تابع htmlspecialchars بهره بردیم.

با اجرا موفق کوئری SQL پیامی را چاپ می کنیم, در غیر اینصورت هم خطای مربوطه را نمایش می دیم.

اجری خودکار اسکریپت (برای لینوکس)

می توانید بصورت روزانه, حجم زیادی از اطلاعات را جمع آوری و در دیتابیس برای نمایش در سایت (مثلا بخش اخبار یا قیمت ارز) استفاده کنید.

بنابراین اگر وبسایتی برای استفاده از این اطلاعات را دارید, نیاز است که از cron job برای اجرای خودکار اسکریپت طبق محدوده زمانی که وجود دارد استفاده کنید.

در آموزش cron job نحوه کار با آن را به طور کامل دیدیم.

کافی است در ترمینال دستور crontab –e را بزنید.اگر از شما درخواست ادیتور را کرد گزینه ۲ را بزنید و سپس خط زیر را اضافه کنید 0 18 * * * /usr/bin/php -f /var/www/html/web-scraping/scrapper.php >> /var/www/html/web-scraping/log.txt

1	18 * * * /usr/bin/php -f /var/www/html/web-scraping/scrapper.php >> /var/www/html/web-scraping/log.txt

این خط اسکریپت شما را هر روز در ساعت ۶ بعد از ظهر اجرا و خروجی لاگ ها را در log.txt ذخیره می کند.

برای دریافت اطلاعات بیشتر در مورد توابع پایه و مهمی که در آموزش وب اسکرپینگ در PHP استفاده کردیم به لینک های زیر مراجعه کنید

DomXpath
DomDocument

در مقاله آموزشی بعدی می توانید یک مثال بسیار جامع از آموزش استخراج اطلاعات فیلم از imdb را ببینید.

امیدوارم از آموزش وب اسکرپینگ در PHP نهایت استفاده را برده باشید.

بیشتر بخوانید: