今天接到一个任务,需要采集https://www.dianping.com 大众点评站。使用php curl时发现存在2个问题。
1,curl 针对https的设置。这个好解决。 curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
2,后面采集时,发现还是被dianping.com转到别的链接上去了。经过分析和排查发现这个是带cookeis访问的。见图
3,使用php的curl存放dianping.com站的cookies失败。采用linux环境内的 curl -c cookie.txt https://www.dianping.com/search/category/207/10 直接得到cookies.txt。比php内的简单见cookies.txt内容
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html # This file was generated by libcurl! Edit at your own risk..dianping.com TRUE / FALSE 0 PHOENIX_ID 0a0102fe-15a825c9312-1834aca
.dianping.com TRUE / FALSE 1551317789 s_ViewType 10 www.dianping.com FALSE / FALSE 0 JSESSIONID D5829965CE0CE4E539181967FE7FB063 .dianping.com TRUE / FALSE 1519781789 aburl 14,直接在php内加上cookies文件,去采集了。成功了。见截图及代码
<?php
$url = 'https://www.dianping.com/search/category/207/10#breadCrumb'; $curl = curl_init(); curl_setopt($curl, CURLOPT_HEADER, false); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl, CURLOPT_COOKIEFILE, "cook.txt"); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_TIMEOUT, 60); $contents = curl_exec($curl); var_dump($contents); curl_close( $curl ); ?>