小白问个爬虫问题

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2882 days ago, the information mentioned may be changed or developed.

想爬点 ZOL 的手机数据，看到 http://detail.zol.com.cn/robots.txt 上面似乎限制爬虫爬取
Disallow: https://detail.zol.com.cn/*

我理解，这样爬虫就无法访问产品数据了？这怎么办呢？有什么变通方法可以爬取。

爬虫

disallow

想爬点

ZOL

16 replies • 2018-07-19 10:43:10 +08:00

Ethanp

Jul 18, 2018 via Android

你都知道看 robots 了不小白了

alvin666

Jul 18, 2018 via Android

悄悄，慢慢地爬，自己用，或者换网站。
人家不让你爬，无解

xpresslink

Jul 18, 2018

那个 robots.txt 主要是给搜索引擎来指引的。和你没有什么关系。

geekcorn

Jul 18, 2018 via iPhone

robots.txt 只是对搜索引擎爬虫的建议性限制吧，理论上正常用户在客户端浏览器能看到什么，操作什么，机器就可以做到什么

b821025551b

Jul 18, 2018

robots.txt 只是君子协议。。。就比如你家门开着，贴张纸，写着小偷别进来，小偷就真不进了么。。。

0x5f

Jul 18, 2018

伪造正常浏览器 ua 啊

liupanhi

Jul 18, 2018

你确实是小白,哈哈哈

frmongo

Jul 18, 2018

@liupanhi 给小弟指点下嘛，别只一笑而过

dcalsky

Jul 18, 2018 via Android

@frmongo 你发 http request 的时候，把 header 里的 user-agent 字段改成其他的。

dcalsky

Jul 18, 2018 via Android

@frmongo 但是其实也不用做任何多余的事情，因为 robotstxt 只是一个声明，遵不遵守全看写爬虫人的意思。

arctanx

Jul 18, 2018

楼主很有节操 233

ml1344677

Jul 18, 2018

破坏计算机信息罪了解一下 23333

musclepanda

Jul 18, 2018

你用 scrapy 的？ scrapy 在设置文件里面设置下就好了，有一个 Allow_robots 这样的设定，关了就好

frmongo

Jul 19, 2018

@arctanx 哈哈

frmongo

Jul 19, 2018

@ml1344677 我擦...

frmongo

Jul 19, 2018

@musclepanda 我用的 python2 的 request，写了个很简单的，伪装成 360 的 agent,可以用