用 python 做爬虫，遇到的问题 - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4195 days ago, the information mentioned may be changed or developed.

我用python做了一个爬知乎的项目，地址在这里： https://github.com/egrcc/zhihu-python
有几个问题想请教一下大家。有没有办法，在不模拟登录的情况下，取得某个问题下的所有回答？
这个项目用到了beautiful soup，但速度实在是很慢，有没有什么其他的库，解析html速度更快一点的？另外，如果要大规模部署的话，是不是用scrapy这个框架更好？

14 replies • 2014-12-17 18:05:12 +08:00

1

hadoop

Dec 17, 2014

怎么解决知乎验证码的问题

2

EPr2hh6LADQWqRVH

Dec 17, 2014

bs4的解析效率和xml解析器有关的啊，你装个lxml效率倍增。
而且你多开几个进程不就完了。。这种io指向的情景甚至多开线程都行吧

3

egrcc

OP

Dec 17, 2014 via Android

@hadoop 登录不一定需要验证码，只有登录错误时，再次登录需要验证码。只要登录正确，下次登录是不需要验证码的

4

invite

Dec 17, 2014

请求多了，小心IP被封。哈哈。

5

egrcc

OP

Dec 17, 2014 via Android

@invite 不会吧，这么严重？

6

invite

Dec 17, 2014

@egrcc 你以为爬虫那么好当的？

7

iewgnaw

Dec 17, 2014

爬虫就不要太追求速度了，太快了很容易封IP的

8

imn1

Dec 17, 2014

lxml + xpath 比 bs 快，regex 更快
导出 cookies 可以不在程序中登录，实际上也是要登录，登录界面换成浏览器而已

我的爬虫一向不带登录，只是读取浏览器cookies，因为都是自用，不发布，没必要写那么复杂

9

shoumu

Dec 17, 2014

pyquery快一点

10

yaotian

Dec 17, 2014

@imn1 如何自动读浏览器cookies?

11

hadoop

Dec 17, 2014

@egrcc 爬得太快了可能会被要求验证码的

12

libo26

Dec 17, 2014

“如何自动读浏览器cookies?”
@yaotian google一下，很多。。
比如：
http://n8henrie.com/2013/11/use-chromes-cookies-for-easier-downloading-with-python-requests/

13

CosWind

Dec 17, 2014

1

太快了，可以用代理，http://pachong.org/

14

egrcc

OP

Dec 17, 2014 via Android

@CosWind 这个好

About · Help · Advertise · Blog · API · FAQ · Solana · 1150 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 47ms · UTC 18:01 · PVG 02:01 · LAX 11:01 · JFK 14:01
♥ Do have faith in what you're doing.