Python 做爬虫对网页上的表格进行转存 Mysql，有什么轮子好用值得推荐？ - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2688 days ago, the information mentioned may be changed or developed.

我是在想，是否有什么现成的轮子，直接把页面上的表格，像是几何视觉光投影那样，直接照到下面的纸（ Mysql ）上成型（不必完美，至少都已经进了 mysql，后期修正这些也好做），不用把时间耗在研究 html 代码上，对 td,tr 去历遍，挑选读值.....
或者我这种思维是不对的
大家使用爬虫对表格进行处理，有何高招经验？

11 replies • 2019-01-31 14:37:55 +08:00

1

David1119

Jan 30, 2019

3

pandas
读取：pd.read_html
保存：df.to_csv 或者 df.to_sql

2

xpresslink

Jan 30, 2019

以我的实践经验来看最省事的方案是

scrapy + djangoitem + django ORM + Mysql

几乎只要写很少代码（通常几十行代码）就可以把网页数据入库了。

前提是你要会 django 和 scrapy，xpath 方法精熟。

3

AicherZX

Jan 30, 2019

@xpresslink 为什么不是 scrapy + sqlalchemy + mysql

4

xpresslink

Jan 30, 2019

@AicherZX 你非要这么说，还可以 peewee 或直接 pymysql 啊
这不是还有一个省事儿的约束条件么

5

locoz

Jan 30, 2019

@David1119 #1 卧槽，pandas 还有这种东西，666666，这效果也太好了

6

locoz

Jan 30, 2019

2 楼说的这个应该是对 html 上的表格最好的解析方式之一了，用之前爬过的一个表格页测试了一下
http://data.eastmoney.com/stock/tradedetail/2019-01-30.html，虽然这个页面是 js 生成的表格但用来测试的话挺合适的，效果如下：

丢进去 html 字符串让它解析，一行出结果

7

xiaozizayang

Jan 31, 2019

表格的标签很明显，写一个针对此情况的通用爬虫不难

8

yanzixuan

Jan 31, 2019

@xpresslink 我写爬虫都是自己撸。requests+parsel+sqlalchemy+mongodb。
mongodb 作为测试环境，随便搞不用担心字段问题。
然后导出 mongodb 的表，自动生成 sqlalchemy 的表。
生产环境用 mysql

9

xpresslink

Jan 31, 2019

@yanzixuan 爬虫其实没有什么一定之规，都是用最省事儿的方案实现了再说，因为人家页面没准过两天就改版了。
更不要说有反爬虫的方案了。

10

wwg1994

Jan 31, 2019

@locoz 代码是这样吗：pd.read_html('http://data.eastmoney.com/stock/tradedetail/2019-01-30.html')，我怎么得到的是一个空列表

11

d5

Jan 31, 2019

powerbi

About · Help · Advertise · Blog · API · FAQ · Solana · 3115 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 76ms · UTC 13:22 · PVG 21:22 · LAX 06:22 · JFK 09:22
♥ Do have faith in what you're doing.