Python + pandas + chunksize 如何分块分组再汇总统计? - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3230 days ago, the information mentioned may be changed or developed.

有一个很大的文件内容是一行一个 MD5 值我需要统计每个 MD5 出现的次数
如果直接 pandas.read_csv 会 MemoryError
一行一行读+字典也行但不是我要的

怎么使用分块读取然后分组统计再汇总?
loop = True
chunkSize = 100000
chunks = []

while loop:
try:
chunk = data.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped.")

df = pd.concat(chunks, ignore_index=True)

8 replies • 2017-08-10 19:44:57 +08:00

1

ferstar

Aug 10, 2017

刚好手上有个类似的数据集，唯一与楼主不同的是每一行是一个[100, 150]的整数，我是这样统计的：
---
```python
from collections import Counter

import pandas as pd

size = 2 ** 10
counter = Counter()
for chunk in pd.read_csv('file.csv', header=None, chunksize=size):
counter.update([i[0] for i in chunk.values])

print(counter)

```
---
大概输出如下：
```
Counter({100: 41,
101: 40,
102: 40,
...
150: 35})
```

2

caomaocao

Aug 10, 2017

Counter() 或者 Mapreduce 的思想做哦~

3

chuanqirenwu

Aug 10, 2017

dask 一行搞定。

dd.groupby().count()，和 pandas 一样的 API，但是把 fill in memory 拓展到 fill in disk。

4

zhusimaji

Aug 10, 2017 via iPhone

Counter 可以试试，有分布式观景首选 mapreduce

5

zhusimaji

Aug 10, 2017 via iPhone

分布式环境

6

zhusimaji

Aug 10, 2017 via iPhone

@chuanqirenwu 学习新姿势，一般数据量大都是实用 spark 完成计算，刚去看了下 dask，不错的包

7

F281M6Dh8DXpD1g2

Aug 10, 2017 via iPhone

sort | uniq -c

8

notsobad

Aug 10, 2017

用 shell 比较简单

cat x.txt | sort | uniq -c

About · Help · Advertise · Blog · API · FAQ · Solana · 2761 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 41ms · UTC 11:51 · PVG 19:51 · LAX 04:51 · JFK 07:51
♥ Do have faith in what you're doing.