Postgres Import CSV cp936 Error

Postgres导入CSV中文编码错误

现在公司数据资源非常丰富,动辄要处理的csv就1E8行纪录,10G,分析起来相当爽。编码格式是cp936,用官方的COPY命令导入Postgres,一直出错,目前COPY的ENCODING不支持cp936。试着用Pandas读几行,保持csv为utf-8再COPY没问题。

解决办法:通过Python3的Pandas读取CSV转换成utf-8,再利用sqlalchemy+psycopg2(目前只支持到Python3.4,还不支持3.5)导入即可。

这么多依赖推荐使用Anaconda,十分方便,尤其适合解决在Windows平台编译C语言相关库(lxml,psycopg2,numpy,pandas)时的各种奇葩问题。

具体方法如下:

In [1]:
import pandas as pd
from sqlalchemy import create_engine
import ipython_memory_usage.ipython_memory_usage as imu
In [2]:
imu.start_watching_memory()
In [2] used 0.0195 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 61.92 MiB
In [3]:
chunksize = 1E6
engine = create_engine('postgresql+psycopg2://postgres:tianKong7@localhost:5432/postgres')
In [3] used 1.3125 MiB RAM in 0.16s, peaked 0.00 MiB above current, total RAM usage 63.23 MiB
In [ ]:
df = pd.read_csv('c_全年库存结构变动_20151124122730.csv', encoding='cp936', chunksize=chunksize)
In [4] used 4.0273 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 67.26 MiB
In [ ]:
for chunk in df:
    chunk.to_sql("inventory", engine, if_exists='append', chunksize=chunksize)
    del chunk
In [ ]:
# CREATE TABLE inv (
#     stat_date	date,
#     store_src_cd	int,
#     sku_id	int,
#     stock_qty	int,
#     canuse_qty	int,
#     len	int,
#     width	int,
#     height	int,
#     item_first_cate_name	varchar(80),
#     item_second_cate_name	varchar(80),
#     item_third_cate_name	varchar(80)
# );

# COPY inv FROM 'D:\\reader\\dask-tutorial\\top200.csv' DELIMITER ',' CSV;