admin管理员组文章数量:1314216
I am trying to write a script that appends new rows of a table to an existing file of tabular data, so that not all progress is lost if an error is encountered. Here are seven different ways to do this in Python:
- Write the data to a text file (a module-free method).
- Write the data to a CSV file using
writerows
of thecsv
module. - Write the data to a database using
sqlite3
(or another database library). - Write the data to a CSV file using
polars
. - Structure the data as a
pandas
DataFrame andpickle
it. - Structure the data as a
polars
DataFrame andpickle
it. - Structure the data as a
pandas
DataFrame and save it as a Parquet file usingfastparquet
.
The following code measures a runtime of each of these options, excluding the time required to build the data, header, and any single-row DataFrames. (I have tried to make this code as concise as possible, while also keeping it readable.)
import numpy as np
from time import perf_counter as pc
import csv
import sqlite3
import pandas as pd
import polars as pl
import pickle as pkl
import fastparquet as fp
header = list('ABCDE')
n_rows = 10**3
data = np.random.rand(n_rows, 5)
df_pd = pd.DataFrame(data, columns=header)
df_pl_header = pl.DataFrame(schema=header)
df_pl = pl.DataFrame(data, schema=header)
print('----------------------------------------------------------------------')
print('[no module]')
fname = 'file0.txt'
pc0 = pc()
with open(fname, mode='w') as f:
f.write('\t'.join(header) + '\n')
f.flush()
for k in range(n_rows):
f.write('\t'.join(map(str, data[k])) + '\n')
f.flush()
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('csv')
fname = 'file1.csv'
pc0 = pc()
with open(fname, mode='w', newline='') as f:
csvwriter = csv.writer(f)
csvwriter.writerow(header)
f.flush()
for row in data:
csvwriter.writerow(row)
f.flush()
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('sqlite3')
fname = 'file2.db'
pc0 = pc()
conn = sqlite3.connect(fname)
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS data_table')
cur.execute('CREATE TABLE IF NOT EXISTS data_table (idx integer PRIMARY KEY, A real, B real, C real, D real, E real)')
for k in range(n_rows):
cur.execute('INSERT INTO data_table VALUES (?, ?, ?, ?, ?, ?)', [k, *data[k]])
connmit()
conn.close()
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('polars')
fname = 'file3.csv'
pc0 = pc()
with open(fname, mode='w', encoding='utf8') as f:
df_pl_header.write_csv(f)
for k in range(n_rows):
df_pl[k].write_csv(f, include_header=False)
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('pandas, pickle')
fname = 'file4.pkl'
pc0 = pc()
df_new = df_pd.iloc[[0]]
df_new.to_pickle(fname)
for k in range(1, n_rows):
df_new = pd.concat([df_new, df_pd.iloc[[k]]], ignore_index=True)
df_new.to_pickle(fname)
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('polars, pickle')
fname = 'file5.pkl'
pc0 = pc()
df_new = df_pl[0]
df_ser = pkl.dumps(df_new)
with open(fname, mode='wb') as f:
f.write(df_ser)
for k in range(1, n_rows):
df_new = df_new.vstack(df_pl[k])
df_ser = pkl.dumps(df_new)
with open(fname, mode='wb') as f:
f.write(df_ser)
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
print('pandas, fastparquet')
fname = 'file6.parquet'
pc0 = pc()
fp.write(fname, df_pd.iloc[[0]])
for k in range(1, n_rows):
fp.write(fname, df_pd.iloc[[k]], append=True)
print(f'To write data to file: {pc()-pc0} sec')
print('----------------------------------------------------------------------')
On my machine, running this script yields the following times.
----------------------------------------------------------------------
[no module]
To write data to file: 0.01229649999004323 sec
----------------------------------------------------------------------
csv
To write data to file: 0.011919600001419894 sec
----------------------------------------------------------------------
sqlite3
To write data to file: 0.02571699999680277 sec
----------------------------------------------------------------------
polars
To write data to file: 0.12898200000927318 sec
----------------------------------------------------------------------
pandas, pickle
To write data to file: 0.6281701999978395 sec
----------------------------------------------------------------------
polars, pickle
To write data to file: 1.4392062999977497 sec
----------------------------------------------------------------------
pandas, fastparquet
To write data to file: 33.678610299990396 sec
----------------------------------------------------------------------
So it appears that the first three approaches are much faster than the other four alternatives for this specific task and test. When I increase the size of the table (both in n_rows
and number of columns), sqlite3
generally pulls ahead as the fastest option.
But am I missing faster ways of implementing the four slower alternatives or other faster ways of appending numerical data to a tabular file besides those listed?
(I am posting this question in part because I and others I know have encountered this question several times, and we do not find many posted attempts at speed comparisons between the numerous available methods, which are discussed mostly separately online.)
本文标签:
版权声明:本文标题:pandas - In Python, what is the fastest way to append numerical tabular data to a data file? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741926443a2405318.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论