admin管理员组

文章数量:1314216

I am trying to write a script that appends new rows of a table to an existing file of tabular data, so that not all progress is lost if an error is encountered. Here are seven different ways to do this in Python:

  • Write the data to a text file (a module-free method).
  • Write the data to a CSV file using writerows of the csv module.
  • Write the data to a database using sqlite3 (or another database library).
  • Write the data to a CSV file using polars.
  • Structure the data as a pandas DataFrame and pickle it.
  • Structure the data as a polars DataFrame and pickle it.
  • Structure the data as a pandas DataFrame and save it as a Parquet file using fastparquet.

The following code measures a runtime of each of these options, excluding the time required to build the data, header, and any single-row DataFrames. (I have tried to make this code as concise as possible, while also keeping it readable.)

import numpy as np
from time import perf_counter as pc
import csv
import sqlite3
import pandas as pd
import polars as pl
import pickle as pkl
import fastparquet as fp

header = list('ABCDE')
n_rows = 10**3
data = np.random.rand(n_rows, 5)
df_pd = pd.DataFrame(data, columns=header)
df_pl_header = pl.DataFrame(schema=header)
df_pl = pl.DataFrame(data, schema=header)

print('----------------------------------------------------------------------')
print('[no module]')
fname = 'file0.txt'

pc0 = pc()
with open(fname, mode='w') as f:
    f.write('\t'.join(header) + '\n')
    f.flush()
    for k in range(n_rows):
        f.write('\t'.join(map(str, data[k])) + '\n')
        f.flush()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('csv')
fname = 'file1.csv'

pc0 = pc()
with open(fname, mode='w', newline='') as f:
    csvwriter = csv.writer(f)
    csvwriter.writerow(header)
    f.flush()
    for row in data:
        csvwriter.writerow(row)
        f.flush()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('sqlite3')
fname = 'file2.db'

pc0 = pc()
conn = sqlite3.connect(fname)
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS data_table')
cur.execute('CREATE TABLE IF NOT EXISTS data_table (idx integer PRIMARY KEY, A real, B real, C real, D real, E real)')
for k in range(n_rows):
    cur.execute('INSERT INTO data_table VALUES (?, ?, ?, ?, ?, ?)', [k, *data[k]])
connmit()
conn.close()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('polars')
fname = 'file3.csv'

pc0 = pc()
with open(fname, mode='w', encoding='utf8') as f:
    df_pl_header.write_csv(f)
    for k in range(n_rows):
        df_pl[k].write_csv(f, include_header=False)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('pandas, pickle')
fname = 'file4.pkl'

pc0 = pc()
df_new = df_pd.iloc[[0]]
df_new.to_pickle(fname)
for k in range(1, n_rows):
    df_new = pd.concat([df_new, df_pd.iloc[[k]]], ignore_index=True)
    df_new.to_pickle(fname)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('polars, pickle')
fname = 'file5.pkl'

pc0 = pc()
df_new = df_pl[0]
df_ser = pkl.dumps(df_new)
with open(fname, mode='wb') as f:
    f.write(df_ser)
for k in range(1, n_rows):
    df_new = df_new.vstack(df_pl[k])
    df_ser = pkl.dumps(df_new)
    with open(fname, mode='wb') as f:
        f.write(df_ser)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('pandas, fastparquet')
fname = 'file6.parquet'

pc0 = pc()
fp.write(fname, df_pd.iloc[[0]])
for k in range(1, n_rows):
    fp.write(fname, df_pd.iloc[[k]], append=True)
print(f'To write data to file:  {pc()-pc0} sec')
print('----------------------------------------------------------------------')

On my machine, running this script yields the following times.

----------------------------------------------------------------------
[no module]
To write data to file:  0.01229649999004323 sec
----------------------------------------------------------------------
csv
To write data to file:  0.011919600001419894 sec
----------------------------------------------------------------------
sqlite3
To write data to file:  0.02571699999680277 sec
----------------------------------------------------------------------
polars
To write data to file:  0.12898200000927318 sec
----------------------------------------------------------------------
pandas, pickle
To write data to file:  0.6281701999978395 sec
----------------------------------------------------------------------
polars, pickle
To write data to file:  1.4392062999977497 sec
----------------------------------------------------------------------
pandas, fastparquet
To write data to file:  33.678610299990396 sec
----------------------------------------------------------------------

So it appears that the first three approaches are much faster than the other four alternatives for this specific task and test. When I increase the size of the table (both in n_rows and number of columns), sqlite3 generally pulls ahead as the fastest option.

But am I missing faster ways of implementing the four slower alternatives or other faster ways of appending numerical data to a tabular file besides those listed?

(I am posting this question in part because I and others I know have encountered this question several times, and we do not find many posted attempts at speed comparisons between the numerous available methods, which are discussed mostly separately online.)

本文标签: