admin管理员组

文章数量:1406924

One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.

For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]] and I want to have a function that returns a DataFrame of the same data:

def get_dataframe() -> pd.DataFrame:
    l = get_data()
    df = pd.DataFrame(l)
    return df

This works well except when len(l)=0 because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:

  1. In get_dataframe(), I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice.
  2. Because I specify the same information twice, they may not be consistent. So I would need to add code to check consistency.
  3. Believe it or not, the DataFrame constructor does not take a list of dtypes. There are workarounds to specify a type for each column, but it is not convenient.

One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype is not one-to-one.

I wonder how is the situation of no data usually handled.

One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.

For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]] and I want to have a function that returns a DataFrame of the same data:

def get_dataframe() -> pd.DataFrame:
    l = get_data()
    df = pd.DataFrame(l)
    return df

This works well except when len(l)=0 because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:

  1. In get_dataframe(), I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice.
  2. Because I specify the same information twice, they may not be consistent. So I would need to add code to check consistency.
  3. Believe it or not, the DataFrame constructor does not take a list of dtypes. There are workarounds to specify a type for each column, but it is not convenient.

One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype is not one-to-one.

I wonder how is the situation of no data usually handled.

Share Improve this question edited Mar 22 at 22:03 wjandrea 33.2k10 gold badges69 silver badges98 bronze badges asked Mar 22 at 4:57 Tom BennettTom Bennett 2,5185 gold badges28 silver badges36 bronze badges 8
  • What's the schema of the data (i.e. column labels and dtypes)? It is known beforehand, right? It'd help to add a minimal reproducible example including the desired output for no data as well as for some data. For specifics, see How to make good reproducible pandas examples. – wjandrea Commented Mar 22 at 22:04
  • "One problem with a pandas DataFrame is that it needs some data to create its structure." - That's incorrect. Have you read the docs? DataFrame. See also DataFrame.empty for an example. – wjandrea Commented Mar 22 at 22:06
  • "pandas needs at least one record to infer the number of columns" - Also incorrect – wjandrea Commented Mar 22 at 22:18
  • "the DataFrame constructor does not take a list of dtypes" - That is correct, and I guess that's the crux of the question. Off the top of my head, I'm not sure the best way to fix that. @konsfik's answer works of course, but it's awfully verbose. – wjandrea Commented Mar 22 at 22:21
  • 1 @wjandrea, thanks for answering my #3. The main thing I was trying to ask is how to make the columns of a DataFrame consistent, whether there is data or not. As is pointed out, this involves explicitly specifying a schema. So, the real question is how to ensure the data and the schema are consistent. The schema is already implicitly embedded in the data generation code. Specifying the same thing in the data structure form means the same info is duplicated, which may lead to inconsistency. Maybe it's unavoidable, and we use validation to check for inconsistency. I am asking for best practice. – Tom Bennett Commented Mar 23 at 5:59
 |  Show 3 more comments

2 Answers 2

Reset to default 1

the DataFrame constructor does not take [multiple] dtypes

Instead use .astype(), which can take a mapping of columns-to-dtypes.

Setup:

>>> dtypes = {'foo': 'int64', 'bar': 'boolean'}
>>> dfe = pd.DataFrame(columns=dtypes.keys())
>>> dfe.dtypes.rename('dtype').to_frame()  # What we don't want: all object
      dtype
foo  object
bar  object

Run .astype():

>>> dft = dfe.astype(dtypes)
>>> dft.dtypes.rename('dtype').to_frame()  # What we want
       dtype
foo    int64
bar  boolean
>>> dft  # Still empty - nothing up my sleeves ;)
Empty DataFrame
Columns: [foo, bar]
Index: []

You can combine them into one line:

pd.DataFrame(columns=dtypes.keys()).astype(dtypes)

You can initialize a pandas dataframe like this:

df = pd.DataFrame({
    'Name': pd.Series(dtype='str'),
    'Age': pd.Series(dtype='int'),
    'Salary': pd.Series(dtype='float'),
    'Date': pd.Series(dtype='datetime64[ns]')
})

this will create an empty df with specified types per column. Is this what you were looking for?

With that, you can also use a schema, as such:

import pandas as pd
from typing import List, Dict, Any

def get_dataframe_schema() -> Dict[str, Any]:
    return {
        'name': str,
        'age': int,
        'score': float
    }

def get_dataframe() -> pd.DataFrame:
    schema = get_dataframe_schema()
    l = get_data()
    if not l:
        return pd.DataFrame(columns=schema.keys()).astype(schema)
    df = pd.DataFrame(l)
    return df.astype(schema)

本文标签: Is there any idiomatic way to return an empty pandas DataFrame when there is no dataStack Overflow