admin管理员组文章数量:1406924
One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.
For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]]
and I want to have a function that returns a DataFrame of the same data:
def get_dataframe() -> pd.DataFrame:
l = get_data()
df = pd.DataFrame(l)
return df
This works well except when len(l)=0
because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:
- In
get_dataframe()
, I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice. - Because I specify the same information twice, they may not be consistent. So I would need to add code to check consistency.
- Believe it or not, the DataFrame constructor does not take a list of dtypes. There are workarounds to specify a type for each column, but it is not convenient.
One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype
is not one-to-one.
I wonder how is the situation of no data usually handled.
One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.
For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]]
and I want to have a function that returns a DataFrame of the same data:
def get_dataframe() -> pd.DataFrame:
l = get_data()
df = pd.DataFrame(l)
return df
This works well except when len(l)=0
because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:
- In
get_dataframe()
, I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice. - Because I specify the same information twice, they may not be consistent. So I would need to add code to check consistency.
- Believe it or not, the DataFrame constructor does not take a list of dtypes. There are workarounds to specify a type for each column, but it is not convenient.
One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype
is not one-to-one.
I wonder how is the situation of no data usually handled.
Share Improve this question edited Mar 22 at 22:03 wjandrea 33.2k10 gold badges69 silver badges98 bronze badges asked Mar 22 at 4:57 Tom BennettTom Bennett 2,5185 gold badges28 silver badges36 bronze badges 8 | Show 3 more comments2 Answers
Reset to default 1the DataFrame constructor does not take [multiple] dtypes
Instead use .astype()
, which can take a mapping of columns-to-dtypes.
Setup:
>>> dtypes = {'foo': 'int64', 'bar': 'boolean'}
>>> dfe = pd.DataFrame(columns=dtypes.keys())
>>> dfe.dtypes.rename('dtype').to_frame() # What we don't want: all object
dtype
foo object
bar object
Run .astype()
:
>>> dft = dfe.astype(dtypes)
>>> dft.dtypes.rename('dtype').to_frame() # What we want
dtype
foo int64
bar boolean
>>> dft # Still empty - nothing up my sleeves ;)
Empty DataFrame
Columns: [foo, bar]
Index: []
You can combine them into one line:
pd.DataFrame(columns=dtypes.keys()).astype(dtypes)
You can initialize a pandas dataframe like this:
df = pd.DataFrame({
'Name': pd.Series(dtype='str'),
'Age': pd.Series(dtype='int'),
'Salary': pd.Series(dtype='float'),
'Date': pd.Series(dtype='datetime64[ns]')
})
this will create an empty df with specified types per column. Is this what you were looking for?
With that, you can also use a schema, as such:
import pandas as pd
from typing import List, Dict, Any
def get_dataframe_schema() -> Dict[str, Any]:
return {
'name': str,
'age': int,
'score': float
}
def get_dataframe() -> pd.DataFrame:
schema = get_dataframe_schema()
l = get_data()
if not l:
return pd.DataFrame(columns=schema.keys()).astype(schema)
df = pd.DataFrame(l)
return df.astype(schema)
本文标签: Is there any idiomatic way to return an empty pandas DataFrame when there is no dataStack Overflow
版权声明:本文标题:Is there any idiomatic way to return an empty pandas DataFrame when there is no data? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744324786a2600671.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
DataFrame
. See alsoDataFrame.empty
for an example. – wjandrea Commented Mar 22 at 22:06