admin管理员组文章数量:1397117
When I do an operation in STATA, for example removing duplicated rows, it will tell me the number of rows removed, for instance:
. sysuse auto.dta
(1978 automobile data)
. drop if mpg<15
(8 observations deleted)
. drop if rep78==.
(4 observations deleted)
For the tidyverse, the package tidylog implements a similar feature, providing feedback on the operation (e.g. for a join, number of joined and unjoined rows, for a filter, nubmer of removed rows, etc.), with the little disadvantage that you will lose the autocompletion of your editor, as it wraps tidyverse functions with definitions like filter(...)
, to accomodate for the fact that the tidyverse upstream definition could change over time.
Is there something similar for pandas?
I found pandas-log but seems abandoned.
Related question for R.
EDIT:
For the time being, I'm monkeypatching with:
def my_drop_duplicates(df, *args, **kwargs):
nrow0 = df.shape[0]
df.drop_duplicates(*args, **kwargs)
nrow1 = df.shape[0]
logging.info(f"Dropped {nrow0-nrow1} duplicates")
pd.DataFrame.my_drop_duplicates = my_drop_duplicates
I guess you could also do something like
def my_drop_duplicates(df, *args, **kwargs):
nrow0 = df.shape[0]
df.internal_drop_duplicates(*args, **kwargs)
nrow1 = df.shape[0]
logging.info(f"Dropped {nrow0-nrow1} duplicates")
pd.DataFrame.internal_drop_duplicates = drop_duplicates
pd.DataFrame.drop_duplicates = my_drop_duplicates
When I do an operation in STATA, for example removing duplicated rows, it will tell me the number of rows removed, for instance:
. sysuse auto.dta
(1978 automobile data)
. drop if mpg<15
(8 observations deleted)
. drop if rep78==.
(4 observations deleted)
For the tidyverse, the package tidylog implements a similar feature, providing feedback on the operation (e.g. for a join, number of joined and unjoined rows, for a filter, nubmer of removed rows, etc.), with the little disadvantage that you will lose the autocompletion of your editor, as it wraps tidyverse functions with definitions like filter(...)
, to accomodate for the fact that the tidyverse upstream definition could change over time.
Is there something similar for pandas?
I found pandas-log but seems abandoned.
Related question for R.
EDIT:
For the time being, I'm monkeypatching with:
def my_drop_duplicates(df, *args, **kwargs):
nrow0 = df.shape[0]
df.drop_duplicates(*args, **kwargs)
nrow1 = df.shape[0]
logging.info(f"Dropped {nrow0-nrow1} duplicates")
pd.DataFrame.my_drop_duplicates = my_drop_duplicates
I guess you could also do something like
def my_drop_duplicates(df, *args, **kwargs):
nrow0 = df.shape[0]
df.internal_drop_duplicates(*args, **kwargs)
nrow1 = df.shape[0]
logging.info(f"Dropped {nrow0-nrow1} duplicates")
pd.DataFrame.internal_drop_duplicates = drop_duplicates
pd.DataFrame.drop_duplicates = my_drop_duplicates
Share
Improve this question
edited Mar 31 at 7:54
robertspierre
asked Mar 26 at 12:13
robertspierrerobertspierre
4,4823 gold badges41 silver badges63 bronze badges
1 Answer
Reset to default 1 +50There isn’t an official pandas equivalent to what STATA or tidylog
in R does, unfortunately — pandas operations are usually silent unless you manually check the results. That said, your monkeypatching approach is actually pretty solid, and I’ve done something similar before.
Here’s a slightly cleaner version that keeps the original method and logs how many duplicates were dropped:
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO)
def my_drop_duplicates(self, *args, **kwargs):
n_before = self.shape[0]
result = pd.DataFrame._original_drop_duplicates(self, *args, **kwargs)
n_after = result.shape[0]
logging.info(f"Dropped {n_before - n_after} duplicate rows")
return result
# Save the original method
pd.DataFrame._original_drop_duplicates = pd.DataFrame.drop_duplicates
# Monkeypatch with the new one
pd.DataFrame.drop_duplicates = my_drop_duplicates
This way, you're not mutating the original DataFrame (which is important since drop_duplicates()
returns a copy by default), and you still get the feedback.
Alternatively, if you don’t want to monkeypatch globally (which can be risky in bigger projects or notebooks), you can subclass DataFrame
:
class LoggedDataFrame(pd.DataFrame):
def drop_duplicates(self, *args, **kwargs):
n_before = self.shape[0]
result = super().drop_duplicates(*args, **kwargs)
n_after = result.shape[0]
print(f"Dropped {n_before - n_after} duplicate rows")
return result
# Usage
df = LoggedDataFrame({"x": [1, 1, 2, 2, 3]})
df = df.drop_duplicates()
I also looked into pandas-log
before, but yeah — looks pretty abandoned. Haven’t seen anything like tidylog
that’s actively maintained for pandas. Would love it if someone made a clean utility package for this!
Hope that helps.
本文标签: pythonLogging operation results in pandas (equivalent of STATAtidylog)Stack Overflow
版权声明:本文标题:python - Logging operation results in pandas (equivalent of STATAtidylog) - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744146148a2592846.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论