admin管理员组文章数量:1345030
I have a dataframe with location, the date a person is departing and the date their replacement is arriving.
x = pd.DataFrame({
'Location': ['Paris', 'Dallas', 'Tokyo', 'London'],
'Departing': ['MAY2025', 'JUN2025', 'APR2025', 'JUN2025'],
'Arriving': ['MAR2025', 'JUN2025', 'JUN2025', 'APR2025'],
})
Location Departing Arriving
0 Paris MAY2025 MAR2025
1 Dallas JUN2025 JUN2025
2 Tokyo APR2025 JUN2025
3 London JUN2025 APR2025
I want to create a column for each of the next 3 months that tallies how many people will be at each location each month. For each month column up to the Departing date, a 1 should be added to the column value. And, for each month starting from the Arriving date onward, another 1 is added. So, if I was creating the table for April 2025, it should look like this:
Location Departing Arriving APR2025 MAY2025 JUN2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
For the first row, someone is leaving Paris May 2025. Each month up to May gets a 1, so a 1 is added to APR2025. The next person coming to Paris arrived in March 2025, so March and every month after it gets a 1 added to the column value. For the Dallas row, someone is leaving in June, so all months up to June (April and May) get a 1. The next person coming to Dallas arrives in June, so June gets a 1.
I need to reuse the code throughout the year, so the date months will change each month. I created the columns by setting a variable for the start month, using pd.datetime to get the next 3 months and then adding them to the dataframe as new columns. But, I'm not sure how to go about filling the columns now.
I have a dataframe with location, the date a person is departing and the date their replacement is arriving.
x = pd.DataFrame({
'Location': ['Paris', 'Dallas', 'Tokyo', 'London'],
'Departing': ['MAY2025', 'JUN2025', 'APR2025', 'JUN2025'],
'Arriving': ['MAR2025', 'JUN2025', 'JUN2025', 'APR2025'],
})
Location Departing Arriving
0 Paris MAY2025 MAR2025
1 Dallas JUN2025 JUN2025
2 Tokyo APR2025 JUN2025
3 London JUN2025 APR2025
I want to create a column for each of the next 3 months that tallies how many people will be at each location each month. For each month column up to the Departing date, a 1 should be added to the column value. And, for each month starting from the Arriving date onward, another 1 is added. So, if I was creating the table for April 2025, it should look like this:
Location Departing Arriving APR2025 MAY2025 JUN2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
For the first row, someone is leaving Paris May 2025. Each month up to May gets a 1, so a 1 is added to APR2025. The next person coming to Paris arrived in March 2025, so March and every month after it gets a 1 added to the column value. For the Dallas row, someone is leaving in June, so all months up to June (April and May) get a 1. The next person coming to Dallas arrives in June, so June gets a 1.
I need to reuse the code throughout the year, so the date months will change each month. I created the columns by setting a variable for the start month, using pd.datetime to get the next 3 months and then adding them to the dataframe as new columns. But, I'm not sure how to go about filling the columns now.
Share Improve this question edited yesterday be84 asked yesterday be84be84 512 bronze badges 1- 1 Which dates are the correct ones? All 3 instances of the data in your question show different dates, not sure which one you want me to use. Tokyo for example is May25 departure and sep26 arrival when you create the dataframe, aug25 departure and sep26 arrival in example dataframe and april25 depart and jun25 arrive in your desired output. Not sure which is correct? – Emi OB Commented yesterday
2 Answers
Reset to default 2Create helper DataFrame df1
for N
months starting by base
, use cross join with DataFrame.merge
for all combinations, so possible sum less and greater or equal values. Last use DataFrame.pivot
for new months columns:
df['Departing_dt'] = pd.to_datetime(df['Departing'], format='%b%Y')
df['Arriving_dt'] = pd.to_datetime(df['Arriving'], format='%b%Y')
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
df1 = pd.DataFrame({'MonthDate':
[base + pd.DateOffset(months=x) for x in range(N)]})
df2 = pd.merge(df.reset_index(), df1, how='cross')
df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
out = (df2.pivot(index=['index','Location','Departing','Arriving'],
columns='MonthDate',
values='sum')
.rename(columns=lambda x: x.strftime("%b%Y").upper())
.reset_index([1,2,3])
.rename_axis(index=None, columns=None))
print(out)
Location Departing Arriving APR2025 MAY2025 JUN2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
Similar solution with using months periods:
df['Departing_dt'] = (pd.to_datetime(df['Departing'], format='%b%Y')
.dt.to_period('M'))
df['Arriving_dt'] = (pd.to_datetime(df['Arriving'], format='%b%Y')
.dt.to_period('M'))
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
month_periods = pd.period_range(start=base.to_period('M'), periods=N, freq='M')
df1 = pd.DataFrame({'MonthDate': month_periods})
df2 = pd.merge(df.reset_index(), df1, how='cross')
df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
out = (df2.pivot(index=['index','Location','Departing','Arriving'],
columns='MonthDate',
values='sum')
.rename(columns=lambda x: x.strftime("%b%Y").upper())
.reset_index([1,2,3])
.rename_axis(index=None, columns=None))
print(out)
Location Departing Arriving APR2025 MAY2025 JUN2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
Numpy alternative work similar, instead cross join is used broadcasting for sum matched datetimes:
Departing = pd.to_datetime(df['Departing'], format='%b%Y')
Arriving = pd.to_datetime(df['Arriving'], format='%b%Y')
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
month_dates = [base + pd.DateOffset(months=i) for i in range(N)]
month_names = [d.strftime("%b%Y").upper() for d in month_dates]
departing_np = Departing.to_numpy()[:, None]
arriving_np = Arriving.to_numpy()[:, None]
month_dates_np = np.array(month_dates, dtype='datetime64[ns]')[None, :]
arr = ((month_dates_np < departing_np).astype(int) +
(month_dates_np >= arriving_np).astype(int))
df = df.join(pd.DataFrame(arr, index=df.index, columns=month_names))
print(df)
Location Departing Arriving APR2025 MAY2025 JUN2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
Performance comparison:
import pandas as pd
#pip install perfplot
import perfplot
def jez1(df):
df['Departing_dt'] = pd.to_datetime(df['Departing'], format='%b%Y')
df['Arriving_dt'] = pd.to_datetime(df['Arriving'], format='%b%Y')
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
df1 = pd.DataFrame({'MonthDate':
[base + pd.DateOffset(months=x) for x in range(N)]})
df2 = pd.merge(df.reset_index(), df1, how='cross')
df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
out = (df2.pivot(index=['index','Location','Departing','Arriving'],
columns='MonthDate',
values='sum')
.rename(columns=lambda x: x.strftime("%b%Y").upper())
.reset_index([1,2,3])
.rename_axis(index=None, columns=None))
return out
def jez2(df):
df['Departing_dt'] = (pd.to_datetime(df['Departing'], format='%b%Y')
.dt.to_period('M'))
df['Arriving_dt'] = (pd.to_datetime(df['Arriving'], format='%b%Y')
.dt.to_period('M'))
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
month_periods = pd.period_range(start=base.to_period('M'), periods=N, freq='M')
df1 = pd.DataFrame({'MonthDate': month_periods})
df2 = pd.merge(df.reset_index(), df1, how='cross')
df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
out = (df2.pivot(index=['index','Location','Departing','Arriving'],
columns='MonthDate',
values='sum')
.rename(columns=lambda x: x.strftime("%b%Y").upper())
.reset_index([1,2,3])
.rename_axis(index=None, columns=None))
return out
def jez3(df):
Departing = pd.to_datetime(df['Departing'], format='%b%Y')
Arriving = pd.to_datetime(df['Arriving'], format='%b%Y')
N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
month_dates = [base + pd.DateOffset(months=i) for i in range(N)]
month_names = [d.strftime("%b%Y").upper() for d in month_dates]
departing_np = Departing.to_numpy()[:, None]
arriving_np = Arriving.to_numpy()[:, None]
month_dates_np = np.array(month_dates, dtype='datetime64[ns]')[None, :]
arr = ((month_dates_np < departing_np).astype(int) +
(month_dates_np >= arriving_np).astype(int))
out = df.join(pd.DataFrame(arr, index=df.index, columns=month_names))
return out
def emi(x):
next_3_months = [(pd.to_datetime('today') + pd.DateOffset(months=month)).strftime("%b%Y") for month in [0, 1, 2]]
add_dates = [pd.date_range(y, z, freq='ME').strftime("%b%Y") for y, z in zip( x['Arriving'], x['Departing'])]
add_person = pd.DataFrame({'Location' : x['Location'], '+ dates':add_dates}).explode('+ dates').value_counts().reset_index()
add_person = add_person.loc[add_person['+ dates'].isin(next_3_months)].copy()
sub_dates = [pd.date_range(y, z, freq='ME').strftime("%b%Y") for y, z in zip( x['Departing'], x['Arriving'])]
sub_person = pd.DataFrame({'Location' : x['Location'], '- dates':sub_dates}).explode('- dates').value_counts().reset_index()
sub_person = sub_person.loc[sub_person['- dates'].isin(next_3_months)].copy()
counts = add_person.merge(sub_person, left_on=['Location', '+ dates'], right_on=['Location', '- dates'], suffixes=[' add', ' sub'], how='outer')
counts['date'] = counts['+ dates'].fillna(counts['- dates'])
counts['change from 1'] = counts['count add'].fillna(0) - counts['count sub'].fillna(0)
pivot_counts = counts.pivot(index='Location', columns='date', values='change from 1').reset_index()
x = x.merge(pivot_counts, on='Location', how='left').fillna(0)
x[[month for month in next_3_months if month not in x.columns]] = 0
x[next_3_months] = (x[next_3_months] + 1).astype(int)
return x
def make_df(n):
np.random.seed(523)
Lo = ['Paris', 'Dallas', 'Tokyo', 'London']
L = ['MAY2025', 'JUN2025', 'SEP2025', 'MAR2025', 'JUL2026']
df = pd.DataFrame({
'Location': np.random.choice(Lo, size=n),
'Departing': np.random.choice(L, size=n),
'Arriving': np.random.choice(L, size=n)})
return df
perfplot.show(
setup=make_df,
kernels=[jez1, jez2, jez3, emi],
n_range=[2**k for k in range(2, 20)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Firstly I would get a list of the next 3 months (including current)
next_3_months = [(pd.to_datetime('today') + pd.DateOffset(months=month)).strftime("%b%Y") for month in [0, 1, 2]]
['Apr2025', 'May2025', 'Jun2025']
Then use pd.date_range() to get all the months between an arrival and departure - i.e. all the months were there will be an additional person. Create a temporary dataframe with this as a column alongside the location column. You can then use df.explode() to get one row for each location and each month they have an additional person. Then filter these months to only those in the next 3.
add_dates = [pd.date_range(y, z, freq='m').strftime("%b%Y") for y, z in zip( x['Arriving'], x['Departing'])]
add_person = pd.DataFrame({'Location' : x['Location'], '+ dates':add_dates}).explode('+ dates').value_counts().reset_index()
add_person = add_person.loc[add_person['+ dates'].isin(next_3_months)].copy()
Location + dates count
0 London Apr2025 1
1 London May2025 1
2 Paris Apr2025 1
Repeat the same for all dates between departure and arrival - i.e. all the months where there will be one less person
sub_dates = [pd.date_range(y, z, freq='m').strftime("%b%Y") for y, z in zip( x['Departing'], x['Arriving'])]
sub_person = pd.DataFrame({'Location' : x['Location'], '- dates':sub_dates}).explode('- dates').value_counts().reset_index()
sub_person = sub_person.loc[sub_person['- dates'].isin(next_3_months)].copy()
Then outer merge these two together, and work out what the difference would be for each location per month. Pivot this to have the months as columns.
counts = add_person.merge(sub_person, left_on=['Location', '+ dates'], right_on=['Location', '- dates'], suffixes=[' add', ' sub'], how='outer')
counts['date'] = counts['+ dates'].fillna(counts['- dates'])
counts['change from 1'] = counts['count add'].fillna(0) - counts['count sub'].fillna(0)
pivot_counts = counts.pivot(index='Location', columns='date', values='change from 1').reset_index()
date Location APR2025 MAY2025
0 London 1.0 1.0
1 Paris 1.0 NaN
2 Tokyo -1.0 -1.0
Merge this back onto the original dataframe, if a month is missing, add it in, fill nan/missing values with 0 and add 1 to everything (1 being the starting number of people, which is then changed by arrivals/departures).
x = x.merge(pivot_counts, on='Location', how='left').fillna(0)
x[[month for month in next_3_months if month not in x.columns]] = 0
x[next_3_months] = (x[next_3_months] + 1).astype(int)
Location Departing Arriving Apr2025 May2025 Jun2025
0 Paris MAY2025 MAR2025 2 1 1
1 Dallas JUN2025 JUN2025 1 1 1
2 Tokyo APR2025 JUN2025 0 0 1
3 London JUN2025 APR2025 2 2 1
本文标签: pythonCalculate column values by comparing datetimeStack Overflow
版权声明:本文标题:python - Calculate column values by comparing datetime - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743780016a2537657.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论