admin管理员组

文章数量:1345030

I have a dataframe with location, the date a person is departing and the date their replacement is arriving.

x = pd.DataFrame({
    'Location': ['Paris', 'Dallas', 'Tokyo', 'London'],
    'Departing': ['MAY2025', 'JUN2025', 'APR2025', 'JUN2025'],
    'Arriving': ['MAR2025', 'JUN2025', 'JUN2025', 'APR2025'],
})

    Location  Departing  Arriving
0    Paris     MAY2025    MAR2025
1    Dallas    JUN2025    JUN2025
2    Tokyo     APR2025    JUN2025
3    London    JUN2025    APR2025

I want to create a column for each of the next 3 months that tallies how many people will be at each location each month. For each month column up to the Departing date, a 1 should be added to the column value. And, for each month starting from the Arriving date onward, another 1 is added. So, if I was creating the table for April 2025, it should look like this:

    Location  Departing  Arriving  APR2025  MAY2025  JUN2025
0    Paris     MAY2025    MAR2025     2        1        1
1    Dallas    JUN2025    JUN2025     1        1        1
2    Tokyo     APR2025    JUN2025     0        0        1
3    London    JUN2025    APR2025     2        2        1

For the first row, someone is leaving Paris May 2025. Each month up to May gets a 1, so a 1 is added to APR2025. The next person coming to Paris arrived in March 2025, so March and every month after it gets a 1 added to the column value. For the Dallas row, someone is leaving in June, so all months up to June (April and May) get a 1. The next person coming to Dallas arrives in June, so June gets a 1.

I need to reuse the code throughout the year, so the date months will change each month. I created the columns by setting a variable for the start month, using pd.datetime to get the next 3 months and then adding them to the dataframe as new columns. But, I'm not sure how to go about filling the columns now.

I have a dataframe with location, the date a person is departing and the date their replacement is arriving.

x = pd.DataFrame({
    'Location': ['Paris', 'Dallas', 'Tokyo', 'London'],
    'Departing': ['MAY2025', 'JUN2025', 'APR2025', 'JUN2025'],
    'Arriving': ['MAR2025', 'JUN2025', 'JUN2025', 'APR2025'],
})

    Location  Departing  Arriving
0    Paris     MAY2025    MAR2025
1    Dallas    JUN2025    JUN2025
2    Tokyo     APR2025    JUN2025
3    London    JUN2025    APR2025

I want to create a column for each of the next 3 months that tallies how many people will be at each location each month. For each month column up to the Departing date, a 1 should be added to the column value. And, for each month starting from the Arriving date onward, another 1 is added. So, if I was creating the table for April 2025, it should look like this:

    Location  Departing  Arriving  APR2025  MAY2025  JUN2025
0    Paris     MAY2025    MAR2025     2        1        1
1    Dallas    JUN2025    JUN2025     1        1        1
2    Tokyo     APR2025    JUN2025     0        0        1
3    London    JUN2025    APR2025     2        2        1

For the first row, someone is leaving Paris May 2025. Each month up to May gets a 1, so a 1 is added to APR2025. The next person coming to Paris arrived in March 2025, so March and every month after it gets a 1 added to the column value. For the Dallas row, someone is leaving in June, so all months up to June (April and May) get a 1. The next person coming to Dallas arrives in June, so June gets a 1.

I need to reuse the code throughout the year, so the date months will change each month. I created the columns by setting a variable for the start month, using pd.datetime to get the next 3 months and then adding them to the dataframe as new columns. But, I'm not sure how to go about filling the columns now.

Share Improve this question edited yesterday be84 asked yesterday be84be84 512 bronze badges 1
  • 1 Which dates are the correct ones? All 3 instances of the data in your question show different dates, not sure which one you want me to use. Tokyo for example is May25 departure and sep26 arrival when you create the dataframe, aug25 departure and sep26 arrival in example dataframe and april25 depart and jun25 arrive in your desired output. Not sure which is correct? – Emi OB Commented yesterday
Add a comment  | 

2 Answers 2

Reset to default 2

Create helper DataFrame df1 for N months starting by base, use cross join with DataFrame.merge for all combinations, so possible sum less and greater or equal values. Last use DataFrame.pivot for new months columns:

df['Departing_dt'] = pd.to_datetime(df['Departing'], format='%b%Y')
df['Arriving_dt']  = pd.to_datetime(df['Arriving'], format='%b%Y')

N = 3
base = pd.to_datetime("APR2025", format="%b%Y")

df1 = pd.DataFrame({'MonthDate':
                   [base + pd.DateOffset(months=x) for x in range(N)]})

df2 = pd.merge(df.reset_index(), df1, how='cross')

df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
              df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))

out = (df2.pivot(index=['index','Location','Departing','Arriving'], 
                 columns='MonthDate', 
                 values='sum')
         .rename(columns=lambda x: x.strftime("%b%Y").upper())
         .reset_index([1,2,3])
         .rename_axis(index=None, columns=None))

print(out)
  Location Departing Arriving  APR2025  MAY2025  JUN2025
0    Paris   MAY2025  MAR2025        2        1        1
1   Dallas   JUN2025  JUN2025        1        1        1
2    Tokyo   APR2025  JUN2025        0        0        1
3   London   JUN2025  APR2025        2        2        1

Similar solution with using months periods:

df['Departing_dt'] = (pd.to_datetime(df['Departing'], format='%b%Y')
                        .dt.to_period('M'))
df['Arriving_dt']  = (pd.to_datetime(df['Arriving'], format='%b%Y')
                        .dt.to_period('M'))

N = 3
base = pd.to_datetime("APR2025", format="%b%Y")
month_periods = pd.period_range(start=base.to_period('M'), periods=N, freq='M')

df1 = pd.DataFrame({'MonthDate': month_periods})

df2 = pd.merge(df.reset_index(), df1, how='cross')

df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
              df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))


out = (df2.pivot(index=['index','Location','Departing','Arriving'], 
                 columns='MonthDate', 
                 values='sum')
         .rename(columns=lambda x: x.strftime("%b%Y").upper())
         .reset_index([1,2,3])
         .rename_axis(index=None, columns=None))

print(out)
  Location Departing Arriving  APR2025  MAY2025  JUN2025
0    Paris   MAY2025  MAR2025        2        1        1
1   Dallas   JUN2025  JUN2025        1        1        1
2    Tokyo   APR2025  JUN2025        0        0        1
3   London   JUN2025  APR2025        2        2        1

Numpy alternative work similar, instead cross join is used broadcasting for sum matched datetimes:

Departing = pd.to_datetime(df['Departing'], format='%b%Y')
Arriving = pd.to_datetime(df['Arriving'], format='%b%Y')

N = 3 
base = pd.to_datetime("APR2025", format="%b%Y")
month_dates = [base + pd.DateOffset(months=i) for i in range(N)]
month_names = [d.strftime("%b%Y").upper() for d in month_dates]
    
departing_np = Departing.to_numpy()[:, None]
arriving_np = Arriving.to_numpy()[:, None]
month_dates_np = np.array(month_dates, dtype='datetime64[ns]')[None, :]

arr = ((month_dates_np < departing_np).astype(int) +
       (month_dates_np >= arriving_np).astype(int))

df = df.join(pd.DataFrame(arr, index=df.index, columns=month_names))
print(df)
  Location Departing Arriving  APR2025  MAY2025  JUN2025
0    Paris   MAY2025  MAR2025        2        1        1
1   Dallas   JUN2025  JUN2025        1        1        1
2    Tokyo   APR2025  JUN2025        0        0        1
3   London   JUN2025  APR2025        2        2        1

Performance comparison:

import pandas as pd
#pip install perfplot
import perfplot


def jez1(df):
    df['Departing_dt'] = pd.to_datetime(df['Departing'], format='%b%Y')
    df['Arriving_dt']  = pd.to_datetime(df['Arriving'], format='%b%Y')
    
    N = 3
    base = pd.to_datetime("APR2025", format="%b%Y")
    
    df1 = pd.DataFrame({'MonthDate':
                       [base + pd.DateOffset(months=x) for x in range(N)]})
    
    df2 = pd.merge(df.reset_index(), df1, how='cross')
    
    df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
                  df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
    
    out = (df2.pivot(index=['index','Location','Departing','Arriving'], 
                     columns='MonthDate', 
                     values='sum')
             .rename(columns=lambda x: x.strftime("%b%Y").upper())
             .reset_index([1,2,3])
             .rename_axis(index=None, columns=None))
    return out

def jez2(df):
    df['Departing_dt'] = (pd.to_datetime(df['Departing'], format='%b%Y')
                            .dt.to_period('M'))
    df['Arriving_dt']  = (pd.to_datetime(df['Arriving'], format='%b%Y')
                            .dt.to_period('M'))
    
    N = 3
    base = pd.to_datetime("APR2025", format="%b%Y")
    month_periods = pd.period_range(start=base.to_period('M'), periods=N, freq='M')
    
    df1 = pd.DataFrame({'MonthDate': month_periods})
    
    df2 = pd.merge(df.reset_index(), df1, how='cross')
    
    df2['sum'] = (df2['MonthDate'].lt(df2['Departing_dt']).astype(int) +
                  df2['MonthDate'].ge(df2['Arriving_dt']).astype(int))
    
    
    out = (df2.pivot(index=['index','Location','Departing','Arriving'], 
                     columns='MonthDate', 
                     values='sum')
             .rename(columns=lambda x: x.strftime("%b%Y").upper())
             .reset_index([1,2,3])
             .rename_axis(index=None, columns=None))

    return out

def jez3(df):
    Departing = pd.to_datetime(df['Departing'], format='%b%Y')
    Arriving = pd.to_datetime(df['Arriving'], format='%b%Y')
    
    N = 3 
    base = pd.to_datetime("APR2025", format="%b%Y")
    month_dates = [base + pd.DateOffset(months=i) for i in range(N)]
    month_names = [d.strftime("%b%Y").upper() for d in month_dates]
        
    departing_np = Departing.to_numpy()[:, None]
    arriving_np = Arriving.to_numpy()[:, None]
    month_dates_np = np.array(month_dates, dtype='datetime64[ns]')[None, :]
    
    arr = ((month_dates_np < departing_np).astype(int) +
           (month_dates_np >= arriving_np).astype(int))
    
    out = df.join(pd.DataFrame(arr, index=df.index, columns=month_names))

    return out

def emi(x):
    next_3_months = [(pd.to_datetime('today') + pd.DateOffset(months=month)).strftime("%b%Y") for month in [0, 1, 2]]
    add_dates = [pd.date_range(y, z, freq='ME').strftime("%b%Y") for y, z in zip( x['Arriving'], x['Departing'])]
    add_person = pd.DataFrame({'Location' : x['Location'], '+ dates':add_dates}).explode('+ dates').value_counts().reset_index()
    add_person = add_person.loc[add_person['+ dates'].isin(next_3_months)].copy()
    
    sub_dates = [pd.date_range(y, z, freq='ME').strftime("%b%Y") for y, z in zip( x['Departing'], x['Arriving'])]
    sub_person = pd.DataFrame({'Location' : x['Location'], '- dates':sub_dates}).explode('- dates').value_counts().reset_index()
    sub_person = sub_person.loc[sub_person['- dates'].isin(next_3_months)].copy()
    
    counts = add_person.merge(sub_person, left_on=['Location', '+ dates'], right_on=['Location', '- dates'], suffixes=[' add', ' sub'], how='outer')
    counts['date'] = counts['+ dates'].fillna(counts['- dates'])
    counts['change from 1'] = counts['count add'].fillna(0) - counts['count sub'].fillna(0)
    pivot_counts = counts.pivot(index='Location', columns='date', values='change from 1').reset_index()
    
    x = x.merge(pivot_counts, on='Location', how='left').fillna(0)
    x[[month for month in next_3_months if month not in x.columns]] = 0
    x[next_3_months] = (x[next_3_months] + 1).astype(int)
    
    return x


def make_df(n):
    np.random.seed(523)

    Lo = ['Paris', 'Dallas', 'Tokyo', 'London']
    L = ['MAY2025', 'JUN2025', 'SEP2025', 'MAR2025', 'JUL2026']


    df = pd.DataFrame({
        'Location': np.random.choice(Lo, size=n),
        'Departing': np.random.choice(L, size=n),
        'Arriving': np.random.choice(L, size=n)})

    return df


perfplot.show(
    setup=make_df,
    kernels=[jez1, jez2, jez3, emi],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

Firstly I would get a list of the next 3 months (including current)

next_3_months = [(pd.to_datetime('today') + pd.DateOffset(months=month)).strftime("%b%Y") for month in [0, 1, 2]]
['Apr2025', 'May2025', 'Jun2025']

Then use pd.date_range() to get all the months between an arrival and departure - i.e. all the months were there will be an additional person. Create a temporary dataframe with this as a column alongside the location column. You can then use df.explode() to get one row for each location and each month they have an additional person. Then filter these months to only those in the next 3.

add_dates = [pd.date_range(y, z, freq='m').strftime("%b%Y") for y, z in zip( x['Arriving'], x['Departing'])]
add_person = pd.DataFrame({'Location' : x['Location'], '+ dates':add_dates}).explode('+ dates').value_counts().reset_index()
add_person = add_person.loc[add_person['+ dates'].isin(next_3_months)].copy()

  Location  + dates  count
0   London  Apr2025      1
1   London  May2025      1
2    Paris  Apr2025      1

Repeat the same for all dates between departure and arrival - i.e. all the months where there will be one less person

sub_dates = [pd.date_range(y, z, freq='m').strftime("%b%Y") for y, z in zip( x['Departing'], x['Arriving'])]
sub_person = pd.DataFrame({'Location' : x['Location'], '- dates':sub_dates}).explode('- dates').value_counts().reset_index()
sub_person = sub_person.loc[sub_person['- dates'].isin(next_3_months)].copy()

Then outer merge these two together, and work out what the difference would be for each location per month. Pivot this to have the months as columns.

counts = add_person.merge(sub_person, left_on=['Location', '+ dates'], right_on=['Location', '- dates'], suffixes=[' add', ' sub'], how='outer')
counts['date'] = counts['+ dates'].fillna(counts['- dates'])
counts['change from 1'] = counts['count add'].fillna(0) - counts['count sub'].fillna(0)
pivot_counts = counts.pivot(index='Location', columns='date', values='change from 1').reset_index()

date Location  APR2025  MAY2025
0      London      1.0      1.0
1       Paris      1.0      NaN
2       Tokyo     -1.0     -1.0

Merge this back onto the original dataframe, if a month is missing, add it in, fill nan/missing values with 0 and add 1 to everything (1 being the starting number of people, which is then changed by arrivals/departures).

x = x.merge(pivot_counts, on='Location', how='left').fillna(0)
x[[month for month in next_3_months if month not in x.columns]] = 0
x[next_3_months] = (x[next_3_months] + 1).astype(int)

  Location Departing Arriving  Apr2025  May2025  Jun2025
0    Paris   MAY2025  MAR2025        2        1        1
1   Dallas   JUN2025  JUN2025        1        1        1
2    Tokyo   APR2025  JUN2025        0        0        1
3   London   JUN2025  APR2025        2        2        1

本文标签: pythonCalculate column values by comparing datetimeStack Overflow