admin管理员组

文章数量:1123233

I am currently working on a project where I have to first merge two datasets:

The first dataset contains weather data in 30 minute intervals. The second dataset contains minute-level data with PV voltage and cloud images but unlike the first, the second lacks time consistency, where several hours of a day might be missing. note that both have a time column

The goal is to do a multi-modal analysis (time series+image) to predict the PV voltage.

my problem is that I expanded the weather data to match the minute level intervals by forward filling the data within each 30 minute interval(the weather data(temp, wind speed,solar irradiance etc) is not linearly interpolated, rather for each 30 minute block with the corresponding data, I expanded the the time (0-29 mins) and gave that the same data of the 30 minute block that it contains), but after merging the combined dataset has fewer rows. What are the optimal ways to merge two datasets on the time column without losing thousands of rows. For reference, the PV and image dataset spans between a few months less than 3 years but only has close to 400k minutes logged. so that's a lot of days with no data.

the PV/image dataset, has time image and PV voltage data and the issue is that it would be consistent, lets say for example from 8 am to 10 pm each minute with corresponding time, image pv voltage data but then cuts off to the next day at 3am and continues like that, the content of each row is consistent there are no "NaN" values basically it's just that some hours of the day are missing and that happens a lot. as I said only 400k minutes (rows) is logged out of 3 years.

and for the goal its not clear yet but I think it would be multi-modal analysis on the two datasets where in the end the outputs of the heads of each model( one that accepts image inputs, the other accepts the weather data) would be used to make a prediction? I'm not really sure and haven't done this before.

Also, since this would be introduced to a CNN model in time series, is the lack of consistent time spacing going to be a problem or is there a way around that? I have never dealt with time-series model and wondering if I should bother with this at all anyway.

from PIL import Image
import io

def decode_image(binary_data):
    # Convert binary data to an image
    image = Image.open(io.BytesIO(binary_data))
    return np.array(image)  # Convert to NumPy array for processing

# Apply to all rows
df_PV['decoded_image'] = df_PV['image'].apply(lambda x: decode_image(x['bytes']))


# Insert the decoded_image column in the same position as the image column
image_col_position = df_PV.columns.get_loc('image')  # Get the position of the image column
df_PV.insert(image_col_position, 'decoded_image', df_PV.pop('decoded_image'))

# Drop the old image column
df_PV = df_PV.drop(columns=['image'])


print(df_PV.head())


# Remove timezone from the column
expanded_weather_df['time'] = pd.to_datetime(expanded_weather_df['time']).dt.tz_localize(None)

# also remove timezone
df_PV['time'] = pd.to_datetime(df_PV['time']).dt.tz_localize(None)

# merge
combined_df = expanded_weather_df.merge(df_PV, on='time', how='inner')```

I am currently working on a project where I have to first merge two datasets:

The first dataset contains weather data in 30 minute intervals. The second dataset contains minute-level data with PV voltage and cloud images but unlike the first, the second lacks time consistency, where several hours of a day might be missing. note that both have a time column

The goal is to do a multi-modal analysis (time series+image) to predict the PV voltage.

my problem is that I expanded the weather data to match the minute level intervals by forward filling the data within each 30 minute interval(the weather data(temp, wind speed,solar irradiance etc) is not linearly interpolated, rather for each 30 minute block with the corresponding data, I expanded the the time (0-29 mins) and gave that the same data of the 30 minute block that it contains), but after merging the combined dataset has fewer rows. What are the optimal ways to merge two datasets on the time column without losing thousands of rows. For reference, the PV and image dataset spans between a few months less than 3 years but only has close to 400k minutes logged. so that's a lot of days with no data.

the PV/image dataset, has time image and PV voltage data and the issue is that it would be consistent, lets say for example from 8 am to 10 pm each minute with corresponding time, image pv voltage data but then cuts off to the next day at 3am and continues like that, the content of each row is consistent there are no "NaN" values basically it's just that some hours of the day are missing and that happens a lot. as I said only 400k minutes (rows) is logged out of 3 years.

and for the goal its not clear yet but I think it would be multi-modal analysis on the two datasets where in the end the outputs of the heads of each model( one that accepts image inputs, the other accepts the weather data) would be used to make a prediction? I'm not really sure and haven't done this before.

Also, since this would be introduced to a CNN model in time series, is the lack of consistent time spacing going to be a problem or is there a way around that? I have never dealt with time-series model and wondering if I should bother with this at all anyway.

from PIL import Image
import io

def decode_image(binary_data):
    # Convert binary data to an image
    image = Image.open(io.BytesIO(binary_data))
    return np.array(image)  # Convert to NumPy array for processing

# Apply to all rows
df_PV['decoded_image'] = df_PV['image'].apply(lambda x: decode_image(x['bytes']))


# Insert the decoded_image column in the same position as the image column
image_col_position = df_PV.columns.get_loc('image')  # Get the position of the image column
df_PV.insert(image_col_position, 'decoded_image', df_PV.pop('decoded_image'))

# Drop the old image column
df_PV = df_PV.drop(columns=['image'])


print(df_PV.head())


# Remove timezone from the column
expanded_weather_df['time'] = pd.to_datetime(expanded_weather_df['time']).dt.tz_localize(None)

# also remove timezone
df_PV['time'] = pd.to_datetime(df_PV['time']).dt.tz_localize(None)

# merge
combined_df = expanded_weather_df.merge(df_PV, on='time', how='inner')```
Share Improve this question asked 8 hours ago TryingToLearnTryingToLearn 12 bronze badges 1
  • Why don't you do a left join instead of the inner join to keep all the time slots from expanded_weather_df ? then after the merge, you can impute the many NaN from df_PV the way you want. – rehaqds Commented 6 hours ago
Add a comment  | 

1 Answer 1

Reset to default 0
  1. Just create an intermediate timestamp as a merge key.
import pandas as pd
import numpy as np
import random

# let dataset 1 be the weather data
# let dataset 2 be the pv voltage & pics data

# define datasets
END_EPOCH_TIME = 1400
dataset_1_timestamps = np.arange(0, END_EPOCH_TIME, 30)
dataset_2_timestamps = np.arange(0, END_EPOCH_TIME)[np.array([(random.random() > 0.8) for _ in range(END_EPOCH_TIME)])]

df_1 = pd.DataFrame({"timestamp": dataset_1_timestamps, "weather": [f"A_{i}" for i in range(len(dataset_1_timestamps))]})
df_2 = pd.DataFrame({"timestamp": dataset_2_timestamps, "pv_voltage": [f"B_{i}" for i in range(len(dataset_2_timestamps))]})

# combine them without losing rows
df_2["current_timestamp"] = df_2["timestamp"]
df_2["timestamp"] =  df_2["timestamp"].apply(lambda t: t - (t % 30))
combined_df = df_2.merge(df_1, on="timestamp", how="right")

print(df_1.head())
print(df_2.head())
print(combined_df.head())

Output

   timestamp weather
0          0     A_0
1         30     A_1
2         60     A_2
3         90     A_3
4        120     A_4
   timestamp pv_voltage  current_timestamp
0          0        B_0                  3
1          0        B_1                  6
2          0        B_2                 14
3          0        B_3                 17
4          0        B_4                 18
   timestamp pv_voltage  current_timestamp weather
0          0        B_0                  3     A_0
1          0        B_1                  6     A_0
2          0        B_2                 14     A_0
3          0        B_3                 17     A_0
4          0        B_4                 18     A_0
  1. The inconsistent time-window for your image data just means you have to be selective about picking which data is needed for inference. Eg. Is it reasonable for your model to make an educated guess based only on the single latest image? Does it need 10 images instead? How many total training/validation/testing samples would you have if a minimum of 10 is required? Since the time granularity of the images is short, can you interpolate between the images somehow in instances where the data gaps are short? (1-2mins), etc. I'd recommend researching similar problems to what you're trying to solve to get ideas.

本文标签: pythonIssue with Merging TimeSeries datasets for consistent Time IntervalsStack Overflow