admin管理员组

文章数量:1419243

I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.

The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.

Filename descriptions:

"downloaded_data.csv" - this is the master file which is downloaded from a URL.

"current-data.csv" - this is old data file to compare against to find new products.

"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).

"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.

Example columns from "downloaded_data.csv" file which should also appear in the new file.

Title Image URL SKU
Product Name One imageurl.jpg SKU123
Product Name Two imageurl.jpg SKU456

I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.

The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.

Filename descriptions:

"downloaded_data.csv" - this is the master file which is downloaded from a URL.

"current-data.csv" - this is old data file to compare against to find new products.

"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).

"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.

Example columns from "downloaded_data.csv" file which should also appear in the new file.

Title Image URL SKU
Product Name One imageurl.jpg SKU123
Product Name Two imageurl.jpg SKU456

Example columns from "current-data.csv" file. Compare the above example CSV file with this using the "SKU" and "ID" columns:

ID Image URL Product Name
SKU123 imageurl.jpg Product Name One
SKU789 imageurl.jpg Product Name Three

Example of the "new products" file (same columns from "downloaded_data.csv" but only inputting the SKUs that are missing from the lookup file "current-data.csv"):

Title Image URL SKU
Product Name Three imageurl.jpg SKU789

Below is my Python code. After running this, the "new-products.csv" file ends up just being a copy of the downloaded master file.

import pandas as pd
import requests

pd.set_option("display.max_rows", None)

# Dowload CSV file
url = "URL GOES HERE"
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content of the response to a local CSV file
    with open("downloaded_data.csv", "wb") as f:
        f.write(response.content)
    print("CSV file downloaded successfully")
else:
    print("Failed to download csv file. Status code: ", response.status_code)

# Read the CSV file into a Pandas DataFrame
master_file_compare = pd.read_csv("downloaded_data.csv", usecols=[29], names=['SKU'])
account_data = pd.read_csv("account-data.csv", usecols=[5], names=['SKU'])

# Merging both dataframe using left join
comparison_result = pd.merge(master_file_compare,account_data, on='SKU', how='left', indicator=True)

# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']

comparison_result.to_csv('comparison-file.csv', encoding='utf8')

# Compare Comparison file to master and generate data file
with open('downloaded_data.csv', 'r', encoding='utf8') as in_file, open('new-products.csv', 'w', encoding='utf8') as out_file:
    for line in in_file:
        if line.split(',')[0].strip() not in comparison_result:
            out_file.write(line)

# print(account_data.head)
print(comparison_result)

To note that the current data (comparison file) will not have the same columns or even ordering of columns. All I am aiming to do is to match on the SKU columns as those are the identifiers.

Any pointers or suggestions on where to focus next would be greatly appreciated.

Thanks!

Share Improve this question edited Jan 30 at 12:54 Calogero Savarino asked Jan 29 at 11:46 Calogero SavarinoCalogero Savarino 376 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

If I understood correctly, you want to get the SKU values that are in the new csv file (let's call it v2.csv) but not in the former one (let's call it v1.csv).

You can do it in many different ways, the way I'd do it would be by converting the list of SKU values of each file to set objects and take the difference between the two to get the ones that are in set_sku_v2 but not in set_sku_v1.

Sample code:

v1_df = pd.read_csv("v1.csv")
v2_df = pd.read_csv("v2.csv")

set_sku_v1 = set(v1_df["SKU"])  # e.g. {12, 15, 19, 22}
set_sku_v2 = set(v2_df["SKU"])  # e.g. {12, 19, 31}

new_sku = list(set_sku_v2 - set_sku_v1)  # e.g. [15, 22]
new_products_df = v2_df.query(f"SKU in {new_sku}")

Using the following example:

display(downloaded_data)
Title   Image_URL   SKU
0   Product_Name_One    imageurl1.jpg   SKU123
1   Product_Name_Two    imageurl2.jpg   SKU456
2   Product_Name_Four   imageurl4.jpg   SKU457

display(current_data)
ID  Image_URL   Product_Name
0   SKU123  imageurl1.jpg   Product_Name_One
1   SKU789  imageurl3.jpg   Product_Name_Three

And adapting your original code:

# Merging both dataframe using left join
comparison_result = pd.merge(downloaded_data, current_data, 
                             left_on='SKU', right_on='ID', 
                             how='left', indicator=True)

# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']   
new_products = downloaded_data.loc[downloaded_data['SKU'].isin(comparison_result['SKU'])]

display(new_products)
    Title   Image_URL   SKU
1   Product_Name_Two    imageurl2.jpg   SKU456
2   Product_Name_Four   imageurl4.jpg   SKU457

本文标签: pandasComparing 2 data files and create a new file with the missing data using PythonStack Overflow