admin管理员组文章数量:1419243
I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.
The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.
Filename descriptions:
"downloaded_data.csv" - this is the master file which is downloaded from a URL.
"current-data.csv" - this is old data file to compare against to find new products.
"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).
"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.
Example columns from "downloaded_data.csv" file which should also appear in the new file.
Title | Image URL | SKU |
---|---|---|
Product Name One | imageurl.jpg | SKU123 |
Product Name Two | imageurl.jpg | SKU456 |
I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.
The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.
Filename descriptions:
"downloaded_data.csv" - this is the master file which is downloaded from a URL.
"current-data.csv" - this is old data file to compare against to find new products.
"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).
"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.
Example columns from "downloaded_data.csv" file which should also appear in the new file.
Title | Image URL | SKU |
---|---|---|
Product Name One | imageurl.jpg | SKU123 |
Product Name Two | imageurl.jpg | SKU456 |
Example columns from "current-data.csv" file. Compare the above example CSV file with this using the "SKU" and "ID" columns:
ID | Image URL | Product Name |
---|---|---|
SKU123 | imageurl.jpg | Product Name One |
SKU789 | imageurl.jpg | Product Name Three |
Example of the "new products" file (same columns from "downloaded_data.csv" but only inputting the SKUs that are missing from the lookup file "current-data.csv"):
Title | Image URL | SKU |
---|---|---|
Product Name Three | imageurl.jpg | SKU789 |
Below is my Python code. After running this, the "new-products.csv" file ends up just being a copy of the downloaded master file.
import pandas as pd
import requests
pd.set_option("display.max_rows", None)
# Dowload CSV file
url = "URL GOES HERE"
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Save the content of the response to a local CSV file
with open("downloaded_data.csv", "wb") as f:
f.write(response.content)
print("CSV file downloaded successfully")
else:
print("Failed to download csv file. Status code: ", response.status_code)
# Read the CSV file into a Pandas DataFrame
master_file_compare = pd.read_csv("downloaded_data.csv", usecols=[29], names=['SKU'])
account_data = pd.read_csv("account-data.csv", usecols=[5], names=['SKU'])
# Merging both dataframe using left join
comparison_result = pd.merge(master_file_compare,account_data, on='SKU', how='left', indicator=True)
# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
comparison_result.to_csv('comparison-file.csv', encoding='utf8')
# Compare Comparison file to master and generate data file
with open('downloaded_data.csv', 'r', encoding='utf8') as in_file, open('new-products.csv', 'w', encoding='utf8') as out_file:
for line in in_file:
if line.split(',')[0].strip() not in comparison_result:
out_file.write(line)
# print(account_data.head)
print(comparison_result)
To note that the current data (comparison file) will not have the same columns or even ordering of columns. All I am aiming to do is to match on the SKU columns as those are the identifiers.
Any pointers or suggestions on where to focus next would be greatly appreciated.
Thanks!
Share Improve this question edited Jan 30 at 12:54 Calogero Savarino asked Jan 29 at 11:46 Calogero SavarinoCalogero Savarino 376 bronze badges2 Answers
Reset to default 0If I understood correctly, you want to get the SKU
values that are in the new csv
file (let's call it v2.csv
) but not in the former one (let's call it v1.csv
).
You can do it in many different ways, the way I'd do it would be by converting the list of SKU
values of each file to set
objects and take the difference between the two to get the ones that are in set_sku_v2
but not in set_sku_v1
.
Sample code:
v1_df = pd.read_csv("v1.csv")
v2_df = pd.read_csv("v2.csv")
set_sku_v1 = set(v1_df["SKU"]) # e.g. {12, 15, 19, 22}
set_sku_v2 = set(v2_df["SKU"]) # e.g. {12, 19, 31}
new_sku = list(set_sku_v2 - set_sku_v1) # e.g. [15, 22]
new_products_df = v2_df.query(f"SKU in {new_sku}")
Using the following example:
display(downloaded_data)
Title Image_URL SKU
0 Product_Name_One imageurl1.jpg SKU123
1 Product_Name_Two imageurl2.jpg SKU456
2 Product_Name_Four imageurl4.jpg SKU457
display(current_data)
ID Image_URL Product_Name
0 SKU123 imageurl1.jpg Product_Name_One
1 SKU789 imageurl3.jpg Product_Name_Three
And adapting your original code:
# Merging both dataframe using left join
comparison_result = pd.merge(downloaded_data, current_data,
left_on='SKU', right_on='ID',
how='left', indicator=True)
# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
new_products = downloaded_data.loc[downloaded_data['SKU'].isin(comparison_result['SKU'])]
display(new_products)
Title Image_URL SKU
1 Product_Name_Two imageurl2.jpg SKU456
2 Product_Name_Four imageurl4.jpg SKU457
本文标签: pandasComparing 2 data files and create a new file with the missing data using PythonStack Overflow
版权声明:本文标题:pandas - Comparing 2 data files and create a new file with the missing data using Python - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745300593a2652348.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论