admin管理员组文章数量:1124387
I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx
).
The files look like this:
Bunch of useless lines
DataName A B C D
DataValues 1 2 3 4
....
Other bunch of uselsess lines
Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).
EDIT
After few more hours spent on it, I was able to come up with something like this:
for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
print(file)
if file.lower().endswith('.csv'):
start = file.index('_ ')
end = file.rindex(' (')
xy = file[start + 1:end]
x = xy.split()[0]
y = xy.split()[1]
to_skip = []
with open(os.path.join(dirpath, file), 'r') as fin:
for index, line in enumerate(fin):
if line[0:4] != 'Data':
to_skip.append(index)
print(to_skip)
data = pd.read_csv(os.path.join(dirpath, file),
skiprows=to_skip,skip_blank_lines=True,
on_bad_lines='skip',index_col=False,
skipinitialspace=True)
It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.
I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx
).
The files look like this:
Bunch of useless lines
DataName A B C D
DataValues 1 2 3 4
....
Other bunch of uselsess lines
Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).
EDIT
After few more hours spent on it, I was able to come up with something like this:
for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
print(file)
if file.lower().endswith('.csv'):
start = file.index('_ ')
end = file.rindex(' (')
xy = file[start + 1:end]
x = xy.split()[0]
y = xy.split()[1]
to_skip = []
with open(os.path.join(dirpath, file), 'r') as fin:
for index, line in enumerate(fin):
if line[0:4] != 'Data':
to_skip.append(index)
print(to_skip)
data = pd.read_csv(os.path.join(dirpath, file),
skiprows=to_skip,skip_blank_lines=True,
on_bad_lines='skip',index_col=False,
skipinitialspace=True)
It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.
Share Improve this question edited 2 days ago erbis87 asked 2 days ago erbis87erbis87 11 bronze badge New contributor erbis87 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 5 |1 Answer
Reset to default 0Assuming you know the text content you wish to match in the first column and extract just that row, you should be able to do the following using Pandas and three lines of code.
The property
pandas.DataFrame.loc
allows you to, "Access a group of rows and columns by label(s) or a boolean array."
See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc
import pandas as pd
df = pd.read_csv('big.csv')
"""Search the CSV file by a column called 'column name' for any rows containing
your 'DataName' or 'DataValues' and return those specific rows.
"""
specific_rows = df.loc[
(df['column name'] == 'DataName') | (df['column name'] == 'DataValues')
]
Alternatively, assuming your CSV file is several GB in size, let's take a two-step approach. (The code below should be drag-and-drop friendly, all you'd need to do is tweak the variable names at the top to your specific needs.)
- Read the CSV file looking at only the desired 'column name' where you expect to find 'DataName' and 'DataValues'
- Locate and store the row indices for rows containing 'DataName' and 'DataValues' in this single column
- Read the CSV file and skip all but the desired rows
- We'll accom
See: pandas read_csv and keep only certain rows (python)
import pandas as pd
path = 'big.csv'
search_col = ['column name']
desired_rows = ['DataName', 'DataValues']
# Step 1
single_column = pd.read_csv(path, usecols=search_col)
rows_to_keep = single_column[single_column.isin(desired_rows)].dropna().index
# Step 2
num_lines = len(single_column.index)
# assume a header row is present, keep header row by adding 1 to the index
rows_to_exclude = [i + 1 for i in range(num_lines) if i not in rows_to_keep]
specific_rows = pd.read_csv(path, skiprows = rows_to_exclude)
Does that help you?
本文标签: pythonRead big CSV extracting only relevant dataStack Overflow
版权声明:本文标题:python - Read big CSV extracting only relevant data - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736628856a1945733.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
with open("my-big-csv.csv") as file:
, loop over lines withfor line in file:
, see if the lines are something you need to keep, and then pass those over topd.read_csv()
. – AKX Commented 2 days agoegrep '^Data(Names|Values' origfile >newfile
? – treuss Commented 2 days ago