python - Read big CSV extracting only relevant data - Stack Overflow

IT技术

更新时间：2025-01-125

admin管理员组
文章数量:1124387

I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx).

The files look like this:


Bunch of useless lines

DataName A B C D
DataValues 1 2 3 4
....

Other bunch of uselsess lines

Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).

EDIT

After few more hours spent on it, I was able to come up with something like this:

for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
    print(file)
    if file.lower().endswith('.csv'):
        start = file.index('_ ')
        end = file.rindex(' (')
        xy = file[start + 1:end]
        x = xy.split()[0]
        y = xy.split()[1]

        to_skip = []
        with open(os.path.join(dirpath, file), 'r') as fin:
            for index, line in enumerate(fin):
                if line[0:4] != 'Data':
                    to_skip.append(index)

        print(to_skip)

        data = pd.read_csv(os.path.join(dirpath, file),
                           skiprows=to_skip,skip_blank_lines=True,
                           on_bad_lines='skip',index_col=False,
                           skipinitialspace=True)

It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.

I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx).

The files look like this:


Bunch of useless lines

DataName A B C D
DataValues 1 2 3 4
....

Other bunch of uselsess lines

Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).

EDIT

After few more hours spent on it, I was able to come up with something like this:

for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
    print(file)
    if file.lower().endswith('.csv'):
        start = file.index('_ ')
        end = file.rindex(' (')
        xy = file[start + 1:end]
        x = xy.split()[0]
        y = xy.split()[1]

        to_skip = []
        with open(os.path.join(dirpath, file), 'r') as fin:
            for index, line in enumerate(fin):
                if line[0:4] != 'Data':
                    to_skip.append(index)

        print(to_skip)

        data = pd.read_csv(os.path.join(dirpath, file),
                           skiprows=to_skip,skip_blank_lines=True,
                           on_bad_lines='skip',index_col=False,
                           skipinitialspace=True)

It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.

Share Improve this question edited 2 days ago asked 2 days ago erbis87 11 bronze badge New contributor erbis87 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

3 You could open the file with with open("my-big-csv.csv") as file:, loop over lines with for line in file:, see if the lines are something you need to keep, and then pass those over to pd.read_csv(). – AKX Commented 2 days ago
2 could you maybe post what you have tried? – Cincinnatus Commented 2 days ago
Please provide enough code so others can better understand or reproduce the problem. – Community Bot Commented 2 days ago
This question is similar to: Read a formatted text file via python. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – JonSG Commented 2 days ago
How about first extracting the relevant lines, using e.g. egrep '^Data(Names|Values' origfile >newfile? – treuss Commented 2 days ago

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Assuming you know the text content you wish to match in the first column and extract just that row, you should be able to do the following using Pandas and three lines of code.

The property pandas.DataFrame.loc allows you to, "Access a group of rows and columns by label(s) or a boolean array."

See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc

import pandas as pd

df = pd.read_csv('big.csv')

"""Search the CSV file by a column called 'column name' for any rows containing
your 'DataName' or 'DataValues' and return those specific rows.
"""
specific_rows = df.loc[
    (df['column name'] == 'DataName') | (df['column name'] == 'DataValues')
]

Alternatively, assuming your CSV file is several GB in size, let's take a two-step approach. (The code below should be drag-and-drop friendly, all you'd need to do is tweak the variable names at the top to your specific needs.)

Read the CSV file looking at only the desired 'column name' where you expect to find 'DataName' and 'DataValues'
- Locate and store the row indices for rows containing 'DataName' and 'DataValues' in this single column
Read the CSV file and skip all but the desired rows
- We'll accom

See: pandas read_csv and keep only certain rows (python)

import pandas as pd

path = 'big.csv'
search_col = ['column name']
desired_rows = ['DataName', 'DataValues']

# Step 1
single_column = pd.read_csv(path, usecols=search_col)
rows_to_keep = single_column[single_column.isin(desired_rows)].dropna().index

# Step 2
num_lines = len(single_column.index)
# assume a header row is present, keep header row by adding 1 to the index
rows_to_exclude = [i + 1 for i in range(num_lines) if i not in rows_to_keep]
specific_rows = pd.read_csv(path, skiprows = rows_to_exclude)

Does that help you?

本文标签： pythonRead big CSV extracting only relevant dataStack Overflow

版权声明：本文标题：python - Read big CSV extracting only relevant data - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736628856a1945733.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Read big CSV extracting only relevant data - Stack Overflow

1 Answer 1

更多相关文章