admin管理员组

文章数量:1124387

I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx).

The files look like this:


Bunch of useless lines

DataName A B C D
DataValues 1 2 3 4
....

Other bunch of uselsess lines

Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).

EDIT

After few more hours spent on it, I was able to come up with something like this:

for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
    print(file)
    if file.lower().endswith('.csv'):
        start = file.index('_ ')
        end = file.rindex(' (')
        xy = file[start + 1:end]
        x = xy.split()[0]
        y = xy.split()[1]

        to_skip = []
        with open(os.path.join(dirpath, file), 'r') as fin:
            for index, line in enumerate(fin):
                if line[0:4] != 'Data':
                    to_skip.append(index)

        print(to_skip)

        data = pd.read_csv(os.path.join(dirpath, file),
                           skiprows=to_skip,skip_blank_lines=True,
                           on_bad_lines='skip',index_col=False,
                           skipinitialspace=True)

It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.

I've got a series of big CSV files generated by a measurement tool. I would like to extract only the relevant data from that CSV, ignoring the other lines. The problem is that the data doesn't always start at the same row (so I cannot use skiprows=xx).

The files look like this:


Bunch of useless lines

DataName A B C D
DataValues 1 2 3 4
....

Other bunch of uselsess lines

Basically I would need to read in a database only the rows that contain "DataName" (to be used as header) or "DataValues" (the actual measurements data).

EDIT

After few more hours spent on it, I was able to come up with something like this:

for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
    print(file)
    if file.lower().endswith('.csv'):
        start = file.index('_ ')
        end = file.rindex(' (')
        xy = file[start + 1:end]
        x = xy.split()[0]
        y = xy.split()[1]

        to_skip = []
        with open(os.path.join(dirpath, file), 'r') as fin:
            for index, line in enumerate(fin):
                if line[0:4] != 'Data':
                    to_skip.append(index)

        print(to_skip)

        data = pd.read_csv(os.path.join(dirpath, file),
                           skiprows=to_skip,skip_blank_lines=True,
                           on_bad_lines='skip',index_col=False,
                           skipinitialspace=True)

It looks like it is more or less doing what I am after, but I'm not sure this is the best/more efficient way to do that.

Share Improve this question edited 2 days ago erbis87 asked 2 days ago erbis87erbis87 11 bronze badge New contributor erbis87 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 5
  • 3 You could open the file with with open("my-big-csv.csv") as file:, loop over lines with for line in file:, see if the lines are something you need to keep, and then pass those over to pd.read_csv(). – AKX Commented 2 days ago
  • 2 could you maybe post what you have tried? – Cincinnatus Commented 2 days ago
  • Please provide enough code so others can better understand or reproduce the problem. – Community Bot Commented 2 days ago
  • This question is similar to: Read a formatted text file via python. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – JonSG Commented 2 days ago
  • How about first extracting the relevant lines, using e.g. egrep '^Data(Names|Values' origfile >newfile? – treuss Commented 2 days ago
Add a comment  | 

1 Answer 1

Reset to default 0

Assuming you know the text content you wish to match in the first column and extract just that row, you should be able to do the following using Pandas and three lines of code.

The property pandas.DataFrame.loc allows you to, "Access a group of rows and columns by label(s) or a boolean array."

See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc

import pandas as pd

df = pd.read_csv('big.csv')

"""Search the CSV file by a column called 'column name' for any rows containing
your 'DataName' or 'DataValues' and return those specific rows.
"""
specific_rows = df.loc[
    (df['column name'] == 'DataName') | (df['column name'] == 'DataValues')
]

Alternatively, assuming your CSV file is several GB in size, let's take a two-step approach. (The code below should be drag-and-drop friendly, all you'd need to do is tweak the variable names at the top to your specific needs.)

  1. Read the CSV file looking at only the desired 'column name' where you expect to find 'DataName' and 'DataValues'
    • Locate and store the row indices for rows containing 'DataName' and 'DataValues' in this single column
  2. Read the CSV file and skip all but the desired rows
    • We'll accom

See: pandas read_csv and keep only certain rows (python)

import pandas as pd

path = 'big.csv'
search_col = ['column name']
desired_rows = ['DataName', 'DataValues']

# Step 1
single_column = pd.read_csv(path, usecols=search_col)
rows_to_keep = single_column[single_column.isin(desired_rows)].dropna().index

# Step 2
num_lines = len(single_column.index)
# assume a header row is present, keep header row by adding 1 to the index
rows_to_exclude = [i + 1 for i in range(num_lines) if i not in rows_to_keep]
specific_rows = pd.read_csv(path, skiprows = rows_to_exclude)

Does that help you?

本文标签: pythonRead big CSV extracting only relevant dataStack Overflow