admin管理员组

文章数量:1289517

I receive operations logs, one per line, in an $input_file: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:

crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]

I have no control over how these lines are produced. I only know that the resources are supposed to be described at the end of each line, between square brackets ([...]). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP ($SEP=="&" in the examples).

For usage purposes, I need to produce a sorted list of (unique) resources in an $output_file$. The following code works, and follows these steps:

  1. Extract the content between [] at the end of the line into $content
  2. If $content is not empty, try and split parts between $SEP into $parts
  3. For each element in $parts, trim trailing/leading whitespace(s)
  4. Put the trimmed sequence into an array (which naturally adds unique resources only)
  5. Write everything into $output_file

With the line examples above, the result should be the following:

Printer DVX500
Printer DV650V 
Scanner SON2000

Here is the script I am using, which is functionally correct (at least so far):

# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
    echo "Error: Input file '$input_file' not found!"
    exit 1
fi

# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
        linenumber=$[$linenumber +1]
        if (( $linenumber%100 == 0 )); then
            echo "line=$linenumber"
        fi
    # Use sed to extract the content between the first "[" and "]"
    content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')

    # If content is not empty, process the extracted part
    if [[ -n "$content" ]]; then
        # Split the content by "&" and process each part
        IFS="&" read -ra parts <<< "$content"
        for part in "${parts[@]}"; do
            # Trim leading/trailing whitespace from each part
            trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
            
            # Store the trimmed part in the associative array (ensures uniqueness)
            unique_lines["$trimmed_part"]=1
        done
    fi
done < "$input_file"

# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
    echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed

echo "Processing completed. Output saved to $output_file" 

The issue is performance: with $input_files between 5000-10.000 lines, the previous script takes between 12 to 30 minutes. I tried several things, but could not figure out a way of really optimising it:

  • using parameter expansion for extracting the resource list, instead of sed;
  • outputing into $output_file along the extraction, instead of storing results in the $unique_lines array (which is way slower because of the many write operations);

The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.

If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.

EDIT: Completing requirements list wrt. brillant commenters remarks :-)

Thanks for the thorough analysis of my use case :-). For the sake of completeness:

  • An important modification I left out: Line may contains other [...], but the info I need to extract is always within the last pair. So a better test set would be the following:
crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]
  • There should not be lines without any [...] group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...

  • The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.

  • The data inside [...] may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between [...], then splitted along the [:space:]$SEP[:space:] characters, then trimmed, which should have taken care of it.

  • Finally, yes: sticking to bash is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) than bash.

I receive operations logs, one per line, in an $input_file: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:

crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]

I have no control over how these lines are produced. I only know that the resources are supposed to be described at the end of each line, between square brackets ([...]). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP ($SEP=="&" in the examples).

For usage purposes, I need to produce a sorted list of (unique) resources in an $output_file$. The following code works, and follows these steps:

  1. Extract the content between [] at the end of the line into $content
  2. If $content is not empty, try and split parts between $SEP into $parts
  3. For each element in $parts, trim trailing/leading whitespace(s)
  4. Put the trimmed sequence into an array (which naturally adds unique resources only)
  5. Write everything into $output_file

With the line examples above, the result should be the following:

Printer DVX500
Printer DV650V 
Scanner SON2000

Here is the script I am using, which is functionally correct (at least so far):

# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
    echo "Error: Input file '$input_file' not found!"
    exit 1
fi

# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
        linenumber=$[$linenumber +1]
        if (( $linenumber%100 == 0 )); then
            echo "line=$linenumber"
        fi
    # Use sed to extract the content between the first "[" and "]"
    content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')

    # If content is not empty, process the extracted part
    if [[ -n "$content" ]]; then
        # Split the content by "&" and process each part
        IFS="&" read -ra parts <<< "$content"
        for part in "${parts[@]}"; do
            # Trim leading/trailing whitespace from each part
            trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
            
            # Store the trimmed part in the associative array (ensures uniqueness)
            unique_lines["$trimmed_part"]=1
        done
    fi
done < "$input_file"

# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
    echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed

echo "Processing completed. Output saved to $output_file" 

The issue is performance: with $input_files between 5000-10.000 lines, the previous script takes between 12 to 30 minutes. I tried several things, but could not figure out a way of really optimising it:

  • using parameter expansion for extracting the resource list, instead of sed;
  • outputing into $output_file along the extraction, instead of storing results in the $unique_lines array (which is way slower because of the many write operations);

The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.

If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.

EDIT: Completing requirements list wrt. brillant commenters remarks :-)

Thanks for the thorough analysis of my use case :-). For the sake of completeness:

  • An important modification I left out: Line may contains other [...], but the info I need to extract is always within the last pair. So a better test set would be the following:
crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]
  • There should not be lines without any [...] group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...

  • The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.

  • The data inside [...] may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between [...], then splitted along the [:space:]$SEP[:space:] characters, then trimmed, which should have taken care of it.

  • Finally, yes: sticking to bash is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) than bash.

Share Improve this question edited Feb 20 at 14:47 Moussa Amrani asked Feb 19 at 19:44 Moussa AmraniMoussa Amrani 333 bronze badges 5
  • 1 Can you ever have input lines that do not contain a [...] part? How about lines where [ or ] appear multiple times? – Ed Morton Commented Feb 19 at 20:33
  • In general, "optimizing performance" and bash don't go together well. Use a real programming language if performance is a concern. – Barmar Commented Feb 19 at 23:42
  • What does sticking to bash is boss-mandatory mean to you? Does it mean that a) you can only use bash builtins (i.e. not sed or awk), or b) that you can use bash to call mandatory POSIX tools such as sed or awk, or c) that you can use bash to call other non-POSIX tools such as perl, python, or ruby but can't write a C program, or d) something else? – Ed Morton Commented Feb 20 at 13:38
  • 1 Your question is exactly my trouble with him nowadays. It has a very vague meaning, and we debated that long enough. The status quo for now is your (b) option: bash + POSIX mandatory tools. Just for the sake of challenging the status quo, I am tempted to try (c) in the upcoming days... Or even just calling appropriate C code cleverly hidden in the script... I'll let you know! :-) – Moussa Amrani Commented Feb 20 at 14:32
  • 1 Thanks for the incredible support. It helped me dig deeper into awk, better understand arrrays/expansions, and even try ruby :-). They are all acceptable solutions: I am voting Ed's solution simply because of the incredible performance it exhibits (tried it on 100k lines with almost immediate completion, waw!), but this does not remove any interest to the others. – Moussa Amrani Commented Feb 21 at 14:17
Add a comment  | 

5 Answers 5

Reset to default 2

You're using the wrong tool. A shell (e.g. bash) is a tool to create/destroy files and processes and sequence calls to other tools. It's specifically NOT a tool to manipulate text. The mandatory POSIX (i.e. available on all Unix boxes) tool to manipulate text is awk so you should be using a single awk script instead of bash calling sed and other tools in a loop which is a famously inefficient approach as well as usually resulting in lengthy, complicated, fragile, non-portable code. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

This, using any awk, will do what you appear to want to do orders of magnitude faster than what you were trying to do as well as being more concise and portable:

$ cat tst.sh
#!/usr/bin/env bash

input_file="$1"
output_file="$2"

awk '
    sub(/.*\[/,"") && sub(/].*/,"") && NF {
        gsub(/ *& */, ORS)
        print
    }
' "$input_file" | sort -u > "$output_file"

You'd run the above the same way as the script in your question:

$ ./tst.sh file outfile

and the output would be:

$ cat outfile
Printer DV650V
Printer DVX500
Scanner SON2000

While I would tackle this with awk, if you must stick with a bash solution then consider:

  • the $(echo ... | sed ...) construct spawns 2 new subshells each time it's invoked (that's once per pass through each of the loops)
  • now multiply that times the number of passes through both the outer and inner loops
  • for a 50K line file you're spawning at least 200K subshells, and that is going to represent a significant chunk of your operating time
  • eliminating those 200K subshells should get your script running in no more than a few seconds (depending on your cpu)

NOTE: OP has updated the question with additional details; this requires a change to the original answer ...

Assumptions:

  • we do not have to worry about single/standalone brackets
  • we do not have to worry about nested brackets
  • we will process data as case sensitive; this means we will treat Printer, printer and PRINTER as three distinct/different strings
  • the amount of white space characters within the brackets may vary but we can rely on using the 3-character string <space>&<space> as a delimiter

General approach:

  • split each line on delimiters [ and ] and process the last set of brackets; parsing the line into an array means the last set of brackets should equate to array entry with the last odd-numbered index
  • take the desired field and split on delimiter <space>&<space>
  • strip leading/trailing white space from each field
  • store the resulting parts as indices in an associative array; this will eliminate the need to sort for unique values
  • printf the associative array indices and pipe to sort

One bash approach:

$ cat test.script
#!/bin/bash

input_file="$1"
output_file="$2"

unset      unique_parts                       # initialize our
declare -A unique_parts                       # associative array

while IFS='[]' read -ra contents              # split input line on delimiters '[' and ']' and store results in array contents[]
do
    ndx="${#contents[@]}"
    (( ndx-= (ndx%2) ? 2 : 1 ))               # find greatest odd-numbered index 

    while read -r part                        # also strips leading/trailing white space
    do
        [[ -z "${part}" ]] && continue        # skip if empty
        unique_parts[${part}]=''              # store as array index
    done <<< "${contents[ndx]// & /$'\n'}"    # replace '<space>&<space>' with a newline (\n)

done < "${input_file}"

# print array indices on separate lines, sort and save in ${output_file}

printf "%s\n" "${!unique_parts[@]}" | sort > "${output_file}"

NOTE: since we've limited ourselves to strictly bash features, variables and builtins, and just one subshell/executable call (| sort), this should be significantly faster than OP's current code

Taking for a test drive (NOTE: I duplicated OP's sample data by a factor of 12,500 to create the 50K-line input.dat file; I also edited a few rows to include leading/trailing white space in the brackets):

$ time ./test.script input.dat output.dat
    
real    0m0.967s
user    0m0.722s
sys     0m0.244s

$ cat output.dat
Printer DV650V
Printer DVX500
Scanner SON2000

Running Ed's latest awk|sort against 50K lines:

real    0m0.130s
user    0m0.122s
sys     0m0.016s

Running OP's current script on my system, terminated @ 10K lines, and multiplied by 5x (extrapolate to 50K line timing):

$ time ./op.script input.dat output.dat
... snip ...
line=9600
line=9700
line=9800
line=9900
line=10000
^C

        10K lines      50K lines
        =========      =========
real    0m47.567s  x5  3m57.834s
user    0m24.958s
sys     0m31.393s

NOTES:

  • timings are from an Intel i7-1260P
  • software versions: Ubuntu 22.04 / GNU bash 5.1.16 / GNU awk 5.1.0 / GNU sed 4.8 / GNU sort 8.32

As a comparison, with OP's original machine and script, the runtimes are the following for the same dataset (50k lines duplicated from OP's original sample):

real    9m7.913s
user    1m23.169s
sys     4m24.448s

This should be considered in contrast with Ed's solution on the same machine, with the same dataset:

real    0m0.289s
user    0m0.030s
sys     0m0.076s

NOTE: This is not the production machine, but rather my own laptop, with bash 5.2.21, gawk 5.3.0, and sort 9.0, running on Cygwin over Windows 8.1.

Perhaps something like this with sed (match the content inside the [] and split at &):

sed -nE 's/^.*\[(..*)].*$/\1/p;' < "$input_file" | sed -E 's/ & /\n/g' | sort -u
Printer DV650V
Printer DVX500
Printer FOO500
Scanner SON2000

For 10K lines

real    0m0.087s
user    0m0.101s
sys     0m0.005s

With the general drift you are better off with a script vs Bash, here is a Ruby:

ruby -lne 'BEGIN{ x=Set.new } 
x.merge($1.split(/\s*&\s*/)) if /.*\[([^\]]*)\]/ =~ $_
END{ x.each { |e| puts e } }' file

With your updated input (with potentially more than one set of []), prints:

Printer DVX500
Printer DV650V
Scanner SON2000

That is 'as found, file order' which could be sorted by Ruby anyway you can imagine.


Since everyone is doing benchmarks, here is a Ruby benchmark.

First, make a bigger file:

echo "crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]" |
 awk -v cnt=100000 '
     {lines[FNR]=$0}
     END{for(i=0;i<cnt;i++) print lines[(i % NR)+1]}' >file 

$ wc -l file
  100000 file

Now use time on the Ruby above:

Printer DVX500
Printer DV650V
Scanner SON2000

real    0m0.185s
user    0m0.176s
sys 0m0.008s

Which, as expected, is quite fast.

Using sed

$ sort -u <(sed -En 's/[^[]*\[([[:alpha:]][^]]*).*/\1/;s/ & /\n/gp;' input_file)
Printer DV650V
Printer DVX500
Scanner SON2000

本文标签: arraysOptimising the performance of nested loops for text extractionmanipulationStack Overflow