admin管理员组文章数量:1289517
I receive operations logs, one per line, in an $input_file
: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:
crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]
I have no control over how these lines are produced. I only know that the
resources are supposed to be described at the end of each line, between square brackets ([...]
). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP
($SEP=="&"
in the examples).
For usage purposes, I need to produce a sorted list of (unique) resources in an
$output_file$
. The following code works, and follows these steps:
- Extract the content between
[]
at the end of the line into$content
- If
$content
is not empty, try and split parts between$SEP
into$parts
- For each element in
$parts
, trim trailing/leading whitespace(s) - Put the trimmed sequence into an array (which naturally adds unique resources only)
- Write everything into
$output_file
With the line examples above, the result should be the following:
Printer DVX500
Printer DV650V
Scanner SON2000
Here is the script I am using, which is functionally correct (at least so far):
# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
echo "Error: Input file '$input_file' not found!"
exit 1
fi
# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
linenumber=$[$linenumber +1]
if (( $linenumber%100 == 0 )); then
echo "line=$linenumber"
fi
# Use sed to extract the content between the first "[" and "]"
content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')
# If content is not empty, process the extracted part
if [[ -n "$content" ]]; then
# Split the content by "&" and process each part
IFS="&" read -ra parts <<< "$content"
for part in "${parts[@]}"; do
# Trim leading/trailing whitespace from each part
trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Store the trimmed part in the associative array (ensures uniqueness)
unique_lines["$trimmed_part"]=1
done
fi
done < "$input_file"
# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed
echo "Processing completed. Output saved to $output_file"
The issue is performance: with $input_files
between 5000-10.000 lines, the
previous script takes between 12 to 30 minutes. I tried several things, but
could not figure out a way of really optimising it:
- using parameter expansion for extracting the resource list, instead of
sed
; - outputing into
$output_file
along the extraction, instead of storing results in the$unique_lines
array (which is way slower because of the many write operations);
The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.
If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.
EDIT: Completing requirements list wrt. brillant commenters remarks :-)
Thanks for the thorough analysis of my use case :-). For the sake of completeness:
- An important modification I left out: Line may contains other
[...]
, but the info I need to extract is always within the last pair. So a better test set would be the following:
crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]
There should not be lines without any
[...]
group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.
The data inside
[...]
may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between[...]
, then splitted along the[:space:]$SEP[:space:]
characters, then trimmed, which should have taken care of it.Finally, yes: sticking to
bash
is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) thanbash
.
I receive operations logs, one per line, in an $input_file
: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:
crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]
I have no control over how these lines are produced. I only know that the
resources are supposed to be described at the end of each line, between square brackets ([...]
). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP
($SEP=="&"
in the examples).
For usage purposes, I need to produce a sorted list of (unique) resources in an
$output_file$
. The following code works, and follows these steps:
- Extract the content between
[]
at the end of the line into$content
- If
$content
is not empty, try and split parts between$SEP
into$parts
- For each element in
$parts
, trim trailing/leading whitespace(s) - Put the trimmed sequence into an array (which naturally adds unique resources only)
- Write everything into
$output_file
With the line examples above, the result should be the following:
Printer DVX500
Printer DV650V
Scanner SON2000
Here is the script I am using, which is functionally correct (at least so far):
# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
echo "Error: Input file '$input_file' not found!"
exit 1
fi
# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
linenumber=$[$linenumber +1]
if (( $linenumber%100 == 0 )); then
echo "line=$linenumber"
fi
# Use sed to extract the content between the first "[" and "]"
content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')
# If content is not empty, process the extracted part
if [[ -n "$content" ]]; then
# Split the content by "&" and process each part
IFS="&" read -ra parts <<< "$content"
for part in "${parts[@]}"; do
# Trim leading/trailing whitespace from each part
trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Store the trimmed part in the associative array (ensures uniqueness)
unique_lines["$trimmed_part"]=1
done
fi
done < "$input_file"
# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed
echo "Processing completed. Output saved to $output_file"
The issue is performance: with $input_files
between 5000-10.000 lines, the
previous script takes between 12 to 30 minutes. I tried several things, but
could not figure out a way of really optimising it:
- using parameter expansion for extracting the resource list, instead of
sed
; - outputing into
$output_file
along the extraction, instead of storing results in the$unique_lines
array (which is way slower because of the many write operations);
The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.
If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.
EDIT: Completing requirements list wrt. brillant commenters remarks :-)
Thanks for the thorough analysis of my use case :-). For the sake of completeness:
- An important modification I left out: Line may contains other
[...]
, but the info I need to extract is always within the last pair. So a better test set would be the following:
crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]
There should not be lines without any
[...]
group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.
The data inside
[...]
may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between[...]
, then splitted along the[:space:]$SEP[:space:]
characters, then trimmed, which should have taken care of it.Finally, yes: sticking to
bash
is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) thanbash
.
5 Answers
Reset to default 2You're using the wrong tool. A shell (e.g. bash
) is a tool to create/destroy files and processes and sequence calls to other tools. It's specifically NOT a tool to manipulate text. The mandatory POSIX (i.e. available on all Unix boxes) tool to manipulate text is awk
so you should be using a single awk
script instead of bash
calling sed
and other tools in a loop which is a famously inefficient approach as well as usually resulting in lengthy, complicated, fragile, non-portable code. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
This, using any awk, will do what you appear to want to do orders of magnitude faster than what you were trying to do as well as being more concise and portable:
$ cat tst.sh
#!/usr/bin/env bash
input_file="$1"
output_file="$2"
awk '
sub(/.*\[/,"") && sub(/].*/,"") && NF {
gsub(/ *& */, ORS)
print
}
' "$input_file" | sort -u > "$output_file"
You'd run the above the same way as the script in your question:
$ ./tst.sh file outfile
and the output would be:
$ cat outfile
Printer DV650V
Printer DVX500
Scanner SON2000
While I would tackle this with awk
, if you must stick with a bash
solution then consider:
- the
$(echo ... | sed ...)
construct spawns 2 new subshells each time it's invoked (that's once per pass through each of the loops) - now multiply that times the number of passes through both the outer and inner loops
- for a 50K line file you're spawning at least 200K subshells, and that is going to represent a significant chunk of your operating time
- eliminating those 200K subshells should get your script running in no more than a few seconds (depending on your cpu)
NOTE: OP has updated the question with additional details; this requires a change to the original answer ...
Assumptions:
- we do not have to worry about single/standalone brackets
- we do not have to worry about nested brackets
- we will process data as case sensitive; this means we will treat
Printer
,printer
andPRINTER
as three distinct/different strings - the amount of white space characters within the brackets may vary but we can rely on using the 3-character string
<space>&<space>
as a delimiter
General approach:
- split each line on delimiters
[
and]
and process the last set of brackets; parsing the line into an array means the last set of brackets should equate to array entry with the last odd-numbered index - take the desired field and split on delimiter
<space>&<space>
- strip leading/trailing white space from each field
- store the resulting parts as indices in an associative array; this will eliminate the need to
sort
for unique values printf
the associative array indices and pipe tosort
One bash
approach:
$ cat test.script
#!/bin/bash
input_file="$1"
output_file="$2"
unset unique_parts # initialize our
declare -A unique_parts # associative array
while IFS='[]' read -ra contents # split input line on delimiters '[' and ']' and store results in array contents[]
do
ndx="${#contents[@]}"
(( ndx-= (ndx%2) ? 2 : 1 )) # find greatest odd-numbered index
while read -r part # also strips leading/trailing white space
do
[[ -z "${part}" ]] && continue # skip if empty
unique_parts[${part}]='' # store as array index
done <<< "${contents[ndx]// & /$'\n'}" # replace '<space>&<space>' with a newline (\n)
done < "${input_file}"
# print array indices on separate lines, sort and save in ${output_file}
printf "%s\n" "${!unique_parts[@]}" | sort > "${output_file}"
NOTE: since we've limited ourselves to strictly bash
features, variables and builtins, and just one subshell/executable call (| sort
), this should be significantly faster than OP's current code
Taking for a test drive (NOTE: I duplicated OP's sample data by a factor of 12,500 to create the 50K-line input.dat
file; I also edited a few rows to include leading/trailing white space in the brackets):
$ time ./test.script input.dat output.dat
real 0m0.967s
user 0m0.722s
sys 0m0.244s
$ cat output.dat
Printer DV650V
Printer DVX500
Scanner SON2000
Running Ed's latest awk|sort
against 50K lines:
real 0m0.130s
user 0m0.122s
sys 0m0.016s
Running OP's current script on my system, terminated @ 10K lines, and multiplied by 5x (extrapolate to 50K line timing):
$ time ./op.script input.dat output.dat
... snip ...
line=9600
line=9700
line=9800
line=9900
line=10000
^C
10K lines 50K lines
========= =========
real 0m47.567s x5 3m57.834s
user 0m24.958s
sys 0m31.393s
NOTES:
- timings are from an Intel i7-1260P
- software versions:
Ubuntu 22.04
/GNU bash 5.1.16
/GNU awk 5.1.0
/GNU sed 4.8
/GNU sort 8.32
As a comparison, with OP's original machine and script, the runtimes are the following for the same dataset (50k lines duplicated from OP's original sample):
real 9m7.913s
user 1m23.169s
sys 4m24.448s
This should be considered in contrast with Ed's solution on the same machine, with the same dataset:
real 0m0.289s
user 0m0.030s
sys 0m0.076s
NOTE: This is not the production machine, but rather my own laptop, with bash 5.2.21
, gawk 5.3.0
, and sort 9.0
, running on Cygwin over Windows 8.1.
Perhaps something like this with sed
(match the content inside the []
and split at &
):
sed -nE 's/^.*\[(..*)].*$/\1/p;' < "$input_file" | sed -E 's/ & /\n/g' | sort -u
Printer DV650V
Printer DVX500
Printer FOO500
Scanner SON2000
For 10K lines
real 0m0.087s
user 0m0.101s
sys 0m0.005s
With the general drift you are better off with a script vs Bash, here is a Ruby:
ruby -lne 'BEGIN{ x=Set.new }
x.merge($1.split(/\s*&\s*/)) if /.*\[([^\]]*)\]/ =~ $_
END{ x.each { |e| puts e } }' file
With your updated input (with potentially more than one set of []
), prints:
Printer DVX500
Printer DV650V
Scanner SON2000
That is 'as found, file order' which could be sorted by Ruby anyway you can imagine.
Since everyone is doing benchmarks, here is a Ruby benchmark.
First, make a bigger file:
echo "crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]" |
awk -v cnt=100000 '
{lines[FNR]=$0}
END{for(i=0;i<cnt;i++) print lines[(i % NR)+1]}' >file
$ wc -l file
100000 file
Now use time
on the Ruby above:
Printer DVX500
Printer DV650V
Scanner SON2000
real 0m0.185s
user 0m0.176s
sys 0m0.008s
Which, as expected, is quite fast.
Using sed
$ sort -u <(sed -En 's/[^[]*\[([[:alpha:]][^]]*).*/\1/;s/ & /\n/gp;' input_file)
Printer DV650V
Printer DVX500
Scanner SON2000
本文标签: arraysOptimising the performance of nested loops for text extractionmanipulationStack Overflow
版权声明:本文标题:arrays - Optimising the performance of nested loops for text extractionmanipulation - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741473460a2380742.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
[...]
part? How about lines where[
or]
appear multiple times? – Ed Morton Commented Feb 19 at 20:33bash
don't go together well. Use a real programming language if performance is a concern. – Barmar Commented Feb 19 at 23:42sticking to bash is boss-mandatory
mean to you? Does it mean that a) you can only use bash builtins (i.e. not sed or awk), or b) that you can use bash to call mandatory POSIX tools such as sed or awk, or c) that you can use bash to call other non-POSIX tools such as perl, python, or ruby but can't write a C program, or d) something else? – Ed Morton Commented Feb 20 at 13:38awk
, better understand arrrays/expansions, and even try ruby :-). They are all acceptable solutions: I am voting Ed's solution simply because of the incredible performance it exhibits (tried it on 100k lines with almost immediate completion, waw!), but this does not remove any interest to the others. – Moussa Amrani Commented Feb 21 at 14:17