admin管理员组文章数量:1397110
I am trying to run a function once for each file from a large list that I am piping in.
Here is some example code, here just grepping in the files whose names are coming from stdin.
In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.
I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.
#!/bin/bash
searchterm="$1"
filelist=$(cat /dev/stdin)
numfiles=$(echo "$filelist" | wc -l)
currfileno=0
while IFS= read -r file; do
((++currfileno))
echo -ne "\r\033[K" 1>&2 # clears the line
echo -ne "$currfileno/$numfiles $file" 1>&2
grep "$searchterm" "$file"
done <<< "$filelist"
I saved this as test_so_stream
, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext
.
The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.
What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.
I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.
I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash
.
Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.
I have tried to use parallel
and xargs
, and I know at least parallel
can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar
option of parallel
but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).
How can I achieve this?
edit to answer @markp-fuso questions in comments:
I know that stderr/stdout both show on the same terminal.
I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.
Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv
does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.
Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.
if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?
Exactly like cat filelist | parallel grep searchterm
does. Ie The grep
output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.
I am trying to run a function once for each file from a large list that I am piping in.
Here is some example code, here just grepping in the files whose names are coming from stdin.
In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.
I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.
#!/bin/bash
searchterm="$1"
filelist=$(cat /dev/stdin)
numfiles=$(echo "$filelist" | wc -l)
currfileno=0
while IFS= read -r file; do
((++currfileno))
echo -ne "\r\033[K" 1>&2 # clears the line
echo -ne "$currfileno/$numfiles $file" 1>&2
grep "$searchterm" "$file"
done <<< "$filelist"
I saved this as test_so_stream
, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext
.
The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.
What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.
I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.
I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash
.
Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.
I have tried to use parallel
and xargs
, and I know at least parallel
can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar
option of parallel
but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).
How can I achieve this?
edit to answer @markp-fuso questions in comments:
I know that stderr/stdout both show on the same terminal.
I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.
Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv
does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.
Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.
if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?
Exactly like cat filelist | parallel grep searchterm
does. Ie The grep
output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.
2 Answers
Reset to default 1I'm not 100% clear on all of OP's requirements so I'm going to focus on a stderr to status line
and stdout to a file
approach. Hopefully this will get OP a bit closer to the final goal ...
Assumptions/understandings:
- one program is generating a list of files (we'll call this
gen_output
; filenames areoutput-#
) - this output needs to be split and fed as stdin to two different programs ...
- one program counts the numbers of files read from stdin (we'll call this
count_input
) and prints the new count to filecounter
- one program processes the files read from input while also generating a status bar (we'll call this
process_input
) - the status bar should be a count of processed files plus the 'count' (from
count_input
) at that point in time, plus the current file being processed - the status bar is printed to stderr
- the
process_input
stdout is written to fileprocess_input.stdout
The 3 programs:
######################### generate 10 outputs at 0.5 second intervals
$ cat gen_output
#!/bin/bash
for ((i=1;i<=10;i++))
do
echo "output-$i"
sleep .5
done
######################### for each input update a counter and overwrite file 'counter'
$ cat count_input
#!/bin/bash
count=0
while read -r input
do
((count++))
echo "${count}" > counter
done
######################### for each input read current total from file 'counter' and then print status line
$ cat process_input
#!/bin/bash
touch counter
count=0
cl_eol=$(tput el) # clear to end of line
while read -r input
do
((count++))
read -r total < counter
printf "\rprocessing %s/%s %s%s" "${count}" "${total}" "${input}" "${cl_eol}" >&2
echo "something to stdout - ${count} / ${total}"
sleep 2
done > process_input.stdout
printf "\nDone.\n" >&2
Using tee
to feed a copy of gen_output
to process_input
before piping to count_input
:
$ ./gen_output | tee >(./process_input) | ./count_input
I've got a .gif
of this in action but SO is not allowing me to upload the image at this time so imagine the following lines being displayed, one at a time at 2 second intervals, while overwriting the previous line:
processing 1/1 output-1
processing 2/4 output-2
processing 3/8 output-3
processing 4/10 output-4
processing 5/10 output-5
processing 6/10 output-6
processing 7/10 output-7
processing 8/10 output-8
processing 9/10 output-9
processing 10/10 output-10
And then a new line is displayed:
Done.
And the stdout:
$ cat process_input.stdout
something to stdout - 1 / 1
something to stdout - 2 / 4
something to stdout - 3 / 8
something to stdout - 4 / 10
something to stdout - 5 / 10
something to stdout - 6 / 10
something to stdout - 7 / 10
something to stdout - 8 / 10
something to stdout - 9 / 10
something to stdout - 10 / 10
One way could be to define a function that prints your progess bar. Like this:
bar_full="=================================================="
bar_empty=" "
len_bar=${#bar_full}
function show_bar() {
step=$1
step_total=$2
progress_ticks=$(((${step} * ${len_bar})/${step_total}))
progress_percent=$(((${step} * 100)/${step_total}))
echo -n -e "\r|${bar_full:0:${progress_ticks}}${bar_empty:${progress_ticks}:${len_bar}}| ${progress_percent}%"
}
Then you use it in your code. As allways with progress bars, you need to know how many steps you have. So let's assume you get the files by a ls
command, then you could do it as in the following example:
max_value=$(ls *.txt | wc -l)
i=0
for file in $(ls *.txt) ; do
((i++))
show_bar $i ${max_value}
sleep 1s
done
echo
本文标签: multithreadingProcess many files in bash while simultaneously updating a progress barStack Overflow
版权声明:本文标题:multithreading - Process many files in bash while simultaneously updating a progress bar - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744149152a2592980.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
N
) is not known until you read all of stdin, then how do you expect to immediately start processing stdin and print a status bar of1/N <filename>
? does the parent process know, in advance, the number of files to be processed? – markp-fuso Commented Mar 26 at 16:47echo -ne
calls,echo 'grep output'
, your twoecho -ne
calls; you should see 3 lines of output; assuming you want the 'status bar' to remain in one place (eg, bottom of console/terminal window) then you're going to need to incorporate some sort of cursor/curses processing (eg,tput
) to allow for the explicit placement of output in the console/terminal window – markp-fuso Commented Mar 26 at 16:47(1,7,24)/N <file1> <file7> <file24>
... keeping in mind the next question of how would you build this single status bar from the current processing status of 3 parallel threads (aka subshells) – markp-fuso Commented Mar 26 at 16:47