admin管理员组

文章数量:1397110

I am trying to run a function once for each file from a large list that I am piping in.

Here is some example code, here just grepping in the files whose names are coming from stdin.

In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.

I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.

#!/bin/bash

searchterm="$1"

filelist=$(cat /dev/stdin)

numfiles=$(echo "$filelist" | wc -l)
currfileno=0

while IFS= read -r file; do
    ((++currfileno))    
    echo -ne "\r\033[K" 1>&2 # clears the line
    echo -ne "$currfileno/$numfiles $file" 1>&2
    grep "$searchterm" "$file"
done <<< "$filelist"

I saved this as test_so_stream, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext.

The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.

What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.

I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.

I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash.

Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.

I have tried to use parallel and xargs, and I know at least parallel can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar option of parallel but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).

How can I achieve this?

edit to answer @markp-fuso questions in comments:

I know that stderr/stdout both show on the same terminal.

I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.

Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.

Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.

if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?

Exactly like cat filelist | parallel grep searchterm does. Ie The grep output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.

I am trying to run a function once for each file from a large list that I am piping in.

Here is some example code, here just grepping in the files whose names are coming from stdin.

In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.

I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.

#!/bin/bash

searchterm="$1"

filelist=$(cat /dev/stdin)

numfiles=$(echo "$filelist" | wc -l)
currfileno=0

while IFS= read -r file; do
    ((++currfileno))    
    echo -ne "\r\033[K" 1>&2 # clears the line
    echo -ne "$currfileno/$numfiles $file" 1>&2
    grep "$searchterm" "$file"
done <<< "$filelist"

I saved this as test_so_stream, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext.

The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.

What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.

I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.

I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash.

Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.

I have tried to use parallel and xargs, and I know at least parallel can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar option of parallel but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).

How can I achieve this?

edit to answer @markp-fuso questions in comments:

I know that stderr/stdout both show on the same terminal.

I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.

Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.

Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.

if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?

Exactly like cat filelist | parallel grep searchterm does. Ie The grep output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.

Share Improve this question edited Mar 26 at 23:42 localhost asked Mar 26 at 11:06 localhostlocalhost 1,2834 gold badges19 silver badges30 bronze badges 9
  • if the total number of files (N) is not known until you read all of stdin, then how do you expect to immediately start processing stdin and print a status bar of 1/N <filename>? does the parent process know, in advance, the number of files to be processed? – markp-fuso Commented Mar 26 at 16:47
  • have you tried running: your two echo -ne calls, echo 'grep output', your two echo -ne calls; you should see 3 lines of output; assuming you want the 'status bar' to remain in one place (eg, bottom of console/terminal window) then you're going to need to incorporate some sort of cursor/curses processing (eg, tput) to allow for the explicit placement of output in the console/terminal window – markp-fuso Commented Mar 26 at 16:47
  • your code and comments seem to imply a belief that stdout and stderr are somehow printed to different areas of the console/terminal window; this is not true; both (stdout and stderr) are printed to the current location of the cursor; while you can dump stdout/stderr to different areas you'll (again) need to add cursor/curses processing calls – markp-fuso Commented Mar 26 at 16:47
  • if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window? how do you keep the output of 3 threads from being interspersed/scrambled in the console/terminal window? are you expecting 3 separate status bars or a single staus bar that reads something like (1,7,24)/N <file1> <file7> <file24> ... keeping in mind the next question of how would you build this single status bar from the current processing status of 3 parallel threads (aka subshells) – markp-fuso Commented Mar 26 at 16:47
  • at this point there are a lot of unknowns about how the 'status bar' is supposed to behave; it's not clear, from what we've been told, if a lot of details have been left out or if we're dealing with an incomplete design/requirement – markp-fuso Commented Mar 26 at 16:48
 |  Show 4 more comments

2 Answers 2

Reset to default 1

I'm not 100% clear on all of OP's requirements so I'm going to focus on a stderr to status line and stdout to a file approach. Hopefully this will get OP a bit closer to the final goal ...

Assumptions/understandings:

  • one program is generating a list of files (we'll call this gen_output; filenames are output-#)
  • this output needs to be split and fed as stdin to two different programs ...
  • one program counts the numbers of files read from stdin (we'll call this count_input) and prints the new count to file counter
  • one program processes the files read from input while also generating a status bar (we'll call this process_input)
  • the status bar should be a count of processed files plus the 'count' (from count_input) at that point in time, plus the current file being processed
  • the status bar is printed to stderr
  • the process_input stdout is written to file process_input.stdout

The 3 programs:

######################### generate 10 outputs at 0.5 second intervals

$ cat gen_output
#!/bin/bash

for ((i=1;i<=10;i++))
do
    echo "output-$i"
    sleep .5
done

######################### for each input update a counter and overwrite file 'counter'

$ cat count_input
#!/bin/bash

count=0

while read -r input
do
    ((count++))
    echo "${count}" > counter
done

######################### for each input read current total from file 'counter' and then print status line

$ cat process_input
#!/bin/bash

touch counter
count=0
cl_eol=$(tput el)             # clear to end of line

while read -r input
do
    ((count++))
    read -r total < counter

    printf "\rprocessing %s/%s %s%s" "${count}" "${total}" "${input}" "${cl_eol}" >&2
    echo "something to stdout - ${count} / ${total}"
    sleep 2
done > process_input.stdout

printf "\nDone.\n" >&2

Using tee to feed a copy of gen_output to process_input before piping to count_input:

$ ./gen_output | tee >(./process_input) | ./count_input

I've got a .gif of this in action but SO is not allowing me to upload the image at this time so imagine the following lines being displayed, one at a time at 2 second intervals, while overwriting the previous line:

processing 1/1 output-1
processing 2/4 output-2
processing 3/8 output-3
processing 4/10 output-4
processing 5/10 output-5
processing 6/10 output-6
processing 7/10 output-7
processing 8/10 output-8
processing 9/10 output-9
processing 10/10 output-10

And then a new line is displayed:

Done.

And the stdout:

$ cat process_input.stdout
something to stdout - 1 / 1
something to stdout - 2 / 4
something to stdout - 3 / 8
something to stdout - 4 / 10
something to stdout - 5 / 10
something to stdout - 6 / 10
something to stdout - 7 / 10
something to stdout - 8 / 10
something to stdout - 9 / 10
something to stdout - 10 / 10

One way could be to define a function that prints your progess bar. Like this:

bar_full="=================================================="
bar_empty="                                                  "
len_bar=${#bar_full}

function show_bar() {
    step=$1
    step_total=$2
    progress_ticks=$(((${step} * ${len_bar})/${step_total}))
    progress_percent=$(((${step} * 100)/${step_total}))
    echo -n -e "\r|${bar_full:0:${progress_ticks}}${bar_empty:${progress_ticks}:${len_bar}}| ${progress_percent}%"
}

Then you use it in your code. As allways with progress bars, you need to know how many steps you have. So let's assume you get the files by a ls command, then you could do it as in the following example:

max_value=$(ls *.txt | wc -l)
i=0
for file in $(ls *.txt) ; do
   ((i++)) 
   show_bar $i ${max_value}
   sleep 1s
done
echo

本文标签: multithreadingProcess many files in bash while simultaneously updating a progress barStack Overflow