admin管理员组

文章数量:1288167

I'm reading a stream, which is tested with a regex:

var deviceReadStream = fs.createReadStream("/path/to/stream");

deviceReadStream.on('data',function(data){
  if( data.match(aRegex) )
    //do something
});

But as the stream is splitted into several chuncks, it is possible that the cut make me miss a match. So there is a better pattern to test continuously a stream with a regex?

more details

The stream is the content of a crashed filesystem. I am searching for a ext2 signature (0xef53). As I do not know how the chunks are splitted, the signature could be splitted and not being detected.

So I used a loop to be able to delimite myself how the chunks are splitted, ie by block of the filesystem.

But using streams seems to be a better pattern, so how can I use streams while defining myself the chunks size ?

I'm reading a stream, which is tested with a regex:

var deviceReadStream = fs.createReadStream("/path/to/stream");

deviceReadStream.on('data',function(data){
  if( data.match(aRegex) )
    //do something
});

But as the stream is splitted into several chuncks, it is possible that the cut make me miss a match. So there is a better pattern to test continuously a stream with a regex?

more details

The stream is the content of a crashed filesystem. I am searching for a ext2 signature (0xef53). As I do not know how the chunks are splitted, the signature could be splitted and not being detected.

So I used a loop to be able to delimite myself how the chunks are splitted, ie by block of the filesystem.

But using streams seems to be a better pattern, so how can I use streams while defining myself the chunks size ?

Share Improve this question edited May 23, 2017 at 12:31 CommunityBot 11 silver badge asked Jul 22, 2015 at 9:46 Gaël BarbinGaël Barbin 3,9293 gold badges31 silver badges52 bronze badges 2
  • What kind of data do you have ing in; What is the regular expression that you're checking with? – d0nut Commented Aug 21, 2015 at 13:29
  • If the chunks you get from the stream are multiples of what is expected to be matched - there probably wont be a problem. However, this seems incredibly unlikely if we have a random regex and random chunks. Therefore, what are the regex and the chunks? – ndnenkov Commented Aug 21, 2015 at 14:35
Add a ment  | 

3 Answers 3

Reset to default 6

Assuming your code just needs to search for the signature 0xef53 (as specified in the"more details" part of your question...

One way to do this and keep using regex is keep a reference to the previous data buffer, concatenate it with the current data buffer, and run the regex on that. Its a bit heavy on cpu usage since it effectively scans each data buffer twice (and there's lots of memory allocation due to the concatenation). It is relatively easy to read so it should be maintainable in the future.

Here's an example of what the code would look like

var deviceReadStream = fs.createReadStream("/path/to/stream");
var prevData = '';

deviceReadStream.on('data',function(data){
  var buffer = prevData + data;
  if( buffer.match(aRegex) )
    //do something

  prevData = data;
});

Another option would be to more manually do the character parisons so the code can catch when the signature is split across data buffers. You can see a a solution to that in this related question Efficient way to search a stream for a string. According to the blog post of the top answer, the Haxe code he wrote can be built to produce JavaScript which you can then use. Or you could write your own custom code to do the search, since the signature that you're looking for is only 4 characters long.

First, if you are determined to use a regex with nodejs, give pcre a try. A node wrapper for pcre is available. Pcre can be configured to do partial matches that can resume across buffer boundaries.

You might, though, just grep (or fgrep for multiple static strings) for a byte offset from the terminal. You can then follow it up with xxd and less to view it or dd to extract a portion.

For example, to get offsets with grep:

grep --text --byte-offset --only-matching --perl-regex "\xef\x53" recovery.img

Note that grep mand line options can vary depending on your distro.

You could also look at bgrep though I haven't used it.

I have had good luck doing recovery using various shell tools and scripts.

A couple of other tangential ments:

  1. Keep in mind the endianness of whatever you are searching.
  2. Take an image since you are doing a recovery, if you have not already. Among other perils, if a device is starting to fail, further access can make it worse.
  3. Reference data carving tools. ref
  4. As you mentioned files may be fragmented. Still I would expect that partitions and files start on sector boundaries. As far as I know the magic would not typically be split.
  5. Be careful not to inadvertently write to the device you are recovering.
  6. As you may know, if you reconstruct the image you may be able to mount the image using a loopback driver.

I would go with looking at the data stream as a moving window of size 6 bytes.

For example, if you have the following file (in bytes): 23, 34, 45, 67, 76

A moving window of 2 passing over the data will be:

[23, 34]
[34, 45]
[45, 67]
[67, 76]

I propose going over these windows looking for your string.

var Stream = require('stream');
var fs = require('fs');

var exampleStream = fs.createReadStream("./dump.dmp");
var matchCounter = 0;
windowStream(exampleStream, 6).on('window', function(buffer){
    if (buffer.toString() === '0xEF53') {
        ++matchCounter;
    }
}).on('end', function(){
    console.log('done scanning the file, found', matchCounter);
});
function windowStream(inputStream, windowSize) {
    var outStream = new Stream();
    var soFar = [];
    inputStream.on('data', function(data){
        Array.prototype.slice.call(data).forEach(function(byte){
            soFar.push(byte);
            if (soFar.length === windowSize) {
                outStream.emit('window', new Buffer(soFar));
                soFar.shift();
            }
        });
    });
    inputStream.on('end', function(){
        outStream.emit('end');
    });
    return outStream;
}

Usually I'm not a fan of going over bytes when you actually need the underling string. In UTF-8 there are cases where it might cause some issues, but assuming everything is in English it should be fine. The example can be improved to support these cases by using a string decoder

EDIT

Here is a UTF8 version

var Stream = require('stream');
var fs = require('fs');

var exampleStream = fs.createReadStream("./dump.dmp", {encoding: 'utf8'});
var matchCounter = 0;

windowStream(exampleStream, 6).on('window', function(windowStr){
    if (windowStr === '0xEF53') {
        ++matchCounter;
    }
}).on('end', function(){
    console.log('done scanning the file, found', matchCounter);
});
function windowStream(inputStream, windowSize) {
    var outStream = new Stream();
    var soFar = "";
    inputStream.on('data', function(data){
        Array.prototype.slice.call(data).forEach(function(char){
            soFar += char;
            if (soFar.length === windowSize) {
                outStream.emit('window', soFar);
                soFar = soFar.slice(1);
            }
        });
    });
    inputStream.on('end', function(){
        outStream.emit('end');
    });
    return outStream;
}

本文标签: javascriptParsing a stream without clippingStack Overflow