admin管理员组文章数量:1290959
I want to pare two large files(5GB+) and find if they are the same or not. One solution I considered is hashing both with crypto and then paring the hashes. But this would take a lot of time since I will have to go through the entire files instead of stopping when a difference is found.
Another solution I thought was to pare the file as they are being streamed with fs.createReadStream()
and break when a difference is found.
stream.on('data', (data) => {
//pare the data from this stream with the other stream
})
But I am not quite sure how I can have two streams that are synchronized.
I want to pare two large files(5GB+) and find if they are the same or not. One solution I considered is hashing both with crypto and then paring the hashes. But this would take a lot of time since I will have to go through the entire files instead of stopping when a difference is found.
Another solution I thought was to pare the file as they are being streamed with fs.createReadStream()
and break when a difference is found.
stream.on('data', (data) => {
//pare the data from this stream with the other stream
})
But I am not quite sure how I can have two streams that are synchronized.
Share Improve this question asked Feb 9, 2021 at 7:48 RafaelRafael 1402 silver badges11 bronze badges 4- 1 First check if they have the same size. If the size is different, then logically they re different. If the size is exactly the same, then you need to analyse further. You can sample few random equal sections from both to check for equality - that's still going to be fairly cheap. Then you might choose to hash them. I'd expect a library can handle this for you but I don't know of any. – VLAZ Commented Feb 9, 2021 at 7:52
-
First check for matching file sizes. Then, just read each file in 1k blocks (probably using promises) and pare those two buffers. Put this in a loop and break the loop when you find a difference. This will be cheaper than hashing each both because you get to break as soon as a difference is found and second because you don't have to calculate the hash for each. Comparing two buffers with
buffer.pare()
should be really fast and optimized. – jfriend00 Commented Feb 9, 2021 at 8:30 - @jfriend00 that is sort of what I thought problem is am not sure to actually do it. Could you share some code please. Thanks. – Rafael Commented Feb 9, 2021 at 8:41
- @RafGk_ - Per your request, I added an answer with code to implement the scheme I described earlier. – jfriend00 Commented Feb 9, 2021 at 9:05
3 Answers
Reset to default 9As requested in your ments, if you want to see how an implementation can be written to do this, here's one. Here's how it works:
- Open each of the two files
- Compare the two files sizes. If not the same, resolve false.
- Allocate two 8k buffers (you can choose the size of buffer to use)
- Read 8k of each file (or less if not 8k left in the file) into your buffers
- Compare those two buffers. If not identical, resolve false.
- When you finish paring all the bytes, resolve true
Here's the code:
const fs = require('fs');
const fsp = fs.promises;
// resolves to true or false
async function pareFiles(fname1, fname2) {
const kReadSize = 1024 * 8;
let h1, h2;
try {
h1 = await fsp.open(fname1);
h2 = await fsp.open(fname2);
const [stat1, stat2] = await Promise.all([h1.stat(), h2.stat()]);
if (stat1.size !== stat2.size) {
return false;
}
const buf1 = Buffer.alloc(kReadSize);
const buf2 = Buffer.alloc(kReadSize);
let pos = 0;
let remainingSize = stat1.size;
while (remainingSize > 0) {
let readSize = Math.min(kReadSize, remainingSize);
let [r1, r2] = await Promise.all([h1.read(buf1, 0, readSize, pos), h2.read(buf2, 0, readSize, pos)]);
if (r1.bytesRead !== readSize || r2.bytesRead !== readSize) {
throw new Error("Failed to read desired number of bytes");
}
if (buf1.pare(buf2, 0, readSize, 0, readSize) !== 0) {
return false;
}
remainingSize -= readSize;
pos += readSize;
}
return true;
} finally {
if (h1) {
await h1.close();
}
if (h2) {
await h2.close();
}
}
}
// sample usage
pareFiles("temp.bin", "temp2.bin").then(result => {
console.log(result);
}).catch(err => {
console.log(err);
});
This could be sped up a bit by opening and closing the files in parallel using Promise.allSettled()
to track when they are both open and then both closed, though because of the plications if one succeeds in opening and the other doesn't and you don't want to leak the one opened file handle, it takes a bit more code to do that perfectly so I kept it simpler here.
And, if you really wanted to optimize for performance, it would be worth testing larger buffers to see if it makes things faster or not.
It's also possible that buf1.equals(buf2)
might be faster than buf1.pare(buf2)
, but you have to make sure that a partial buffer read at the end of the file still works properly when using that since .equals()
always pares the entire buffer. You could build two versions and pare their performance.
Here's a more plicated version that opens and closes the files in parallel and might be slightly faster:
const fs = require('fs');
const fsp = fs.promises;
async function pareFiles(fname1, fname2) {
const kReadSize = 1024 * 8;
let h1, h2;
try {
let openResults = await Promise.allSettled([fsp.open(fname1), fsp.open(fname2)]);
let err;
if (openResults[0].status === "fulfilled") {
h1 = openResults[0].value;
} else {
err = openResults[0].reason;
}
if (openResults[1].status === "fulfilled") {
h2 = openResults[1].value;
} else {
err = openResults[1].reason;
}
// after h1 and h2 are set (so they can be properly closed)
// throw any error we got
if (err) {
throw err;
}
const [stat1, stat2] = await Promise.all([h1.stat(), h2.stat()]);
if (stat1.size !== stat2.size) {
return false;
}
const buf1 = Buffer.alloc(kReadSize);
const buf2 = Buffer.alloc(kReadSize);
let pos = 0;
let remainingSize = stat1.size;
while (remainingSize > 0) {
let readSize = Math.min(kReadSize, remainingSize);
let [r1, r2] = await Promise.all([h1.read(buf1, 0, readSize, pos), h2.read(buf2, 0, readSize, pos)]);
if (r1.bytesRead !== readSize || r2.bytesRead !== readSize) {
throw new Error("Failed to read desired number of bytes");
}
if (buf1.pare(buf2, 0, readSize, 0, readSize) !== 0) {
return false;
}
remainingSize -= readSize;
pos += readSize;
}
return true;
} finally {
// does not return file close errors
// but does hold resolving the promise until the files are closed
// or had an error trying to close them
// Since we didn't write to the files, a close error would be fairly
// unprecedented unless the disk went down
const closePromises = [];
if (h1) {
closePromises.push(h1.close());
}
if (h2) {
closePromises.push(h2.close());
}
await Promise.allSettled(closePromises);
}
}
pareFiles("temp.bin", "temp2.bin").then(result => {
console.log(result);
}).catch(err => {
console.log(err);
});
There are certainly libraries that do this, and file-sync-cmp is very popular (270k weekly downloads). It does the parison in the simplest way, by reading the same number of bytes from the two files in different buffers, and then paring the buffers byte by byte.
There's also a more modern library, filepare, "using native Promises and native BufferTools (alloc and Buffer parisons)".
Whenever practical, don't reinvent the wheel :)
Since the difference might be at the very end of the files, I guess calculating a hash of the files is the most (yet costly) straightforward and secure process.
Did you try the MD5-File npm package and get some performance indicators?
本文标签: javascriptBest practice for comparing two large files in NodejsStack Overflow
版权声明:本文标题:javascript - Best practice for comparing two large files in Node.js - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741504368a2382230.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论