admin管理员组文章数量:1323317
I need to read a large zip file in node-js and process each file (approx 100MB zip file containing ca 40.000 XML files, 500kb each file unpressed). I am looking for a 'streaming' solution that has acceptable speed and does not require to keep the whole dataset in memory (JSZip, node-zip worked for me, but it keeps everything in RAM and the performance is not good enough). A quick attempt in c# shows that loading, unpacking and parsing the XML can be achieved in approx 9 seconds on 2 year old laptop (using DotNetZip
). I don't expect nodejs to be as fast, but anything under one minute would be okay. Unpacking the file to local disk and then processing it, is not an option.
I am currently attempting to use the unzip
module () but can't get it work, so I don't know if the speed is okay, but at least it looks like I can stream each file and process it in the callback. (The problem is that I only receive the first 2 entries, then it stops calling the .on('entry', callback)
callback. I don't get any error, it just silently stops after 2 files. It would also be good to know how I can get the full XML in one chunk instead of fetching buffer after buffer.)
function openArchive(){
fs.createReadStream('../../testdata/small2.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
var fileName = entry.path;
var type = entry.type; // 'Directory' or 'File'
var size = entry.size;
console.log(fileName);
entry.on('data', function(data){
console.log("received data");
});
});
}
There's plenty of node-js modules for working with zip files, so this question is really about to figure out which library is best suited for this scenario.
I need to read a large zip file in node-js and process each file (approx 100MB zip file containing ca 40.000 XML files, 500kb each file unpressed). I am looking for a 'streaming' solution that has acceptable speed and does not require to keep the whole dataset in memory (JSZip, node-zip worked for me, but it keeps everything in RAM and the performance is not good enough). A quick attempt in c# shows that loading, unpacking and parsing the XML can be achieved in approx 9 seconds on 2 year old laptop (using DotNetZip
). I don't expect nodejs to be as fast, but anything under one minute would be okay. Unpacking the file to local disk and then processing it, is not an option.
I am currently attempting to use the unzip
module (https://www.npmjs/package/unzip) but can't get it work, so I don't know if the speed is okay, but at least it looks like I can stream each file and process it in the callback. (The problem is that I only receive the first 2 entries, then it stops calling the .on('entry', callback)
callback. I don't get any error, it just silently stops after 2 files. It would also be good to know how I can get the full XML in one chunk instead of fetching buffer after buffer.)
function openArchive(){
fs.createReadStream('../../testdata/small2.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
var fileName = entry.path;
var type = entry.type; // 'Directory' or 'File'
var size = entry.size;
console.log(fileName);
entry.on('data', function(data){
console.log("received data");
});
});
}
There's plenty of node-js modules for working with zip files, so this question is really about to figure out which library is best suited for this scenario.
Share asked Sep 5, 2014 at 11:02 shaftshaft 2,2292 gold badges25 silver badges38 bronze badges 2- When you say you "can't get it to work" - what issue? what error? It's hard for others to troubleshoot that general statement. – bryanmac Commented Sep 5, 2014 at 11:13
- 1 I mentioned what doesn't work. The code above only reads two files from a zip. – shaft Commented Sep 5, 2014 at 11:17
3 Answers
Reset to default 4I've had the same task to do: process 100+ MB zip archives with 100 000+ XML files in each of them. In that case, unzipping the files on disk is just a waste of HD space. I tried adm-zip but it would load and expand the whole archive in RAM, and my script would break at around 1 400 MB RAM usage.
Using the code from the question, and the nice tip from Dilan's answer, I was sometimes only getting partial XML content, that would of course break my XML parser.
After some trials, I've ended up with that code:
// process one .zip archive
function process_archive(filename) {
fs.createReadStream(filename)
.pipe(unzip.Parse())
.on('entry', function (entry) {
// entry.path is file name
// entry.type is 'Directory' or 'File'
// entry.size is size of file
const chunks = [];
entry.on('data', (data) => chunks.push(data));
entry.on('error', (err) => console.log(err));
entry.on('end', () => {
let content = Buffer.concat(chunks).toString('utf8');
process_my_file(entry.path, content);
entry.autodrain();
});
});
return;
}
If this can help anybody, it's quite fast and worked well for me, only using a max of 25 MB of RAM.
you have to call .autodrain() or pipe data to another stream
entry.on('data', function(data) {
entry.autodrain();
// or entry.pipe(require('fs').createWriteStream(entry.path))
});
Solution for late 2024:
the unzip
package is dated and has deprecated dependencies. unzip-stream
is a still maintained version with up-to date packages.
My solution returns a Promise that resolves only when all the files are extracted, and has an optional callback that can be called after each individual files is extracted, for indicating progress.
it also streams the data to the file instead of saving each individual file in memory and then writing it, like MeatZebre's Answer, which was a concern for me, unzipping large video files.
import unzip from 'unzip-stream';
import fs from "fs";
import path from "path";
let src = '/Volumes/Crucial X6/iCloud-photos-2024/iCloud Photos.zip';
let dest = '/Users/adelphia/test';
let extracted = await processArchive(src, dest, fn=>{
console.log(`Extracted ${path.basename(fn)}`);
});
console.log('done', extracted);
export function processArchive(src, dest, onEach) {
return new Promise(resolve=>{
let promises = [];
fs.createReadStream(src)
.pipe(unzip.Parse())
.on('finish', async ()=>{
let extracted = await Promise.all(promises);
resolve(extracted);
})
.on('entry', function (entry) {
promises.push(new Promise(entryComplete=>{
let filename = path.basename(entry.path);
let dest_path = path.join(dest, filename);
entry.pipe(fs.createWriteStream(dest_path)).on('finish', ()=>{
entry.autodrain();
if(onEach) onEach(dest_path);
entryComplete(dest_path);
});
}));
});
});
}
本文标签: javascripthow to read and process large zip files in nodejsStack Overflow
版权声明:本文标题:javascript - how to read and process large zip files in node-js - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742143996a2422707.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论