admin管理员组

文章数量:1124801

I have some large json.gz files and am attempting to parse these files using a library like simdjson/rapidjson. Since the files are quite large (7GB) when compressed, I have written some code to get a decompressed stream of the json string in chunks, from the gz file.

Since the chunks of json being streamed are based on a memory buffer, most of the chunks are invalid json and elements may only be closed several chunks down the line and the json is deeply nested and complex. Thus, just parsing each chunk does not work as there needs to be some cache for when elements end etc.

Is there a way to handle this with either simdjson or rapidjson?

I am by no means good at c++ so any help would be greatly appreciated!!

Here is the code:

#include <string>
#include <zlib.h>
#include <fstream>
#include <iostream>
#include "simdjson.h"

const int CHUNK_SIZE = 10240;

void decompress(const std::string &filename) {

    gzFile gzFile = gzopen(filename.c_str(), "rb");

    if (!gzFile) {
        std::cerr << "error opening gzipped file: " << filename << std::endl;
        return;
    }

    char buffer[CHUNK_SIZE];

    int bytesRead;
    while ((bytesRead = gzread(gzFile, buffer, sizeof(buffer))) > 0) {

        std::string chunk(buffer, bytesRead); // or string_view

        // PROCESS CHUNK WITH SIMDJSON/RAPIDJSON HERE

    }

    if (bytesRead < 0) {
        std::cerr << "error during decompression: " << gzerror(gzFile, NULL) << std::endl;
    }

    gzclose(gzFile);

}

int main() {

    auto start = std::chrono::high_resolution_clock::now();

    std::string filename = "data/example.json.gz";
    
    decompress(filename);

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end - start;

    std::cout << "\n" << "elapsed time (seconds): " << elapsed.count() << "\n" << std::endl;

    return 0;

}

Thanks!

I have some large json.gz files and am attempting to parse these files using a library like simdjson/rapidjson. Since the files are quite large (7GB) when compressed, I have written some code to get a decompressed stream of the json string in chunks, from the gz file.

Since the chunks of json being streamed are based on a memory buffer, most of the chunks are invalid json and elements may only be closed several chunks down the line and the json is deeply nested and complex. Thus, just parsing each chunk does not work as there needs to be some cache for when elements end etc.

Is there a way to handle this with either simdjson or rapidjson?

I am by no means good at c++ so any help would be greatly appreciated!!

Here is the code:

#include <string>
#include <zlib.h>
#include <fstream>
#include <iostream>
#include "simdjson.h"

const int CHUNK_SIZE = 10240;

void decompress(const std::string &filename) {

    gzFile gzFile = gzopen(filename.c_str(), "rb");

    if (!gzFile) {
        std::cerr << "error opening gzipped file: " << filename << std::endl;
        return;
    }

    char buffer[CHUNK_SIZE];

    int bytesRead;
    while ((bytesRead = gzread(gzFile, buffer, sizeof(buffer))) > 0) {

        std::string chunk(buffer, bytesRead); // or string_view

        // PROCESS CHUNK WITH SIMDJSON/RAPIDJSON HERE

    }

    if (bytesRead < 0) {
        std::cerr << "error during decompression: " << gzerror(gzFile, NULL) << std::endl;
    }

    gzclose(gzFile);

}

int main() {

    auto start = std::chrono::high_resolution_clock::now();

    std::string filename = "data/example.json.gz";
    
    decompress(filename);

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = end - start;

    std::cout << "\n" << "elapsed time (seconds): " << elapsed.count() << "\n" << std::endl;

    return 0;

}

Thanks!

Share Improve this question asked yesterday thatsroughbuddythatsroughbuddy 431 silver badge5 bronze badges 2
  • RapidJSON supports reading from custom streams. If your files are actually a collection of JSON objects you can read one at a time with the kParseStopWhenDoneFlag flag, if it's actually a complete array or object you will have to do more finagling. – Botje Commented yesterday
  • Alternatively, find an istream implementation for gzip files and use that with the RapidJSON istream wrapper – Botje Commented yesterday
Add a comment  | 

1 Answer 1

Reset to default 0

RapidJSON supports the std::istream interface, so you can use Boost's filtered streams to decompress the gzip files transparently:

ifstream file("data/example.json.gz", ios_base::in | ios_base::binary);
boost::iostreams::filtering_streambuf<input> in;
in.push(gzip_decompressor());
in.push(file);

rapidjson::Document d;
d.ParseStream(rapidjson::IStreamWrapper(in));

本文标签: