admin管理员组文章数量:1124801
I have some large json.gz
files and am attempting to parse these files using a library like simdjson
/rapidjson
. Since the files are quite large (7GB) when compressed, I have written some code to get a decompressed stream of the json string in chunks, from the gz file.
Since the chunks of json being streamed are based on a memory buffer, most of the chunks are invalid json and elements may only be closed several chunks down the line and the json is deeply nested and complex. Thus, just parsing each chunk does not work as there needs to be some cache for when elements end etc.
Is there a way to handle this with either simdjson or rapidjson?
I am by no means good at c++ so any help would be greatly appreciated!!
Here is the code:
#include <string>
#include <zlib.h>
#include <fstream>
#include <iostream>
#include "simdjson.h"
const int CHUNK_SIZE = 10240;
void decompress(const std::string &filename) {
gzFile gzFile = gzopen(filename.c_str(), "rb");
if (!gzFile) {
std::cerr << "error opening gzipped file: " << filename << std::endl;
return;
}
char buffer[CHUNK_SIZE];
int bytesRead;
while ((bytesRead = gzread(gzFile, buffer, sizeof(buffer))) > 0) {
std::string chunk(buffer, bytesRead); // or string_view
// PROCESS CHUNK WITH SIMDJSON/RAPIDJSON HERE
}
if (bytesRead < 0) {
std::cerr << "error during decompression: " << gzerror(gzFile, NULL) << std::endl;
}
gzclose(gzFile);
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
std::string filename = "data/example.json.gz";
decompress(filename);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "\n" << "elapsed time (seconds): " << elapsed.count() << "\n" << std::endl;
return 0;
}
Thanks!
I have some large json.gz
files and am attempting to parse these files using a library like simdjson
/rapidjson
. Since the files are quite large (7GB) when compressed, I have written some code to get a decompressed stream of the json string in chunks, from the gz file.
Since the chunks of json being streamed are based on a memory buffer, most of the chunks are invalid json and elements may only be closed several chunks down the line and the json is deeply nested and complex. Thus, just parsing each chunk does not work as there needs to be some cache for when elements end etc.
Is there a way to handle this with either simdjson or rapidjson?
I am by no means good at c++ so any help would be greatly appreciated!!
Here is the code:
#include <string>
#include <zlib.h>
#include <fstream>
#include <iostream>
#include "simdjson.h"
const int CHUNK_SIZE = 10240;
void decompress(const std::string &filename) {
gzFile gzFile = gzopen(filename.c_str(), "rb");
if (!gzFile) {
std::cerr << "error opening gzipped file: " << filename << std::endl;
return;
}
char buffer[CHUNK_SIZE];
int bytesRead;
while ((bytesRead = gzread(gzFile, buffer, sizeof(buffer))) > 0) {
std::string chunk(buffer, bytesRead); // or string_view
// PROCESS CHUNK WITH SIMDJSON/RAPIDJSON HERE
}
if (bytesRead < 0) {
std::cerr << "error during decompression: " << gzerror(gzFile, NULL) << std::endl;
}
gzclose(gzFile);
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
std::string filename = "data/example.json.gz";
decompress(filename);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "\n" << "elapsed time (seconds): " << elapsed.count() << "\n" << std::endl;
return 0;
}
Thanks!
Share Improve this question asked yesterday thatsroughbuddythatsroughbuddy 431 silver badge5 bronze badges 2 |1 Answer
Reset to default 0RapidJSON supports the std::istream
interface, so you can use Boost's filtered streams to decompress the gzip files transparently:
ifstream file("data/example.json.gz", ios_base::in | ios_base::binary);
boost::iostreams::filtering_streambuf<input> in;
in.push(gzip_decompressor());
in.push(file);
rapidjson::Document d;
d.ParseStream(rapidjson::IStreamWrapper(in));
本文标签:
版权声明:本文标题:c++ - parsing partial json chunks with simdjsonrapidjson - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736601982a1945236.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
kParseStopWhenDoneFlag
flag, if it's actually a complete array or object you will have to do more finagling. – Botje Commented yesterday