admin管理员组文章数量:1122846
Need to speed up file reading, so decided to use multithreading
I have a very simple programm for it:
int main()
{
std::string mPath = "input.txt";
std::ifstream in_1(mPath, std::ifstream::ate | std::ifstream::binary);
std::streampos size = in_1.tellg();
in_1.close();
std::streampos half = size / 2;
auto lambd = [&mPath](std::streampos start, std::streampos end)
{
auto clock_start = std::chrono::system_clock::now();
std::string word;
std::ifstream fstream(mPath);
fstream.seekg(start);
auto counter = start;
std::string rangeStr = std::to_string(start) + ' ' + std::to_string(end);
while ((counter < end) && (fstream >> word))
{
counter += word.length() + 1;
//some usefull work with data, was commented during tests
}
auto clock_now = std::chrono::system_clock::now();
float currentTime = float(std::chrono::duration_cast <std::chrono::microseconds> (clock_now - clock_start).count());
std::cout << "\nThread Elapsed Time:(" << rangeStr << ") " << currentTime / 1000000 << " S" << std::endl;
};
lambd(0, size);
lambd(0, half);
lambd(half, size);
std::thread th_1(lambd, 0, half);
std::thread th_2(lambd, half, size);
th_1.join();
th_2.join();
}
And I have got very strange output(Windows 11, visual studio 2022):
Thread Elapsed Time:(0 1520000000) 18.9158 S
Thread Elapsed Time:(0 760000000) 9.36048 S
Thread Elapsed Time:(760000000 1520000000) 9.3542 S
Thread Elapsed Time:(0 760000000) 38.2231 S
Thread Elapsed Time:(760000000 1520000000) 38.3247 S
I should use only c++ library, so could not use nmap.
Could I ask you for supprt ? Why does it happen, that whe I'm using multithreading - I have such big gap? Does it mean the ifstream use same data stream if file is the same ? Maybe I could avoid it somehow? I don't want to use C-style reading.
UPDATED TEXT BELOW: thank you very much for answers.Makar Biziukin gave proposals, which sounds reasonable
But in general - don't understand why does the same code written in 'C-style' works better. And it was main motivation to ask question. It's interresting for me, why does behavior of ifstream is differ so much from fopen and fcanf. Code has been rewritten one-to-one using c-style.
#define _CRT_SECURE_NO_DEPRECATE
#include <iostream>
#include <fstream>
#include <chrono>
#include <string>
#include <thread>
int main()
{
std::string path = "input.txt";
FILE* fp = fopen(path.c_str(), "r");
char str[100];
std::fseek(fp, 0, SEEK_END); // seek to end
std::size_t filesize = std::ftell(fp);
std::size_t half = filesize / 2;
std::fseek(fp, half, SEEK_SET);
fscanf(fp, "%s", str);
half = std::ftell(fp);
fclose(fp);
for (int i = 0; i < 10; i++)
{
std::cout << "\n\n!!!!!!!!!!!!!!!!!!!!\n\n";
auto lamdb = [&path](size_t start, size_t end)
{
auto clock_start = std::chrono::system_clock::now();
char str[100];
FILE* fp = fopen(path.c_str(), "r");
std::fseek(fp, start, SEEK_SET);
std::string rangeStr = std::to_string(start) + ' ' + std::to_string(end);
while ((start < end) && (fscanf(fp, "%s", str) != EOF))
{
start += strlen(str) + 1;
//some usefull work with data, was commented during tests
}
auto clock_now = std::chrono::system_clock::now();
float currentTime = float(std::chrono::duration_cast <std::chrono::microseconds> (clock_now - clock_start).count());
std::cout << "\nThread Elapsed Time:(" << rangeStr << ") " << currentTime / 1000000 << " S" << std::endl;
fclose(fp);
};
lamdb(0, filesize);
lamdb(0, half);
lamdb(half, filesize);
std::thread th_1(lamdb, 0, half);
std::thread th_2(lamdb, half, filesize);
th_1.join();
th_2.join();
}
return 0;
}
one of the outputs(all output results are near to each other):
Thread Elapsed Time:(0 1520000000) 21.7994 S
Thread Elapsed Time:(0 760000007) 10.7144 S
Thread Elapsed Time:(760000007 1520000000) 10.8991 S
Thread Elapsed Time:(760000007 1520000000) 12.6544 S
Thread Elapsed Time:(0 760000007) 13.3838 S
Need to speed up file reading, so decided to use multithreading
I have a very simple programm for it:
int main()
{
std::string mPath = "input.txt";
std::ifstream in_1(mPath, std::ifstream::ate | std::ifstream::binary);
std::streampos size = in_1.tellg();
in_1.close();
std::streampos half = size / 2;
auto lambd = [&mPath](std::streampos start, std::streampos end)
{
auto clock_start = std::chrono::system_clock::now();
std::string word;
std::ifstream fstream(mPath);
fstream.seekg(start);
auto counter = start;
std::string rangeStr = std::to_string(start) + ' ' + std::to_string(end);
while ((counter < end) && (fstream >> word))
{
counter += word.length() + 1;
//some usefull work with data, was commented during tests
}
auto clock_now = std::chrono::system_clock::now();
float currentTime = float(std::chrono::duration_cast <std::chrono::microseconds> (clock_now - clock_start).count());
std::cout << "\nThread Elapsed Time:(" << rangeStr << ") " << currentTime / 1000000 << " S" << std::endl;
};
lambd(0, size);
lambd(0, half);
lambd(half, size);
std::thread th_1(lambd, 0, half);
std::thread th_2(lambd, half, size);
th_1.join();
th_2.join();
}
And I have got very strange output(Windows 11, visual studio 2022):
Thread Elapsed Time:(0 1520000000) 18.9158 S
Thread Elapsed Time:(0 760000000) 9.36048 S
Thread Elapsed Time:(760000000 1520000000) 9.3542 S
Thread Elapsed Time:(0 760000000) 38.2231 S
Thread Elapsed Time:(760000000 1520000000) 38.3247 S
I should use only c++ library, so could not use nmap.
Could I ask you for supprt ? Why does it happen, that whe I'm using multithreading - I have such big gap? Does it mean the ifstream use same data stream if file is the same ? Maybe I could avoid it somehow? I don't want to use C-style reading.
UPDATED TEXT BELOW: thank you very much for answers.Makar Biziukin gave proposals, which sounds reasonable
But in general - don't understand why does the same code written in 'C-style' works better. And it was main motivation to ask question. It's interresting for me, why does behavior of ifstream is differ so much from fopen and fcanf. Code has been rewritten one-to-one using c-style.
#define _CRT_SECURE_NO_DEPRECATE
#include <iostream>
#include <fstream>
#include <chrono>
#include <string>
#include <thread>
int main()
{
std::string path = "input.txt";
FILE* fp = fopen(path.c_str(), "r");
char str[100];
std::fseek(fp, 0, SEEK_END); // seek to end
std::size_t filesize = std::ftell(fp);
std::size_t half = filesize / 2;
std::fseek(fp, half, SEEK_SET);
fscanf(fp, "%s", str);
half = std::ftell(fp);
fclose(fp);
for (int i = 0; i < 10; i++)
{
std::cout << "\n\n!!!!!!!!!!!!!!!!!!!!\n\n";
auto lamdb = [&path](size_t start, size_t end)
{
auto clock_start = std::chrono::system_clock::now();
char str[100];
FILE* fp = fopen(path.c_str(), "r");
std::fseek(fp, start, SEEK_SET);
std::string rangeStr = std::to_string(start) + ' ' + std::to_string(end);
while ((start < end) && (fscanf(fp, "%s", str) != EOF))
{
start += strlen(str) + 1;
//some usefull work with data, was commented during tests
}
auto clock_now = std::chrono::system_clock::now();
float currentTime = float(std::chrono::duration_cast <std::chrono::microseconds> (clock_now - clock_start).count());
std::cout << "\nThread Elapsed Time:(" << rangeStr << ") " << currentTime / 1000000 << " S" << std::endl;
fclose(fp);
};
lamdb(0, filesize);
lamdb(0, half);
lamdb(half, filesize);
std::thread th_1(lamdb, 0, half);
std::thread th_2(lamdb, half, filesize);
th_1.join();
th_2.join();
}
return 0;
}
one of the outputs(all output results are near to each other):
Thread Elapsed Time:(0 1520000000) 21.7994 S
Thread Elapsed Time:(0 760000007) 10.7144 S
Thread Elapsed Time:(760000007 1520000000) 10.8991 S
Thread Elapsed Time:(760000007 1520000000) 12.6544 S
Thread Elapsed Time:(0 760000007) 13.3838 S
Share Improve this question edited Nov 22, 2024 at 10:16 buggi zhuk asked Nov 21, 2024 at 23:35 buggi zhukbuggi zhuk 1091 silver badge8 bronze badges 3 |1 Answer
Reset to default 3The performance degradation you're seeing is due to several factors: 1. Disk I/O Contention: Multiple threads reading from the same physical disk can cause the disk head to jump back and forth, resulting in worse performance than sequential reading. 2. File System Buffering: Windows file system maintains buffers, and concurrent access can lead to buffer thrashing.
Try this version:
#include <iostream>
#include <fstream>
#include <vector>
#include <thread>
#include <string>
#include <chrono>
#include <memory>
class BufferedFileReader {
private:
std::vector<char> buffer;
size_t position = 0;
public:
BufferedFileReader(const std::string& path, std::streampos start, std::streampos size) {
std::ifstream file(path, std::ios::binary);
if (!file) {
throw std::runtime_error("Cannot open file");
}
buffer.resize(size);
file.seekg(start);
file.read(buffer.data(), size);
}
bool getNextWord(std::string& word) {
word.clear();
// Skip whitespace
while (position < buffer.size() && std::isspace(buffer[position])) {
position++;
}
if (position >= buffer.size()) {
return false;
}
// Read word
while (position < buffer.size() && !std::isspace(buffer[position])) {
word += buffer[position++];
}
return !word.empty();
}
};
void processFileChunk(const std::string& path, std::streampos start, std::streampos size) {
auto clock_start = std::chrono::system_clock::now();
try {
BufferedFileReader reader(path, start, size);
std::string word;
size_t wordCount = 0;
while (reader.getNextWord(word)) {
// Process word here
wordCount++;
}
auto clock_now = std::chrono::system_clock::now();
float currentTime = float(std::chrono::duration_cast<std::chrono::microseconds>(clock_now - clock_start).count());
std::cout << "Thread Elapsed Time:(" << start << " " << (start + size)
<< ") " << currentTime / 1000000 << "s, Words: " << wordCount << std::endl;
}
catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
}
}
int main() {
std::string mPath = "input.txt";
// Get file size
std::ifstream file(mPath, std::ifstream::ate | std::ifstream::binary);
if (!file) {
std::cerr << "Cannot open file" << std::endl;
return 1;
}
std::streampos totalSize = file.tellg();
file.close();
// Calculate chunk sizes
const int numThreads = 4; // You can adjust this
std::vector<std::thread> threads;
std::streampos chunkSize = totalSize / numThreads;
// First test sequential reading
std::cout << "Sequential reading:" << std::endl;
processFileChunk(mPath, 0, totalSize);
// Now test parallel reading
std::cout << "\nParallel reading:" << std::endl;
for (int i = 0; i < numThreads; i++) {
std::streampos start = i * chunkSize;
std::streampos size = (i == numThreads - 1) ? (totalSize - start) : chunkSize;
threads.emplace_back(processFileChunk, mPath, start, size);
}
for (auto& thread : threads) {
thread.join();
}
return 0;
}
本文标签: creading with multiple threads using ifstreamStack Overflow
版权声明:本文标题:c++ - reading with multiple threads using ifstream - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736306552a1932997.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
mmap
on the file is arguably the fastest way to do things (with some pointers and string functions such asstrchr
, etc). Or, usingpread
. With streams, you have an extra layer of buffering. And, AFAIK, each stream has its own buffering. Okay, unless you need absolute speed. As to only c++ library, is this a H/W assignment? – Craig Estey Commented Nov 21, 2024 at 23:53sync_with_stdio
. IO streams are also heavier than stdio, and you really see that extra weight if you don't allow the compiler to optimize. General note: time measurements of unoptimized code are mostly meaningless in C++ because the Standard library counts on the compiler eliminating a lot of boilerplate. – user4581301 Commented Nov 22, 2024 at 17:43