admin管理员组

文章数量:1277257

I am new to developing in node.js (though relatively experienced at client-side javascript) and I'm running into lots of questions about good practices when dealing with asynchronous operations in node.js.

My specific issue (though I imagine this is a fairly general purpose topic) is that I have a node.js app (running on a Raspberry Pi) that is recording the readings from several temperature probes every 10 seconds to an in memory data structure. This works just fine. The data accumulates over time in memory and, as it accumulates and reaches a particular size threshold, the data is regularly aged (keeping only the last N days of data) to keep it from growing beyond a certain size. This temperature data is used to control some other appliances.

Then, I have a separate interval timer that writes this data out to disk every so often (to persist it if the process crashes). I'm using async node.js (fs.open(), fs.write() and fs.close()) disk IO to write the data out to disk.

And, because of the async nature of the disk IO, it occurs to me that the very data structure I'm trying to write to disk may get modified right in the middle of me writing it out to disk. That would potentially be a bad thing. If data is only appended to the data structure while writing out to disk, that won't actually cause a problem with the way I'm writing the data, but there are some circumstances where earlier data can be modified as new data is being recorded and that would really mess with the integrity of what I'm in the middle of writing to disk.

I can think of all sorts of somewhat ugly safeguards I could put in my code such as:

  1. Switch to synchronous IO to write the data to disk (don't really want to do that for server responsiveness reasons).
  2. Set a flag when I started writing data and don't record any new data while that flag is set (causes me to lose the recording of data during the write).
  3. More plicated versions of option 2 where I set the flag and when the flag is set, new data goes in a separate, temporary data structure that when the file IO is done is then merged with the real data (doable, but seems ugly).
  4. Take a snapshot copy of the original data and take your time to write that copy to disk knowing that nobody else will be modifying the copy. I don't want to do this because the data set is relatively large and I'm in a limited memory environment (Raspberry PI).

So, my question is what are design patterns for writing a large data set with async IO when other operations may want to modify that data during the async IO? Are there more general purpose ways of handling my issue than the specific work-arounds listed above?

I am new to developing in node.js (though relatively experienced at client-side javascript) and I'm running into lots of questions about good practices when dealing with asynchronous operations in node.js.

My specific issue (though I imagine this is a fairly general purpose topic) is that I have a node.js app (running on a Raspberry Pi) that is recording the readings from several temperature probes every 10 seconds to an in memory data structure. This works just fine. The data accumulates over time in memory and, as it accumulates and reaches a particular size threshold, the data is regularly aged (keeping only the last N days of data) to keep it from growing beyond a certain size. This temperature data is used to control some other appliances.

Then, I have a separate interval timer that writes this data out to disk every so often (to persist it if the process crashes). I'm using async node.js (fs.open(), fs.write() and fs.close()) disk IO to write the data out to disk.

And, because of the async nature of the disk IO, it occurs to me that the very data structure I'm trying to write to disk may get modified right in the middle of me writing it out to disk. That would potentially be a bad thing. If data is only appended to the data structure while writing out to disk, that won't actually cause a problem with the way I'm writing the data, but there are some circumstances where earlier data can be modified as new data is being recorded and that would really mess with the integrity of what I'm in the middle of writing to disk.

I can think of all sorts of somewhat ugly safeguards I could put in my code such as:

  1. Switch to synchronous IO to write the data to disk (don't really want to do that for server responsiveness reasons).
  2. Set a flag when I started writing data and don't record any new data while that flag is set (causes me to lose the recording of data during the write).
  3. More plicated versions of option 2 where I set the flag and when the flag is set, new data goes in a separate, temporary data structure that when the file IO is done is then merged with the real data (doable, but seems ugly).
  4. Take a snapshot copy of the original data and take your time to write that copy to disk knowing that nobody else will be modifying the copy. I don't want to do this because the data set is relatively large and I'm in a limited memory environment (Raspberry PI).

So, my question is what are design patterns for writing a large data set with async IO when other operations may want to modify that data during the async IO? Are there more general purpose ways of handling my issue than the specific work-arounds listed above?

Share Improve this question edited Sep 6, 2014 at 4:01 jfriend00 asked Sep 6, 2014 at 0:20 jfriend00jfriend00 708k103 gold badges1k silver badges1k bronze badges 10
  • 1 Maybe this helps to some degree? stackoverflow./q/14795145/218196 – Felix Kling Commented Sep 6, 2014 at 0:27
  • 2 @FelixKling - that article looks like a good description of how the async operations in node.js work. I understand that already. In fact, it's my understanding of the async architecture that informs me that I have a concurrency issue to deal with and that's why I'm looking for good practice design patterns for solving that concurrency issue. – jfriend00 Commented Sep 6, 2014 at 0:31
  • As far as I understood node is still single threaded, and concurrent execution doesn't necessarily mean parallel execution, so there should not be an issue with reading and writing data at the same time (because it can't happen, just like in the browser). But these are all just guesses, I actually don't know for sure, so don't listen to me (too much) :) – Felix Kling Commented Sep 6, 2014 at 0:34
  • 5 Why are people voting to close this question? I'm asking how to write live data using async IO. – jfriend00 Commented Sep 6, 2014 at 4:00
  • 1 @Bergi - yes I am writing the data in chunks. For memory usage reasons(this is a small Raspberry Pi without a lot of RAM), I don't want to serialize it all into memory at once. There are opportunities for additional sensor reads to occur between the async writes and I've actually seen that occur in my logs. – jfriend00 Commented Sep 6, 2014 at 15:43
 |  Show 5 more ments

2 Answers 2

Reset to default 7

Your problem is data synchronization. Traditionally this is solved with locks/mutexes, but javascript/node doesn't really have anything like that built-in.

So, how do we solve this in node? We use queues. Personally, I use the queue function from the async module.

Queues work by keeping a list of tasks that need to be executed and only execute those tasks, in the order they're added to the queue, once the previous task has pleted (similar to your option 3).

Note: The async module's queue method can actually run multiple tasks concurrently (like the animation above shows) but, since we're talking data synchronization here, we don't want that. Luckily we can tell it to just run one at a time.

In your particular situation what you'll want to do is setup a queue which can do two types of tasks:

  1. Modify your data structure
  2. Write your data structure to disk

Whenever you get new data from your temperature probes, add the task to your queue to modify your data structure with that new data. Then, whenever your interval timer fires, add the task to your queue that writes your data structure to disk.

Since the queue will only run one task at a time, in the order they're added to the queue, it guarentees that you'll never be modifying your in-memory data structure while you're writing data to disk.

A very simple implementation of that might look like:

var dataQueue = async.queue(function(task, callback) {
    if (task.type === "newData") {
        memoryStore.add(task.data); // modify your data structure however you do it now
        callback(); // let the queue know the task is done; you can pass an error here as usual if needed
    } else if (task.type === "writeData") {
        fs.writeFile(task.filename, JSON.stringify(memoryStore), function(err) {
            // error handling
            callback(err); // let the queue know the task is done
        })
    } else {
        callback(new Error("Unknown Task")); // just in case we get a task we don't know about
    }
}, 1); // The 1 here is setting the concurrency of the queue so that it will only run one task at a time

// call when you get new probe data
funcion addNewData(data) {
    dataQueue.push({task: "newData", data: data}, function(err) {
        // called when the task is plete; optional
    });
}

// write to disk every 5 minutes
setInterval(function() {
    dataQueue.push({task: "writeData", filename: "somefile.dat"}, function(err) {
        // called when the task is plete; optional
    });
}, 18000);

Also note that you can now add data to your data structure asynchronously. Say you add a new probe that fires off an event whenever its value changes. You can just addNewData(data) as you do with your existing probes and not worry about it conflicting with in-progress modifications or disk writes (this really es in to play if you start writing to a database instead of an in-memory data store).


Update: A more elegant implementation using bind()

The idea is that you use bind() to bind arguments to a function and then push the new bound function that bind() returns on to the queue. That way you don't need to push some custom object on to the queue that it has to interpret; you can just give it a function to call, all setup with the correct arguments already. The only caveat is that the function has to take a callback as its last argument.

That should allow you to use all the existing functions you have (possibly with slight modifications) and just push them on to the queue when you need to make sure they don't run concurrently.

I threw this together to test the concept:

var async = require('async');

var dataQueue = async.queue(function(task, callback) {
    // task is just a function that takes a callback; call it
    task(callback); 
}, 1); // The 1 here is setting the concurrency of the queue so that it will only run one task at a time

function storeData(data, callback) {
    setTimeout(function() { // simulate async op
        console.log('store', data);
        callback(); // let the queue know the task is done
    }, 50);
}

function writeToDisk(filename, callback) {
    setTimeout(function() { // simulate async op
        console.log('write', filename);
        callback(); // let the queue know the task is done
    }, 250);
}

// store data every second
setInterval(function() {
    var data = {date: Date.now()}
    var boundStoreData = storeData.bind(null, data);
    dataQueue.push(boundStoreData, function(err) {
        console.log('store plete', data.date);
    })
}, 1000)

// write to disk every 2 seconds
setInterval(function() {
    var filename = Date.now() + ".dat"
    var boundWriteToDisk = writeToDisk.bind(null, filename);
    dataQueue.push(boundWriteToDisk, function(err) {
        console.log('write plete', filename);
    });
}, 2000);

First - let's show a practical solution and then let's dive into how and why it works:

var chain = Promise.resolve(); // Create a resolved promise
var fs = Promise.promisifyAll(require("fs"));

chain = chain.then(function(){
    return fs.writeAsync(...); // A
});

// some time in the future
chain = chain.then(function(){
    return fs.writeAsync(...); // This will always execute after A is done
})

Since you've tagged your question with promises - it's worth mentioning that promises solve this (quite plicated) problem very well on their own and do so quite easily.

Your data synchronization problem is called the producer consumer problem. There are a lot of ways to tackle synchronization in JavaScript - this recent piece by Q's KrisKowal is a good read on the subject.

Enter: Promises

The simplest way to solve it with promises is to chain everything through a single promise. I know you're experienced with promises yourself but for newer readers let's recap:

Promises are an abstraction over the notion of sequencing itself. A promise is a single (read discrete) unit of action. Chaining promises, much like ; in some languages, notes the end of one operation and the start of the next. Promises in JavaScript abstract two main things - the notion of actions taking time and exceptional conditions.

There is a 'higher' abstraction at play here called a monad, while A+ promises do not abide the monad laws strictly (for convenience) there are implementations of promises that do. Promises abstract a certain kind of processing where monads abstract the notion of processing itself, you can say that a promise is a monad or for the very least that they are monadic.

Promises start off as pending meaning they represent an action that has already started but has not pleted yet. At some point they might go through resolution during which they settle at one of two states:

  • Fulfiled - indicating that the action has pleted successfully.
  • Rejected - indicating that the action has not pleted successfully.

Once a promise is settled it can no longer change its state. Just like you can continue a ; on the next line - you can continue a promise with the .then keyword which chains the previous action to the next.

Solving producer - consumer.

A traditional solution to the producer/consumer problem can be acplished with traditional concurrency constructs like Dijkstra's semaphores. Indeed such a solution exists through promises or plain callbacks but I believe we can do something similar.

Instead, we'll keep a program running, and append new actions to it every time.

var fsQueue = Promise.resolve(); // start a new chain

// one place
fsQueue = fsQueue.then(function(){ // assuming promisified fs here
    return fs.writeAsync(...); 
});

// some other place
fsQueue = fsQueue.then(function(){
    return fs.writeAsync(...);
});

Adding actions to the queue assures we have ordered synchronization and actions will only execute after earlier ones have finished. This is the simplest synchronization solution to this problem and requires wrapping fs.asyncFunction calls by .thening them to your queue.

An alternative solution would be using something akin to a "monitor" - we can ensure the access is consistent from within by wrapping fs:

var fs = B.promisifyAll(require("fs")); // bluebird promisified fs 
var syncFs = { // sync stands for synchronized, not synchronous
    queue: B.resolve();
    writeAsync = function(){
        var args = arguments
        return (queue = queue.then( // only execute later
            return fs.writeAsync.apply(fs,arguments);
        });
    } // promisify other used functions similarly
};

Which would produce synchronized versions of fs actions. It is also possible to automate this (haven't tested) using something similar:

// assumes module is promisified and ignores nested functions
function synchronize(module){
    var ret = {}, queue = B.resolve();
    for(var fn in module){
        ret[fn] = function(){
            var args = arguments;
            queue = queue.then(function(){
                return module[fn].apply(module, args); 
            })
        };
    }
    ret.queue = queue; // expose the queue for handling errors
    return ret;
}

Which should produce a version of a module that synchronizes all its actions. Note that we get the added benefit that errors don't get suppressed and the file system will not be in inconsistent state because actions won't get executed until the error that caused the action not to execute gets handled.

Isn't that kind of similar to a queue?

Yes! Queues do something very similar (which you can see in the other answer) by providing a first in first out structure for actions. Much like program code which executes in that order to begin with. Promises are simply a stronger side of the same coin in my opinion.

The other answer also provides a viable option through queues.

About your suggested approaches

Switch to synchronous IO to write the data to disk (don't really want to do that for server responsiveness reasons).

While I agree this is the simplest - the 'monitor' approach of chaining all actions you need synchronized on the same queue is very similar.

Set a flag when I started writing data and don't record any new data while that flag is set (causes me to lose the recording of data during the write).

That flag is effectively a mutex. If you block (or yield and put the action in a queue) when someone retries to do that you've got a real mutex that holds the "mutex guarantees".

Retrying with that flag, and keeping a list of next actions to hold the flag is actually very mon in implementations of a semaphore - one example is in the linux kernel.

More plicated versions of option 2 where I set the flag and when the flag is set, new data goes in a separate, temporary data structure that when the file IO is done is then merged with the real data (doable, but seems ugly). Take a snapshot copy of the original data and take your time to write that copy to disk knowing that nobody else will be modifying the copy. I don't want to do this because the data set is relatively large and I'm in a limited memory environment (Raspberry PI).

These approaches are usually called transactional RCU updates, they're actually very modern and very fast in some cases - for example for the "readers writers problem" (which is very very similar to what you have). Native support for these kicked in the linux kernel quite recently. Doing this in certain cases is actually both viable and performant although in your case is overplicating things a bit like you suggest.

So, to sum it up

  • It's not an easy problem, but an interesting one.
  • Luckily, promises solve it pretty well, they were built exactly to solve this sort of problem by abstracting the notion of a sequence.

Happy coding, Pi NodeJS project sounds awesome. Let me know if I could clarify this any further.

本文标签: javascriptHow to write a live data set to disk with async IOStack Overflow