I recently began experimenting with nodejs for a small web scraping project. I wrote a tiny script that goes out to lots of URLs and downloads files to disk. The simple solution was to iterate through the list and send a request to load the URL and download the page.
Too Many Open Files
Unfortunately, there are limits on the amount of simultaneous
exec() calls you can make. Since running an external command via
exec() is non-blocking, making too many back-to-back calls of it will result in the following:
Maximum Simultaneous Calls
To solve this, I implemented something like the following:
This will only allow so many exec() calls to simultaneously run. The rest of the URLs will be stored in a queue until a slot becomes available for them. Checking (and shifting) the queue is done in the callback function
wget_callback(). I fetch the next URL to download out of the
queue only if there are no more than
exec() calls already running. I keep track of how many calls are currently running using
count and increment/decrement accordingly.
I’m sure there are tons of libraries that do this, but I decided to implement a quick and dirty solution to this problem and thought I’d share!