So remember that article I wrote about partial http downloads using curl? You probably don't, so let me refresh your memory. I ended it by saying:
I'm hoping to implement this in pure ruby code tomorrow
Oh yeah. Nearly a month ago at this point.
Here's the thing. If it's not a priority, it ain't gonna get done. Unless it's fun.
So the quick background. Cyclocane does downloads of these raw spaghetti files where the entire storm's history is kept in one file. And right now, Cyclocane only needs the very last run. And given that one of the sources is not compressing their data at all, well... towards the end of a storm's life, the file can be huge.
So why'd I procrastinate? Well, the very day after writing the post about figuring out curl and the partial download, I didn't need data from that site (storm dissipated). So why bother, eh?
Anyways, back to the present (or yesterday as the case may be) and I finally decided to look into this again.
If you google ruby partial http downloads
, you'll get approximately 1
relevant result (the rest being about partials in regards to ruby
templating). Following that one relevant result, you'll get a
stackoverflow answer that points to this other stackoverflow
answer that just looks overly complex.
I started trying the net/http
and uri
based stuff out in pry
and I
just felt icky. Surely there was a better way.
Surely.
Given that the stackoverflow answer came from 2009, I figured there had to be some ruby library that had tackled this problem in a nicer, friendlier way by now.
I had already heard about the faraday gem... it's like Rack in reverse or something, but until now, I didn't have a compelling reason to try it.
It's been almost a day since I did this, but apparently I finally found
this through a google search of ruby custom http headers
(after also
trying partial downloads mechanize ruby
which only taught me that it
wasn't possible in mechanize).
intridea's article on faraday was what I needed. It didn't directly answer my question, but since I already knew how with curl, I knew what HTTP headers I would need to set to get this working.
So after all of this, here's what you need to do to be able to work with range headers in ruby... assuming you're using faraday of course.
I make no guarantees that this will work well for you, but it seems to be working for me so far (it's only day 2 though).*
require 'faraday'
conn = Faraday.new # set up a new connection
header_response = conn.head url # do a HEAD request
starting_bytes = header_response.env[:response_headers]["content-length"].to_i - 100000
starting_bytes = 0 if starting_bytes < 0
conn.headers = {'Range' => "bytes=#{starting_bytes}-"}
response = conn.get url
puts response.body
Require the library and get it ready.
require 'faraday'
conn = Faraday.new
Do an http HEAD request and set it to a variable.
header_response = conn.head url
Here, I've decided I only need the last 100,000 bits (bytes,
halfwords.. I don't know), so I find out what the content-length
is
and subtract 100,000 from that.
starting_bytes = header_response.env[:response_headers]["content-length"].to_i - 100000
Frankly, I haven't tested this, but the hope is that if the file is less than the 100,000 then this will catch what might've been an invalid request (and again, I haven't even tested what happens if you put a negative number into a range request).
starting_bytes = 0 if starting_bytes < 0
Now set the data range. You can do <START>-<END>
or you can do what
I've done here and do <START>-
which tells it to start with whatever
is your START
is and then just keep going until it reaches the end of the
file.
conn.headers = {'Range' => "bytes=#{starting_bytes}-"}
Finally, you actually get the partial download.
response = conn.get url
response.body
holds the contents that theoretically you're going to do
something magical with. Here, it's just being outputted to the screen.
puts response.body
So I did a totally unscientific comparison of the before and after. My update script (which does more than update the spaghetti models but still) went from 45 seconds to 9 seconds. And if the remote server had been slower (this server tends to fall to its knees anytime there's a major storm system in the northwest pacific) or the current active storms had been around longer (bigger files), that time difference would've been even more dramatic (given that I'm only ever getting the last 100,000 thingies from the file regardless of size).
So yeah. Here's your ending. Use faraday!