the hayley

So remember that article I wrote about partial http downloads using curl? You probably don't, so let me refresh your memory. I ended it by saying:

I'm hoping to implement this in pure ruby code tomorrow

Oh yeah. Nearly a month ago at this point.

Here's the thing. If it's not a priority, it ain't gonna get done. Unless it's fun.

why did I want partial downloads in the first place?

So the quick background. Cyclocane does downloads of these raw spaghetti files where the entire storm's history is kept in one file. And right now, Cyclocane only needs the very last run. And given that one of the sources is not compressing their data at all, well... towards the end of a storm's life, the file can be huge.

So why'd I procrastinate? Well, the very day after writing the post about figuring out curl and the partial download, I didn't need data from that site (storm dissipated). So why bother, eh?

Anyways, back to the present (or yesterday as the case may be) and I finally decided to look into this again.

how on earth do I do a partial download in ruby

If you google ruby partial http downloads, you'll get approximately 1 relevant result (the rest being about partials in regards to ruby templating). Following that one relevant result, you'll get a stackoverflow answer that points to this other stackoverflow answer that just looks overly complex.

I started trying the net/http and uri based stuff out in pry and I just felt icky. Surely there was a better way.

Surely.

Given that the stackoverflow answer came from 2009, I figured there had to be some ruby library that had tackled this problem in a nicer, friendlier way by now.

Faraday to the rescue

I had already heard about the faraday gem... it's like Rack in reverse or something, but until now, I didn't have a compelling reason to try it.

It's been almost a day since I did this, but apparently I finally found this through a google search of ruby custom http headers (after also trying partial downloads mechanize ruby which only taught me that it wasn't possible in mechanize).

intridea's article on faraday was what I needed. It didn't directly answer my question, but since I already knew how with curl, I knew what HTTP headers I would need to set to get this working.

tl;dr

So after all of this, here's what you need to do to be able to work with range headers in ruby... assuming you're using faraday of course.

I make no guarantees that this will work well for you, but it seems to be working for me so far (it's only day 2 though).*

require 'faraday'
      conn = Faraday.new # set up a new connection
      header_response = conn.head url # do a HEAD request
      starting_bytes = header_response.env[:response_headers]["content-length"].to_i - 100000
      starting_bytes = 0 if starting_bytes < 0
      conn.headers = {'Range' => "bytes=#{starting_bytes}-"}
      response = conn.get url
      puts response.body

And now with commentary

Require the library and get it ready.

require 'faraday'
      conn = Faraday.new

Do an http HEAD request and set it to a variable.

header_response = conn.head url

Here, I've decided I only need the last 100,000 bits (bytes, halfwords.. I don't know), so I find out what the content-length is and subtract 100,000 from that.

starting_bytes = header_response.env[:response_headers]["content-length"].to_i - 100000

Frankly, I haven't tested this, but the hope is that if the file is less than the 100,000 then this will catch what might've been an invalid request (and again, I haven't even tested what happens if you put a negative number into a range request).

starting_bytes = 0 if starting_bytes < 0

Now set the data range. You can do <START>-<END> or you can do what I've done here and do <START>- which tells it to start with whatever is your START is and then just keep going until it reaches the end of the file.

conn.headers = {'Range' => "bytes=#{starting_bytes}-"}

Finally, you actually get the partial download.

response = conn.get url

response.body holds the contents that theoretically you're going to do something magical with. Here, it's just being outputted to the screen.

puts response.body

the benefits

So I did a totally unscientific comparison of the before and after. My update script (which does more than update the spaghetti models but still) went from 45 seconds to 9 seconds. And if the remote server had been slower (this server tends to fall to its knees anytime there's a major storm system in the northwest pacific) or the current active storms had been around longer (bigger files), that time difference would've been even more dramatic (given that I'm only ever getting the last 100,000 thingies from the file regardless of size).

So yeah. Here's your ending. Use faraday!

the hayley