I’ve recently started toying around with Amazon S3 for doing some remote backups.  On Linux, I’ve been using S3 tools to access S3.  I’ve been fairly impressed with S3 and S3 tools so far.

There are a few extra things beyond what the s3cmd program (included in S3 tools) does that would be useful for me.  First, s3cmd will upload / download from S3 as fast as it can.  If you are transferring a large file, this can be problematic as it will saturate your connection for some time.  I was looking for an easy way to shape this traffic without doing traffic shaping at the network level (which could be tricky as S3 uses HTTP for transfers).  I found trickle, a pretty neat little app that does traffic shaping in userland.

Second, the ability to resume a failed transfer would be nice.  Last week I was in the middle of uploading a large file to S3 from the fairly limited speed connection I have here.  With trickle, it was going to take about 24 hours to upload the whole file.  Unfortunately, about 16 hours in to the transfer some bad storms here cut my connection off for a few minutes.  It was rather frustrating to have to start that transfer back at the beginning again.

I researched it a little bit, and it looks like S3 doesn’t really provide a mechanism to resume interrupted transfers.  The best way that I could think of to handle this and minimize the pain of a dropped connection is to split the larger archive into smaller pieces and upload them individually.  By doing this, if you have to start the transfer over, you only have to re-transfer the part of file.  I made my script split it into 20 MB pieces.  At the 20 KB/s rate that trickle lets through, this would only take 17 minutes or so to re-upload.  Splitting a file into smaller pieces reduces the amount of data that would have to be uploaded again if the connection is cut off.

I’ve put the scripts below.  s3-up.sh will first create a tar.gz (compressed archive) of the files specified for upload, then split the tar.gz file into 20 MB chunks and upload it at 20 KB/s.  s3-down.sh will download multiple files from S3, put them back together, and extract them.  It will try to decrypt them using the currently configured password for s3cmd.  You can specify a different decryption password (if you regularly change the encryption password you use for s3cmd for instance).

Feel free to tweak as needed.  You’ll need to have s3cmd installed, along with trickle to use these scripts.  If you are using CentOS, both of those are in the EPEL repositoryI make absolutely no guarantees that these scripts will work for you.  They have not been thoroughly tested.  They are not intended to be relied upon in a production environment.

s3-up.sh:

#!/bin/bash

``DATE=`date +%Y-%m-%d`
if [ -z $1 ] || [ -z $2 ]
then
    echo Arguments: files to upload, base name of file
else
    mkdir /tmp/$$
    tar -czf /tmp/$$/$2_$DATE.tar.gz $1
    mkdir /tmp/$$/$2_$DATE
    split  -d -b 20m /tmp/$$/$2_$DATE.tar.gz /tmp/$$/$2_$DATE/$2_$DATE.tar.gz.
    trickle -s -u 20 s3cmd put -e --recursive /tmp/$$/$2_$DATE s3://your-s3-bucket-name/backup/
fi

s3-down.sh:

if [ -z $1 ] || [ -z $2 ]
then
    echo Arguments: file to get, like s3://your-s3-bucket-name/backup/bak1/*, where to extract, optional password
else
    mkdir /tmp/$$
    mkdir /tmp/$$/split
    cd /tmp/$$/split

    if [ -z $3 ]
    then
        s3cmd get $1
    else
        s3cmd get $1
        for fn in /tmp/$$/split/*
        do
            mv $fn $fn-enc
            gpg -d --verbose --no-use-agent --batch --yes --passphrase $3 -o $fn $fn-enc
            rm $fn-enc
        done
    fi
    
    cat  * > ../temp.tar.gz
    cd ..
    tar -xvzf temp.tar.gz -C $2
fi