How to Retrieve a File Archived to AWS Glacier

These are brief instructions to retrieve a particular file from AWS Glacier.

The process is very slow: it takes approx 2-3 hours to retrieve the file list and then another 2+ hours to get the file.

Currently, looking up the file and finding its ArchiveId is a very manual process, but it could be easily automated with a few lines of Python and a decent RegEx.

Initiate Job

To get started, we need to specify the credentials to access the Vault (via the user profile stored in ~/.aws/credentials), the region and the vault name; note how a user could have permission to access vaults in other AWS accounts, so the --account-id is required – using the - shortcut means “use the same account as the user is authenticated for”.

Because we don’t know the ArchiveId of the file we want to download (and possibly, not even whether it’s actually there at all) we first need an “inventory” of all the files that are in the vault:

AWS_PROFILE=...
AWS_REGION=us-west-2
VAULT=...
JOB_ID=$(aws glacier initiate-job --account-id - \
    --vault-name $VAULT \
    --job-parameters '{"Type": "inventory-retrieval"}' | \
    jq -r .jobId)

The response for the initiate-job task has this form:

{
    "location": "/<ACCOUNT ID>/vaults/<VAULT>/jobs/c3b...dWTxedwM6",
    "jobId": "c3b...TxedwM6"
}

so we extract the jobId from the JSON using jq and store it into the JOB_ID var.

Wait for Job to complete

While in progress, we can interrogate AWS for completion status of the JOB_ID, on the specified VAULT:

aws glacier describe-job --account-id - \
    --vault-name $VAULT \
    --job-id $JOB_ID

Response:

{
    "JobId": "c3b...TxedwM6",
    "Action": "InventoryRetrieval",
    "VaultARN": "arn:aws:glacier:us-west-2:<ACCOUNT>:vaults/<VAULT>",
    "CreationDate": "2024-03-05T04:21:11.735Z",
    "Completed": false,
    "StatusCode": "InProgress",
    "InventoryRetrievalParameters": {
        "Format": "JSON"
    }
}

While hitting enter on a keyboard for a couple of hours at random interval is amazing fun and a great way to spend an evening; we’d rather have the computer do this for us while we could, I don’t know, have a beer? watch a movie? hack some code in a different terminal?

while [ "$(aws glacier describe-job --account-id - \
        --vault-name $VAULT \
        --job-id $JOB_ID | jq -r .Completed)" != "true" ]
do 
    echo -n .
    sleep 60
done
printf "\nJob $JOB_ID Completed\n\n"

_{Image by tookapic from Pixabay}

As all the good things in life, this eventually too comes to an end (expect anywhere between 2 and 4 hours, possibly longer) and we can then retrieve the full list of the vault’s contents in a JSON file:

aws glacier get-job-output --account-id - \
    --vault-name $VAULT --job-id $JOB_ID \
    Downloads/scans.json

Find the ArchiveId for the file

The contents of the JSON file are fairly straightforward to parse/search – this time round I just used a trivial Cmd-F to “find” the file I was looking for, obviously one could go nuts by implementing automated search.

Note, however, how the full path of the file is stored (when it was archived by the Backup tool) and while it is JSON, it is escaped and stored as a string.

{
  "VaultARN": "arn:aws:glacier:us-west-2:<Account ID>:vaults/<VAULT>",
  "InventoryDate": "2024-03-04T21:19:29Z",
  "ArchiveList": [
    {
      "ArchiveId": "0HwoNe...mZA",
      "ArchiveDescription": "{\"path\": \"/Public/scans/file-001.pdf\", \"type\": \"file\"}",
      "CreationDate": "2019-01-01T22:26:52Z",
      "Size": 854274,
      "SHA256TreeHash": "d291e...5"
    },
    {
      "ArchiveId": "ef456...K51g",
      "ArchiveDescription": "{\"path\": \"/Public/scans/report_yr10.pdf\", \"type\": \"file\"}",
      "CreationDate": "2019-01-01T22:27:25Z",
      "Size": 2301336,
      "SHA256TreeHash": "6...f24"
    },
    {
      "ArchiveId": "_Mtv...A6NNSg",
      "ArchiveDescription": "{\"path\": \"/Public/scans/rainbow_art.pdf\", \"type\": \"file\"}",
      "CreationDate": "2019-01-01T22:28:04Z",
      "Size": 1534374,
      "SHA256TreeHash": "1...fd6"
    },
    {
      "ArchiveId": "VNj...k0g",
      "ArchiveDescription": "{\"path\": \"/Public/scans/transcript.pdf\", \"type\": \"file\"}",
      "CreationDate": "2019-01-01T22:28:05Z",
      "Size": 5204272,
      "SHA256TreeHash": "5...0e58"
    },
    ...
  ]
}

Retrieve data

Be that as it may, now that we have the ArchiveId for our file (let’s assume it’s the life-changing file-001.pdf) we can ask Glacier to pretty please retrieve it for us:

JOB_ID=$(aws glacier initiate-job --account-id - \
        --vault-name $VAULT \
        --job-parameters \
        '{"Type": "archive-retrieval", "ArchiveId": "0HwoNe...mZA"}' |\
    jq -r .JobId)

Again, this is not a quick fetch, but it will take again several hours, so we are again at it with some patience:

while [ "$(aws glacier describe-job --account-id - \
        --vault-name $VAULT \
        --job-id $JOB_ID | jq -r .Completed)" != "true" ]
do 
    echo -n .
    sleep 60
done
printf "\nJob $JOB_ID Completed\n\n"; \

Once the retrieval job completes, we can finally retrieve the file from Glacier and stored it somewhere locally:

aws glacier get-job-output --account-id - \
        --vault-name $VAULT --job-id $JOB_ID \
        Downloads/file-001.pdf

Profit!

Code Trips & Tips