CUDA Development on AWS GPU Instances

As I recently joined the Instagram Relevance team, working on optimizing GPU efficiency when running our ML recommendation models, I wanted to get some practice working on CUDA C++ kernels.

At first, the plan was to use my Ubuntu dev box, but that one runs an older GeForce GTX770, whose Nvidia proprietary drivers are no longer supported, and also would not be able to run the latest and greatest CUDA libraries.

My MacBook Pro M2 (an absolutely awesome development platform) is a non-starter: it does support GPU acceleration, and PyTorch will happily use the Metal devices, but it won’t be much help when it comes to CUDA.

The obvious choice was then to spin up an AWS EC2 GPU instance and call it a day – that turned out to be a winner, but also non-obvious, so I figured, I should share the process (and the code – yes, there’s code obviously: you wouldn’t think I’d do such a thing as spinning up Cloud resources… manually?!?! eeeew 🤮😂).

Where’s the code? In my GitHub repository, obviously – feel free to just clone it and use it, but if you are curious about the process (and a couple of gotchas), please read on.

AMI selection

This was reasonably straightforward, but not entirely obvious: I initially chose to use Amazon’s AMI that uses Amazon Linux (I always prefer simplicity) but that one has a fundamental drawback, in that it uses an ancient version of glibcc and when I tried to connect VSCode for a remote session, that failed.

Don’t get me wrong, I love using the shell, but doing development in vim is my definition of self-harm 😉, and while VSCode is far from my IDE of choice (JetBrains FTW) it is a reasonable option, and works well for C++ development.

So, eventually, I chose to use the Amazon Ubuntu AMI Image with support for GPU and PyTorch 2.x:

# Get both fields: ImageId and Name (sorted by CreationDate descending)
AMI_INFO=$(aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=*GPU*PyTorch*2*Ubuntu*" "Name=architecture,Values=x86_64" \
  --query "Images[].[ImageId, CreationDate, Name]" \
  --output text | sort -k2 -r | head -n1)
AMI_ID=$(echo "$AMI_INFO" | awk '{print $1}')
AMI_NAME=$(echo "$AMI_INFO" | cut -f3-)

Instance Type

GPU instances are of the g class, and come in several flavors, but they don’t come cheap: the g4dn.xlarge is the cheapest, and if you select one in a region such as us-west-2 (or us-east-2) it comes down to around 60¢/hr, which is reasonable.

Just remember to tear it down once you’re done with your development session: letting it run for a week or so can quickly add up to around $100 (which is also why one should always automate stuff, BTW).

To run an instance, you need a VPC (and a bunch of other stuff): you can use the default VPC, but I find that… well, lazy 😃 so I prefer to create one (or more) ad-hoc (and tear them down when no longer used – however, as there is no cost associated with those resources, VPCs can be happily left alone for a while).

Using a tag to look it up is the easiest way to distinguish whether a new one needs to be created:

# Check if a VPC with Tag Key=project, Value=cuda-learn exists
VPC_ID=$(aws ec2 describe-vpcs \
  --region us-west-2 \
  --filters "Name=tag:project,Values=cuda-learn" \
  --query "Vpcs[0].VpcId" \
  --output text)

GPU Instances Quota

As I went to test whether this whole thing worked, the EC2 instance failed to come up with a “you don’t have enough vCPU quota available” kinda error, which was at first puzzling.

It turns out that all G and VT instance requests are by default throttled to 0 vCPUs: you can view all the allowed quotas for your account in the EC2 Quota Console, and confirm what “account-level quota” you have (note that those quotas are Region-specific).

From there, you can file a request to increase the quota (note that there are two different quotas for on-demand and spot instances; you will need to file two separate requests, or remember to just use the one type you requested the quota increase for).

AWS is reasonably quick to approve requests (in my experience, it takes approximately an hour or so) and they will happily increase the quota to 8 vCPUs – which is not great, but enough to run a couple of g4dn.xlarge instances (each one takes up 4 vCPUs).

Remote Connection to EC2

Nothing too surprising there, these are after all standard EC2 instances, so the usual stuff applies:

create a Security Group;
make sure that it has an inbound rule that allows incoming SSH (port 22) connection from wherever you are (see also below);
create (or reuse) an existing KeyPair, and remember to store it locally in a place from where you can access it later (top tip: put it in a folder that is .gitignored 😂)

KEY_NAME=gpu-key
# Check if a key pair named `gpu-key` exists
KEY_ID=$(aws ec2 describe-key-pairs --key-names ${KEY_NAME} \
  --query 'KeyPairs[*].KeyPairId' --output text)

if [[ -z ${KEY_ID} ]]; then
  aws ec2 create-key-pair --key-name ${KEY_NAME} \
    --query 'KeyMaterial' --output text > private/${KEY_NAME}.pem
  chmod 400 ${KEY_NAME}.pem
  KEY_ID=$(aws ec2 describe-key-pairs --key-names ${KEY_NAME} \
    --query 'KeyPairs[*].KeyPairId' --output text)

  success "Created SSH key  ${KEY_NAME} (${KEY_ID})"
fi
msg "Key pair in private/${KEY_NAME}.pem"

make sure your instance has a public IP (and note it somewhere – although it’s easy enough to find out later):

# Get Public IP
PUBLIC_IP=$(aws ec2 describe-instances \
  --instance-ids ${INSTANCE_ID} \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text)
msg "Public IP: ${PUBLIC_IP}"

Security Group Inbound Rule
In general, I believe that key pair-based SSH access is secure enough, so using the 0.0.0.0/0 IP filter for your Inbound rule is not that terrible; however, if you happen to have a reasonably “stable” IP address (in my case, AT&T Fiber changes my IP once every several months, and possibly longer than a year) I would suggest you use that /32 instead.

Finding out your own IP address is as simple as using http://formyip.com (or AWS Console will also tell you when you edit the SG Inbound Rule) and worth doing.

SSH Config
Strictly not necessary (you can happily use ssh ubuntu@32.12.1.223 or whatever the IP happens to be), but it makes life (a lot) easier adding the following:

# /etc/hosts
32.12.1.223     cuda-learn

# ~/.ssh/config
Host cuda-learn
    HostName cuda-learn
    User ubuntu
    IdentityFile ~/.ssh/gpu-key.pem
    StrictHostKeyChecking no

where gpu-key.pem is the KeyPair content copied from wherever you saved it (I did mention to put it somewhere safe earlier, now, didn’t I?), at which point doing:

ssh cuda-learn

works without a hitch (tip: if you have trouble connecting, using the -v option in ssh is a great way to check out what’s up with what).

VSCode Remote Connection
At this point, connecting to the instance via VSCode is as simple as clicking on the little “connect” icon in lower left corner:

and using cuda-learn as the hostname, SSH will do its magic and you will soon be coding away on your GPU.

Setting up the Development environment

Alas, when it comes up, the EC2 instance has pretty much all you need to run CUDA code, but nothing very useful for development (most notably, it seems unable to clone repos, even public ones, without first authenticating to GitHub).

The setup-host.sh script has the necessary steps (you will need to manually copy/paste it on the EC2 instance, sadly) and I will probably further automate its execution (at the moment, what is missing is storing the GitHub SSH key in AWS SecretsManager, and accessing it from the remote host via the aws CLI).

Please note that the gh key that is referenced there has nothing to do with the gpu-key that you use to connect to the instance: this is the private part of the private/public key pair you have registered with GitHub as an Authentication Key (not a Publishing Key) – see here for more details if you haven’t done this before.

Using AWS Go client (SDK)

While the first iteration of this was all done via shell scripting using the aws CLI (the scripts in the scripts/ folder should still be there – if by the time you read this they are gone, you can still find them moving back to one of the early commits), that was just the “quick & dirty way:” what I really wanted was the flexibility of a full programming language to build a more flexible CLI.

The AWS Go SDK is a great way to truly manage “Infrastructure as Code” (IaC) and using it one achieves the proper balance between flexibility and speed of development.

So, I migrated the whole functionality to the cuda-learn binary, which is build off the code in the go-aws-cli/ folder; I will not be describing it here (it may warrant a post of its own) but please browse the README for instructions as to how to use it, and have a look at the code to see what is possible.

Happy CUDA hacking!

One response to “CUDA Development on AWS GPU Instances”

Implementing Pooled Embeddings in CUDA Kernel – Code Trips & Tips

August 18, 2025 at 10:51 pm

[…] this code in CUDA, you will need a Linux box with a GPU and appropriate Nvidia SDK; please see my previous post to learn how to instantiate an AWS EC2 instance to try out the […]

Code Trips & Tips