Torch Guide

This is very small guide that will walk you through main features, important concepts and some bonus annoyances of working with The ML Framework. I hope this page will help someone with his journey.

Installation

So, let's start, shall we? To use torch, you need to install it (and python).

While we all love latest features, in ML world they are rarely used, so you can install old version for better compatibility. Even if you are not using Anaconda, I *still* recommend you to install default Anaconda version of python. Usually torch and friends are releasing binaries for the previous major version, 3.12 at the point of writing.

Sidenote: a lot of project right now are still on 3.11 and will not work on 3.12 due to ABI incompatibilities between python versions, which is why Anaconda is so helpful since it can actually install different python versions in different venvs. Only sad part is that venvs are global.

Depending on the platform, you will have multiple installation options, which are presented on this page. Windows specifically has a tendency to not work sometimes without CUDA Toolkit, so may need to install it first. After installation, we are ready to go! Let's import torch.


import torch

Behind the scenes torch will initialize it's C++ engine and all available devices. If you are using CUDA, you can check its availability using this command:


torch.cuda.is_available()

If this returns True, then your installation is working properly! Otherwise, something is wrong and you need to investigate this before proceeding.

First steps: Tensors and basic operations

Torch operates over large arrays of data and behaviours attached to them. We will learn about gradients and evaluation graphs later, right now we will focus ourselves on just numbers.

Easiest way to think about torch & python is to imagine like you have two universes: One is that you are in, and second is the one behind the mirror. Most things that you do in your universe will affect other side, but it doesn't mean that they are the same. Torch tries as hard as it can to blur the line, but sometimes you will find lines like this:


value = torch.tensor(0.000232)
print(value.item())
      '0.000232...'

Notice the `.item()` call. What it does is it tries to 'move' scalar value from torch's universe to python's one, converting single-value tensor to just float.
This does not work for arrays though, and torch will complain about it since there is no conventional pure-python representation for large multidimensional numeric arrays. Be careful about everything you pass around, it may add multiple hours of debugging later :)

But what is tensor, you ask? It's just a generalization of a matrix. If you imagine a matrix, it has two dimensions: rows and columns, vector has only one dimension, and a picture will have three (height, width, and color values, commonly referred as channels). So tensor is just a large blob of numbers, with a certain shape and additional properties and methods, like rank or transpositions. Sometimes you will encounter tensors with up to six or seven dimensions, which is why sometimes you will see people write code with comments describing current shape of the tensor:


      def process(x: torch.Tensor) -> torch.Tensor:
B, H ,W, C = x.shape
assert C==3
x = lin(x) # B x H x W x 8
x = x.reshape(B, -1, 8) # B x H*W x 8
return conv(x.permute(0, 2, 1)) # B x 8 x H*W/2

But data is useless without algorithms, right? So, what can we do with this data?
Most torch operations that included in the tensor itself are element-wise, with additional common operations including transposition, matrix multiplication and indexing.


# Element-wise
a: torch.Tensor = torch.randn((12, 8))
b: torch.Tensor = torch.randn((12, 8))
a + b
a - b
a / b
a * b

# Transposition
a_t = a.T

# Matrix multiplication
a_t @ b

# Indexing, taking first two elements from both dimensions
a[:2, :2]

And those are primary tensor operations.

Making a first model a dumb way.

Let's try to imagine that we don't have any helper functions and modules. Instead, we are going to build a model from scratch. And we will start from the most basic example that humankind was able to find: Simulating sin function.

There are many types of models, we call them architectures. Different architectures can be useful for different tasks, such as compression, object detection, generation, etc. Our architecture will be much simpler than ones used for "real" models. We will just unfold the number into a vector, and then compress it back to a single number.
Specific reasons behind this architecture we will discuss in a different chapter, but generally these types of models are easier to train, build and understand. But while I can't explain why we are doing this, at least I can explain to you how.

Imagine or draw a sin graph. What do you see? A bunch of bends. Yeah. That's the thing we (our model specifically) are simulating. And that's almost all tasks. We just have bends, and we are trying to replicate them. In our case, sin is a "true function". And amount of bends our model can simulate is limited by amount of activation bends. Different activation functions give different bends: This us why sometimes different activations give drastically different results in a same model. In out case though, we for simplicity will use ReLU, which we are going to write ourselves.

Every model is just a sequence of actions, linear and non-linear layers alternating with each other. Usually in simple networks (they are called Feed-Forwards or Multi-Layer Perceptrons (MLP)) we have: Linear -> Activation -> (Linear -> Activation) * N -> Linear. To build our basic model we just need to find a way to create those two layers.
And they are the simplest ones! Linear layer is just input matrix (variable X) matrix-multiplied by weight matrix (variable W), and bias matrix (variable B) added afterwards. So the formula would be:

Y = X W + B

And because we've chosen ReLU as our activation function, we just need to set every number that's lower than zero to zero!
As a result, our code will look like this:


def relu(x):
    return torch.clamp(x, min=0)

def linear(weight, bias, x):
    x = x @ weight
    x = x + bias
    return x

def create_model():
    return {
        "lin1": {
            "weight": torch.randn((1, 128)),
            "bias": torch.randn(128),
        },
        "lin2": {
            "weight": torch.randn((128, 1)),
            "bias": torch.randn(1),
        }
    }

def model(model, x):
    x = linear(model["lin1"]["weight"], model["lin1"]["bias"], x)
    x = relu(x)
    x = linear(model["lin2"]["weight"], model["lin2"]["bias"], x)
    return x