Tutorial 8: Recurrent Neural Networks for Text - ADSC 4720

Before we dive in: You’ve already taught a CNN to recognise images by learning spatial patterns. Now we’ll teach a network to read — to understand sequences by learning temporal patterns. We work data-first: start with real text, define what we want to predict, then build the machinery step by step.

Learning Objectives¶

By the end of this tutorial you will be able to:

Explain why MLPs fail on sequential data and what problem RNNs solve.
Describe the main text prediction tasks: next character, next word, and sentiment classification.
Convert raw text to numbers: tokenisation, vocabulary, and nn.Embedding.
Build input/target pairs for sequence prediction using a sliding window.
Implement an RNN cell from scratch and explain the role of the hidden state.
Train a character-level language model and generate text.

Prerequisites (Tutorial 7 Recap)¶

In Tutorial 7 you:

Loaded pre-trained CNNs (AlexNet, ResNet18) from torchvision.models.
Froze backbone weights and fine-tuned only the classifier head.

Today we shift from images to text sequences — data where order matters.

# ── Setup ──────────────────────────────────────────────────────────────
import random
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Dataset

import numpy as np
import matplotlib.pyplot as plt

import plotly.graph_objects as go
from plotly.subplots import make_subplots

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

if torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print(f'Using device: {device}')

Using device: mps

Part 1 — Why Do We Need RNNs? 🧠¶

The Problem with MLPs on Text¶

In Tutorial 3 and 4 you learned that an MLP takes a fixed-size input vector and maps it to an output. Every input position is treated independently — there’s no built-in concept of before or after.

This is a serious problem for language. Consider:

“I grew up in France. I went to school there. I learned the culture. I speak fluent ___.”

The answer is French — but to know that, you need to remember a word from much earlier in the sequence. A traditional MLP can only look at the tokens you explicitly place in its input window. If the useful clue sits outside that window, the model never sees it.

How the Supervised Data Looks¶

For next-token prediction, a fixed-window MLP and an RNN are trained on different input structures:

MLP: last k tokens -> next token
RNN: token 1, then token 2, then token 3, ... -> updated hidden state -> next token

That difference is the whole point: the MLP has a short, hard cutoff on memory, while the RNN carries a running summary forward through time.

# ── Part 1: How MLP and RNN training examples are structured ─────────────
tokens = ['I', 'grew', 'up', 'in', 'France', '.', 'I', 'speak', 'fluent', '___']
MLP_WINDOW = 3

print('Fixed-window MLP training pairs (window = 3 tokens):')
print()
for i in range(MLP_WINDOW, len(tokens)):
    context = tokens[i - MLP_WINDOW:i]
    target = tokens[i]
    print(f'  input={context!r:<30} -> target={target!r}')

print()
print("Prediction for the blank:")
print(f"  MLP sees only: {tokens[-1-MLP_WINDOW:-1]}")
print(f"  Key clue 'France' is {len(tokens) - 1 - tokens.index('France')} positions back — outside the window.")
print()
print('RNN view of the same prediction:')
for step, token in enumerate(tokens[:-1], start=1):
    prefix = tokens[:step]
    print(f'  step {step:>2}: prefix={prefix!r}')
print()
print("The RNN receives the whole prefix one token at a time, so 'France' can influence the prediction.")

Fixed-window MLP training pairs (window = 3 tokens):

  input=['I', 'grew', 'up']            -> target='in'
  input=['grew', 'up', 'in']           -> target='France'
  input=['up', 'in', 'France']         -> target='.'
  input=['in', 'France', '.']          -> target='I'
  input=['France', '.', 'I']           -> target='speak'
  input=['.', 'I', 'speak']            -> target='fluent'
  input=['I', 'speak', 'fluent']       -> target='___'

Prediction for the blank:
  MLP sees only: ['I', 'speak', 'fluent']
  Key clue 'France' is 5 positions back — outside the window.

RNN view of the same prediction:
  step  1: prefix=['I']
  step  2: prefix=['I', 'grew']
  step  3: prefix=['I', 'grew', 'up']
  step  4: prefix=['I', 'grew', 'up', 'in']
  step  5: prefix=['I', 'grew', 'up', 'in', 'France']
  step  6: prefix=['I', 'grew', 'up', 'in', 'France', '.']
  step  7: prefix=['I', 'grew', 'up', 'in', 'France', '.', 'I']
  step  8: prefix=['I', 'grew', 'up', 'in', 'France', '.', 'I', 'speak']
  step  9: prefix=['I', 'grew', 'up', 'in', 'France', '.', 'I', 'speak', 'fluent']

The RNN receives the whole prefix one token at a time, so 'France' can influence the prediction.

# ── Part 1: Context Window Visualisation ───────────────────────────────
sentence = ['I', 'grew', 'up', 'in', 'France', '.', 'I', 'speak', 'fluent', '___']
mlp_weights = np.array([0.02, 0.02, 0.03, 0.05, 0.01, 0.02, 0.15, 0.25, 0.40, 0.0])
rnn_weights = np.array([0.05, 0.08, 0.07, 0.10, 0.30, 0.05, 0.10, 0.10, 0.10, 0.0])

colors_mlp = ['#e74c3c' if w > 0.1 else '#fadbd8' for w in mlp_weights]
colors_rnn = ['#27ae60' if w > 0.1 else '#d5f5e3' for w in rnn_weights]

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('MLP: only nearby words matter', 'RNN: long-range context remembered')
)
fig.add_trace(
    go.Bar(x=sentence, y=mlp_weights, marker_color=colors_mlp, name='MLP',
           hovertemplate='Word: %{x}<br>Influence: %{y:.2f}<extra></extra>'),
    row=1, col=1)
fig.add_trace(
    go.Bar(x=sentence, y=rnn_weights, marker_color=colors_rnn, name='RNN',
           hovertemplate='Word: %{x}<br>Influence: %{y:.2f}<extra></extra>'),
    row=1, col=2)
fig.update_layout(
    title_text='How Much Each Word Influences the Prediction of "___"',
    height=380, showlegend=False, template='plotly_white')
fig.update_yaxes(title_text='Influence Weight', range=[0, 0.55])
fig.show()
print("The RNN gives high weight to 'France' (the key context word), which the MLP misses.")

The RNN gives high weight to 'France' (the key context word), which the MLP misses.

Checkpoint¶

Part 2 — The Data: What Are We Working With? 📄¶

Before writing any model code, let’s look at the actual data and the tasks we want to solve.

Our Training Corpus¶

We’ll use a nursery rhyme as our text corpus — small enough to read in full, which makes it easy to inspect what the model is learning.

Three Prediction Tasks¶

#	Task	Input	Output	Model
1	Next-character prediction	`'mary had a litt'`	`'l'`	CharRNN / CharLSTM
2	Next-word prediction	`['mary', 'had', 'a']`	`'little'`	(illustrated)
3	Sentiment classification	`'this film was fantastic'`	`positive`	SentimentLSTM

All three share one challenge: order matters — you can’t shuffle the words and still answer correctly. The rest of this tutorial builds the tools to solve each task.

# ── Part 2: The Text Corpus and Three Prediction Tasks ────────────────
# Dataset 1: Nursery rhyme — our training corpus
TEXT = (
    'mary had a little lamb little lamb little lamb '
    'mary had a little lamb its fleece was white as snow '
    'and everywhere that mary went mary went mary went '
    'and everywhere that mary went the lamb was sure to go '
    'it followed her to school one day school one day school one day '
    'it followed her to school one day which was against the rules'
)

print('=== CORPUS ===')
print(TEXT)
print(f'\nLength: {len(TEXT)} characters,  unique: {len(set(TEXT))}')

print('\n=== TASK 1: Next-Character Prediction ===')
print('Given a sequence of characters, predict the next one:')
for start in [0, 15, 30]:
    prefix = TEXT[start:start + 15]
    target = TEXT[start + 15]
    print(f'  {repr(prefix)}  →  {repr(target)}')

print('\n=== TASK 2: Next-Word Prediction ===')
words = TEXT.split()
print('Given a sequence of words, predict the next one:')
for i in [0, 4, 8]:
    print(f'  {words[i:i+3]}  →  {repr(words[i+3])}')

print('\n=== TASK 3: Sentiment Classification ===')
print('Given a whole review, predict its sentiment:')
for text, label in [('this film was absolutely wonderful', 'POSITIVE'),
                    ('this film was painfully boring',     'NEGATIVE')]:
    print(f'  {repr(text)}  →  {label}')

=== CORPUS ===
mary had a little lamb little lamb little lamb mary had a little lamb its fleece was white as snow and everywhere that mary went mary went mary went and everywhere that mary went the lamb was sure to go it followed her to school one day school one day school one day it followed her to school one day which was against the rules

Length: 328 characters,  unique: 21

=== TASK 1: Next-Character Prediction ===
Given a sequence of characters, predict the next one:
  'mary had a litt'  →  'l'
  'le lamb little '  →  'l'
  'lamb little lam'  →  'b'

=== TASK 2: Next-Word Prediction ===
Given a sequence of words, predict the next one:
  ['mary', 'had', 'a']  →  'little'
  ['lamb', 'little', 'lamb']  →  'little'
  ['lamb', 'mary', 'had']  →  'a'

=== TASK 3: Sentiment Classification ===
Given a whole review, predict its sentiment:
  'this film was absolutely wonderful'  →  POSITIVE
  'this film was painfully boring'  →  NEGATIVE

# ── Part 2: Visualise the Three Tasks ─────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(16, 3.5))
plt.subplots_adjust(wspace=0.35)

# Task 1: next-character
chars1  = list('mary had ') + ['?']
colors1 = ['#AED6F1'] * 9 + ['#F9E79F']
axes[0].bar(range(len(chars1)), [1]*len(chars1), color=colors1, edgecolor='grey', width=0.9)
for i, c in enumerate(chars1):
    axes[0].text(i, 0.5, repr(c), ha='center', va='center', fontsize=9, fontweight='bold')
axes[0].text(9, 1.08, '"a"', ha='center', fontsize=13, color='#1A5276', fontweight='bold')
axes[0].set_title('Task 1 — Next Character\n"mary had " → ?', fontsize=10, fontweight='bold')
axes[0].set_yticks([]); axes[0].set_xticks([]); axes[0].set_ylim(0, 1.4)

# Task 2: next-word
words2  = ['mary', 'had', 'a', '?']
colors2 = ['#A9DFBF'] * 3 + ['#F9E79F']
axes[1].bar(range(len(words2)), [1]*len(words2), color=colors2, edgecolor='grey', width=0.9)
for i, w in enumerate(words2):
    axes[1].text(i, 0.5, w, ha='center', va='center', fontsize=11, fontweight='bold')
axes[1].text(3, 1.08, '"little"', ha='center', fontsize=13, color='#1A5276', fontweight='bold')
axes[1].set_title('Task 2 — Next Word\n"mary had a" → ?', fontsize=10, fontweight='bold')
axes[1].set_yticks([]); axes[1].set_xticks([]); axes[1].set_ylim(0, 1.4)

# Task 3: sentiment
words3  = ['this', 'film', 'was', 'wonderful']
colors3 = ['#D7BDE2'] * 4
axes[2].bar(range(len(words3)), [1]*len(words3), color=colors3, edgecolor='grey', width=0.9)
for i, w in enumerate(words3):
    axes[2].text(i, 0.5, w, ha='center', va='center', fontsize=10, fontweight='bold')
axes[2].annotate('→ POSITIVE', xy=(3.5, 0.5), fontsize=12,
                 color='#1E8449', fontweight='bold', va='center')
axes[2].set_xlim(-0.5, 6)
axes[2].set_title('Task 3 — Sentiment\n"this film was wonderful" → ?', fontsize=10, fontweight='bold')
axes[2].set_yticks([]); axes[2].set_xticks([]); axes[2].set_ylim(0, 1.4)

plt.suptitle('Three Sequence Prediction Tasks — Order Matters in All Cases',
             fontsize=12, fontweight='bold', y=1.03)
plt.show()

Checkpoint¶

Part 3 — From Text to Numbers 🔢¶

Neural networks work with floating-point tensors. To feed them text, we need a pipeline:

raw text  →  tokenise  →  build vocabulary  →  encode as integers  →  embed as vectors

We’ll build each step by hand on the nursery rhyme from Part 2.

# ── Part 3: Step 1 — Tokenisation and Vocabulary ──────────────────────
# Split the text into units (tokens). We use characters for our language model.

char_vocab = sorted(set(TEXT))          # all unique characters, sorted
char2idx   = {ch: i for i, ch in enumerate(char_vocab)}
idx2char   = {i: ch for ch, i in char2idx.items()}
VOCAB_SIZE = len(char_vocab)

print(f'Text length      : {len(TEXT)} characters')
print(f'Vocabulary size  : {VOCAB_SIZE} unique characters')
print(f'Vocabulary       : {char_vocab}')
print()

# Word-level for contrast
word_vocab = sorted(set(TEXT.split()))
print(f'Word vocab size  : {len(word_vocab)} unique words')
print(f'Word vocab       : {word_vocab}')
print()
print('We use character-level for the language model.')
print('Real systems use subword tokenisation (BPE/WordPiece) but characters show the ideas cleanly.')

Text length      : 328 characters
Vocabulary size  : 21 unique characters
Vocabulary       : [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'u', 'v', 'w', 'y']

Word vocab size  : 28 unique words
Word vocab       : ['a', 'against', 'and', 'as', 'day', 'everywhere', 'fleece', 'followed', 'go', 'had', 'her', 'it', 'its', 'lamb', 'little', 'mary', 'one', 'rules', 'school', 'snow', 'sure', 'that', 'the', 'to', 'was', 'went', 'which', 'white']

We use character-level for the language model.
Real systems use subword tokenisation (BPE/WordPiece) but characters show the ideas cleanly.

# ── Part 3: Step 2 — Encode Text as Integers ──────────────────────────
sample = 'mary had a'
encoded_sample = [char2idx[c] for c in sample]
decoded_sample = [idx2char[i] for i in encoded_sample]

print(f'Original : {repr(sample)}')
print(f'Encoded  : {encoded_sample}')
print(f'Decoded  : {decoded_sample}')
print()

# Encode the full corpus
encoded_corpus = [char2idx[c] for c in TEXT]
assert ''.join(idx2char[i] for i in encoded_corpus) == TEXT
print(f'Full corpus encoded as {len(encoded_corpus)} integers.')
print(f'First 30 values: {encoded_corpus[:30]}')

Original : 'mary had a'
Encoded  : [11, 1, 14, 20, 0, 8, 1, 4, 0, 1]
Decoded  : ['m', 'a', 'r', 'y', ' ', 'h', 'a', 'd', ' ', 'a']

Full corpus encoded as 328 integers.
First 30 values: [11, 1, 14, 20, 0, 8, 1, 4, 0, 1, 0, 10, 9, 16, 16, 10, 5, 0, 10, 1, 11, 2, 0, 10, 9, 16, 16, 10, 5, 0]

# ── Part 3: Step 3a — One-Hot Encoding (the naive approach) ────────────
corpus_tensor = torch.tensor(encoded_corpus, dtype=torch.long)
one_hot       = F.one_hot(corpus_tensor, num_classes=VOCAB_SIZE).float()

print(f'One-hot shape: {one_hot.shape}  ({len(encoded_corpus)} chars × {VOCAB_SIZE} vocab)')
print()
print('First 5 characters as one-hot rows:')
for i in range(5):
    ch  = TEXT[i]
    idx = int(corpus_tensor[i])
    print(f'  {repr(ch):4s} (idx={idx:2d})  position {idx} = 1, rest = 0')

print()
bytes_onehot = one_hot.numel() * 4
print(f'Memory for this corpus: {bytes_onehot:,} bytes  ({bytes_onehot/1024:.1f} KB)')
print()
print('Scaling problem:')
print(f'  1M-word corpus × 50k-word vocab → {50_000 * 1_000_000 * 4 / 1e9:.0f} GB just for inputs.')
print('  We need a more compact representation.')

One-hot shape: torch.Size([328, 21])  (328 chars × 21 vocab)

First 5 characters as one-hot rows:
  'm'  (idx=11)  position 11 = 1, rest = 0
  'a'  (idx= 1)  position 1 = 1, rest = 0
  'r'  (idx=14)  position 14 = 1, rest = 0
  'y'  (idx=20)  position 20 = 1, rest = 0
  ' '  (idx= 0)  position 0 = 1, rest = 0

Memory for this corpus: 27,552 bytes  (26.9 KB)

Scaling problem:
  1M-word corpus × 50k-word vocab → 200 GB just for inputs.
  We need a more compact representation.

# ── Part 3: Step 3b — nn.Embedding (dense learned vectors) ────────────
torch.manual_seed(42)
EMBED_DIM = 16
embedding = nn.Embedding(num_embeddings=VOCAB_SIZE, embedding_dim=EMBED_DIM)

embedded = embedding(corpus_tensor)   # (seq_len, EMBED_DIM)
print(f'Embedding table shape  : {embedding.weight.shape}  ({VOCAB_SIZE} tokens × {EMBED_DIM} dims)')
print(f'Embedded corpus shape  : {embedded.shape}')
print(f'Memory: {embedded.numel()*4:,} bytes  (one-hot was {one_hot.numel()*4:,} bytes)')
print(f'Compression: {one_hot.shape[1]/EMBED_DIM:.1f}x fewer numbers per token')
print()

# Plotly heatmap: one-hot vs embedding for first 12 characters
n = 12
oh_np = one_hot[:n].numpy()
em_np = embedded[:n].detach().numpy()
chars = [repr(c) for c in TEXT[:n]]

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(f'One-Hot ({VOCAB_SIZE}-dim, sparse)', f'Embedding ({EMBED_DIM}-dim, dense)'))
fig.add_trace(go.Heatmap(
    z=oh_np, x=list(range(VOCAB_SIZE)), y=chars,
    colorscale=[[0, '#f0f4f8'], [1, '#2980b9']], showscale=False,
    hovertemplate='Char %{y}, col %{x}: %{z}<extra></extra>'), row=1, col=1)
fig.add_trace(go.Heatmap(
    z=em_np, x=[f'd{i}' for i in range(EMBED_DIM)], y=chars,
    colorscale='RdBu', showscale=False,
    hovertemplate='Char %{y}, %{x}: %{z:.3f}<extra></extra>'), row=1, col=2)
fig.update_layout(
    title='Representing Characters: One-Hot vs Embedding (first 12 positions)',
    height=380, template='plotly_white')
fig.show()
print('Embeddings are randomly initialised — backprop updates them end-to-end during training.')

Embedding table shape  : torch.Size([21, 16])  (21 tokens × 16 dims)
Embedded corpus shape  : torch.Size([328, 16])
Memory: 20,992 bytes  (one-hot was 27,552 bytes)
Compression: 1.3x fewer numbers per token

Embeddings are randomly initialised — backprop updates them end-to-end during training.

Checkpoint¶

Part 4 — Building a Sequence Dataset 📦¶

We have the encoded corpus as a flat integer tensor. Now we need to slice it into (input, target) pairs for the DataLoader.

The Sliding Window¶

For a language model, the target at every position is simply the next character:

index: 0  1  2  3  4  5  6  7  ...
text:  m  a  r  y     h  a  d  ...

window i=0:  input  = [m  a  r  y     h  a  d  ... ]   (positions 0 … SEQ_LEN-1)
             target = [a  r  y     h  a  d     ... ]   (positions 1 … SEQ_LEN  )

The model learns to predict the next character at every position simultaneously — this is called teacher forcing and gives SEQ_LEN training signals per forward pass.

# ── Part 4: Sliding Window Preview ────────────────────────────────────
SEQ_LEN = 20

print(f'Text (first 40 chars): {repr(TEXT[:40])}')
print(f'SEQ_LEN = {SEQ_LEN}')
print()
print(f'{"i":>4}  {"input window":<{SEQ_LEN+2}}  →  {"target (shifted right by 1)"}')
print('-' * 60)
for i in range(4):
    x_str = TEXT[i    : i + SEQ_LEN]
    y_str = TEXT[i + 1: i + SEQ_LEN + 1]
    print(f'{i:>4}  {repr(x_str):<{SEQ_LEN+2}}  →  {repr(y_str)}')
print('...')
print(f'\nTotal windows: {len(TEXT) - SEQ_LEN}')

Text (first 40 chars): 'mary had a little lamb little lamb littl'
SEQ_LEN = 20

   i  input window            →  target (shifted right by 1)
------------------------------------------------------------
   0  'mary had a little la'  →  'ary had a little lam'
   1  'ary had a little lam'  →  'ry had a little lamb'
   2  'ry had a little lamb'  →  'y had a little lamb '
   3  'y had a little lamb '  →  ' had a little lamb l'
...

Total windows: 308

# ── Part 4: CharDataset and DataLoader ─────────────────────────────────
class CharDataset(Dataset):
    """Sliding-window character dataset.
    Each sample: (input_seq, target_seq) where target = input shifted right by 1."""
    def __init__(self, encoded, seq_len):
        self.data    = torch.tensor(encoded, dtype=torch.long)
        self.seq_len = seq_len

    def __len__(self):
        return len(self.data) - self.seq_len

    def __getitem__(self, idx):
        x = self.data[idx      : idx + self.seq_len]      # input window
        y = self.data[idx + 1  : idx + self.seq_len + 1]  # target (shifted)
        return x, y


char_dataset = CharDataset(encoded_corpus, SEQ_LEN)
char_loader  = DataLoader(char_dataset, batch_size=32, shuffle=True)

print(f'Dataset size     : {len(char_dataset)} (input, target) pairs')
print(f'Batches per epoch: {len(char_loader)}')
print()

x0, y0 = char_dataset[0]
print(f'x[0] decoded: {repr("".join(idx2char[i.item()] for i in x0))}')
print(f'y[0] decoded: {repr("".join(idx2char[i.item()] for i in y0))}')
print()
print('y is x shifted one step to the right — the next character at every position.')

Dataset size     : 308 (input, target) pairs
Batches per epoch: 10

x[0] decoded: 'mary had a little la'
y[0] decoded: 'ary had a little lam'

y is x shifted one step to the right — the next character at every position.

Checkpoint¶

Part 5 — The RNN: Processing Sequences with Memory 🔄¶

We now have a dataset of (input_seq, target_seq) pairs — sequences of character indices, each to be embedded and processed in order.

What model can do this?

An MLP would flatten the whole window into one vector, losing the ordering within the window. An RNN processes one token at a time, passing a hidden state forward between steps:

\mathbf{h}_t = \tanh\!\left(\mathbf{W}_{xh}\,\mathbf{x}_t \;+\; \mathbf{W}_{hh}\,\mathbf{h}_{t-1} \;+\; \mathbf{b}\right)

(1)

$\mathbf{x}_t$ — embedded input at step $t$
$\mathbf{h}_{t-1}$ — hidden state carried forward from the previous step
$\mathbf{h}_t$ — updated hidden state (the model’s “working memory”)

The same weight matrices $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ are reused at every step — just as CNN kernels are reused at every spatial position.

# ── Part 5: ManualRNNCell — One Step of an RNN ─────────────────────────
class ManualRNNCell(nn.Module):
    """h_t = tanh(W_xh * x_t + W_hh * h_{t-1} + b)"""
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.W_xh = nn.Linear(input_size,  hidden_size, bias=False)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=True)

    def forward(self, x_t, h_prev):
        return torch.tanh(self.W_xh(x_t) + self.W_hh(h_prev))


# ── Trace through "mary" one character at a time ────────────────────────
HIDDEN_SIZE = 16
torch.manual_seed(7)
cell      = ManualRNNCell(EMBED_DIM, HIDDEN_SIZE)
embed_tmp = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
h         = torch.zeros(1, HIDDEN_SIZE)

prefix = 'mary'
print(f'Processing {repr(prefix)} one character at a time:')
print(f'{"Step":>5}  {"Char":>6}  {"h (first 6 dims)"}')
print('-' * 52)
for t, ch in enumerate(prefix):
    x_t = embed_tmp(torch.tensor([[char2idx[ch]]])).squeeze(1)  # (1, EMBED_DIM)
    h   = cell(x_t, h)
    vals = h[0, :6].detach().numpy().round(3)
    print(f'  t={t}    {repr(ch):>4}    {vals}')

print()
print(f'After {len(prefix)} steps h carries a compressed summary of {repr(prefix)}.')
print(f'Shape: {h.shape}')

Processing 'mary' one character at a time:
 Step    Char  h (first 6 dims)
----------------------------------------------------
  t=0     'm'    [ 0.424  0.386  0.466 -0.004 -0.457  0.064]
  t=1     'a'    [-0.197  0.617  0.229  0.223 -0.29  -0.692]
  t=2     'r'    [ 0.444  0.01  -0.737 -0.499 -0.854  0.161]
  t=3     'y'    [ 0.879  0.54   0.791 -0.245 -0.339 -0.695]

After 4 steps h carries a compressed summary of 'mary'.
Shape: torch.Size([1, 16])

# ── Part 5: Plotly — Hidden State Evolution ────────────────────────────
prefix_str = TEXT[:15]
torch.manual_seed(7)
n_units  = 6
cell_vis = ManualRNNCell(EMBED_DIM, n_units)
emb_vis  = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
h_vis    = torch.zeros(1, n_units)
h_hist   = [h_vis.squeeze(0).detach().numpy().copy()]

with torch.no_grad():
    for ch in prefix_str:
        x_t   = emb_vis(torch.tensor([[char2idx[ch]]])).squeeze(1)
        h_vis = cell_vis(x_t, h_vis)
        h_hist.append(h_vis.squeeze(0).detach().numpy().copy())

h_arr   = np.array(h_hist)            # (len+1, n_units)
x_ticks = ['h₀'] + list(prefix_str)

fig = go.Figure()
for u in range(n_units):
    fig.add_trace(go.Scatter(
        x=list(range(len(x_ticks))), y=h_arr[:, u],
        mode='lines+markers', name=f'unit {u}',
        hovertemplate='Step %{x}<br>Value: %{y:.4f}<extra></extra>'))
fig.update_layout(
    title=f'Hidden State Evolution: ManualRNNCell reading {repr(prefix_str)}',
    xaxis=dict(tickvals=list(range(len(x_ticks))), ticktext=x_ticks, tickfont=dict(size=10)),
    yaxis_title='Hidden unit value',
    height=400, template='plotly_white')
fig.show()
print('Each line is one hidden unit. Values change at every new character.')
print('The pattern after the last character encodes the whole prefix.')

Each line is one hidden unit. Values change at every new character.
The pattern after the last character encodes the whole prefix.

# ── Part 5: nn.RNN — the PyTorch Wrapper ──────────────────────────────
# nn.RNN applies the same cell logic across every timestep in one efficient call.
rnn_layer = nn.RNN(input_size=EMBED_DIM, hidden_size=HIDDEN_SIZE,
                   num_layers=1, batch_first=True)

x_batch, y_batch = next(iter(char_loader))      # (32, SEQ_LEN)
emb_batch        = embedding(x_batch)            # (32, SEQ_LEN, EMBED_DIM)
output, h_n      = rnn_layer(emb_batch)

print('Running nn.RNN on a real batch from CharDataset:')
print(f'  Input (embedded) : {emb_batch.shape}   (batch=32, seq={SEQ_LEN}, embed={EMBED_DIM})')
print(f'  output           : {output.shape}  (batch=32, seq={SEQ_LEN}, hidden={HIDDEN_SIZE})')
print(f'                      ↑ hidden state at EVERY timestep')
print(f'  h_n              : {h_n.shape}    (num_layers=1, batch=32, hidden={HIDDEN_SIZE})')
print(f'                      ↑ hidden state at the LAST timestep only')
print()
print(f'output[:, -1, :] equals h_n[0]:')
print(f'  Max difference: {(output[:, -1, :] - h_n[0]).abs().max().item():.2e}')
print()
print('For the language model we use ALL output timesteps,')
print('because we predict the next character at every position simultaneously.')

Running nn.RNN on a real batch from CharDataset:
  Input (embedded) : torch.Size([32, 20, 16])   (batch=32, seq=20, embed=16)
  output           : torch.Size([32, 20, 16])  (batch=32, seq=20, hidden=16)
                      ↑ hidden state at EVERY timestep
  h_n              : torch.Size([1, 32, 16])    (num_layers=1, batch=32, hidden=16)
                      ↑ hidden state at the LAST timestep only

output[:, -1, :] equals h_n[0]:
  Max difference: 0.00e+00

For the language model we use ALL output timesteps,
because we predict the next character at every position simultaneously.

Checkpoint¶

Part 6 — Character-Level Language Model 🔤¶

We now have every piece in place:

Component	What it does
`CharDataset`	Sliding-window `(input_seq, target_seq)` pairs
`nn.Embedding`	Maps character indices → dense vectors
`nn.RNN`	Processes the sequence, outputs a hidden state at every step

Putting them together into CharRNN:

index seq  →  nn.Embedding  →  (batch, seq_len, embed_dim)
           →  nn.RNN        →  (batch, seq_len, hidden_size)   [all timesteps]
           →  nn.Linear     →  (batch, seq_len, vocab_size)    [logit per step]

At each timestep the model outputs a probability distribution over the vocabulary. New text is generated by sampling from (or argmax-ing) that distribution character by character.

# ── Part 6: CharRNN Model ──────────────────────────────────────────────
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers=2):
        super().__init__()
        self.embed       = nn.Embedding(vocab_size, embed_dim)
        self.rnn         = nn.RNN(embed_dim, hidden_size, num_layers=num_layers,
                                  batch_first=True)
        self.fc          = nn.Linear(hidden_size, vocab_size)
        self.hidden_size = hidden_size
        self.num_layers  = num_layers

    def forward(self, x, h=None):
        x      = self.embed(x)        # (batch, seq_len) → (batch, seq_len, embed_dim)
        out, h = self.rnn(x, h)       # → (batch, seq_len, hidden_size)
        logits = self.fc(out)         # → (batch, seq_len, vocab_size)
        return logits, h


rnn_model = CharRNN(vocab_size=VOCAB_SIZE, embed_dim=32,
                    hidden_size=128, num_layers=2).to(device)
n_params  = sum(p.numel() for p in rnn_model.parameters())
print(f'CharRNN parameter count: {n_params:,}')
print('Architecture: Embedding(VOCAB×32) → RNN(2L, 128h) → Linear(128→VOCAB)')

CharRNN parameter count: 57,141
Architecture: Embedding(VOCAB×32) → RNN(2L, 128h) → Linear(128→VOCAB)

# ── Part 6: Training Helper ─────────────────────────────────────────────
def train_char_model(model, loader, n_epochs=300, lr=3e-3, clip=1.0):
    """Train CharRNN or CharLSTM with teacher forcing (predict at every position)."""
    optimizer    = optim.Adam(model.parameters(), lr=lr)
    criterion    = nn.CrossEntropyLoss()
    loss_history = []
    for epoch in range(1, n_epochs + 1):
        model.train()
        total_loss = 0.0
        for xb, yb in loader:
            xb, yb = xb.to(device), yb.to(device)    # (B, T), (B, T)
            optimizer.zero_grad()
            logits, _ = model(xb)                     # (B, T, V)
            loss = criterion(logits.reshape(-1, VOCAB_SIZE), yb.reshape(-1))
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()
            total_loss += loss.item()
        loss_history.append(total_loss / len(loader))
        if epoch % 60 == 0:
            print(f'  Epoch {epoch:>3}/{n_epochs}  loss={loss_history[-1]:.4f}')
    return loss_history


print('Training CharRNN (300 epochs)...')
rnn_losses = train_char_model(rnn_model, char_loader, n_epochs=300)
print('Done!')

Training CharRNN (300 epochs)...

  Epoch  60/300  loss=0.1532

  Epoch 120/300  loss=0.1480

  Epoch 180/300  loss=0.1472

  Epoch 240/300  loss=0.1448

  Epoch 300/300  loss=0.1451
Done!

# ── Part 6: Text Generation and Training Curve ─────────────────────────
def generate_text(model, seed, length=120, temperature=1.0):
    """Sample text character-by-character from a trained CharRNN or CharLSTM."""
    model.eval()
    indices  = [char2idx[c] for c in seed if c in char2idx]
    generated = list(seed)
    state     = None
    with torch.no_grad():
        x_seed       = torch.tensor([indices], dtype=torch.long).to(device)
        _, state     = model(x_seed, state)
        last_idx     = indices[-1]
        for _ in range(length):
            x_in         = torch.tensor([[last_idx]], dtype=torch.long).to(device)
            logits, state = model(x_in, state)
            logits        = logits[0, -1, :] / temperature
            probs         = torch.softmax(logits, dim=-1)
            last_idx      = torch.multinomial(probs, 1).item()
            generated.append(idx2char[last_idx])
    return ''.join(generated)


fig = go.Figure()
fig.add_trace(go.Scatter(y=rnn_losses, mode='lines', name='CharRNN',
                          line=dict(color='steelblue', width=2)))
fig.update_layout(title='CharRNN Training Loss (teacher forcing)',
                  xaxis_title='Epoch', yaxis_title='Cross-Entropy Loss',
                  template='plotly_white', height=340)
fig.show()

print('Generated text (temperature=0.5 — more focused):')
print(generate_text(rnn_model, seed='mary', length=120, temperature=0.5))
print()
print('Generated text (temperature=1.2 — more creative):')
print(generate_text(rnn_model, seed='mary', length=120, temperature=1.2))

Generated text (temperature=0.5 — more focused):

marywent mary went mary went and everywhere that mary went the lamb was sure to go it followed her to school one day school 

Generated text (temperature=1.2 — more creative):
maryhere that mary went the lamb was sure to go it followed her to school one day school one day school one day it followed

Summary¶

Concept	Key idea
RNN vs MLP	RNN passes a hidden state forward — memory across steps
Tokenisation	Split raw text into characters or words
Vocabulary	Map each unique token to an integer index
`nn.Embedding`	Compact learned dense vector per token (better than one-hot)
Sliding window	`(input_seq, target_seq)` pairs where target is input shifted by 1
Teacher forcing	Feed true characters during training — predict at every step
Hidden state	$h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b)$ — accumulated memory
Many-to-many	Predict at every timestep (character language model)

What’s Next?¶

In Tutorial 9 we’ll tackle the two challenges this tutorial leaves open:

Long-range dependencies: vanilla RNNs forget context that’s more than ~20 steps back. We’ll introduce LSTMs (gated memory) and see how they fix vanishing gradients.
Sequence classification: we’ll build a sentiment classifier that reads a whole review and outputs a single label (many-to-one).

After that, Tutorial 10 replaces recurrence entirely with self-attention — the core of the Transformer.