RSK World - Complete Documentation - Transformer NMT | Transformer NMT | Neural Machine Translation | Self-Attention | PyTorch | RSK World

Project Description

This project implements a Transformer-based neural machine translation system for machine translation. The encoder-decoder transformer architecture with self-attention and multi-head attention mechanisms processes source sequences and generates high-quality translations, making it excellent for neural machine translation tasks. The system includes beam search decoding, positional encoding, BLEU score evaluation, and advanced features like attention visualization, parallel corpus support, and comprehensive training tools.

The Transformer NMT uses encoder-decoder transformer architecture with self-attention mechanism. The encoder processes source sequences using multi-head self-attention layers, while the decoder generates translations with masked self-attention and encoder-decoder attention. The implementation provides complete PyTorch support, comprehensive training pipeline, REST API server, BLEU score evaluation, and deployment tools for neural machine translation applications.

Project Screenshots

1 / 4

Core Features

Transformer Architecture

Encoder-decoder transformer
Multi-head self-attention
Positional encoding
High-quality translation
Neural machine translation

Self-Attention Mechanism

Self-attention in encoder
Masked self-attention in decoder
Encoder-decoder attention
Multi-head attention
Long-range dependencies

Beam Search Decoding

Beam search algorithm
Multiple candidate exploration
Configurable beam width
Improved response quality
Better sequence selection

Positional Encoding

Sinusoidal positional encoding
Position information injection
Word order understanding
Sequence structure awareness
Position-aware embeddings

BLEU Score Evaluation

BLEU score calculation
Translation quality metrics
Model performance evaluation
Validation during training
Comprehensive evaluation

REST API Server

Flask-based API
Translation endpoint
Batch translation endpoint
CORS enabled
Production-ready

Advanced Features

Attention Visualization

Attention weight visualization
Model interpretability
Source-target alignment
Visual attention maps
Heatmap generation

Parallel Corpus Support

Parallel corpus format
Source-target pairs
Vocabulary building
Data preprocessing

Multi-Head Attention

Multiple attention heads
Different relationship types
Parallel attention computation
Enhanced representation

Training Visualization

Loss curve visualization
Accuracy tracking
Learning rate monitoring
Overfitting detection
Training history plots

REST API Endpoints

Endpoint	Method	Description	Request Body	Response
/translate	POST	Single sentence translation	{"text": "sentence", "use_beam_search": true, "beam_width": 5}	Translation result
/translate/batch	POST	Batch translation	{"texts": ["sentence1", "sentence2"], "use_beam_search": true}	Batch translations
/health	GET	Health check	N/A	Server status

Technologies Used

This Transformer NMT project is built using modern deep learning and web technologies. The core implementation uses Python as the primary programming language and PyTorch for deep learning operations. The project includes a Transformer encoder-decoder architecture with self-attention and multi-head attention mechanisms for neural machine translation. The project includes a Flask-based REST API for web integration, Jupyter Notebook support for interactive development and demonstrations, and comprehensive BLEU score evaluation for assessing translation quality.

The Transformer model uses encoder-decoder architecture with self-attention design, enabling the model to process source sequences and generate high-quality translations. The system supports beam search decoding for better translation quality, positional encoding for word order understanding, and multi-head attention for capturing different types of relationships, making it suitable for various neural machine translation applications.

Python 3.8+ PyTorch 2.0+ Transformer Self-Attention NMT Multi-Head Machine Translation Jupyter Notebook Flask 2.3+ BLEU Score

Installation & Usage

Installation

Install all required dependencies for the Transformer NMT project:

# Install all requirements
pip install -r requirements.txt

# The Transformer model will be trained on your data
# Prepare parallel corpus data in data/parallel_corpus.txt
# Format: source_sentence ||| target_sentence

PyTorch Installation

Install PyTorch (CPU or GPU version):

# For CPU only
pip install torch torchvision torchaudio

# For CUDA (GPU support) - CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify installation
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

Verify Installation

Test the model and verify all components work:

# Test model architecture
python test_model.py

# This will verify:
# - Model can be instantiated
# - Forward pass works
# - All components function correctly
# - Device compatibility (CPU/CUDA)

Training the Model

Train the Transformer model on your parallel corpus dataset:

# Prepare data in data/parallel_corpus.txt
# Format: "source_sentence ||| target_sentence" (one pair per line)

# Basic training with default parameters
python train.py --data_path data/parallel_corpus.txt --num_epochs 50 --batch_size 32

# Full training with all parameters
python train.py \
    --data_path data/parallel_corpus.txt \
    --num_epochs 50 \
    --batch_size 32 \
    --d_model 512 \
    --num_heads 8 \
    --num_layers 6 \
    --d_ff 2048 \
    --dropout 0.1 \
    --lr 0.0001 \
    --max_length 100 \
    --min_freq 2 \
    --save_dir ./models

# Training with specific number of pairs (for testing)
python train.py --data_path data/parallel_corpus.txt --num_epochs 10 --num_pairs 1000

# Or use Jupyter notebook
jupyter notebook transformer_nmt_demo.ipynb

Training Parameters:

--data_path: Path to parallel corpus file (required)
--num_epochs: Number of training epochs (default: 50)
--batch_size: Batch size for training (default: 32)
--d_model: Model dimension (default: 512)
--num_heads: Number of attention heads (default: 8)
--num_layers: Number of encoder/decoder layers (default: 6)
--d_ff: Feed-forward network dimension (default: 2048)
--dropout: Dropout rate (default: 0.1)
--lr: Learning rate (default: 0.0001)
--max_length: Maximum sequence length (default: 100)
--min_freq: Minimum word frequency for vocabulary (default: 2)
--num_pairs: Number of sentence pairs to use (optional)
--save_dir: Directory to save models (default: ./models)

Translation Inference

Translate sentences using the trained model:

# Single sentence translation
python inference.py --model_path models/best_model.pt --sentence "Hello, how are you?"

# Batch translation from file
python inference.py --model_path models/best_model.pt --input_file input.txt --output_file output.txt

# With beam search (better quality)
python inference.py --model_path models/best_model.pt --sentence "Hello" --use_beam_search --beam_width 5

# Or use Jupyter notebook
jupyter notebook transformer_nmt_demo.ipynb

# Using Python API
from inference import load_model, translate_sentence

translation = translate_sentence(
    "Hello, how are you?",
    model_path="models/best_model.pt",
    vocab_dir="./models",
    device="cuda",
    use_beam_search=True,
    beam_width=5
)
print(translation)

REST API Server

Start the Flask API server for web integration:

# Start API server (default port 5000)
python api_server.py --model_path models/best_model.pt --vocab_dir ./models

# Start on custom port
python api_server.py --model_path models/best_model.pt --vocab_dir ./models --port 8080

# Start on custom host and port
python api_server.py --model_path models/best_model.pt --vocab_dir ./models --host 0.0.0.0 --port 5000

# API will be available at http://localhost:5000

# Example API calls:
# POST /translate - {"text": "Hello", "use_beam_search": true, "beam_width": 5}
# POST /translate/batch - {"texts": ["Hello", "How are you?"], "use_beam_search": true}
# GET /health - Check API health

API Server Parameters:

--model_path: Path to trained model checkpoint (required)
--vocab_dir: Directory containing vocabulary files (default: ./models)
--port: Port to run server on (default: 5000)
--host: Host to bind to (default: 0.0.0.0)

Docker Deployment

Deploy using Docker for production:

# Build Docker image
docker build -t transformer-nmt .

# Run container
docker run -d \
    -p 5000:5000 \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/data:/app/data \
    --name transformer-nmt-api \
    transformer-nmt

# Or use Docker Compose
docker-compose up -d

# View logs
docker logs transformer-nmt-api

# Stop container
docker stop transformer-nmt-api

Model Evaluation

Evaluate the trained model performance:

# Evaluate trained model
python evaluate.py --model_path models/best_model.pt --test_file data/test.txt

# Or use evaluation module
from evaluate import calculate_bleu_score

# Calculate BLEU score for translations
reference = "bonjour le monde"
candidate = "hello world"
bleu_score = calculate_bleu_score(reference, candidate)
print(f"BLEU Score: {bleu_score}")

BLEU Score Evaluation

Evaluate translation quality using BLEU score:

from evaluate import calculate_bleu_score

# BLEU Score for translation evaluation
reference = "bonjour le monde"
candidate = "hello world"
bleu = calculate_bleu_score(reference, candidate)

# Print results
print(f"BLEU Score: {bleu:.4f}")

# Evaluate on test set
python evaluate.py --model_path models/best_model.pt --test_file data/test.txt

BLEU Score Description:

BLEU Score: Measures n-gram precision between reference and candidate translations, widely used for machine translation evaluation
Range: 0.0 to 1.0, where higher scores indicate better translation quality
Usage: Standard metric for evaluating neural machine translation systems

Attention Visualization

Visualize attention weights to understand model behavior:

# Attention visualization is available in the Jupyter notebook
jupyter notebook transformer_nmt_demo.ipynb

# The notebook includes:
# - Attention weight visualization
# - Source-target alignment heatmaps
# - Multi-head attention visualization
# - Positional encoding visualization

# This creates heatmaps showing which source words
# the model focuses on when generating each target word

Batch Translation

Translate multiple sentences from files:

# Batch translation from file
python inference.py \
    --model_path models/best_model.pt \
    --input_file input_sentences.txt \
    --output_file translations.txt \
    --use_beam_search \
    --beam_width 5

# Or use API endpoint
# POST /translate/batch - {"texts": ["sentence1", "sentence2"]}

Jupyter Notebook

Open the interactive Jupyter notebook for demonstrations:

# Transformer NMT demonstration notebook
jupyter notebook transformer_nmt_demo.ipynb

# The notebook includes:
# - Model architecture visualization
# - Self-attention mechanism explanation
# - Training setup examples
# - Translation inference examples
# - Positional encoding visualization

# Or use JupyterLab
jupyter lab transformer_nmt_demo.ipynb

Project Structure

                transformer-nmt/

                ├── README.md                          # Main documentation

                ├── requirements.txt                   # Python dependencies

                ├── LICENSE                            # License file

                ├── QUICKSTART.md                      # Quick start guide

                ├── CHANGELOG.md                      # Changelog

                ├── RELEASE_NOTES.md                  # Release notes

                │

                ├── Core Modules

                │   ├── transformer_model.py          # Transformer architecture

                │   ├── data_preprocessing.py          # Data loading and vocabulary

                │   ├── train.py                      # Training script

                │   ├── inference.py                  # Translation inference

                │   ├── evaluate.py                   # BLEU score evaluation

                │   ├── utils.py                      # Utility functions

                │   ├── config.py                     # Configuration settings

                │   └── test_model.py                 # Model testing

                │

                ├── API & Services

                │   ├── api_server.py                # Flask REST API

                │   └── docs/API.md                  # API documentation

                │

                ├── Data

                │   └── parallel_corpus.txt          # Training data (source ||| target)

                │

                ├── Models

                │   └── (trained model checkpoints and vocabularies)

                │

                ├── Notebooks

                │   └── transformer_nmt_demo.ipynb   # Jupyter notebook demo

                │

                ├── Scripts

                │   └── prepare_data.py              # Data preparation utilities

Configuration Options

Model Configuration

Customize model and training parameters in config.py and train.py:

# Model Architecture (config.py)
D_MODEL = 512                     # Model dimension
NUM_HEADS = 8                     # Number of attention heads
NUM_LAYERS = 6                    # Number of encoder/decoder layers
D_FF = 2048                       # Feed-forward network dimension
DROPOUT = 0.1                     # Dropout rate
MAX_LEN = 5000                    # Maximum sequence length for positional encoding

# Training Parameters (config.py)
BATCH_SIZE = 32                   # Training batch size
LEARNING_RATE = 0.0001            # Learning rate
NUM_EPOCHS = 50                   # Number of training epochs
MAX_LENGTH = 100                  # Maximum sequence length
MIN_FREQ = 2                      # Minimum word frequency for vocabulary
GRAD_CLIP = 1.0                   # Gradient clipping value
TRAIN_SPLIT = 0.9                 # Train/validation split ratio

# Inference Configuration
BEAM_WIDTH = 5                    # Beam search width
USE_BEAM_SEARCH = True            # Use beam search or greedy decoding
MAX_DECODE_LENGTH = 100           # Maximum decoding length

Configuration Tips:

D_MODEL: Should be divisible by NUM_HEADS (e.g., 512/8 = 64)
NUM_HEADS: Common values: 4, 8, 16. More heads = better but slower
NUM_LAYERS: More layers = better quality but slower training. Start with 4-6
D_FF: Typically 4x D_MODEL (e.g., 512 * 4 = 2048)
DROPOUT: 0.1 is standard. Increase if overfitting (0.2-0.3)
LEARNING_RATE: Start with 0.0001. Use learning rate scheduling
BATCH_SIZE: Larger = faster but needs more memory. Adjust based on GPU

Training Progress Logging

The training script automatically logs progress to JSON files:

# Training logs are saved to:
# models/training_history.json

# Contains:
# - Training loss per epoch
# - Validation loss per epoch
# - Learning rate schedule
# - Best model checkpoint info

# Visualize training progress
python visualize_training.py --log_file models/training_history.json

Detailed Architecture

Transformer Components

1. Encoder:

Stack of identical encoder layers (default: 6 layers)
Each layer contains: Multi-head self-attention + Feed-forward network
Residual connections around each sub-layer
Layer normalization after each sub-layer
Processes source sequence to create rich representations

2. Decoder:

Stack of identical decoder layers (default: 6 layers)
Each layer contains: Masked self-attention + Encoder-decoder attention + Feed-forward
Masked self-attention prevents looking at future tokens
Encoder-decoder attention connects to encoder outputs
Generates target sequence one token at a time

3. Attention Mechanisms:

Self-Attention (Encoder): Words attend to all words in source
Masked Self-Attention (Decoder): Words attend only to previous words
Encoder-Decoder Attention: Decoder attends to encoder outputs
Multi-Head: Multiple attention heads capture different relationships

Self-Attention Formula

The attention mechanism computes:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
- Q (Query): What information am I looking for?
- K (Key): What information do I have?
- V (Value): What information do I provide?
- d_k: Dimension of keys (d_model / num_heads)
- √d_k: Scaling factor to prevent softmax saturation

Multi-Head Attention:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O

Each head: head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Positional Encoding

Sinusoidal positional encoding adds position information:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
- pos: Position in sequence
- i: Dimension index
- d_model: Model dimension

This allows the model to understand:
- Absolute position of words
- Relative distances between words
- Word order in sequences

Advanced Features Usage

Translation Usage

Use the translation model with customizable parameters:

# Single sentence translation
from inference import translate_sentence

translation = translate_sentence(
    "Hello, how are you?",
    model_path="models/best_model.pt",
    vocab_dir="./models",
    device="cuda",
    use_beam_search=True,
    beam_width=5
)
print(translation)

# Batch translation
from inference import translate_batch

translations = translate_batch(
    ["Hello", "How are you?"],
    model_path="models/best_model.pt",
    vocab_dir="./models",
    device="cuda",
    use_beam_search=True,
    beam_width=5
)
print(translations)

Beam Search Decoding

Use beam search for better translation quality:

from inference import translate_sentence

# Translate with beam search
translation = translate_sentence(
    "Hello, how are you?",
    model_path="models/best_model.pt",
    vocab_dir="./models",
    device="cuda",
    use_beam_search=True,
    beam_width=5  # Number of candidates to explore
)

# Higher beam width = better quality but slower
# Recommended: 3-10 for balance between quality and speed

# Greedy decoding (faster, lower quality)
translation = translate_sentence(
    "Hello",
    model_path="models/best_model.pt",
    vocab_dir="./models",
    device="cuda",
    use_beam_search=False
)

Model Evaluation

Evaluate model performance with BLEU score:

from evaluate import calculate_bleu_score

# Calculate BLEU score for translation
reference = "bonjour le monde"
candidate = "hello world"
bleu = calculate_bleu_score(reference, candidate)

print(f"BLEU Score: {bleu:.4f}")

# Evaluate on test set
python evaluate.py --model_path models/best_model.pt --test_file data/test.txt

# Returns BLEU score for translation quality assessment

Parallel Corpus Preparation

Prepare parallel corpus data for training:

# Prepare parallel corpus file
# Format: source_sentence ||| target_sentence

# Example data/parallel_corpus.txt:
hello world ||| bonjour le monde
how are you ||| comment allez-vous
good morning ||| bonjour

# The training script will automatically:
# - Build vocabulary from the corpus
# - Split into train/validation sets
# - Preprocess and tokenize sentences
# - Create data loaders for training

# Use the prepared data for training
python train.py --data_path data/parallel_corpus.txt --num_epochs 50

Positional Encoding

Understand how positional encoding works in the Transformer:

# Positional encoding is automatically applied in the model
# It uses sinusoidal functions to encode position information

# The positional encoding allows the model to understand:
# - Word order in sequences
# - Relative positions between words
# - Sequence structure

# Visualization available in Jupyter notebook
jupyter notebook transformer_nmt_demo.ipynb

# The notebook includes positional encoding visualization
# showing how position information is encoded in embeddings

Complete Training Workflow

Step-by-Step Training Process

Step 1: Prepare Data

# Create parallel corpus file
# Format: source ||| target (one pair per line)

echo "hello world ||| bonjour le monde" > data/parallel_corpus.txt
echo "how are you ||| comment allez-vous" >> data/parallel_corpus.txt
echo "good morning ||| bonjour" >> data/parallel_corpus.txt

# Or use data preparation script
python scripts/prepare_data.py --input raw_data.txt --output data/parallel_corpus.txt

Step 2: Train Model

# Start training
python train.py --data_path data/parallel_corpus.txt --num_epochs 50

# Training will:
# 1. Load and preprocess data
# 2. Build source and target vocabularies
# 3. Split into train/validation sets (90/10)
# 4. Initialize Transformer model
# 5. Train with label smoothing loss
# 6. Save checkpoints and best model
# 7. Log training history to JSON

Step 3: Monitor Training

Watch console output for epoch progress
Check models/training_history.json for detailed logs
Visualize training curves: python visualize_training.py
Best model saved as models/best_model.pt

Step 4: Evaluate Model

# Evaluate on test set
python evaluate.py --model_path models/best_model.pt --test_file data/test.txt

# Calculate BLEU scores for translation quality

Step 5: Translate

# Single sentence
python inference.py --model_path models/best_model.pt --sentence "Hello"

# Batch translation
python inference.py --model_path models/best_model.pt --input_file input.txt --output_file output.txt

API Usage Examples

Translation Endpoint (cURL)

Translate a sentence using the REST API:

curl -X POST http://localhost:5000/translate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, how are you?",
    "use_beam_search": true,
    "beam_width": 5
  }'

# Response:
# {
#   "source": "Hello, how are you?",
#   "translation": "Bonjour, comment allez-vous?",
#   "beam_search": true
# }

Batch Translation (cURL)

Translate multiple sentences at once:

curl -X POST http://localhost:5000/translate/batch \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Hello", "How are you?", "Good morning"],
    "use_beam_search": true,
    "beam_width": 5
  }'

# Response:
# {
#   "sources": ["Hello", "How are you?", "Good morning"],
#   "translations": ["Bonjour", "Comment allez-vous?", "Bonjour"],
#   "beam_search": true
# }

Health Check (cURL)

Check API server health and model status:

curl -X GET http://localhost:5000/health

# Response:
# {
#   "status": "healthy",
#   "model_loaded": true,
#   "device": "cuda"
# }

Python Requests Example

Use the API with Python requests library:

import requests

# Translation endpoint
response = requests.post(
    'http://localhost:5000/translate',
    json={
        'text': 'Hello, how are you?',
        'use_beam_search': True,
        'beam_width': 5
    }
)
data = response.json()
print(f"Source: {data['source']}")
print(f"Translation: {data['translation']}")

# Batch translation
batch_response = requests.post(
    'http://localhost:5000/translate/batch',
    json={
        'texts': ['Hello', 'How are you?'],
        'use_beam_search': True
    }
)
print(batch_response.json())

# Health check
health = requests.get('http://localhost:5000/health')
print(health.json())

JavaScript/Fetch Example

Use the API with JavaScript fetch API:

// Single translation
fetch('http://localhost:5000/translate', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({
        text: 'Hello, how are you?',
        use_beam_search: true,
        beam_width: 5
    })
})
.then(res => res.json())
.then(data => {
    console.log('Source:', data.source);
    console.log('Translation:', data.translation);
});

// Batch translation
fetch('http://localhost:5000/translate/batch', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({
        texts: ['Hello', 'How are you?'],
        use_beam_search: true
    })
})
.then(res => res.json())
.then(data => {
    console.log('Translations:', data.translations);
});

// Health check
fetch('http://localhost:5000/health')
.then(res => res.json())
.then(data => console.log('Status:', data));

Transformer Model Variants

Model	Parameters	Size	Use Case	Speed
Small Transformer	d_model=256, 4 layers	~20-40 MB	Fast inference, basic tasks	Fastest
Medium Transformer	d_model=512, 6 layers	~50-100 MB	Balanced quality/speed	Fast
Large Transformer	d_model=512, 8 layers	~150-300 MB	Higher quality translation	Moderate
XL Transformer	d_model=1024, 12 layers	~500-1000 MB	Best quality, research	Slower

Dataset Information

Parallel Corpus Format

The project uses parallel corpus format with source and target sentence pairs:

Source and target language pairs
One sentence pair per line
Separated by ||| delimiter
Automatic vocabulary building
Train/validation split support
Multiple language pair support

Data Format

Training data is stored in parallel corpus format (one pair per line):

# parallel_corpus.txt format (one pair per line)
hello world ||| bonjour le monde
how are you ||| comment allez-vous
good morning ||| bonjour
thank you ||| merci
see you later ||| à bientôt

# Format: source_sentence ||| target_sentence
# The training script automatically:
# - Builds vocabulary from both source and target
# - Splits into train/validation sets
# - Tokenizes and preprocesses sentences

Adding Custom Training Data

Add your own parallel corpus data for training:

# Simply append to parallel corpus file
with open('data/parallel_corpus.txt', 'a', encoding='utf-8') as f:
    f.write("source sentence ||| target sentence\n")
    f.write("hello ||| bonjour\n")
    f.write("goodbye ||| au revoir\n")

# Or create a new domain-specific file
with open('data/custom_corpus.txt', 'w', encoding='utf-8') as f:
    f.write("custom source ||| custom target\n")

# Use in training
python train.py --data_path data/custom_corpus.txt --save_dir models/custom

Troubleshooting & Best Practices

Common Issues

CUDA Out of Memory: Reduce batch_size in train.py, use smaller d_model (256 instead of 512), reduce num_layers, or use CPU mode
Model Not Found: Ensure model is trained first by running train.py or loading from models/ directory. Check model path is correct
Vocabulary Not Found: Ensure vocabularies are saved during training. Check vocab_dir path matches training save_dir
Slow Translation: Use smaller d_model (256) or fewer layers (4 instead of 6), reduce beam_width, or use greedy decoding
API Connection Error: Check if api_server.py is running on port 5000. Verify model_path and vocab_dir are correct
Import Errors: Verify all dependencies installed: pip install -r requirements.txt. Check Python version (3.8+)
Sequence Too Long: Reduce MAX_LENGTH in config.py or use shorter sentences. Model has max sequence length limit
Poor Translation Quality: Train for more epochs, use larger model, increase training data, or adjust learning rate
Training Loss Not Decreasing: Check learning rate (may be too high/low), verify data format, check for data issues
Validation Loss Increasing: Model may be overfitting. Increase dropout, use more data, or reduce model size
NLTK Data Missing: Run python -c "import nltk; nltk.download('punkt')" to download required NLTK data

Best Practices

Training Data: Use diverse, high-quality parallel corpus data. More data = better results. Aim for 10K+ sentence pairs minimum
Data Format: Ensure parallel corpus uses ||| separator. One pair per line. Clean and normalize text before training
Data Preprocessing: Normalize text, handle special characters, ensure consistent encoding (UTF-8)
Batch Size: Use smaller batches (16-32) for limited GPU memory. Larger batches (64+) for faster training if memory allows
Learning Rate: Start with 0.0001 and adjust based on training loss. Use ReduceLROnPlateau scheduler (automatic in training)
Gradient Clipping: Default is 1.0. Increase if training is unstable, decrease if gradients are too small
Beam Search: Use beam_width 3-10 for balance. Higher = better quality but slower. 5 is a good default
Model Selection: Start with d_model=256, 4 layers for speed/testing. Use 512+ and 6+ layers for production quality
Evaluation: Regularly evaluate BLEU score on validation set. Monitor for overfitting (val loss increasing)
Vocabulary: Adjust MIN_FREQ to control vocabulary size. Lower (1-2) = larger vocab, higher (3-5) = smaller vocab
Checkpointing: Model saves best checkpoint automatically. Can resume training from checkpoint if needed
API Rate Limiting: Implement rate limiting for production deployments. Consider using nginx or similar
Logging: Monitor training logs (training_history.json) for debugging and optimization
Device Selection: Use CUDA if available for faster training. CPU works but much slower

Performance Optimization

GPU Usage: Set CUDA_VISIBLE_DEVICES for multi-GPU systems. Use GPU for training and inference when available
Model Selection: Use d_model=256, 4 layers for fastest inference. Larger models (512, 6+ layers) for better quality
Batch Processing: Use batch translation endpoint for processing multiple sentences efficiently. Reduces overhead
Caching: API server caches model in memory. Model loads once on first request, then reused
Sequence Length: Limit MAX_LENGTH to reduce memory usage and improve speed. Shorter sequences = faster
Decoding Parameters: Use greedy decoding for speed (10x faster), beam search for quality (better translations)
Model Quantization: Consider model quantization for production to reduce memory and speed up inference
Async Processing: For high-throughput, consider async API or queue system for batch processing
Memory Management: Clear GPU cache between batches if running out of memory: torch.cuda.empty_cache()

Contact Information

Get in Touch

Developer: Molla Samser
Designer & Tester: Rima Khatun

rskworld.in

help@rskworld.in support@rskworld.in

+91 93305 39277

License

This project is for educational purposes only. See LICENSE file for more details.

Theme Settings

Color Scheme

Display Options

Font Size

Transformer NMT

Project Description

Project Screenshots

Core Features

Transformer Architecture

Self-Attention Mechanism

Beam Search Decoding

Positional Encoding

BLEU Score Evaluation

REST API Server

Advanced Features

Attention Visualization

Parallel Corpus Support

Multi-Head Attention

Training Visualization

REST API Endpoints

Technologies Used

Installation & Usage

Installation

PyTorch Installation

Verify Installation

Training the Model

Translation Inference

REST API Server

Docker Deployment

Model Evaluation

BLEU Score Evaluation

Attention Visualization

Batch Translation

Jupyter Notebook

Project Structure

Configuration Options

Model Configuration

Training Progress Logging

Detailed Architecture

Transformer Components

Self-Attention Formula

Positional Encoding

Advanced Features Usage

Translation Usage

Beam Search Decoding

Model Evaluation

Parallel Corpus Preparation

Positional Encoding

Complete Training Workflow

Step-by-Step Training Process

API Usage Examples

Translation Endpoint (cURL)

Batch Translation (cURL)

Health Check (cURL)

Python Requests Example

JavaScript/Fetch Example

Transformer Model Variants

Dataset Information

Parallel Corpus Format

Data Format

Adding Custom Training Data

Troubleshooting & Best Practices

Common Issues

Best Practices

Performance Optimization

Contact Information

Get in Touch

License