Cloud Server Setup

This document describes the cloud server environment for C1 project.

Server Info

Server: u1 (University of Toronto)

Connection:

ssh user@<server-ip>

Hardware

GPUs: 8x NVIDIA GPUs (80GB each)

GPU 0-3: 80GB, cuda:0
GPU 4-7: 80GB, cuda:4

CPU: AMD EPYC (128 threads)

Storage:

/data1/ - Models and training outputs
Project code: /home/user/chess-llm/jt/C1/

Model Paths

Base models are stored on /data1/models/:

/data1/models/Qwen/
├── Qwen3-0.6B/          # Qwen3 0.6B Instruct
├── Qwen3-4B/            # Qwen3 4B (base)
└── Qwen3-4B-Instruct-2507/  # Qwen3 4B Instruct

Training Outputs

Trained models are saved to /data1/C1/:

/data1/C1/
├── qwen3-0.6b/
│   ├── sft_gemini3_flash/
│   └── sft_gemini3.5_flash/
└── qwen3-4b/
    ├── sft_gemini3_flash/
    └── sft_gemini3.5_flash/

Each training output contains:

checkpoint-*/ - Training checkpoints (16, 32, 48, ..., 320)
adapter_model.safetensors - Final LoRA weights
trainer_state.json - Training history
all_results.json - Final metrics

GPU Allocation Strategy

When running parallel training jobs, allocate GPUs explicitly:

# Job 1 on GPU 0-3
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/sft.sh config1.yaml

# Job 2 on GPU 4-7
CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/sft.sh config2.yaml

Standard allocation:

Single training job: 4 GPUs (DDP)
Evaluation: 4 GPUs (vLLM tensor parallel)

Directories

/home/user/chess-llm/jt/C1/     # Project root
/home/user/chess-llm/jt/C1/code/   # Scripts
/home/user/chess-llm/jt/C1/configs/ # Configs
/home/user/chess-llm/jt/C1/data/    # Training data
/home/user/chess-llm/jt/C1/logs/    # Logs
/home/user/chess-llm/jt/C1/outputs/ # Evaluation outputs

Conda Environment

Environment name: c1

conda activate c1

Python: 3.12

Important Notes

Always check GPU usage before starting new jobs:
```
nvidia-smi
```

Kill orphaned processes if needed:

# Find processes using specific GPUs
nvidia-smi pmon -c 1

# Kill by PID
kill -9 <PID>

Training logs are in /logs:
- sft_train_*.log - Training logs
- sft_eval_*.log - Evaluation logs
WandB entity: lilvjosephtang-university-of-toronto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Server Setup

Server Info

Hardware

Model Paths

Training Outputs

GPU Allocation Strategy

Directories

Conda Environment

Important Notes

FilesExpand file tree

CLOUD.md

Latest commit

History

CLOUD.md

File metadata and controls

Cloud Server Setup

Server Info

Hardware

Model Paths

Training Outputs

GPU Allocation Strategy

Directories

Conda Environment

Important Notes