Skip to content

Latest commit

 

History

History
117 lines (84 loc) · 2.36 KB

File metadata and controls

117 lines (84 loc) · 2.36 KB

Cloud Server Setup

This document describes the cloud server environment for C1 project.

Server Info

Server: u1 (University of Toronto)

Connection:

ssh user@<server-ip>

Hardware

GPUs: 8x NVIDIA GPUs (80GB each)

GPU 0-3: 80GB, cuda:0
GPU 4-7: 80GB, cuda:4

CPU: AMD EPYC (128 threads)

Storage:

  • /data1/ - Models and training outputs
  • Project code: /home/user/chess-llm/jt/C1/

Model Paths

Base models are stored on /data1/models/:

/data1/models/Qwen/
├── Qwen3-0.6B/          # Qwen3 0.6B Instruct
├── Qwen3-4B/            # Qwen3 4B (base)
└── Qwen3-4B-Instruct-2507/  # Qwen3 4B Instruct

Training Outputs

Trained models are saved to /data1/C1/:

/data1/C1/
├── qwen3-0.6b/
│   ├── sft_gemini3_flash/
│   └── sft_gemini3.5_flash/
└── qwen3-4b/
    ├── sft_gemini3_flash/
    └── sft_gemini3.5_flash/

Each training output contains:

  • checkpoint-*/ - Training checkpoints (16, 32, 48, ..., 320)
  • adapter_model.safetensors - Final LoRA weights
  • trainer_state.json - Training history
  • all_results.json - Final metrics

GPU Allocation Strategy

When running parallel training jobs, allocate GPUs explicitly:

# Job 1 on GPU 0-3
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/sft.sh config1.yaml

# Job 2 on GPU 4-7
CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/sft.sh config2.yaml

Standard allocation:

  • Single training job: 4 GPUs (DDP)
  • Evaluation: 4 GPUs (vLLM tensor parallel)

Directories

/home/user/chess-llm/jt/C1/     # Project root
/home/user/chess-llm/jt/C1/code/   # Scripts
/home/user/chess-llm/jt/C1/configs/ # Configs
/home/user/chess-llm/jt/C1/data/    # Training data
/home/user/chess-llm/jt/C1/logs/    # Logs
/home/user/chess-llm/jt/C1/outputs/ # Evaluation outputs

Conda Environment

Environment name: c1

conda activate c1

Python: 3.12

Important Notes

  1. Always check GPU usage before starting new jobs:

    nvidia-smi
  2. Kill orphaned processes if needed:

    # Find processes using specific GPUs
    nvidia-smi pmon -c 1
    
    # Kill by PID
    kill -9 <PID>
  3. Training logs are in /logs:

    • sft_train_*.log - Training logs
    • sft_eval_*.log - Evaluation logs
  4. WandB entity: lilvjosephtang-university-of-toronto