This document describes the cloud server environment for C1 project.
Server: u1 (University of Toronto)
Connection:
ssh user@<server-ip>GPUs: 8x NVIDIA GPUs (80GB each)
GPU 0-3: 80GB, cuda:0
GPU 4-7: 80GB, cuda:4
CPU: AMD EPYC (128 threads)
Storage:
/data1/- Models and training outputs- Project code:
/home/user/chess-llm/jt/C1/
Base models are stored on /data1/models/:
/data1/models/Qwen/
├── Qwen3-0.6B/ # Qwen3 0.6B Instruct
├── Qwen3-4B/ # Qwen3 4B (base)
└── Qwen3-4B-Instruct-2507/ # Qwen3 4B InstructTrained models are saved to /data1/C1/:
/data1/C1/
├── qwen3-0.6b/
│ ├── sft_gemini3_flash/
│ └── sft_gemini3.5_flash/
└── qwen3-4b/
├── sft_gemini3_flash/
└── sft_gemini3.5_flash/Each training output contains:
checkpoint-*/- Training checkpoints (16, 32, 48, ..., 320)adapter_model.safetensors- Final LoRA weightstrainer_state.json- Training historyall_results.json- Final metrics
When running parallel training jobs, allocate GPUs explicitly:
# Job 1 on GPU 0-3
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/sft.sh config1.yaml
# Job 2 on GPU 4-7
CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/sft.sh config2.yamlStandard allocation:
- Single training job: 4 GPUs (DDP)
- Evaluation: 4 GPUs (vLLM tensor parallel)
/home/user/chess-llm/jt/C1/ # Project root
/home/user/chess-llm/jt/C1/code/ # Scripts
/home/user/chess-llm/jt/C1/configs/ # Configs
/home/user/chess-llm/jt/C1/data/ # Training data
/home/user/chess-llm/jt/C1/logs/ # Logs
/home/user/chess-llm/jt/C1/outputs/ # Evaluation outputsEnvironment name: c1
conda activate c1Python: 3.12
-
Always check GPU usage before starting new jobs:
nvidia-smi
-
Kill orphaned processes if needed:
# Find processes using specific GPUs nvidia-smi pmon -c 1 # Kill by PID kill -9 <PID>
-
Training logs are in
/logs:sft_train_*.log- Training logssft_eval_*.log- Evaluation logs
-
WandB entity:
lilvjosephtang-university-of-toronto