it's more stable to use SGD on pretrained models instead of using ADAM
it's more stable to use SGD on pretrained models instead of using ADAM