An empirical analysis of compute-optimal large language model training