--- datasets: - roneneldan/TinyStories --- Some very small and very simple models. 29,960,200 parameters. "dim":256,"dim_head":32,"headcount":8,"ff_mult":4, "vocab_size":50304, "num_layers":4. this is nonstandard (for tinystories), reflecting a full gpt-2 vocabulary size (bloating the embedding layers), and the use of a swiglu activation function, (which doubles the width of one of the feedforward layers). training, inference, dataset preparation, and network definitions source available at https://github.com/SQCU/attn_demo training logs (unprocessed! unfiltered! it's a bunch of log prints of train and validation loss!) and training loader source for each run included with the demo models.