Finding Optimal Tokenizers
Summary
The article presents an approach to computing optimal tokenizers for datasets using an ILP formulation and LP relaxation, drawing parallels to the Traveling Salesman Problem through cutting-plane techniques. It discusses practical limitations, such as near-optimal results on training data and generalization concerns, as well as hardware and solver considerations. The piece also covers experimental setups, results on toy problems, and potential future work to scale up tokenization optimization.