the actual paper - note that as of today (Mon Jan 22 2024) this is a pre-print. (it’s a short read.)
this came to me via a mailing list that i’m on that has, in my opinion, a decidedly curmudgeonly orientation to its membership. so i feel a real need to look at links from this source with a more adversarial bent. taken at the researchers word, there are some interesting nuggets in here.
setting aside the assertion that over half of the web is machine translated dreck … (not unbelievable, i don’t know that it’s shocking …)
- there’s a tremendous amount of short article content … click-bait english articles are widely translated into other languages, we may very well be dragging the rest of the world down with us in our cognitive decline.
- the well may very well be poisoned for future LLM development if folks are training on the internet corpus across multiple sparsely used languages. whether this results in improved training techniques or just punting on things remains to be seen.