Misc

BLOOM: a large language model that’s open source and build for the scientific community

Reading Time: < 1 minute

Since Transformers was invented, there have been plenty of large language models as Machine Learning has taken off in the last few years. GPT3, RoBERTa, PaLM, and many more. Most of them were trained by large companies like OpenAI, Facebook, or Google. They often use open-source databases like Wikipedia. The problem with these models is they retain the same bias that already exists in these data since humans are not free of bias. The companies may take precautions when deploying the final results of the models, but the training data were not examined for bias. In addition, because the dominating language in the business world that make money is still English, not much attempt was made in including low-resource languages.

BLOOM, the BigScience Large Open-science Open-access Multilingual Language Model, was made by scientists for research purposes. They specifically focused on eliminating bias in the training data. And also included many more Asian and African languages. These improvements target the research and education sector. So far, its direct usage is language generation, which generally still suffers from repetition. But the encoding itself can be used for summarization and question and answer.

Huggingface model card: Link here

Research site: Link Here