Year: 2022

Misc

ChatGPT: interactive Wikipedia that really know how to code

Reading Time: 2 minutes

It seems that there is a endless supply of large language models being applied to interesting situations. This time it’s a lot closer to everyday life than the previous models. OpenAI released the open source beta of ChatGPT, a chatbot backed by GPT3.5, that can answer scientific questions about the world, generate recipes, and write better than average code. Here is the blog to explains how it works. https://openai.com/blog/chatgpt. And if you have an OpenAI account, here is the beta. https://chat.openai.com/chat

I was fortunate enough to get to test the product this week and it’s surprisingly user friendly. There is just something about the chatbot framework that really intrigues me. Maybe it’s my strong urge to engage in interesting conversations. It’s able to answer complicated scientific questions regarding quantum mechanics, to biological pathways, to mathematic concepts. You do have to ask it to give examples and further explain to get more detailed information. However, if you are an expert in the field, you may find the information too shallow. For example, I published scientific papers previously on the biological pathways that control pupil dilation in rodents. I was not able to get detailed information on the level of scientific papers. This might not be a bad thing, since a flawed model called Galactica was introduced a few weeks ago. It was trained on scientific papers to generate texts like scientific papers. The questionable outcome is authoritative text with obviously wrong information. Being humble works better in this case.

I also tested on math concepts such as Taylors series and Fourier’s transform. It was able to give good explanations and examples. Another strong suit the model provide is the ability to generate above average programming code. It’s not surprising since previous GPT models have been used in the Copilot product to generate code and documentations. Yet it’s still nice to see the model include documentation and explanations of the generate code. On the side of more daily tasks, it is also able to generate cooking recipes that seems to be reasonable. Although I have not test the actual recipes yet.

Regarding limitation, I found some questions that model refuse to give an answer or not able to give an answer. For example, when I ask it to give ethically questionable instructions, the model will refuse to give those instructions. Or if I ask the model to tell me about things that changes or hard to determine, then it will refuse to answer. For example, if I ask what’s the bright start next to the moon is called, it will tell me that it depend on time and location.

All and all, it feels like a interesting tool to test and do more research. People have mentioned that this model feel like a reasonable educational tool when developed properly. And since there are so few AI product applied to education, I really wish more educational product can be developed using this model.

Misc

BLOOM: a large language model that’s open source and build for the scientific community

Reading Time: < 1 minute

Since Transformers was invented, there have been plenty of large language models as Machine Learning has taken off in the last few years. GPT3, RoBERTa, PaLM, and many more. Most of them were trained by large companies like OpenAI, Facebook, or Google. They often use open-source databases like Wikipedia. The problem with these models is they retain the same bias that already exists in these data since humans are not free of bias. The companies may take precautions when deploying the final results of the models, but the training data were not examined for bias. In addition, because the dominating language in the business world that make money is still English, not much attempt was made in including low-resource languages.

BLOOM, the BigScience Large Open-science Open-access Multilingual Language Model, was made by scientists for research purposes. They specifically focused on eliminating bias in the training data. And also included many more Asian and African languages. These improvements target the research and education sector. So far, its direct usage is language generation, which generally still suffers from repetition. But the encoding itself can be used for summarization and question and answer.

Huggingface model card: Link here

Research site: Link Here

AI

Whisper – OpenAI’s latest speech transcription package

Reading Time: < 1 minute

Speech transcription is the process of converting speech audio into text. The text becomes searchable and there is variety of Natural Language Processing (NLP) tools that can make sense of it. Traditionally this is done by humans. Early technology are less accurate (<70%) so the NLP tools does not work affectively. Machine Learning made great strides and increased the accuracy to more than 90%. However this technology is largely inaccessible to an average person or app developer. Training your own model require technical knowledge, and cloud solutions like Google, AWS, or Microsoft Azure is relatively expensive for large quantities.

With Whisper, developers can use their own GPU capable hardware to make mass amount of speech transcriptions. Theoretically, this will enable more exciting solutions that utilize speech transcription technology. I personally like to see some competition in personal assistant field, on wearable technology.

Here is a tutorial on how to set it up.

https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/

And the original paper and code if you are interested.

https://openai.com/blog/whisper/

AI

LaMDA model makes dialogs more grounded and more believable

Reading Time: < 1 minute

I recently did a journal club explaining Google’s LaMDA model, Language Models for Dialog Applications. Specifically, I liked how they added the external Information Retrieval approach (aka. Google Search) to the existing large language model to ground the dialog. This helps the model to be more ground and believable to a human because it conforms to the human assumptions, 1) agrees with what was said before, and 2) agrees with human common sense (fact check).

AI

Do human like precision more than recall?

Reading Time: 3 minutes

After working in the industry for a while, and applying the machine learning knowledge in what works well for business, I have come to the conclusion that human much prefer precision than recall. First thing first, what are precision and recall. The technical definitions for precision is (true positive) / (true positive + false positive), and recall is (true positive) / (true positive + false negative). That formula may be hard to understand, but the intuitive sense is that using precision as the metric will lower the number of false positives, while using recall as the metric will lower the number of false negatives.

Further explaining the technical words, false positives are the number of time where you thought it was true, but it turns out to be false. This is what I traditionally think about mistakes. False negatives are the number of times where you thought it was false, but it is actually true. I think of this as surprises. Taking from the book, Thinking Fast and Slow, people tend to make up reasons and logic to make sure whatever their conclusion is, it is the correct one. Base on that human tendency, false positives may be harder to accept than false negatives. This is because false positives directly violated our believes, assume we are reasonable people, whereas false negative just let ourselves know we learned something new, and it’s much easier to accept.

Given that false positives are harder to take for human beings, we try really hard to minimize the chance of that happening. Before prediction models exists, people have rules to do things. And if the rules help us solve a problem our find a thing, we keep the rules. When false positives happen, we will fix the rules to minimize false positives, but we don’t really try to find where the rule fails to catch the things we want, which are false negatives. And in a sense, before big data and fast predictive models happen, regular people that doesn’t have statistical trainings doesn’t care that much about false negatives. I think we are more comfortable finding false positives than false negatives.

The reality is that I find myself making more models that favors precision than recall when I don’t have the ability to maximize both. Obviously I like to use F1, the harmonic mean of precision and recall, to measure model success. But it’s not always the case given the amount of labeled data I have. In most cases I rather tell my client that I was not able to find all the things you want, than tell my client I may tell you something that’s wrong. The traditional rule approach tend to find things that have high precision and doesn’t cover much ground. And combine a bunch of these rules to get the desired results. I find myself doing similar things, just with prediction models. I keep advertising the high recall models, especially for discovering phases, where my client can find new directions for research, but they have been slow adapting that approach. Maybe they also have a tougher time telling their customers that they are not 100 percent positive about their findings.

Outside of research, I don’t know if I will ever be able to ask my clients to use recall as a metric they truly care about. Maybe it’s because my industry value precision more. Or maybe the word precision sound more positive than recall :). I would love to find out some business examples, where people are more open minded about using recall to solve their problem. And the people who make decisions understand its value.

Misc

Intel MKL error: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8

Reading Time: < 1 minute

This problem came up when I moved an anaconda environment onto a server that doesn’t have internet, simply by copying over the env folder. This somehow breaks the symbolic link for the library, even though the library is completely in the env folder.

The only solution I that I found working is to set the LD_PRELOAD environment variable to the base anaconda lib folder.

export LD_PRELOAD=~/anaconda3/lib/libmkl_core.so:~/anaconda3/lib/libmkl_sequential.so:~/anaconda3/lib/libmkl_def.so

You may have to include the specific *.so file that is reported in the error message. Apparently MKL library have different versions of the *.so file and it didn’t word the specific version in the env folder.

You can include this in the ~/.bashrc file to it loads up everytime.

Original post in linked here.

AI, Journal

Importing Wikipedia dump into mysql

Reading Time: < 1 minute

So I was thinking about using Wikipedia data to make a knowledge base and practice some NLP techniques on it. The first step is to import the English portion of Wikipedia into a mysql database so I query it as needed.

The first thought is to go to the Wikipedia download page.

I first tried to download the already made sql, but those SQL script available to download doesn’t actually include the text we see on Wikipedia. So I have go to the XML files, and follow instructions provided by this source.

Basically, we need to use a tool called MWDumper, that will convert XML into SQL scripts. We can download the compile java here, with the instructions here.

This code provided by the blog are mostly correct, except table page have one more column. All we need to do is to add the column like this:

ALTER TABLE page
ADD COLUMN page_counter INT AFTER page_restrictions;

Another change is that one of the column in revision is too small, so we need to change the field property.

ALTER TABLE `revision`
CHANGE `rev_comment` `rev_comment` blob NOT NULL AFTER `rev_text_id`;

There are also duplicate page_titles in page, so make sure they are not set to UNIQUE

ALTER TABLE `page`
ADD INDEX `page_name_title` (`page_namespace`, `page_title`),
ADD INDEX `name_title` (`page_namespace`, `page_title`),
DROP INDEX `page_name_title`,
DROP INDEX `name_title`;

After that it should just be a waiting game until everything is done. My slow server took about 2 days. The final size is about 126 GB on database. Happy NLPing!