AI

LaMDA model makes dialogs more grounded and more believable

Reading Time: < 1 minute

I recently did a journal club explaining Google’s LaMDA model, Language Models for Dialog Applications. Specifically, I liked how they added the external Information Retrieval approach (aka. Google Search) to the existing large language model to ground the dialog. This helps the model to be more ground and believable to a human because it conforms to the human assumptions, 1) agrees with what was said before, and 2) agrees with human common sense (fact check).

AI

Do human like precision more than recall?

Reading Time: 3 minutes

After working in the industry for a while, and applying the machine learning knowledge in what works well for business, I have come to the conclusion that human much prefer precision than recall. First thing first, what are precision and recall. The technical definitions for precision is (true positive) / (true positive + false positive), and recall is (true positive) / (true positive + false negative). That formula may be hard to understand, but the intuitive sense is that using precision as the metric will lower the number of false positives, while using recall as the metric will lower the number of false negatives.

Further explaining the technical words, false positives are the number of time where you thought it was true, but it turns out to be false. This is what I traditionally think about mistakes. False negatives are the number of times where you thought it was false, but it is actually true. I think of this as surprises. Taking from the book, Thinking Fast and Slow, people tend to make up reasons and logic to make sure whatever their conclusion is, it is the correct one. Base on that human tendency, false positives may be harder to accept than false negatives. This is because false positives directly violated our believes, assume we are reasonable people, whereas false negative just let ourselves know we learned something new, and it’s much easier to accept.

Given that false positives are harder to take for human beings, we try really hard to minimize the chance of that happening. Before prediction models exists, people have rules to do things. And if the rules help us solve a problem our find a thing, we keep the rules. When false positives happen, we will fix the rules to minimize false positives, but we don’t really try to find where the rule fails to catch the things we want, which are false negatives. And in a sense, before big data and fast predictive models happen, regular people that doesn’t have statistical trainings doesn’t care that much about false negatives. I think we are more comfortable finding false positives than false negatives.

The reality is that I find myself making more models that favors precision than recall when I don’t have the ability to maximize both. Obviously I like to use F1, the harmonic mean of precision and recall, to measure model success. But it’s not always the case given the amount of labeled data I have. In most cases I rather tell my client that I was not able to find all the things you want, than tell my client I may tell you something that’s wrong. The traditional rule approach tend to find things that have high precision and doesn’t cover much ground. And combine a bunch of these rules to get the desired results. I find myself doing similar things, just with prediction models. I keep advertising the high recall models, especially for discovering phases, where my client can find new directions for research, but they have been slow adapting that approach. Maybe they also have a tougher time telling their customers that they are not 100 percent positive about their findings.

Outside of research, I don’t know if I will ever be able to ask my clients to use recall as a metric they truly care about. Maybe it’s because my industry value precision more. Or maybe the word precision sound more positive than recall :). I would love to find out some business examples, where people are more open minded about using recall to solve their problem. And the people who make decisions understand its value.

Misc

Intel MKL error: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8

Reading Time: < 1 minute

This problem came up when I moved an anaconda environment onto a server that doesn’t have internet, simply by copying over the env folder. This somehow breaks the symbolic link for the library, even though the library is completely in the env folder.

The only solution I that I found working is to set the LD_PRELOAD environment variable to the base anaconda lib folder.

export LD_PRELOAD=~/anaconda3/lib/libmkl_core.so:~/anaconda3/lib/libmkl_sequential.so:~/anaconda3/lib/libmkl_def.so

You may have to include the specific *.so file that is reported in the error message. Apparently MKL library have different versions of the *.so file and it didn’t word the specific version in the env folder.

You can include this in the ~/.bashrc file to it loads up everytime.

Original post in linked here.

AI, Journal

Importing Wikipedia dump into mysql

Reading Time: < 1 minute

So I was thinking about using Wikipedia data to make a knowledge base and practice some NLP techniques on it. The first step is to import the English portion of Wikipedia into a mysql database so I query it as needed.

The first thought is to go to the Wikipedia download page.

I first tried to download the already made sql, but those SQL script available to download doesn’t actually include the text we see on Wikipedia. So I have go to the XML files, and follow instructions provided by this source.

Basically, we need to use a tool called MWDumper, that will convert XML into SQL scripts. We can download the compile java here, with the instructions here.

This code provided by the blog are mostly correct, except table page have one more column. All we need to do is to add the column like this:

ALTER TABLE page
ADD COLUMN page_counter INT AFTER page_restrictions;

Another change is that one of the column in revision is too small, so we need to change the field property.

ALTER TABLE `revision`
CHANGE `rev_comment` `rev_comment` blob NOT NULL AFTER `rev_text_id`;

There are also duplicate page_titles in page, so make sure they are not set to UNIQUE

ALTER TABLE `page`
ADD INDEX `page_name_title` (`page_namespace`, `page_title`),
ADD INDEX `name_title` (`page_namespace`, `page_title`),
DROP INDEX `page_name_title`,
DROP INDEX `name_title`;

After that it should just be a waiting game until everything is done. My slow server took about 2 days. The final size is about 126 GB on database. Happy NLPing!

Misc

Install Rasa suite for chatbot

Reading Time: < 1 minute

To use chatbot to coordinate automatic systems, I have searched many options. There are Google’s dialogflow, and Microsoft’s Azure bots. Dialogflow charge money once I need to use REST API calls, and Azure is just too complicated to set up all the services. No luck with multiple tries for both. I have previously used Rasa for work, and have some experience in how it works.

  1. Go to Rasa’s website and install Rasa X using docker. https://rasa.com/docs/rasa-x/installation-and-setup/install/docker-compose
  2. Need to load the NLU markdown data, story markdown data, and domain.yml files.
  3. Also make sure git repo is connected. I still don’t know how it work, because it’s not loading the files in that git repo.
  4. The only way I found to work is follow the documentation here, regarding action server. No need to change the endpoint.yml anymore though.
  5. Instead of editing the new docker-compose.override.yml file, just go back to the docker-compose.yml file and edit the app_1 part and replace with this:
  app:
    restart: always
    image: "rasa/rasa-sdk:latest"
    volumes:
      - ./actions:/app/actions
    expose:
      - "5055"
    depends_on:
      - rasa-production

Now it should be good to go once with use docker-compose up -d again!

Journal

Install CUDA driver on a new Ubuntu system

Reading Time: < 1 minute

After recycling the old pc to install Ubuntu, I wanted to install CUDA drivers. And of course ran into the same old errors. Here is some notes to make sure I don’t run into the same errors again.

  1. Download and installing CUDA driver. Need to install make and gcc, sudo apt install gcc make
  2. There are still some errors form cuda install, as directed by nvidia forum,https://forums.developer.nvidia.com/t/info-finished-with-code-256-error-install-of-driver-component-failed/107661, need to look at the file on /var/log/nvidia-installer.log. It’s because The Nouveau kernel driver. Now follow the nvidia instruction at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau to blacklist Nouveau.
  3. How come it doesn’t work? Well, need to reboot the system of course.
  4. Now it installs correct. Have fun!
  5. Make sure to use the entire harddrive if you have fresh Ubuntu install. Here -> https://www.panzoto.com/extend-the-free-space-on-lvm/

Journal

Recycle old pc to be a server

Reading Time: < 1 minute

After more than 10 years of using my old pc, I finally decide to get the a new one. It’s black, and powerful. But enough of that. I decided to recycle my old pc to be a linux server. After installing Ubuntu 20.04, I decided to partition the old hard drive. And here is guide I followed. It even shows you how to fuse multiple hard drive together to behave like a single drive. Enjoy!


https://techguides.yt/guides/how-to-partition-format-and-auto-mount-disk-on-ubuntu-20-04/

Journal

Acadia National Park Trip Recap

Reading Time: 3 minutes

Recently I had a family trip to Acadia National Park. It was an interesting journey, especially in the mist of Covid 19 pandemic.

We first made a stop at Portland, ME since many friends and co-workers have stopped there for the summer. It’s a small city with a Key West vibe. We stopped at the old port as people recommended, and took an afternoon walk along the shops near the bay. There were several lovely pottery shops, but we weren’t such big fans of pottery. It was a specially hot day, so it’s interesting to see shops with blowers running. If you can’t tell by now, I live in Florida for a long time. Air conditioner is standard in Florida, so it’s strange in my mind to not have A/C in the summer. My son loves rocks and there was a gem shop with colorful rocks. We bought a few bottles of small gems, and he was quite happy. Some of the restaurants were busy, and we couldn’t find a table until late at night, so we stopped at a smaller one just to get a bite. I would say the highlight of the afternoon is the ice cream shop with mocha and tiramisu.

Official trip to Acadia began the next day. Only noticeable thing on the highway is that I had two incident of people driving off the road almost immediately in front of me. I just never seen it in the many years of driving in Florida. People hit each other’s cars, but they don’t just drive off the road. Maybe Maine have different driving regulations, or many people in Maine have special training to avoid cars, by driving off of the highway.

Once we get to the hotel, the staff were extra nice, recommend us the sites at the national park, and offered to print vehicle pass since I didn’t have any. It was tough to get the non-operational wifi to work, but the helpful staff made it sufferable.

Acadia National Park is as expected for any national park in the peak season. There were a lot of people. I rarely see that many people in the U.S. outside of large cities. The Sand Beach was especially bad since there was miles of one way road, and the only way you can find parking is to loop back after driving 30 minutes. Otherwise it’s about equal time to walk after finding parking. Because the daytime entrance window is only 30 minutes from the booking time, I had to really try hard to get through the area. I don’t remember much since I just dropped off my family and never stayed the beach.

Cadillac summit road was interesting to drive on. The only comparison I have is driving on Smoky mountain roads in the Carolinas. Unless you are used to the mountain roads, speed limit is your friend. It’s not like the road is extra narrow, but the feeling of imminent cliff diving made it more challenging for my shaking legs. Anyway, the glacial rocks were extra fun to walking/hiking on once we got to the peak of the mountain. You can almost walk anywhere on the peak. Around half of the rocks were covered by trees, so most of the peak is surprisingly accessible by walking. Large groups of flies was sort of annoying if you have kids. And although it’s rarely sunny near the peak, extra sun screen is recommended. I had sun burn with regular amount of sun block.

One extra note is that most people around our hotel wore masks, but the visitor to the national park most choose to not wear one. I had one feeling that’s because local population were older than the visitors. But it could also be the local culture is much more acceptable for wearing masks.

Overall, it was a fun relaxing trip. We probably had way too much lobster, even though we are not such fan of that much protein.

AI, Journal

Inspired by Github Copilot and What Makes a Good Programmer

Reading Time: < 1 minute

Recently Github starting to send out invites for Copilot. It’s a AI assisted code generator for several different languages. For python, it will generate efficient code according to docstring the programmer wrote. For other languages, it will infer from the function declaration. I tested on Leetcode, and the time and space complexity is quite good. Although it struggle with some of the hardest tasks, it’s fulfills the promise it claims to do.

Should you use it though? The way the model is trained, it uses docstring and public available code. There is the obvious licensing issues. Can you use someone’s code, if they did not explicitly state it’s open source, even if it’s a public repository. There have been cases already discovered that have personal info in the comments of code, or in the embedded HTML. That makes people think twice about using it if they might sued later.

Another point is should you use it even if it’s legal. For now, it only generate a single function. I haven’t seen it write a complete class or generate scripts with folder structure. When the program gets more complicated, a lot of the higher CS concepts like cohesion, coupling, and usage of design patterns are more import than writing an efficient function. Therefore, I would put this as a tool for beginner to learn programming than an actual tool for advanced programmer to deploy. I have been learning and debating about when to use object oriented programming and when to use functional programming. I found the following resources to be helpful. For now, I’m still in the camp of learning better structure than blindly using Copilot to generate programs.

#ArjanCodes channel on Youtube: https://www.youtube.com/watch?v=ZsvftkbbrR0&list=PLC0nd42SBTaNuP4iB4L6SJlMaHE71FG6N&index=7

Python 3 Objecte Oriented Programming (book): https://www.packtpub.com/product/python-3-object-oriented-programming-third-edition/9781789615852

Misc

People just like to watch

Reading Time: < 1 minute

I was watching a youtube video of how to make bonzai today. And I was think about how to make my own bonzai better. And I felt that it’s just too much work to make it better. I’m content with just having it the way it is now. I realized that I don’t enjoy shaping up my bonzai, not even owning the bonzai is that attractive for me. I probably just want to watch other people making bonzai and want nothing to do with myself. I’m perfect happy with just watching.

I don’t think it’s a unique feeling, because there are so many channels on Youtube about watching other people playing games. Heck, the whole sports industry is watching other people play a game.

So I decided, maybe I will just make videos about me making programming projects. One, I actually like to build project with code. Two, other people might not want to spend the time coding, but enjoy other people building a project.

So look forward to my video, talking about a computer programming projects I will make.

View More