Donche

Blogue One

Titanic - Kaggle Competition

It was the ship of dreams to everyone else. 0. Introduction The Titanic sinking accident was a famous shipwreck in the North Atlantic from late night on April 14, 1912 to early morning on the 15th. The disaster shocked the world, killing more than 1,500 people and becoming the worst shipwreck in peacetime in history[1]. Titanic: Machine Learning from Disaster is an entry-level competition for kaggle and is currently the most contested team with more than 10,000 teams. After the launch of ...

Create ASCII Art with Python

It has been a long time since the last time I wrote a blog. This time I will be talking about a small project that can be used to generate ASCII Art from images or videos, and eventually play the generated ASCII video with music on. The source code can be found here. 0. About ASCII art This is surely something interesting to talk about. ASCII art is a graphic design technique that uses computers for presentation and consists of pictures pieced together from the 95 printable (from a total of...

MBTI Personality Type Prediction Based on Text Analysis

Kaggle is a great place to search for some interesting datasets. And this time, I found a dataset called (MBTI) Myers-Briggs Personality Type Dataset , which contains 8600 MBTI types from different people and their posts. Based on several analyses posted on Kaggle, I used this dataset to train a modal of personality prediction. At last, personalities of several charaters in the series Bojack Horseman are predicted. According to Wikipedia, although it is popular in the business sector, the MBT...

Douban Through 200,000 Albums

Last time we talked about that my spider had got 60,000 albums. During these days I had been thinking about how to deal with these data. At the same time, the spider was also busy wandering aroud to get more. Untill now, I have 200,000 albums, and we’ll see what can we know from these enormous information 0. Review of Previous Blog In the previous blog, I wrote a spider(or we can call it a Web crawler as well) to get the information about albums. In fact, there’s not only albums, but also EP...

Get More Than 80,000 Albums Using Handwritten Web Crawler

This project is inspired by another similar work[1]. In addition, I also refered to a book about spider. I intended to do it with Scrapy, a powerful web crawling framework. But it turns that my first try was a good one and it got more than 50,000 albums for me. At the same time it was not banned by the web site (probably because it just crawled too slow. Anyway I’ll try next time to make it fast enough to get banned…). Then I thought it may be a good one and useful for this small project Thi...

PageRank and graph based recommender system

So why PageRank and graph based recommender system? Because the graph based PersonalRank [1] is derived from the Topic-Sensitive PageRank [2]. So if one want to completely understand PersonalRank, it’s better to start with the famous PageRank[3]. 1. PageRank According to wikipedia: PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the imp...

Latent Factor Recommender System

Of course, there are many other algorithms besides the neighborhood-based recommender system mentioned earlier, such as Latent Factor Models. Because it was introduced in “Recommender System Practices” [1], I found some errors in formulas and codes, and some codes lack explanations. At the same time, because I spend a lot of time debugging parameters, so I decided to write a blog to record it 1. Latent Factor Models The Latent Factor Models is the most popular research topic in the recommend...

Neighborhood-based Recommender System

1. Neighborhood-based algorithm The neighborhood-based algorithm can be divided into two categories, one is user-based collaborative filtering algorithm, and the other is item-based collaborative filtering algorithm. Both of these algorithms are already mature, so this blog is just written as notes. The data set used is the 1M data set provided by MovieLens and can be downloaded from this site. Load the data set and divide each user’s watch movie by 7:3 into a training set and a test set. ...

Stereo Vision Summary

1.Stereo Vision Summary The source code is in my github. There are several implementations of stereo vision that have been done (not Feature Matching yet. Although feature point matching is more accurate, only a few dozen of them are often available after filtering. If the edge of the environment are not obvious, there may even be less. The matrices R and T are very unreliable in this way. So I just stopped here) The accuracy of Block Matching is quite good. In general, as long as th...