After several years (really!) of development, I’m pleased to announce that rankit has been upgraded to v0.3. The version of previous major release is v0.1, which was made in 2016. So what has changed during these three years?
At the time when rankit v0.1 was developed, it was aimed to implement all fascinating algorithms mentioned in a beautiful ranking book and provide an alternative ranking solution to Bangumi. The programming interface designed at that time is far from practical. As I was looking into more ranking algorithms, the more I felt that rankit should include some popular ranking algorithms that are based on time series. One of them is Elo ranking, which has a simple implementation in rankit v0.2. But the usage scenario is limited: updating ranking is tedious and its interfaces are not compatible with existing ranking algorithms.
In v0.3, following updates are made to address those problems mentioned above:
Split rankers to “Unsupervised ranker” and “Time series ranker”. Each of those two rankers have their own interface, and they both consume the Table object to keep records.
Introduced Elo ranker, Glicko 2 ranker and TrueSkill ranker (only paired competition record is supported)
Updated interface to make it more user-friendly. For unsupervised ranker, only rank method is provided since it needs no update. For time series ranker, user should provide one use update to accumulate new record data and one can retrieve latest leader board by invoking leaderboard after each update.
The documentation of rankit has also been updated.
One side product of rankit is the “Scientific animation ranking for Bangumi”. For a long time, the ranking is not updated and it is gradually forgotten. I gave it a refresh last year and it is now having a completely new look with faster response and simple search functionality. More importantly, the ranking will be updated monthly. I would invite you all to have a try it here: https://ranking.ikely.me
The latest scientific animation ranking also involves minor algorithm changes. It is often witnessed that at certain time, users with peculiar intention would pour into Bangumi to rate one specific animation with certain high or low score. This impacted the proper order of rating. In previous version of scientific ranking, one can neglect those users who rate one anime and leave permanently, but could not handle those users who rate multiple animations. I adjusted the scores user rated overall and made several normalization according to the distribution of users’ rating, and all versions of normalized scores are fed into rankit to calculate latest rank. The final rank is still merged using Borda count.
But could this be solved from another perspective? One thing I have been thinking about is how to involve time series ranking into current ranking scheme. Ideally, time series ranking should act to combat ranking manipulation behavior in a way other than pairwise ranking. As I was reading about TrueSkill ranking, their brilliant idea to inference ranking using linear chain graph stroke me. Actually, TrueSkill is a generative graph model that organized in a order same as competition score. Another issue that need to resolve is to help users adjust historical ratings automatically: a user would rate an animation with wider range before, but his or her rating criteria may change with the evolvement of time. How to propagate recent rating criteria to historical ratings? All these should be supported in the next version of rankit. That is, to enable ranking for multiple players in a match, and power to propagate recent competition result to historical data.
Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:
Wish table: about 10k rows
User_id
Work_id
Wish
0
233
1
…
…
…
Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id
Work_id
Rate
0
22
like
2
33
dislike
…
…
…
Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id
Title
Category
0
Some_anime
anime
1
Some_manga
manga
…
…
…
For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.
Traditional recommendation system methods (that I know)
Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?
Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.
In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.
Feature Engineering
From wish table:
Distribution of user’s preference on anime/manga (2d+2d)
Distribution of item’s preference (2d)
Word2vec embedding of user on wish-to-watch items (20d)
Word2vec embedding of user on not-wish-to-watch items (10d)
Word2vec embedding of item on wish-to-watch users (20d)
Word2vec embedding of item on not-wish-to-watch users (10d)
Lsi embedding of user (20d)
Lsi embedding of item (20d)
From record table:
Distribution of user’s preference on anime/manga (4d+4d)
Distribution of item’s preference (4d)
Mean/StdErr of user’s rating (2d)
Mean/StdErr of item’s rating (2d)
Word2vec embedding of user on loved and liked items (32d)
Word2vec embedding of user on disliked items (10d)
Word2vec embedding of item on loved and liked users (32d)
Word2vec embedding of item on disliked users (10d)
Lsi embedding of user (20d)
Lsi embedding of item (20d)
Lda topic distribution of user on love, like and neutral items (20d)
Lda topic distribution of item on love, like and neutral ratings (20d)
Item categorial (1d, categorial feature)
User Id (1d, only used in FM)
Item Id (1d, only used in FM)
Model ensembing
The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.
In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.
Is this the ultimate ceiling?
Definitely not. One can push the boundary much further:
I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.
Conclusion
I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.
One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!
One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.
head user-2017-02-17T12_26_12-2017-02-19T06_06_44.tsv -n 5 head record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv -n 5 head subject-2017-02-26T00_28_51-2017-02-27T02_15_34.tsv -n 5
uid name nickname joindate activedate
7 7 lorien. 2008-07-14 2010-06-05
2 2 陈永仁 2008-07-14 2017-02-17
8 8 堂堂 2008-07-14 2008-07-14
9 9 lxl711 2008-07-14 2008-07-14
name iid typ state adddate rate tags
2 189708 real dropped 2016-10-06
2 76371 real dropped 2015-11-07
2 119224 real dropped 2015-03-04
2 100734 real dropped 2014-10-09
subjectid authenticid subjectname subjecttype rank date votenum favnum tags
1 1 第一次的親密接觸 book 1069 1999-11-01 57 [7, 84, 0, 3, 2] 小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2 2 坟场 music 272 421 [108, 538, 50, 18, 20] 陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4 4 合金弹头7 game 2396 2008-07-17 120 [14, 164, 6, 3, 2] STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6 6 军团要塞2 game 895 2007-10-10 107 [15, 108, 23, 9, 7] 抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15
由于标签列表是按照分号分隔的,我们首先使用 split($9, tags, ";") 把分割后的字符串存储在 array tags 里。接着在 for(i in tags) 里,i 实际上是 index,对于每一个 tag 我们再次使用 split,得到具体的 tag 和 tag 人数。可以看到,可以使用 C 语言的方式写 awk 的行处理逻辑。在写的时候是可以隔行的,虽然我都写在了一行。
在此之后,我们首先对条目的 id 排序,再在条目中对 tag 的标记人数排序。这里 sort 需要使用两个 -k 选项指定排序顺序。同时我们把 -nr 的条件写在每个排序列的后面,这样可以对列按照不同的排序逻辑排序。
awk -F "\t"'{cnt[$2]+=1; cum[$2]+=$6}; END {for(i in cnt){printf("%d\t%f\t%d\n", i, cum[i]/cnt[i], cnt[i]);}}' < anime_record.tsv | sort -t$'\t' -k2,2nr -k3,3nr | head
需要注意的是,这里又使用了 awk 里面的字典数据结构(associative array)。你可以看作是一个 python 里面的 dict。我们使用了两个变量:cnt 和 cum 存储每一个动画 id 的评分人数和评分总和。在最后,我们在 END 包围的代码块里面生成最后的结果。这时候 END 里面的语句是在文件遍历之后执行的。
comm 命令对每个文件的行操作,同时它的先验要求是文件已经排过序。它可以给出三列数据:仅在第一个文件中出现的行;仅在第二个文件中出现的行;在两个文件中同时出现的行。可以看出,这个就是求差集和并集的操作。通过指定 -13,我们指定只输出仅在第二个文件中出现的行,也就是在 user_list.tsv 中没有爬到的用户。
4. Join
获得收藏数据中的用户 id
在收藏数据中,我们记录下了用户的用户名,却没有记录用户 id 和用户昵称!这个是爬虫在设计时候的缺陷。这时候只能通过和用户数据 join 来弥补了。可是怎么在文本文件中进行 join 的操作呢?
首先,我们抽取两组数据:
1 2
sed 1d record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv| sort -t$'\t' -k1,1 > record.sorted.tsv head record.sorted.tsv
对于一个最大流问题,其定义一般都是这样的:有一个有向图 G,源点为 s,汇点为 t,每条边都有其对应的流最大容量 c,源点 s 只有发出的流量,汇点 t 只有汇入的流量,求能从源点到汇点的最大可能流量。
那么,我先不想说证明过程,我直接给出结论:我们用一种反复查找的方法,试图在当前图 G 上找到一条从 s 到 t 的路径,其边上允许的流量非零。每次找到这一条路径之后,也就可以确定这条路径上的流量瓶颈,将路径上的边的可行流量减去这一流量瓶颈,在新图上进行下一次查找。我们一直持续这样的查找,直到无法从 s 到 t 走出一条流量瓶颈大于 0 的路径。这个算法叫做 Ford-Fulkerson 算法。这个查找方式可以使用 BFS 进行。
voidbfs(){ memset(dep, 0xff, sizeof(dep)); memset(depcnt, 0, sizeof(depcnt)); dep[sink] = 0; depcnt[dep[sink]] = 1; std::queue<int> Q; Q.push(sink); int u, v; while (!Q.empty()) { u = Q.front(); Q.pop(); for (int e = head[u]; ~e; e = edges[e].prev) if (edges[e].c == 0 && dep[edges[e].v] == -1) { v = edges[e].v; Q.push(v); dep[v] = dep[u] + 1; depcnt[dep[v]]++; } } }
intSAP(){ bfs(); memcpy(cur, head, sizeof(head)); int u = source, maxflow = 0, top = 0, i; while (dep[source]<N) { if (u == sink) { //find the bottleneck int delta = inf, p; for (int t = top - 1; t >= 0; t--) { int e = stack[t]; if (edges[e].c<delta) { delta = edges[e].c; p = t; } } for (int t = top - 1; t >= 0; t--) { int e = stack[t]; edges[e].c -= delta; edges[e ^ 1].c += delta; } maxflow += delta; top = p; u = edges[stack[p]].u; } for (i = cur[u]; ~i; i = edges[i].prev) { if (edges[i].c>0 && dep[edges[i].v] == dep[u] - 1) { cur[u] = i; u = edges[i].v; stack[top++] = i; break; } } if (i == -1) { depcnt[dep[u]]--; if (depcnt[dep[u]] == 0) break; int mindep = N; for (int e = head[u]; ~e; e = edges[e].prev) { if (edges[e].c>0 && mindep > dep[edges[e].v]) { mindep = dep[edges[e].v]; cur[u] = e; } } dep[u] = mindep + 1; depcnt[dep[u]]++; if (u != source) { u = edges[stack[--top]].u; } } } return maxflow; }
intmain(){ freopen("testcase", "r", stdin); scanf("%d%d", &N, &M); source = 1, sink = N; memset(head, 0xff, sizeof(head)); for (int i = 1; i <= M; i++) { int u, v, c; scanf("%d%d%d", &u, &v, &c); addedge(u, v, c); }
我原来以为这个题目还是比较简单的,因为就是一个纯纯的监督学习问题。但是我发现我与最高水平的差距还是挺大。不论是实力还是客观条件方面都与别人的队伍有着较大的差距。最为愤怒的是,在前一天我在知乎上面看到有人说使用前三个时间槽的 gap 进行加权平均就能达到 0.29 的结果,我非常愤怒。我为了达到这个值花了这么多时间和精力,竟然还不如别人这么简单的无脑方法???如果能有更多的人和我团队合作,我们的思路就会越来越开阔,不至于像我走这种死胡同。
v_0 <- a vertex with odd degree or, if no such vertex, any arbitrary vertex. Repeat: select an vertex v_i+1 adjacent of v_i, which should not separate the graph or, the only adjacent vertex of v_i remove edge <v_i, v_i+1> and jump to v_i+1 Until all edges have been visited. Return the sequence of visited edges.
今天要说的题目来源于 Soldier and Number Game. 这道题到底干什么呢,说白了就是求两个数$1 \le b \le a \le 5 000 000$之间的所有数的质因数数目之和。快速求解一个数的质因数数目是这道题的关键。
Sieve of Eratosthenes 是这道题的关键。这是一个求解质数/质数判定的方法,其思想是筛选出一个质数的整数倍,质数的整数倍必然不是质数,所以都被筛选出去。在没有筛选的结果中继续求解当前最小数字的整数倍,直至筛完为止。其 C++ 代码如下所示:
Sieve of Eratosthenes
1 2 3 4 5 6 7 8 9 10 11 12
// prime number table from 2 to 5000000 // those who have number 0 are prime numbers. int table[5000001]; voidbuild_table(){ constint limit = sqrt(5000000)+1; table[0]=table[1]=-1; for(int i=2; i<=limit; i++) if(!table[i]) for(int j=2; j*i<=5000000; j++) table[i*j]=i; } intisprime(int i){return !table[i];}