You tell me I'm wrong. Then you'd better prove you're right.

My Kaggle Days China experience

I have been a Kaggle fan for a long time. A community of data scientists and engineers devoting for pioneer data science practice has always been attractive for me. Though I’m not a dedicated Kaggler, I would still devote several month’s weekends per year to some Kaggle competitions after work, to grasp the spirit of dedication and religious attitude. That’s why I registered it the first time when I received the registration notification email.

Besides the lectures and GrandMasters, another thing that specifically attracts me is the offline data science competition. I have been curious about the authentic ability of offline coding, since in my opinion, most Kagglers online are fed by public Kernels. What would be their real performance without Kernels? Though I’m only a linear model Kaggler, I’m certain that I am somewhat experienced than others in feature selection, so there might be a chance for me to win.

I checked the previous Kaggle days before this event, and all the offline competitions are about tabular data. So I tried to get me familiar with all common EDA APIs and code snippets of pandas feature generation and scikit-learn compatible cross-validation. I know deeply that my skills in traditional machine learning cannot achieve high place on leaderboard in the age of deep learning, so I invited one of my colleague, Lihao, who is a deep learning expert, to join me.


Opening of Kaggle Days China

The first day of the event was all about lectures and workshops. There were several interesting workshops for you to attend, but you must register first. The first thing I regreted was that there was an lecture about LightGBM that I really wanted to attend, but it conflicted with a workshop about modeling pipeline. In fact, I attended that lecture when it was about to end, and even so, I still learned something insightful from that. I may need to go review the lecture videos later.

Someone mentioned before that all the things to do when attending a technical meeting is to chat with people: no need to attend lectures, no workshops, just communicating. And I have to say this is the best part of the Kaggle Day. I did talk with a lot of people. However, I was still too shy during the meeting because it wasn’t me who tried to get to know others first. I have to say everybody in Expert group have their domain knowledge, not all of them are necessarily experienced Kagglers, but they know AI industries in China very well. As a SDE working in a small city, Suzhou, I have not felt this excitement of communicating with industry experts for a long time ever since I left Beijing in 2015.

Announce of competition title on day 2

At the end of the day, the organizer disclosed the title of tomorrow’s data science competition. Though I had expected that it should be another tabular data competition, the title indicated that it would be a computer vision competition. It reminds me of a previous competition classifying astronomical objects, but it may not necessarily be in the same form. Having no practical knowledge of contemporary computer vision, in which deep learning has dominated, I regretted I didn’t follow my domestic advisor Lianwen Jin well when I was in graduate school. My working experience also could not contribute to this competition since I’m working in NLP group. Fortunately, when we were about to leave, a guy came to us, asking if he can join us. He said he had some CV background. Perhaps this was the best news I received that day, so I was grateful for him.


I had decided from yesterday’s night that if the competition was really a computer vision competition, I would resort to I leaned about it this summer, and this was the only thing I know how to use in modern deep learning based CV. It turned out it is. This competition requires us to classify images into two classes, so it’s a typical binary classification problem.

CV requires GPU equipped machines, and on the night before competition, we were required to configure our machines on specified service provided by UCloud. It was actually a Jupyter Notebook backended by 8 GPUs. However, without proper configuration, that machine is almost unusable. It has tf 2.0 alpha installed, not final release version nor stable tf 1.14. So Lihao spent a lot of time configuring the machine in the morning.

I originally thought that one need to perform EDA and do proper train/test split first, but soon I discarded this idea for this CV competition. However, Williame Lee, the guy I mentioned above, spent some time inspecting data first. He tried to find out some patterns of the image. But in my opinion, features are extracted automatically by deep neural networks, and even if I concluded some patterns, we don’t know where to feed it if I use deep neural network to extract feature.

Kaggle offline competition ongoing

The core spirit of is using pre-trained networks to classify images, and fine-tune them at the end. This turned out to be a very successful idea. I used the whole morning to build the pipeline, and it works! My first classifier, which is using ResNet34 as pretrained model, works as well as baseline. Later, Lihao trained this model further to push it to 0.85, and we tried several other models like ResNet18 and ResNet50. Even NN simple as RetNet18 can achieve good results at 0.82 after fine-tunning. Williame also developed a neural net using mobile net which is achieving 0.81 on public leaderboard.

Meanwhile at the same time, Fengari shared his 0.88 baseline, which is using EfficientNet. You can imagine that this fed many competitors attending this competition. Lihao then switched to this new baseline and adapted its cross-validation scheme. At last, we merged our three ResNet pretrained models and two EfficientNet adaptations as final result. That placed us at 17th over the whole leaderboard (34 teams). Not too bad for me as my first CV competition experience!

Day 2’s experience is like thrown in to a swimming pool (I don’t know how to swim, really) and learn to swim by myself. I have successfully trained a deep neural network targeting computer vision for the first time. Now I’m not fearing CV any more!

The organizer soon announced the winners, who are sitting in-front of us face-to-face during the whole competition. They are using multi-tasking to improve our model, which is a key technique that Williame implied in the morning. Their solution is here:

Before this Kaggle Day, my ambition was to stand on the winning stage. But unfortunately, I still have many things to learn to achieve this goal. I asked Lihao later if this is the ideal tech venue that he likes, his answer was no, but he still prized the core spirit of this Kaggle community. I hope next year, I would find someone who bear the same mindset as me and debut together. If I could cross-dress the next time, the best!

Saving the React novice: update a component properly

Front end development in React is all about updating components. As the business logic of a webpage is described in render() and lifecycle hooks, setting states and props properly is the core priority of every React developer. It relates to not only the functionalities but also the rendering effectiveness. In this article, I’m going to tell from the basics of React component update, then we will look at some common errors that React novice would often produce. Some optimizing techniques will also be described to help novice set up a proper manner when thinking in React.


A webpage written in React consists of states. That is to say, every change of a React component will lead to a change in the appearance of a webpage. Since state can be passed down to child components as props, the change of state and props are responsible for the variation of view. However, there are two key principles pointed out by React documentation:

  1. One can not change props directly.
  2. One can only update state by using setState() .

These two constraints are linked with how React works. React’s message passing is unidirectional, so one can not mutate props from child component to parent component. setState() is related to a component’s lifecycle hooks, any attempt to change state without using setState() will bypass lifecycle hooks’ functionality:

React lifecycle hooks diagram, referring to

However, during my development experiences, I have observed numerous cases where these two principles are broken. A major part of those misbehaved patterns can be traced back to the selective ignorance of a important property of JavaScript.

Assignment: reference or value?

Assignment to const value

Let’s look at the example above. We all know const allows us to declare a variable that cannot be assigned a second time, so there’s no doubt why when we are trying to assign temp with 3 a TypeError is thrown. However, const does not imply constness to the internal field of an object. When we are trying to mutate the internal field of obj , it just works.

Assigned object is an reference to original

This is another common operation. From the script we know a and b are two non-object variables, and re-assignment to b leads to a !== b . However, when we are trying to assign d as c, which is an object, mutating the internal field does not change the equal relationship between the two. That implies that d is a reference of c.

So we can conclude two observations from the above:

  1. const does not mean constness to object’s field. It certainly cannot prevent developer from mutating its internal field.
  2. Assigning an object to another variable would pass on its reference.

Having acknowledged of the above, we can go on to the following code:

As you read the code, you are clearly aware of the intention of the author: onChange is a place where this.state is changed: it is changed according to incoming parameters. The author get a copy of original data first, then modify its value, then push to nextData. At last, the author calls this.setState to update.

If this code’s pattern appears in your projects and it works, it is normal. According to React’s component lifecycle, the is changed to nextData , and it will eventually effect the render()‘s return value. However, there are a series of flaws in this code that I have to point out. If you fully understand and agree with the two observations I mentioned above, you will find the following points making you uncomfortable:

  1. in line 5 is assigning‘s reference to, which means changing directly COULD mutate
  2. prevData is assigned as a reference to in line 10. However, as you read through the code, you will realize that this is not the real intention of the author. He wants to “separate” prevData from by using const. However, this is a totally misunderstanding of const.
  3. In line 13, each item in prevData is mutated by assigning its field a to another value. However, as we mentioned before, prevData is a reference to, and is a reference to That means by doing so, the author changed the content of without using setState and modified from child component!
  4. In line 18–20, the author finally calls setState to update However, since he has already changed the state in line 13, this is happening too late. (Perhaps the only good news is that is no longer a reference to now.)

Well, someone may clam: so what? My page is working properly! Perhaps those people do not understand the functionalities of lifecycle hooks. Usually, people write their business logic in lifecycle hooks, such as deriving state from props, or to perform a fetch call when some props changes. At this time, we may write like the following:

if ( !== {
// business logic

Every time a component finished its update, it will call componentDidUpdate. This happens whenever setState is called or props is changed.

Unfortunately, if a novice developer unintentionally mutated this.state or this.props , these lifecycle hooks will not work, and will certainly cause some unexpected behaviors.

How to make every update under control?

If you are an lazy guy and like the natural thinking of using a temporary variable separating itself from original, as I displayed above, you are welcomed to use immer. Every time you are trying to update state, it would provide you a draft, and you can modify whatever you want on that before returning. An example is given by its documentation.

However, you should know that the most proper way to update a state field without modifying it directly through reference is to perform a clone. The clone sometimes needs to be deep to make sure every field is a through copy of the original one, not reference. One can achieve that goal by deepClone from lodash. But I do not recommend that since it may be too costy. Only in rare cases you will need deepClone.

Rather, I recommend using Object.assign(target, …sources) . What this function does is updating target by using elements from sources. It will return target after update is complete, but its content will be different from those of sources. So updating object should be like:

const newObj = Object.assign({}, this.state.obj, {a: 1})

The actual programming can be more easy: you should know that there’s spread syntax available for you to expand an object or array. Using spread syntax, you can easily create an new object or array by writing:

const newObj = {...this.state.obj}
const newArray = []

That allows you to copy the original content of an object/an array into a new object/array. The copy behavior at this point is shadow copy, which means only the outer most object/array is changed, and if there’s an object inside the field of the original object, or an object at some position of an index, that inner object will be copied as a reference. This avoids the costy operations inside deepClone. I like this, because it gives you precise control on what you need to change.

So the proper component I gave above should be like this:

Some further advice

I would suggest all components you write from now on should be changed to React.PureComponent. This is no different from React.Component except it has its default shouldComponentUpdate function: it performs a shadow comparision of its states and props to check whether it should be updated, meanwhile React.Component would always return true if you do not provide custom logic. This would not only improve page’s performance but will also help you realize the unexpected rendering when you made the mistake I mentioned above.

If you need similar functionality on function components, you can try React.mono() which is available since React 16.6.

RegExp.test() returns true but RegExp.exec() returns null?

Consider the following Javascript script:

const regex = /<(\w+)>/g
const testString = "<location>"
// Check whether testString contains pattern "<slot>"
// If yes, extract the slot name

if (regex.test(testString)) {

It seems to be a perfect logic extracting the tag name of the tag string: one uses test to prevent the situation that testString does not match. However, it would throw an error:

TypeError: regex.exec(...) is null

It just violates our intuition: the string do match the regex, but it could not be matched. So why does this happen?

As MDN has stated, invoking test on a regex with flag “global” set would set “lastIndex” on the original regex. That lastIndex allows user to continue test the original string since there may be more than one matched patterns. This is affecting the exec that comes along. So the proper usage of a regex with “g” flag set is to continue executing the regex until it returns null, which means no other pattern could be found in the string.

This behavior is sometimes undesirable, since in the above scenario, I just want to know whether the string matches the predefined pattern. And the reason I don’t want to create the regex on the fly is to save the overhead of creating an object.

One obvious solution is to remove “g” flag. But sometimes, we do want to keep the “g” flag to find if a string matches the given pattern, and we don’t wish to modify the internal state of regex. In that case, one can switch to , which would always returns the index of first match, or -1 if not matched.

Write a reusable modern React component module

今天这篇文章主要讲怎么用一种科学、优雅的方式开发一个可复用的 React Component,其实实际上不仅限于 React Component,如果想要写任何一个可以被 npm install 的模块都是可以适用的!

听起来很简单?这个事情的结论是可以用一句话概括的,在我研究了半个月之后发现确实是的(,我们走了一些弯路)。为了照顾那些大忙人,可以直接下拉到文章最底端,使用我推荐的库就行了,我可以保证那个库和它背后的模板是 best practice。

有人说,既然你已经给出了答案,那么还要读这篇文章干什么呢?因为写一个 js module 有很多种写法。可是要做到可复用还要做一些额外的工作。以最简单的方式,我写一个 mymodule.js,在最下方写上 export default MyModule,然后在另一个调用其的文件里面写上 import MyModule from ‘./mymodule’ 这个事情就算完了。我可以对外宣称,这是一个可复用的组件!真的吗?


  1. 这个组件依赖于 React, React-dom,怎么让用户知道他们也必须用这两个东西呢?
  2. 如果用户在自己的应用上面用了 React 16.2,但是我的组件使用了 React 16.8,用户在使用我的组件的时候会出现兼容性问题吗?会不会为了兼容性问题而装两个版本的 React 呢?
  3. 用户调用我的包,是通过文件直接调用的,怎么才能把我的包放在 node_modules 里,即,通过 npm install 的形式安装?
  4. 我的包使用了某些先进的语言特性。通过文件直接调用是无法通过 babel 转义为较低版本的 Javascript 的。甚至,用户都不能通过 import MyModule from './mymodule' 的形式调用!

但是在使用 npm 上面的包的时候,我似乎完全不用担心上面这些问题。所以为了写一个可复用的模块还需要某些额外的操作。


如果我们在 Google 上搜索 “write a reusable react component module” 等词语,我们可以找到一些号称自己是模板的东西,比如说这个 rinse-react,它是一个可以作为 boilerplate 的项目。让我们去看看它为了让 rinse-react 模块化做了哪些工作。

如果你去你的任何一个项目里的 node_modules 里看,你会发现 package.json 至关重要,因为它指定了一个模块的入口点,那就是 main 这个 field。对于 rinse-react 这个模块,其入口点是 dist/rinse.js,可以猜出这个文件是经由 webpack 打包后输出的。

于是我们打开 webpack.config.js, 里面的输入和各种 loader 同正常的 SAP 配置大同小异。但是需要重点指出的是 output.libraryTarget,它的含义就是把 src/index.js 里的返回值以怎样的形式作为模块输出。这里面要指定的值与模块会被怎样调用有关。我们除了 import MyModule from 'mymodule' 这种方式,还可能以 var MyModule = require('mymodule') CommonJS 的形式调用,也可能会以 define('MyModule', [], function() { ... }) 的 AMD 形式调用。为了兼容所有的调用方式,在 rinse-react 里面设置 output.libraryTarge = 'umd' ,这样被 webpack bundle 之后的模块就可以以任何一种形式调用。


一部分人可以欢呼了, 因为看起来找到了一个可以拿来用的模板。但是有没有发现其中的问题?

  1. 它看上去也把当前版本的 react, react-dom 和 styled-components 打包了进去,增加了库的大小;
  2. 它实际上是先通过 babel 的转译再被下游应用调用的模块,所以下游应用使用 ES6 的 import 的时候,并不会有真正的 tree-shaking (所谓 tree-shaking,就是 ES6 通过分析 import 和 export 判定哪些代码被真正地调用,从而在执行前就把不被调用的代码给去掉)。
  3. 我也没有必要在下游应用 bundle 的时候对源模块进行二次 bundling。

对于第一个问题,在 package.json 里面可以使用 peerDependencies 解决。在 peerDependencies 里面的东西,应该是下游应用也同时依赖的东西。如果你不把依赖放在peerDependencies 而是放在 dependencies 里(就像 rinse-react 一样),它们就会成为私有的依赖。

那么自然地,有没有 peerDependencies 在模块里的版本和下游应用的版本不匹配的情况呢?当然会有。这时候如果出现了不可兼容的版本,npm install 的时候会有提示,而在实际开发中,我发现安装的是最高指定版本。

但是我们还没有解决如何真正地实现 tree-shaking 特性。幸运的是,最佳实践告诉我们,rollup.js 正是为了解决这一问题而来的。rollup 也是一种打包工具,但是和 webpack 的目的不同,rollup 的初衷是为了尽量把模块的依赖打平并且高效地利用 tree-shaking。从这一出发点来讲,编写可复用的模块应该使用 rollup.

在 rollup 打包的过程中,在 package.json 里面会提供两个入口:传统的 main 指向打包后兼容 UMD 的打包内容;前卫的 module 应该会指向一个类似于 的文件:它使用 ES6 的先进特性。这样,在一个实际应用试图 import 一个模块的时候,它会先查看 package.json 是否有 module,如果有的话就以 module 指向的文件作为入口,避免了 babel 转译并且最大限度利用 tree-shaking. 如果应用在 build 的时候不支持 module,就 fallback 到 main 所指向的 UMD.

一个基于 rollup 的库模板

由于很偶然的原因,在我试图研究 Ant Design 如何开发出如此优雅的组建库的同时,我发现了一个可以自动生成基于 rollup bundling 的库模板生成器。这个东西基本上解决了我上面所有的困惑。我诚邀各位试一试这个 create-react-library,并且劝退那些想要研究 Ant Design 的人,他们家自己研发的 rc-init 连文档都没有且都没有人维护的。

当然,在我使用这个模板生成器的时候,他们只能生产出基于 babel 6 的配置。为了与实际开发环境匹配,我又手动修改到了 babel 7。


看来怎样开发一个 js 库这个问题到现在算是解决了。但是这样的库真的能与 Ant Design 相媲美了吗?在实际下游应用开发过程中会有按需加载的需要,为了能让用户按需加载 Ant Design,babel-plugin-import 应运而生。怎样优雅地面向按需加载开发是一个需要研究的问题。


  1. Rinse-react:
  2. Webpack: output.libraryTarget:
  3. Writing Reusable Components in ES6
  4. CommonJS vs AMD vs RequireJS vs ES6 Modules
  5. 你的 Tree-Shaking 并没什么卵用
  6. Webpack and Rollup: the same but different

Release of rankit v0.3 and roadmaps for future bangumi ranking

After several years (really!) of development, I’m pleased to announce that rankit has been upgraded to v0.3. The version of previous major release is v0.1, which was made in 2016. So what has changed during these three years?

At the time when rankit v0.1 was developed, it was aimed to implement all fascinating algorithms mentioned in a beautiful ranking book and provide an alternative ranking solution to Bangumi. The programming interface designed at that time is far from practical. As I was looking into more ranking algorithms, the more I felt that rankit should include some popular ranking algorithms that are based on time series. One of them is Elo ranking, which has a simple implementation in rankit v0.2. But the usage scenario is limited: updating ranking is tedious and its interfaces are not compatible with existing ranking algorithms.

In v0.3, following updates are made to address those problems mentioned above:

  1. Split rankers to “Unsupervised ranker” and “Time series ranker”. Each of those two rankers have their own interface, and they both consume the Table object to keep records.
  2. Introduced Elo ranker, Glicko 2 ranker and TrueSkill ranker (only paired competition record is supported)
  3. Updated interface to make it more user-friendly. For unsupervised ranker, only rank method is provided since it needs no update. For time series ranker, user should provide one use update to accumulate new record data and one can retrieve latest leader board by invoking leaderboard after each update.

The documentation of rankit has also been updated.

One side product of rankit is the “Scientific animation ranking for Bangumi”. For a long time, the ranking is not updated and it is gradually forgotten. I gave it a refresh last year and it is now having a completely new look with faster response and simple search functionality. More importantly, the ranking will be updated monthly. I would invite you all to have a try it here:

The latest scientific animation ranking also involves minor algorithm changes. It is often witnessed that at certain time, users with peculiar intention would pour into Bangumi to rate one specific animation with certain high or low score. This impacted the proper order of rating. In previous version of scientific ranking, one can neglect those users who rate one anime and leave permanently, but could not handle those users who rate multiple animations. I adjusted the scores user rated overall and made several normalization according to the distribution of users’ rating, and all versions of normalized scores are fed into rankit to calculate latest rank. The final rank is still merged using Borda count.

But could this be solved from another perspective? One thing I have been thinking about is how to involve time series ranking into current ranking scheme. Ideally, time series ranking should act to combat ranking manipulation behavior in a way other than pairwise ranking. As I was reading about TrueSkill ranking, their brilliant idea to inference ranking using linear chain graph stroke me. Actually, TrueSkill is a generative graph model that organized in a order same as competition score. Another issue that need to resolve is to help users adjust historical ratings automatically: a user would rate an animation with wider range before, but his or her rating criteria may change with the evolvement of time. How to propagate recent rating criteria to historical ratings? All these should be supported in the next version of rankit. That is, to enable ranking for multiple players in a match, and power to propagate recent competition result to historical data.

Mangaki data challenge 1st place solution

Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:

  1. Wish table: about 10k rows
User_id Work_id Wish
0 233 1
  1. Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id Work_id Rate
0 22 like
2 33 dislike
  1. Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id Title Category
0 Some_anime anime
1 Some_manga manga

For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.

Traditional recommendation system methods (that I know)

Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?

Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.

In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.

Feature Engineering

From wish table:

  • Distribution of user’s preference on anime/manga (2d+2d)
  • Distribution of item’s preference (2d)
  • Word2vec embedding of user on wish-to-watch items (20d)
  • Word2vec embedding of user on not-wish-to-watch items (10d)
  • Word2vec embedding of item on wish-to-watch users (20d)
  • Word2vec embedding of item on not-wish-to-watch users (10d)
  • Lsi embedding of user (20d)
  • Lsi embedding of item (20d)

From record table:

  • Distribution of user’s preference on anime/manga (4d+4d)
  • Distribution of item’s preference (4d)
  • Mean/StdErr of user’s rating (2d)
  • Mean/StdErr of item’s rating (2d)
  • Word2vec embedding of user on loved and liked items (32d)
  • Word2vec embedding of user on disliked items (10d)
  • Word2vec embedding of item on loved and liked users (32d)
  • Word2vec embedding of item on disliked users (10d)
  • Lsi embedding of user (20d)
  • Lsi embedding of item (20d)
  • Lda topic distribution of user on love, like and neutral items (20d)
  • Lda topic distribution of item on love, like and neutral ratings (20d)
  • Item categorial (1d, categorial feature)
  • User Id (1d, only used in FM)
  • Item Id (1d, only used in FM)

Model ensembing

The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.

Model stacking

The stack net requires one to feed all prediction result from the first layer as feature to second feature. The stacking technique requires one to do KFold cross-validation at the beginning, and then to predict each fold’s result based on all other folds as training data on the second level. Here is the most intuitive (as far as I think) description of model stacking technique:

In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.

Is this the ultimate ceiling?

Definitely not. One can push the boundary much further:

  1. I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
  2. I only used lightGBM to build GBDT. One can also use xgboost. Even though they all provides GBDT, lightGBM is a leaf-wise tree growth algorithm based model, while xgboost is depth-wise tree growth. Even though two models are all CART based GBDT, they behaves differently.
  3. I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
  4. I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.


I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.

One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!

One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.

Console as a SQL interface for quick text file processing

最近在处理业务上的某些事情时,会发现这样的问题:为了调试某个程序,会 dump 出一堆中间文件的 log,去查找哪些地方发生了异常。这种文件都是 tsv 文件,即使用 TAB 分割的列表文件。一种最简单的做法就是在文本编辑器中打开这些文件,然后通过观察去查看异常,可以配合编辑器的搜索功能。在这方面 sublime text 最为适合,因为只有这个能打开大文件。但是,这样每一次打开都需要很长的时间,而且我去搜索某个字符串的时候,每输入一个字符编辑器就会停顿很长时间。

于是我希望能有这么一种东西,对文本文件这种没有被某种结构化工具存储的、没有明确定义 schema 的东西进行快速的查找、计算。简单地借用 SQL 中的术语,我希望能在不导入数据的情况下进行 SELECTWHEREORDERLIMITJOIN 等基本功能。幸运的是,最近发现了以前打印出来一直都没看的讲义,一些最基本的 Unix command 就基本上涵盖了我所希望的对文本文件处理的功能。本文就按照这些 SQL 功能语句把这些命令进行梳理。

本文进行处理的数据对象是在 2 月用 Bangumi_Spider 爬取的 Bangumi Data,这包括用户、条目和收藏记录:

head user-2017-02-17T12_26_12-2017-02-19T06_06_44.tsv -n 5
head record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv -n 5
head subject-2017-02-26T00_28_51-2017-02-27T02_15_34.tsv -n 5
uid    name    nickname    joindate    activedate
7    7    lorien.    2008-07-14    2010-06-05
2    2    陈永仁    2008-07-14    2017-02-17
8    8    堂堂    2008-07-14    2008-07-14
9    9    lxl711    2008-07-14    2008-07-14
name    iid    typ    state    adddate    rate    tags
2    189708    real    dropped    2016-10-06        
2    76371    real    dropped    2015-11-07        
2    119224    real    dropped    2015-03-04        
2    100734    real    dropped    2014-10-09        
subjectid    authenticid    subjectname    subjecttype    rank    date    votenum    favnum    tags
1    1    第一次的親密接觸    book    1069    1999-11-01    57    [7, 84, 0, 3, 2]    小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2    2    坟场    music    272        421    [108, 538, 50, 18, 20]    陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4    4    合金弹头7    game    2396    2008-07-17    120    [14, 164, 6, 3, 2]    STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6    6    军团要塞2    game    895    2007-10-10    107    [15, 108, 23, 9, 7]    抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15


  1. 非实时。我所说的“实时”并不是今天是 4 月 16 日而数据只是 2 月的,而是我无法保证数据是在某一个时间点上的快照。对于用户数据,由于爬取一次需要两天的时间,在这两天的时间里,可能用户修改了他们的昵称和用户名而在爬取的数据上未反映出来。更为严重的问题是,对于收藏数据,可能会出现在爬取数据的时候用户进行了收藏的操作,导致爬取的数据出现重复或缺失。而且由于用户数据和收藏数据是分开爬取的,我无法保证通过用户名能把两个 table 一一对应地 join 起来。
  2. 非顺序。可以从预览的数据中看到。
  3. 爬虫本身缺陷。由于我对于 Bangumi 出现 500 错误没有在处理上体现出来,所以会导致某些数据有所缺失。

在下面的文章里,我们将一边使用 Unix Command 对数据进行类似于 SQL 语句的操作,一边阐述 Bangumi_Spider 产生的 data 的各种特点和后续处理需要注意的问题。


筛选 2017 冬季番组

现在我们有了条目数据, 而条目数据是记录了标签信息的,我们可以从标签信息中抽取出 2017 年冬季番组。这个标签是“2017 年 1 月”。我们可以用一个 grep 语句取出这些番组:

grep "2017年1月" subject-2017-02-26T00_28_51-2017-02-27T02_15_34.tsv | grep "anime" | wc -l

然而这个抽取方式有很大的缺陷。我们没有指定应该在数据的哪一列上查找“anime”或“2017 年 1 月”!如果有一部音乐条目的名字里面就有 anime 这个词,而又被打上了 2017 年 1 月的标签呢?这显然不是我们希望得到的。实际上,需要指定列的话,最好的方式就是使用 awk

awk -F "\t" '$9 ~ /[;\t]2017年1月:/ && $4=="anime"' subject-2017-02-26T00_28_51-2017-02-27T02_15_34.tsv > anime_selection.tsv
wc -l anime_selection.tsv
head anime_selection.tsv -n 5
85 anime_selection.tsv
122772    122772    六心公主    anime        2016-12-30    26    [19, 41, 1, 1, 4]    17冬:1;原创:1;PONCOTAN:4;2016年:2;广桥凉:1;TVSP:1;池赖宏:1;原优子:1;mebae:1;TV:4;日本动画:1;片山慎三:1;Studio:1;STUDIOPONCOTAN:4;2016:5;TVA:1;短片:2;上田繁:1;搞笑:4;中川大地:2;岛津裕之:2;种崎敦美:1;2017年1月:1;テレビアニメ:1;オリジナル:1;SP:1;6HP:2;村上隆:10;未确定:1
125900    125900    锁链战记~赫克瑟塔斯之光~    anime    3065    2017-01-07    88    [66, 24, 216, 20, 60]    山下大辉:3;17冬:1;原创:1;游戏改:47;CC:1;花泽香菜:7;TV:22;未确定:2;グラフィニカ:2;佐仓绫音:4;2017年1月:61;锁链战记:1;2017:10;锁链战记~Haecceitas的闪光~:15;热血:2;チェインクロ:1;石田彰:22;声优:2;2017年:4;Telecom_Animation_Film:1;十文字:1;柳田淳一:1;战斗:2;内田真礼:2;剧场版:1;奇幻:17;2017·01·07:1;工藤昌史:3;2015年10月:1;TelecomAnimationFilm:9
126185    126185    POPIN Q    anime        2016-12-23    10    [134, 11, 3, 3, 0]    荒井修子:1;黒星紅白:4;原创:3;黑星红白:1;2016年:5;_Q:1;日本动画:1;2016年12月:2;未确定:1;小泽亚李:1;2017:2;2016:5;动画电影:1;2017年:5;Q:3;东映动画:1;种崎敦美:1;2017年1月:1;宫原直树:1;POPIN:6;東映アニメーション:12;剧场版:24;东映:4;萌系画风:1;濑户麻沙美:5
129805    129805    混沌子    anime    2910    2017-01-11    197    [264, 24, 764, 60, 67]    上坂すみれ:3;ブリドカットセーラ恵美:2;季番:2;黑暗推理向慎入:2;2016年:3;CHAOS:5;游戏改:67;SILVER_LINK.:2;悬疑:5;游戏:9;TV:73;未确定:9;伪:4;GAL改:106;2017:28;科幻:2;志倉千代丸:4;SILVERLINK.:34;SLIVERLINK.:70;2017年:10;志仓千代丸:4;2017年1月:170;5pb.:112;混沌子:3;反乌托邦:16;剧透禁止:27;极权主义世界:8;松冈祯丞:3;神保昌登:6;CHILD:5
131901    131901    神怒之日    anime        2017-10-01    0    [79, 1, 0, 3, 1]    GENCO:3;2017年10月:2;TV:4;未确定:2;2017年:2;GAL改:4;游戏改:4;LIGHT:2;2017:3;エロゲ改:3;2017年1月:1

可以看到,我们提升了一些 precision。awk 的思想就是把一个文本文件按行处理,然后在单引号里面对行进行编程。你可以看到,我们使用内部变量 \$9 指定第 9 列的内容(你可以看到,是从 1 开始标号的),这一列是标签。我们使用正则表达式对 \$9 进行匹配,正则表达式需要用斜杠包围。同时,我们指定第四列的内容是 anime。这样就可以将满足我们需求的条目筛选出来。顺便说一句,\$0 是整行的内容。



awk -F "\t" '{match($8, /\[([0-9]+), ([0-9]+), ([0-9]+), ([0-9]+), ([0-9]+)\]/, m); printf("%d\t%s\t%s\t%d\t%d\t%d\t%d\t%d\n", $1, $3, $4, m[1], m[2], m[3], m[4], m[5])}' < anime_selection.tsv | sort -t$'\t' -k6,6 -nr | sed 5q
179949    小林家的龙女仆    anime    157    84    2065    19    46
185792    小魔女学园    anime    245    51    1471    30    29
174043    为美好的世界献上祝福! 第二季    anime    297    63    1431    21    28
174143    人渣的本愿    anime    271    51    1381    60    133
188091    珈百璃的堕落    anime    126    46    1366    22    73

按照管道切割,第一个语句还是使用 awk,但是执行的功能不是筛选,而是构造新表。在这个构造新表的过程中,观看人数的抽取使用了 match 函数。被抽取的对象存储在变量 m 里,是 match 的第三个变量。第一个变量是目标匹配字符串,第二个变量是正则表达式。注意需要将匹配的对象用圆括号括起来。

awk 的函数有很多,具体的可以直接参考 awk 手册(我估计没人会看的)或是这里。很多都是 C 语言的内置函数。

对于排序功能我们需要调用 sort 命令。首先需要指定输入文件的分隔符(注意这里有点 hacky,必须是 -t$'\t')。由于我们希望对在看人数从大到小排序,我们必须指定按照第 6 列排序,即 -k6,6。同时我们指定 -nr 表明视第六列为数字并倒序。

结果显示了在二月某个时刻收看番组的前五个。Very reasonable。


既然我们已经爬取了标签,能不能对爬取的标签列表进行展开?这时候要使用 awksplit 函数和 array 类型了:

awk -F "\t" '{split($9, tags, ";");for(i in tags){ split(tags[i], itm, ":"); printf("%d\t%s\t%s\t%d\n", $1, $3, itm[1], itm[2]);};}' < anime_selection.tsv | sort -t$'\t' -k1,1n -k4,4nr | head -n 20
122772    六心公主    村上隆    10
122772    六心公主    2016    5
122772    六心公主    PONCOTAN    4
122772    六心公主    STUDIOPONCOTAN    4
122772    六心公主    TV    4
122772    六心公主    搞笑    4
122772    六心公主    2016年    2
122772    六心公主    6HP    2
122772    六心公主    中川大地    2
122772    六心公主    岛津裕之    2
122772    六心公主    短片    2
122772    六心公主    テレビアニメ    1
122772    六心公主    オリジナル    1
122772    六心公主    17冬    1
122772    六心公主    2017年1月    1
122772    六心公主    mebae    1
122772    六心公主    SP    1
122772    六心公主    Studio    1
122772    六心公主    TVA    1
122772    六心公主    TVSP    1
sort: write failed: standard output: Broken pipe
sort: write error

由于标签列表是按照分号分隔的,我们首先使用 split($9, tags, ";") 把分割后的字符串存储在 array tags 里。接着在 for(i in tags) 里,i 实际上是 index,对于每一个 tag 我们再次使用 split,得到具体的 tag 和 tag 人数。可以看到,可以使用 C 语言的方式写 awk 的行处理逻辑。在写的时候是可以隔行的,虽然我都写在了一行。

在此之后,我们首先对条目的 id 排序,再在条目中对 tag 的标记人数排序。这里 sort 需要使用两个 -k 选项指定排序顺序。同时我们把 -nr 的条件写在每个排序列的后面,这样可以对列按照不同的排序逻辑排序。

2. Aggregation


我们更为关心的是收藏数据中的一些统计数据,如平均分、活跃人数等。最基本的统计数据形式就是计数。虽然可以使用 awk 完成,但是这个比较特殊,有更为简单的方法。


cut -f1,2 < anime_taglist.tsv| uniq -c | sed 10q
29 122772    六心公主
30 125900    锁链战记~赫克瑟塔斯之光~
25 126185    POPIN Q
30 129805    混沌子
11 131901    神怒之日
30 143205    南镰仓高校女子自行车社
29 146732    碧蓝幻想
30 148037    伤物语III 冷血篇
30 148099    刀剑神域:序列之争
30 148181    飙速宅男 新世代

其中,cut 命令对我们抽取出来的标签列表的第一列和第二列单独取出,然后送给 uniquniq 会对每行进行 de-duplication 的操作,并且加上 -c 的选项会给出每行出现的个数。在输出的内容中,第一列就是每个动画对应的标签数量。

需要重点强调的是,uniq 只能对已经排序的输入有效。

可以看到,在一个条目页面,标签数量最多只有 30 个。


有时候我们会对列表的某几项求取最大、最小值。下面显示了如何求取每一部动画标记人数最多的 tag。当然,我们也是用 awk 完成的。

awk -F "\t" '$1!=prev {print $0; prev=$1}' <anime_taglist.tsv | sed 10q
122772    六心公主    村上隆    10
125900    锁链战记~赫克瑟塔斯之光~    2017年1月    61
126185    POPIN Q    剧场版    24
129805    混沌子    2017年1月    170
131901    神怒之日    GAL改    4
143205    南镰仓高校女子自行车社    2017年1月    34
146732    碧蓝幻想    A-1Pictures    38
148037    伤物语III 冷血篇    剧场版    80
148099    刀剑神域:序列之争    剧场版    87
148181    飙速宅男 新世代    2017年1月    36

请注意,受了 uniq 的启发,这里的处理也是需要输入已经排好序。


既然有了 awk,我们就可以做更多的复杂计算。这里我们用平均值举一个例子:我们希望对爬取的收藏记录挑出动画和有评分的,计算其平均值。我还是希望能在有序输入上计算,于是先生成有序输入:

sed 1d record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv | awk -F "\t" '$6 && $3=="anime"' | sort -k2,2n | cut -f7 --complement > anime_record.tsv
head anime_record.tsv
103122    2    anime    do    2012-10-18    9
103909    2    anime    collect    2012-11-06    8
104394    2    anime    collect    2012-12-28    9
110320    2    anime    collect    2012-12-11    10
112414    2    anime    collect    2012-12-26    7
118090    2    anime    collect    2013-01-31    7
125406    2    anime    collect    2013-03-04    8
165363    2    anime    collect    2013-10-10    6
183190    2    anime    collect    2014-02-01    8
207963    2    anime    collect    2014-07-21    8


awk -F "\t" '{cnt[$2]+=1; cum[$2]+=$6}; END {for(i in cnt){printf("%d\t%f\t%d\n", i, cum[i]/cnt[i], cnt[i]);}}' < anime_record.tsv | sort -t$'\t' -k2,2nr -k3,3nr | head
202419    10.000000    4
186960    10.000000    3
143694    10.000000    2
158396    10.000000    2
193619    10.000000    2
11121    10.000000    1
127124    10.000000    1
127125    10.000000    1
127597    10.000000    1
137405    10.000000    1
sort: write failed: standard output: Broken pipe
sort: write error

需要注意的是,这里又使用了 awk 里面的字典数据结构(associative array)。你可以看作是一个 python 里面的 dict。我们使用了两个变量:cntcum 存储每一个动画 id 的评分人数和评分总和。在最后,我们在 END 包围的代码块里面生成最后的结果。这时候 END 里面的语句是在文件遍历之后执行的。

3. Union, intersection and except


在前文我们讲过,用户数据可能会因爬虫爬取时出现 500 错误而丢失,但是条目记录可能保留了这部分数据。而且,由于用户数据爬取的时间先于收藏数据,会出现用户在其间改了用户名、还有新的注册用户加入 Bangumi 等问题。这里我们试着查看有多少这类在用户数据中丢失的用户。

首先我们用用户数据生成用户 id 和 username 的列表:

sed 1d user-2017-02-17T12_26_12-2017-02-19T06_06_44.tsv| cut -f1,2 | sort -t$'\t' -k1,1n > user_list.tsv
tail user_list.tsv
321167    321167
321168    321168
321169    321169
321170    321170
321171    321171
321172    321172
321173    321173
321174    321174
321175    321175
321176    321176


sed 1d record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv | cut -f1 | sort -t$'\t' -k1,1 | uniq > record_user.tsv
head record_user.tsv

这实际上就是求 record_user 对于 user_list 中 user 的差集。如果是这样的话,我们可以使用 comm 命令:

cut -f2 user_list.tsv| sort | comm -13 - record_user.tsv > user_failure.tsv
head user_failure.tsv

comm 命令对每个文件的行操作,同时它的先验要求是文件已经排过序。它可以给出三列数据:仅在第一个文件中出现的行;仅在第二个文件中出现的行;在两个文件中同时出现的行。可以看出,这个就是求差集和并集的操作。通过指定 -13,我们指定只输出仅在第二个文件中出现的行,也就是在 user_list.tsv 中没有爬到的用户。

4. Join

获得收藏数据中的用户 id

在收藏数据中,我们记录下了用户的用户名,却没有记录用户 id 和用户昵称!这个是爬虫在设计时候的缺陷。这时候只能通过和用户数据 join 来弥补了。可是怎么在文本文件中进行 join 的操作呢?


sed 1d record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv| sort -t$'\t' -k1,1 > record.sorted.tsv
head record.sorted.tsv
100004    10380    anime    collect    2012-09-30    10    标签:;中二病;花泽香菜;嘟嘟噜;牧瀬紅莉栖;Steins
100004    10440    anime    collect    2012-10-02    9    标签:;あの日見た花の名前を僕達はまだ知らない
100004    10639    anime    collect    2012-10-02        标签:;虚渊玄;奈绪蘑菇;梶浦
100004    16235    anime    collect    2012-10-02    8    标签:;未来日记;我妻由乃
100004    18635    anime    collect    2012-10-02    7    标签:;罪恶王冠;泽野弘之
100004    23684    anime    collect    2012-09-30    9    标签:;黑子的篮球;基
100004    23686    anime    do    2012-10-01        标签:;刀剑神域;2012年7月;SAO;川原砾;梶浦由纪
100004    2453    anime    collect    2012-10-02    7    标签:;BRS;Black★RockShooter
100004    2463    anime    collect    2012-10-02    8    标签:;デュラララ!!
100004    265    anime    collect    2012-09-30    10    标签:;EVA;庵野秀明;神作;Evangelion;补完
sed 1d user-2017-02-17T12_26_12-2017-02-19T06_06_44.tsv| cut -f1,2,3 | sort -t$'\t' -k2,2 > user_list.sorted.tsv
head user_list.sorted.tsv
10    10    MoMo.
100    100    Jacob
10000    10000    漠漠
100000    100000    natalie_1204
100001    100001    七堂伽蓝
100002    100002    宇
100003    100003    二的二次方
100004    100004    从不卖萌的K
100005    100005    Astrid
100006    100006    tsiaben

需要注意的是,我们希望对用户的用户名进行 join,其前提是我们的列表需要对被 join 的列进行排序。我们已经在上面进行了对用户名列的排序操作。

接着,我们可以使用 join 了。我们可以首先进行 INNER JOIN:

# inner join
join -t$'\t' -1 2 -2 1 user_list.sorted.tsv record.sorted.tsv | wc -l
join -t$'\t' -1 2 -2 1 user_list.sorted.tsv record.sorted.tsv | sed 5q
100004    100004    从不卖萌的K    10380    anime    collect    2012-09-30    10    标签:;中二病;花泽香菜;嘟嘟噜;牧瀬紅莉栖;Steins
100004    100004    从不卖萌的K    10440    anime    collect    2012-10-02    9    标签:;あの日見た花の名前を僕達はまだ知らない
100004    100004    从不卖萌的K    10639    anime    collect    2012-10-02        标签:;虚渊玄;奈绪蘑菇;梶浦
100004    100004    从不卖萌的K    16235    anime    collect    2012-10-02    8    标签:;未来日记;我妻由乃
100004    100004    从不卖萌的K    18635    anime    collect    2012-10-02    7    标签:;罪恶王冠;泽野弘之
join: write error: Broken pipe

可以看到我们已经 join 了一些有价值的东西。被 join 的那个字段总是在第一列。

如果使用 LEFT OUTER JOIN 或是 RIGHT OUTER JOIN,我们需要用 -a 指定哪个文件需要全部输出:

# left outer join
join -t$'\t' -1 2 -2 1 -a 1 user_list.sorted.tsv record.sorted.tsv | wc -l
join -t$'\t' -1 2 -2 1 -a 1 user_list.sorted.tsv record.sorted.tsv | head
10    10    MoMo.
100    100    Jacob
10000    10000    漠漠
100000    100000    natalie_1204
100001    100001    七堂伽蓝
100002    100002    宇
100003    100003    二的二次方
100004    100004    从不卖萌的K    10380    anime    collect    2012-09-30    10    标签:;中二病;花泽香菜;嘟嘟噜;牧瀬紅莉栖;Steins
100004    100004    从不卖萌的K    10440    anime    collect    2012-10-02    9    标签:;あの日見た花の名前を僕達はまだ知らない
100004    100004    从不卖萌的K    10639    anime    collect    2012-10-02        标签:;虚渊玄;奈绪蘑菇;梶浦
join: write error: Broken pipe
# right outer join
join -t$'\t' -1 2 -2 1 -a 2 user_list.sorted.tsv record.sorted.tsv | wc -l

这里我们可以看到使用 RIGHT OUTER JOIN 的时候,对于没有被 join 上的用户就是在 user_list 中没有出现的用户。这时候没有被 join 上用户名和用户昵称的记录应该和 user_failure.tsv 的内容是一样的。

join -t$'\t' -1 2 -2 1 -a 2 user_list.sorted.tsv record.sorted.tsv | sort > record.complete.tsv
join -t$'\t' -1 2 -2 1 user_list.sorted.tsv record.sorted.tsv | sort > record.inner.tsv
comm -13 record.inner.tsv record.complete.tsv| cut -f1 | uniq | diff - user_failure.tsv
< sdf4
> sdf4

可以看到两个文件在实际内容上,除了顺序有所差别,并没有本质上的变动。这一定程度上证明了 JOIN 的正确性。

当然,我们既然有了 join,就会有人问怎么进行多个 column 的 join?遗憾的是,目前 join 只支持一列的 join。对于多列的 join 可以参考这个

5. Conclusion and future plan for Bangumi Spider

依据现有的技术,我们可以通过配合 awksort 进行文本文件操作从而组合成各种我们想要的类似于 SQL 语句的功能,这个给我们线下调试带来了很大的方便。同时我们可以使用 catcomm 达到交集、并集和差集的功能,使用 join 可以把两个文件通过指定的列 join 起来。在懒得使用 pandas 或数据库的时候,这些命令给我们带来了很大的方便。

同时,我觉得目前 Bangumi Spider 问题太多。我计划从一下方面入手将 Bangumi Spider 升级:

  1. 废除 user spider,更新 record spider 使之记录 user spider 目前所记录的所有信息,在爬虫结束任务之后,通过 raw record spider 生成 record 记录和 user 记录。后期处理还要力求 record spider 不能重复爬取条目。
  2. 修改 Bangumi Spider 对 Bangumi 主站 500 错误的处理。
  3. 保留 subject spider 的部分功能,特别是 infobox 那边的功能,去掉 tag 信息。 subject 要和 record 相 join 解决条目重定位的问题。

机器学习的疯狂三月——NCAA 男篮预测竞赛

这篇文章我们讨论 NCAA 一级男子篮球锦标赛的比赛结果预测。这同样是一个 Kaggle 例行举行的算法竞赛。这篇文章所要讲的就是我的解题思路。不过鉴于这是赛前预测,真正的结果需要到四月才能得知,众看官在察觉这是一口毒奶之后大可拿真正比赛结果打我脸。


这个算法竞赛给出了如下数据:从 1985 年到 2017 年所有常规赛季的得分记录和 2003 年到 2017 年所有常规赛季的详细记录。详细记录包括一堆复杂的篮球统计数字,由于敝人并不会打篮球所以也不知道中文翻译成什么合适。同时也包括 1985 年到 2016 年所有锦标赛季的得分记录和从 2003 到 2016 的锦标赛季详细。所要预测的是 2017 年进入锦标赛的所有 68 支队伍的所有可能组合的获胜概率——由于每一支队伍都已经标上了序号,预测的是每一对组合中序号较小的队伍的获胜概率。所以需要预测 $\frac{68 \times 67}{2}=2278$种情况。该比赛的目标函数是使得比赛实际出现的组合的获胜情况的 log loss 最小。

\textrm{LogLoss} = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]

其中 n 是实际会发生的比赛的次数。


由于我对体育比赛一无所知,经过论坛的讨论观察,我决定拿当年的常规赛季预测当年的锦标赛季结果。如果你去读了 NCAA 的比赛规则之后,你会发现一个最明显的 feature:seed。Seed 一定程度上包含了比赛队伍的排名强弱信息。在论坛里有人告诉我们,如果仅仅拿 seed 进行一个简单的预测:seed 小的一方必定会赢过 seed 大的一方,就已经能够得到 72.75% 的准确率了。然而 accuracy 与 logloss 还是有很大差别的,这个数字也只是给人们一个选取 feature 的直观感受而已。经过实验,如果单单使用两队 seed 之差一个 feature 进行 logistic regression 的 training,使用 2003-2012 的数据作为 training set,2013-2016 的数据作为 test set,可以得到 0.62 左右的 logloss。(论坛里有人拿到了 0.55 左右的 logloss,这是事后预测结果,不可应用于事前预测。)

当然这个结果就是给人保个底。之后就可以在这个基础之上进行模型增强了。经过多方挖掘,我又发现了一些奇妙的统计数据:Four Factors of Basketball Success. 把这些 feature 加入之后,使用 LR 可以得到 0.60 左右的 logloss. 其实还可以自己造一些类似的 feature,比如说 $\frac{Assist}{FGA+0.44\times FTA+Assist}$。这些也有益于提升模型性能。但是这些提升聊胜于无,让我深感失望。

我决定使用我写的无监督排序库 rankit 提升模型性能。在 rankit 里,我提供了 Massey rank, Colley rank, Point difference rank, Markov Rank, Offence-Defence rank 还有 Keener rank. 光是这些 rank 方法在 得分上堆砌起来就能获得很多 feature 了。rating 比 ranking 能提供更多的信息,所以我使用两队的 rating 之差作为 feature. 至于 normalization, 我对每一个 feature 的 Cumulative distribution function 把所有 feature normalize 到 [0, 1]. 到这里,如果使用 MPN,可以获得 0.58 左右的 logloss.

然而 rankit 仅仅是对比赛得分进行排名,也太浪费了。我们有那么多统计数据,我们可以对那么多的统计数据每个进行一次 rating 的计算,可以得到更有价值的信息。如果我们对两支队伍的防守篮板进行 rating 计算,和对进攻篮板的 rating 计算在物理意义上就有防守能力与进攻能力的差别。于是我们可以把所有统计数据用所有可能的无监督排序方法计算一下 rating.

于是,最终使用的 model 采用了两类 feature:一种是统计数据的直接简单归纳,如篮板计算赛季均值,助攻计算赛季均值,加上上文提到的 Four factors of basketball success。还有一类 feature 则是对统计数据的无监督排序处理化。

那么,每一队都有了能够足够表示自己的 feature,是不是应该直接相减得到 两队发生一次比赛的 feature,继而用这个 feature 训练一个 LR 或 GBDT 进行 classification 呢。事实上我用的是 CNN。为什么看上了 CNN 呢,我把表示一场比赛的 feature 组织成一个两行的矩阵,每一行就是描述一个队伍的 feature。如果你使用 CNN,你加一个卷积核在上面,你比使用两个 feature 直接相减有了更多的控制参数:至少对于两支队伍同一个 feature 你的特征抽取方式是 $c_1~x_1+c_2~x_2+c_3$,而如果使用 feature 直接相减就相当于 $c_1=-c_2$,少了模型的控制自由度。而且,我希望得到 feature 之间相互 cross 的效果,使用一种连接主义模型就能得到这样的效果。当然有人就会说了,你这样会不会有过拟合啊,因为你这个模型复杂度太大了。事实上,我的实验表明使用一个很简单的四层神经网络(输入层,卷积层,全连接层,输出层)就能很好地得到我的效果,而且比 GBDT (使用 xgboost 和 lightgbm 做的实验,feature 是两支队伍的 feature 相减)要好。


我一开始的设想还是 feature cross,也就是构造一些队伍之间相互交互的 feature 达到更加细粒度的队伍实力描述,然后我发现这个甚至不如直接使用 seed delta train 一个 lr. 我认为其主要原因是训练数据太少:由于是对锦标赛进行预测,而可用的训练数据也只有两千多个,我用一个十万维度的 model 能 train 出什么东西来?


现在的 rankit 非常难用,连我用的时候都不得不去查看源代码确认用法或是确认是否已经添加了我希望的功能。我希望今年的锦标赛结束之前能把 rankit 升级到 0.2.


此外,论坛中也有提到使用 elo 评分作为特征。Elo 评分体系是一个重要的体系,却没有被加入 rankit。在比赛最后两天我试图实现这个,但是时间太少,就没有做。这也是一个待办事项。

Max flow, minimum cut

在微软苏州已经入职了一个月了,生活开始渐渐稳定下来。和在北京的各种不堪相比,我要感谢这次调职,得以让我从 easy 模式开始社畜生活。同时搁置好久的算法又可以得以继续研究下去了。

今天讲的算法是最大流最小割算法。实际上这个是我在《算法导论》上看来看去都看不明白的一个章节:术语太多,而且代码很少(我好菜啊)。幸运的是,hihoCoder 上面的一系列专题为我垫了不少砖头,让我得以掌握这一块知识。

对于一个最大流问题,其定义一般都是这样的:有一个有向图 G,源点为 s,汇点为 t,每条边都有其对应的流最大容量 c,源点 s 只有发出的流量,汇点 t 只有汇入的流量,求能从源点到汇点的最大可能流量。

那么,我先不想说证明过程,我直接给出结论:我们用一种反复查找的方法,试图在当前图 G 上找到一条从 s 到 t 的路径,其边上允许的流量非零。每次找到这一条路径之后,也就可以确定这条路径上的流量瓶颈,将路径上的边的可行流量减去这一流量瓶颈,在新图上进行下一次查找。我们一直持续这样的查找,直到无法从 s 到 t 走出一条流量瓶颈大于 0 的路径。这个算法叫做 Ford-Fulkerson 算法。这个查找方式可以使用 BFS 进行。


An tricky example of flow network

如果我们用 BFS 找到了 A-B-D-F 这条可行路径,其瓶颈是 3,减去该瓶颈之后无法找到更新的路径。然而其最大流是 5:由 B 分出的流量有一份到 D,另外两份到 E,然后 A-C-D 有一个流量为 2 的路径。这时候,我们需要一个“残余网络”的概念帮助我们解决这个问题:每条边维护一条对应的反向边,其大小为当前在正向上面已经消耗掉的流量,而正向边的容量为当前正向剩余的可行流量。换句话说,正向边和反向边的存在,帮助我们维护了每条边还有多少剩余流量可用。如果有一条流量容量为 1 的边,其正向流量为 1000000,反向流量为 999999,这也是可行的!我们在残余网络上找到的一条可行路径,叫做增广路径。于是我上面描述的算法可以如下表达:

While ( findAugmentPath() ) // 判断是否有增广路
maxFlow = maxFlow + delta // 最大流增加
modifyGraph() // 对增广路进行修改
End While


我们通常说 max flow, minimun cut。那么什么被定义为 cut?在一个带有 s 和 t 的网络流里,有一种划分把点划分到两个不相交的 S 和 T 集合中,$s \in S, t \in T$. 净流 f(S, T) 被定义为穿过割 (S, T) 的流量之和(当一个割经过反向边的时候,反向边上的流量应为负值)。割的容量 C(S, T) 被定义为这条割上所有的容量之和(不包括反向边)。也就是说,$f(S, T) \le C(S, T)$。 可以证明,当前网络的流量总是等于任意一个割的净流。但是,任意的割不会有相同的容量。其中最小的那个割的容量对应的割,我们称之为最小割。


  1. f 是网络的最大流;
  2. 该网络的残余网络不包含任何增广路径。
  3. 流网络的某个切割 (S, T) 的容量等于 f。

因此,求出了最大流,也就求出了最小割的最大容量。使用上文所提及的 Ford-Fulkerson 算法,能够快速算出割、网络流量。

Production Code

使用 BFS 进行每次增广路径查找的 Ford-Fulkerson 实现叫做 Edmonds-Karp 算法。由于 BFS 时间复杂度(约)为 O(E),流量递增操作的总次数为 O(VE),该算法的时间复杂度为 $O(VE^2)$。这并不是最快的算法。现在有一种算法,通过记录每个点到汇点 t 的最短距离维持搜索顺序,使用 DFS 的方法进行增广路径的查找,能大大降低时间复杂度。这一算法的代码如下:

// 本代码对应于
#include <cstdio>
#include <cstring>
#include <queue>

const int inf = 0x3f3f3f3f;
const int maxn = 505, maxm = 40005;
int N, M;

typedef struct {
int u, v, c, prev;
int head[maxn];
edge_t edges[maxm];

int source, sink;
int dep[maxn], depcnt[maxn], cur[maxn], stack[maxn];

void addedge(int u, int v, int c) {
static int id = 0;
edges[id].u = u; edges[id].v = v; edges[id].c = c; edges[id].prev = head[u]; head[u] = id++;
edges[id].u = v; edges[id].v = u; edges[id].c = 0; edges[id].prev = head[v]; head[v] = id++;

void bfs() {
memset(dep, 0xff, sizeof(dep));
memset(depcnt, 0, sizeof(depcnt));
dep[sink] = 0;
depcnt[dep[sink]] = 1;
std::queue<int> Q;
int u, v;
while (!Q.empty()) {
u = Q.front();
for (int e = head[u]; ~e; e = edges[e].prev) if (edges[e].c == 0 && dep[edges[e].v] == -1) {
v = edges[e].v;
dep[v] = dep[u] + 1;

int SAP() {
memcpy(cur, head, sizeof(head));
int u = source, maxflow = 0, top = 0, i;
while (dep[source]<N) {
if (u == sink) {
//find the bottleneck
int delta = inf, p;
for (int t = top - 1; t >= 0; t--) {
int e = stack[t];
if (edges[e].c<delta) {
delta = edges[e].c;
p = t;
for (int t = top - 1; t >= 0; t--) {
int e = stack[t];
edges[e].c -= delta;
edges[e ^ 1].c += delta;
maxflow += delta;
top = p;
u = edges[stack[p]].u;
for (i = cur[u]; ~i; i = edges[i].prev) {
if (edges[i].c>0 && dep[edges[i].v] == dep[u] - 1) {
cur[u] = i;
u = edges[i].v;
stack[top++] = i;
if (i == -1) {
if (depcnt[dep[u]] == 0) break;
int mindep = N;
for (int e = head[u]; ~e; e = edges[e].prev) {
if (edges[e].c>0 && mindep > dep[edges[e].v]) {
mindep = dep[edges[e].v];
cur[u] = e;
dep[u] = mindep + 1;
if (u != source) {
u = edges[stack[--top]].u;
return maxflow;

int main() {
freopen("testcase", "r", stdin);
scanf("%d%d", &N, &M);
source = 1, sink = N;
memset(head, 0xff, sizeof(head));
for (int i = 1; i <= M; i++) {
int u, v, c;
scanf("%d%d%d", &u, &v, &c);
addedge(u, v, c);

printf("%d\n", SAP());
return 0;


  1. 在进行 DFS 寻找增广路径之前用 BFS 从汇点搜索建立 dep 和 depcnt,即到汇点的最短距离,这样为在 DFS 的时候总是走最短路径提供了依据。
  2. 正向边和反向边成对存储,索引为偶数的总是正向边;正向边和反向边可以通过索引的异或切换;
  3. stack 里面存储通过 DFS 寻找的增广路径。
  4. 在每一次 DFS 寻找过后,回溯到瓶颈路径的起点重新搜索,节省时间;
  5. 维护一个 cur,其在物理意义上与 head 相同,也是为了直接从当前的可行边搜索,节省时间;
  6. 但是如果从某点找不到可行边,需要更新该点的 dep 与 depcnt。这也就意味着 cur 也要更新。

就是这样!但是有的文献指出,第一步建立 dep 与 depcnt 可以不进行。这主要的实现参考。


  1. hihocoder hiho一下 115、116、117、118、119 周
  2. 《算法导论》(第三版)第 26 章
  3. Max Flow-SAP-Improved Shortest Augmenting


滴滴算法大赛是一场旨在预测城市交通供给需求不平衡程度的机器学习比赛。这个比赛的赛题是给定待预测的时间节点,预测整个城市在不同区块中需求与供给的差异 gap. 最终的比赛衡量方式是

$$\begin{equation} MAPE = \frac{1}{D}\sum_{d=1}^{D}\left( \frac{1}{T} \sum_{t=1}^{T} \left| \frac{gap_{dt}-s_{dt}}{gap_{dt}}\right|\right) \forall gap_{dt}>0 \end{equation}$$

在比赛中,D = 66,T = 144. 官方给出的训练数据有前三个星期的全部订单,天气和路况描述,区块属性描述。



在这场比赛中我只使用订单的训练数据,无视天气、路况和区块属性数据。理由是天气、路况和区块属性最终还是会反映在订单上面。所以着重描绘订单反映的特征就可以了。我提取了如下八类特征,全部以 one-hot coding 方式呈现:

  1. 订单所处时间是哪一小时(24维)
  2. 订单时间是否为双休日(2维)
  3. 订单地点在哪一区块(66维)
  4. 7 天前同一时间槽供求状况(21维)
  5. 订单所处时间的前一时间槽供求状况(21维)
  6. 订单前两个时间槽组合供求缺口状况(49维)
  7. 订单前三个时间槽的平均缺口状况(7维)
  8. 订单前三个时间槽的缺口标准差状况(3维)

“时间槽”是赛题预先定义的把一整天时间划分为 144 个时间片的每一个时间片。为了用 one-hot coding 的方式描述供求,我使用如下的方式重新定义 gap:将 gap 以 [0,0],(0,1],(1,3],(3,6],(6,12],(12,30],(30,$+\infty$) 划分,形成 7 维特征。同时将需求(也就是客户发单数)以 [0,10],(10,20],(20,$+\infty$) 划分,形成 3 维特征。对于 21 维的“供求状况”,则使用 2d feature 的方式将两个特征相组合。对于“缺口状况”,则是将两个 gap 以 2d feature 的方式相组合,形成 49 维特征。对于缺口的标准差,则使用 [0,1],(1,5],(5,$+\infty$) 划分,形成 3 维特征。

但是这是远远不够的。比如说,我们想要设计的特征,能够包含一个区块在不同时间槽上的变化,这时候就必须结合时间特征和区域特征做二维特征。于是可以对所有这八类特征相互组合形成二维特征,最终形成一个一万多维的、非常稀疏特征。如果这时候按照这个特征训练一个线性回归模型,可以在测试集上面达到 0.32-0.34 的结果。但是实际上在应用的时候,并非八种特征完全相互组合,而是选取有物理意义的组合,以减少参数个数。代码中通过使用矩阵作为 mask 指定哪几类特征组合是被允许的。

我们还有一维特征,使用量的大小而非 one-hot coding:训练一个线性回归模型,预测的结果形成最后一维特征。这个特征也可以与先前的八类特征相互组合形成二维特征。


一开始我使用 scikit-learn 的线性回归模型建模的,但是这个库可能偏离了实用性。对于稀疏的 feature,我非常希望能有一范数做正则项,但是 scikit-learn 里面没有一个模型是符合我的要求的。而且,对于题目中要求的 metric,最好是对每一个训练样本进行 $\frac{1}{gap_{dt}}$的加权。并不是所有的模型在 scikit-learn 里面都有加权的选项。

其实最好的工具是 xgboost。这是一个基于残差迭代方法的、专门应用于机器学习竞赛的库。通过使用基于 GBDT 的回归模型,可以达到 0.29 左右的结果。需要指出的是,我发现三次残差迭代的效果是最好的,几乎颠覆了我以前对 GBDT 的印象。


我选取了元月 4 日、9 日、17 日、19 日和 21 日的全天数据作为 validation,其余的去除元月 1-3 日的数据作为 training. 同时复制 10 日和 16 日的记录一遍。

需要指出的是,在训练的时候要去除 gap=0 的数据,因为这些数据对评价标准没有任何影响(如果真的像评价标准那样定义的话)而且还会对模型产生 bias。


先前有报导显示,直接下载提交的测试文件就能达到 0.6 的 level,而按照定义,这样的测试文件应该生成 1 的 MAPE。可能我在我的 MAPE 计算程序里面有与官方理解不一致的地方。


我原来以为这个题目还是比较简单的,因为就是一个纯纯的监督学习问题。但是我发现我与最高水平的差距还是挺大。不论是实力还是客观条件方面都与别人的队伍有着较大的差距。最为愤怒的是,在前一天我在知乎上面看到有人说使用前三个时间槽的 gap 进行加权平均就能达到 0.29 的结果,我非常愤怒。我为了达到这个值花了这么多时间和精力,竟然还不如别人这么简单的无脑方法???如果能有更多的人和我团队合作,我们的思路就会越来越开阔,不至于像我走这种死胡同。

而且,我并非从一开始就参加这个竞赛,而是在硕士学位答辩完成之后才想着去“玩一玩”的。先前看到过 @haruki_kirigaya 也在玩这个(不过我要说,我没有看任何人的代码)。第一次提交是在 6 月 9 号,只有十次提交机会,而且中间还换了一次数据。不过,借口是无穷多的,实力的差距是绝对的。这也是我参加的第一次机器学习竞赛。我上次看到一个 Bangumi 上的人 fo 了我(我这种弱渣垃圾有什么好 fo 的),人家声称坚持打过每一场 Kaggle 竞赛,我真是无地自容。还是看书去吧。