### 2019-08-01 Saving the React novice: update a component properly

Front end development in React is all about updating components. As the business logic of a webpage is described in render() and lifecycle hooks, setting states and props properly is the core priority of every React developer. It relates to not only the functionalities but also the rendering effectiveness. In this article, I’m going to tell from the basics of React component update, then we will look at some common errors that React novice would often produce. Some optimizing techniques will also be described to help novice set up a proper manner when thinking in React.

## Basics

A webpage written in React consists of states. That is to say, every change of a React component will lead to a change in the appearance of a webpage. Since state can be passed down to child components as props, the change of state and props are responsible for the variation of view. However, there are two key principles pointed out by React documentation:

1. One can not change props directly.
2. One can only update state by using setState() .

These two constraints are linked with how React works. React’s message passing is unidirectional, so one can not mutate props from child component to parent component. setState() is related to a component’s lifecycle hooks, any attempt to change state without using setState() will bypass lifecycle hooks’ functionality:

However, during my development experiences, I have observed numerous cases where these two principles are broken. A major part of those misbehaved patterns can be traced back to the selective ignorance of a important property of JavaScript.

## Assignment: reference or value?

Let’s look at the example above. We all know const allows us to declare a variable that cannot be assigned a second time, so there’s no doubt why when we are trying to assign temp with 3 a TypeError is thrown. However, const does not imply constness to the internal field of an object. When we are trying to mutate the internal field of obj , it just works.

This is another common operation. From the script we know a and b are two non-object variables, and re-assignment to b leads to a !== b . However, when we are trying to assign d as c, which is an object, mutating the internal field does not change the equal relationship between the two. That implies that d is a reference of c.

So we can conclude two observations from the above:

1. const does not mean constness to object’s field. It certainly cannot prevent developer from mutating its internal field.
2. Assigning an object to another variable would pass on its reference.

Having acknowledged of the above, we can go on to the following code:

As you read the code, you are clearly aware of the intention of the author: onChange is a place where this.state is changed: it is changed according to incoming parameters. The author get a copy of original data first, then modify its value, then push to nextData. At last, the author calls this.setState to update.

If this code’s pattern appears in your projects and it works, it is normal. According to React’s component lifecycle, the this.state.data is changed to nextData , and it will eventually effect the render()‘s return value. However, there are a series of flaws in this code that I have to point out. If you fully understand and agree with the two observations I mentioned above, you will find the following points making you uncomfortable:

1. data=props.data in line 5 is assigning this.props.data‘s reference to this.state.data, which means changing this.state.data directly COULD mutate this.props.data.
2. prevData is assigned as a reference to this.state.data in line 10. However, as you read through the code, you will realize that this is not the real intention of the author. He wants to “separate” prevData from this.state.data by using const. However, this is a totally misunderstanding of const.
3. In line 13, each item in prevData is mutated by assigning its field a to another value. However, as we mentioned before, prevData is a reference to this.state.data, and this.state.data is a reference to this.props.data. That means by doing so, the author changed the content of this.state.data without using setState and modified this.props.data from child component!
4. In line 18–20, the author finally calls setState to update this.state.data. However, since he has already changed the state in line 13, this is happening too late. (Perhaps the only good news is that this.state.data is no longer a reference to this.props.data now.)

Well, someone may clam: so what? My page is working properly! Perhaps those people do not understand the functionalities of lifecycle hooks. Usually, people write their business logic in lifecycle hooks, such as deriving state from props, or to perform a fetch call when some props changes. At this time, we may write like the following:

Every time a component finished its update, it will call componentDidUpdate. This happens whenever setState is called or props is changed.

Unfortunately, if a novice developer unintentionally mutated this.state or this.props , these lifecycle hooks will not work, and will certainly cause some unexpected behaviors.

## How to make every update under control?

If you are an lazy guy and like the natural thinking of using a temporary variable separating itself from original, as I displayed above, you are welcomed to use immer. Every time you are trying to update state, it would provide you a draft, and you can modify whatever you want on that before returning. An example is given by its documentation.

However, you should know that the most proper way to update a state field without modifying it directly through reference is to perform a clone. The clone sometimes needs to be deep to make sure every field is a through copy of the original one, not reference. One can achieve that goal by deepClone from lodash. But I do not recommend that since it may be too costy. Only in rare cases you will need deepClone.

Rather, I recommend using Object.assign(target, …sources) . What this function does is updating target by using elements from sources. It will return target after update is complete, but its content will be different from those of sources. So updating object should be like:

The actual programming can be more easy: you should know that there’s spread syntax available for you to expand an object or array. Using spread syntax, you can easily create an new object or array by writing:

That allows you to copy the original content of an object/an array into a new object/array. The copy behavior at this point is shadow copy, which means only the outer most object/array is changed, and if there’s an object inside the field of the original object, or an object at some position of an index, that inner object will be copied as a reference. This avoids the costy operations inside deepClone. I like this, because it gives you precise control on what you need to change.

So the proper component I gave above should be like this:

I would suggest all components you write from now on should be changed to React.PureComponent. This is no different from React.Component except it has its default shouldComponentUpdate function: it performs a shadow comparision of its states and props to check whether it should be updated, meanwhile React.Component would always return true if you do not provide custom logic. This would not only improve page’s performance but will also help you realize the unexpected rendering when you made the mistake I mentioned above.

If you need similar functionality on function components, you can try React.mono() which is available since React 16.6.

### 2019-07-10 RegExp.test() returns true but RegExp.exec() returns null?

Consider the following Javascript script:

It seems to be a perfect logic extracting the tag name of the tag string: one uses test to prevent the situation that testString does not match. However, it would throw an error:

It just violates our intuition: the string do match the regex, but it could not be matched. So why does this happen?

As MDN has stated, invoking test on a regex with flag “global” set would set “lastIndex” on the original regex. That lastIndex allows user to continue test the original string since there may be more than one matched patterns. This is affecting the exec that comes along. So the proper usage of a regex with “g” flag set is to continue executing the regex until it returns null, which means no other pattern could be found in the string.

This behavior is sometimes undesirable, since in the above scenario, I just want to know whether the string matches the predefined pattern. And the reason I don’t want to create the regex on the fly is to save the overhead of creating an object.

One obvious solution is to remove “g” flag. But sometimes, we do want to keep the “g” flag to find if a string matches the given pattern, and we don’t wish to modify the internal state of regex. In that case, one can switch to string.search(regex) , which would always returns the index of first match, or -1 if not matched.

### 2019-05-03 Write a reusable modern React component module

1. 这个组件依赖于 React， React-dom，怎么让用户知道他们也必须用这两个东西呢？
2. 如果用户在自己的应用上面用了 React 16.2，但是我的组件使用了 React 16.8，用户在使用我的组件的时候会出现兼容性问题吗？会不会为了兼容性问题而装两个版本的 React 呢？
3. 用户调用我的包，是通过文件直接调用的，怎么才能把我的包放在 node_modules 里，即，通过 npm install 的形式安装？
4. 我的包使用了某些先进的语言特性。通过文件直接调用是无法通过 babel 转义为较低版本的 Javascript 的。甚至，用户都不能通过 import MyModule from './mymodule' 的形式调用！

## 对模板的优化

1. 它看上去也把当前版本的 react, react-dom 和 styled-components 打包了进去，增加了库的大小；
2. 它实际上是先通过 babel 的转译再被下游应用调用的模块，所以下游应用使用 ES6 的 import 的时候，并不会有真正的 tree-shaking （所谓 tree-shaking，就是 ES6 通过分析 import 和 export 判定哪些代码被真正地调用，从而在执行前就把不被调用的代码给去掉）。
3. 我也没有必要在下游应用 bundle 的时候对源模块进行二次 bundling。

## 剩下的问题

### 2019-04-21 Release of rankit v0.3 and roadmaps for future bangumi ranking

After several years (really!) of development, I’m pleased to announce that rankit has been upgraded to v0.3. The version of previous major release is v0.1, which was made in 2016. So what has changed during these three years?

At the time when rankit v0.1 was developed, it was aimed to implement all fascinating algorithms mentioned in a beautiful ranking book and provide an alternative ranking solution to Bangumi. The programming interface designed at that time is far from practical. As I was looking into more ranking algorithms, the more I felt that rankit should include some popular ranking algorithms that are based on time series. One of them is Elo ranking, which has a simple implementation in rankit v0.2. But the usage scenario is limited: updating ranking is tedious and its interfaces are not compatible with existing ranking algorithms.

1. Split rankers to “Unsupervised ranker” and “Time series ranker”. Each of those two rankers have their own interface, and they both consume the Table object to keep records.
2. Introduced Elo ranker, Glicko 2 ranker and TrueSkill ranker (only paired competition record is supported)
3. Updated interface to make it more user-friendly. For unsupervised ranker, only rank method is provided since it needs no update. For time series ranker, user should provide one use update to accumulate new record data and one can retrieve latest leader board by invoking leaderboard after each update.

The documentation of rankit has also been updated.

One side product of rankit is the “Scientific animation ranking for Bangumi”. For a long time, the ranking is not updated and it is gradually forgotten. I gave it a refresh last year and it is now having a completely new look with faster response and simple search functionality. More importantly, the ranking will be updated monthly. I would invite you all to have a try it here: https://ranking.ikely.me

The latest scientific animation ranking also involves minor algorithm changes. It is often witnessed that at certain time, users with peculiar intention would pour into Bangumi to rate one specific animation with certain high or low score. This impacted the proper order of rating. In previous version of scientific ranking, one can neglect those users who rate one anime and leave permanently, but could not handle those users who rate multiple animations. I adjusted the scores user rated overall and made several normalization according to the distribution of users’ rating, and all versions of normalized scores are fed into rankit to calculate latest rank. The final rank is still merged using Borda count.

But could this be solved from another perspective? One thing I have been thinking about is how to involve time series ranking into current ranking scheme. Ideally, time series ranking should act to combat ranking manipulation behavior in a way other than pairwise ranking. As I was reading about TrueSkill ranking, their brilliant idea to inference ranking using linear chain graph stroke me. Actually, TrueSkill is a generative graph model that organized in a order same as competition score. Another issue that need to resolve is to help users adjust historical ratings automatically: a user would rate an animation with wider range before, but his or her rating criteria may change with the evolvement of time. How to propagate recent rating criteria to historical ratings? All these should be supported in the next version of rankit. That is, to enable ranking for multiple players in a match, and power to propagate recent competition result to historical data.

### 2017-10-02 Mangaki data challenge 1st place solution

Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:

1. Wish table: about 10k rows
User_id Work_id Wish
0 233 1
1. Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id Work_id Rate
0 22 like
2 33 dislike
1. Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id Title Category
0 Some_anime anime
1 Some_manga manga

For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.

## Traditional recommendation system methods (that I know)

Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?

Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.

In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.

## Feature Engineering

### From wish table:

• Distribution of user’s preference on anime/manga (2d+2d)
• Distribution of item’s preference (2d)
• Word2vec embedding of user on wish-to-watch items (20d)
• Word2vec embedding of user on not-wish-to-watch items (10d)
• Word2vec embedding of item on wish-to-watch users (20d)
• Word2vec embedding of item on not-wish-to-watch users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)

### From record table:

• Distribution of user’s preference on anime/manga (4d+4d)
• Distribution of item’s preference (4d)
• Mean/StdErr of user’s rating (2d)
• Mean/StdErr of item’s rating (2d)
• Word2vec embedding of user on loved and liked items (32d)
• Word2vec embedding of user on disliked items (10d)
• Word2vec embedding of item on loved and liked users (32d)
• Word2vec embedding of item on disliked users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)
• Lda topic distribution of user on love, like and neutral items (20d)
• Lda topic distribution of item on love, like and neutral ratings (20d)
• Item categorial (1d, categorial feature)
• User Id (1d, only used in FM)
• Item Id (1d, only used in FM)

## Model ensembing

The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.

## Model stacking

The stack net requires one to feed all prediction result from the first layer as feature to second feature. The stacking technique requires one to do KFold cross-validation at the beginning, and then to predict each fold’s result based on all other folds as training data on the second level. Here is the most intuitive (as far as I think) description of model stacking technique: http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/

In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.

## Is this the ultimate ceiling?

Definitely not. One can push the boundary much further:

1. I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
2. I only used lightGBM to build GBDT. One can also use xgboost. Even though they all provides GBDT, lightGBM is a leaf-wise tree growth algorithm based model, while xgboost is depth-wise tree growth. Even though two models are all CART based GBDT, they behaves differently.
3. I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
4. I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.

## Conclusion

I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.

One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!

One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.

### 2017-04-14 Console as a SQL interface for quick text file processing

uid    name    nickname    joindate    activedate
7    7    lorien.    2008-07-14    2010-06-05
2    2    陈永仁    2008-07-14    2017-02-17
8    8    堂堂    2008-07-14    2008-07-14
9    9    lxl711    2008-07-14    2008-07-14
name    iid    typ    state    adddate    rate    tags
2    189708    real    dropped    2016-10-06
2    76371    real    dropped    2015-11-07
2    119224    real    dropped    2015-03-04
2    100734    real    dropped    2014-10-09
subjectid    authenticid    subjectname    subjecttype    rank    date    votenum    favnum    tags
1    1    第一次的親密接觸    book    1069    1999-11-01    57    [7, 84, 0, 3, 2]    小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2    2    坟场    music    272        421    [108, 538, 50, 18, 20]    陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4    4    合金弹头7    game    2396    2008-07-17    120    [14, 164, 6, 3, 2]    STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6    6    军团要塞2    game    895    2007-10-10    107    [15, 108, 23, 9, 7]    抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15


1. 非实时。我所说的“实时”并不是今天是 4 月 16 日而数据只是 2 月的，而是我无法保证数据是在某一个时间点上的快照。对于用户数据，由于爬取一次需要两天的时间，在这两天的时间里，可能用户修改了他们的昵称和用户名而在爬取的数据上未反映出来。更为严重的问题是，对于收藏数据，可能会出现在爬取数据的时候用户进行了收藏的操作，导致爬取的数据出现重复或缺失。而且由于用户数据和收藏数据是分开爬取的，我无法保证通过用户名能把两个 table 一一对应地 join 起来。
2. 非顺序。可以从预览的数据中看到。
3. 爬虫本身缺陷。由于我对于 Bangumi 出现 500 错误没有在处理上体现出来，所以会导致某些数据有所缺失。

## 1. SELECT … WHERE … ORDER BY …

### 筛选 2017 冬季番组

90


85 anime_selection.tsv
122772    122772    六心公主    anime        2016-12-30    26    [19, 41, 1, 1, 4]    17冬:1;原创:1;PONCOTAN:4;2016年:2;广桥凉:1;TVSP:1;池赖宏:1;原优子:1;mebae:1;TV:4;日本动画:1;片山慎三:1;Studio:1;STUDIOPONCOTAN:4;2016:5;TVA:1;短片:2;上田繁:1;搞笑:4;中川大地:2;岛津裕之:2;种崎敦美:1;2017年1月:1;テレビアニメ:1;オリジナル:1;SP:1;6HP:2;村上隆:10;未确定:1
125900    125900    锁链战记～赫克瑟塔斯之光～    anime    3065    2017-01-07    88    [66, 24, 216, 20, 60]    山下大辉:3;17冬:1;原创:1;游戏改:47;CC:1;花泽香菜:7;TV:22;未确定:2;グラフィニカ:2;佐仓绫音:4;2017年1月:61;锁链战记:1;2017:10;锁链战记～Haecceitas的闪光～:15;热血:2;チェインクロ:1;石田彰:22;声优:2;2017年:4;Telecom_Animation_Film:1;十文字:1;柳田淳一:1;战斗:2;内田真礼:2;剧场版:1;奇幻:17;2017·01·07:1;工藤昌史:3;2015年10月:1;TelecomAnimationFilm:9
126185    126185    POPIN Q    anime        2016-12-23    10    [134, 11, 3, 3, 0]    荒井修子:1;黒星紅白:4;原创:3;黑星红白:1;2016年:5;_Q:1;日本动画:1;2016年12月:2;未确定:1;小泽亚李:1;2017:2;2016:5;动画电影:1;2017年:5;Q:3;东映动画:1;种崎敦美:1;2017年1月:1;宫原直树:1;POPIN:6;東映アニメーション:12;剧场版:24;东映:4;萌系画风:1;濑户麻沙美:5
131901    131901    神怒之日    anime        2017-10-01    0    [79, 1, 0, 3, 1]    GENCO:3;2017年10月:2;TV:4;未确定:2;2017年:2;GAL改:4;游戏改:4;LIGHT:2;2017:3;エロゲ改:3;2017年1月:1


### 抽取标签列表

122772    六心公主    村上隆    10
122772    六心公主    2016    5
122772    六心公主    PONCOTAN    4
122772    六心公主    STUDIOPONCOTAN    4
122772    六心公主    TV    4
122772    六心公主    搞笑    4
122772    六心公主    2016年    2
122772    六心公主    6HP    2
122772    六心公主    中川大地    2
122772    六心公主    岛津裕之    2
122772    六心公主    短片    2
122772    六心公主    テレビアニメ    1
122772    六心公主    オリジナル    1
122772    六心公主    17冬    1
122772    六心公主    2017年1月    1
122772    六心公主    mebae    1
122772    六心公主    SP    1
122772    六心公主    Studio    1
122772    六心公主    TVA    1
122772    六心公主    TVSP    1
sort: write failed: standard output: Broken pipe
sort: write error


## 总结

### 2016-02-05 使用 rankit 构建更科学的排名

rankit 是一个使用线性代数和最优化理论为基础的常见排名算法库。这个库是我写的。我说“常见”其实对于绝大部分人来说根本就闻所未闻，但是对于专门研究排名的研究者来说，这里面包含的算法都是比较基础的。自从阅读了一本介绍排名的书籍《谁排第一？关于评价和排序的科学》之后，我觉得很有必要把这里面的宝藏介绍给对此一无所知的研究者。作为一个做与机器学习相关研究的研究生，竟然除了 PageRank 和 HITS 其他排名算法一个都没有听说过——课上不教，研究也涉及不到——因为生活中应用这些排名算法的场景，实在是很少接触到。然而，这些排名算法正在美国的体育比赛排名中大行其道。在那本书中，大部分案例直接来源于美国的橄榄球比赛排名。

## 小白怎样使用 rankit？

rankit 提供了 Converter，只要用户能提供排名对象的每一轮的比赛结果（用 pandas.DataFrame 表示并包含特定的 columns），就能够为后续的算法计算出矩阵。在不同的数学模型下，需要的观测值就会有所不同。每一个算法都有它的物理意义，那么使用某一个算法要使用与其物理意义相对应的矩阵。比如说对于 MarkovRank 来说，这种算法需要表示排名对象之间相互投票的结果。我在 rankit 里面专门提供了 RateDifferenceVoteMatrix，SimpleDifferenceVoteMatrix 和 RateVoteMatrix 把观测到的评分结果表示成这样的矩阵。对于算法适用于什么样的矩阵这个问题，我在 rankit 的 GitHub 主页上面已经给出了参考表格。

rankit 还提供了排名融合的算法，排名融合能够使多种算法的排名更加稳定。用户也可以自行构建排名加入排名融合器，生成融合后的排名。另外，如果要比较排名之间的结果，rankit 还提供了两种测度描述排名列表之间的差距。

## 为 Bangumi 动画排名！

1. Colley Rank (colley)
2. Massey Rank (massey)
3. Difference Rank (differ)
4. Markov Rank, using rate vote matrix as input (markov_rv)
5. Markov Rank, using rate difference vote matrix as input (markov_rdv)
6. Markov Rank, using simple difference vote matrix as input (markov_sdv)
7. Offence-defence Rank (od)
8. Keener Rank (no bias) (keener)

id rank
253 1
326 2
324 3
265 4
237 5
321 6
6049 7
1728 8
110467 9
2907 10
340 11
839 12
1608 13
876 14
3302 15
238 16
2734 17
1428 18
120700 19
37460 20