展示每个番组的详细资料并包括一些有趣的数据(which requires extra data mining
使用 tag 系统把每个番组连接起来
……
但是目前为止,我还没有把 Bangumi Research 设计成一个可以登陆的客户端的打算。
经过若干周末的努力,下一代 Bangumi Research https://chii.ai 渐成雏形。chii.ai 来源于 Bangumi 的域名 chii.in。以下是这个新的网站的工程设计:
后端设计
用户访问 Bangumi Research 并不会发生写操作,而且用户最主要的目的就是访问我们的科学排名。除此之外,用户还可能会查看番组的详细信息,以及进行 tag 搜索。我不需要关心查看番组信息的 API,因为可以直接复用 Bangumi API。我主要关注以下三个 API 的设计(这只是概念上的展示,实际实现有所不同):
/api/rank
这个API 应该返回一个数组,其数据结构为
1 2 3 4 5 6 7 8 9 10 11 12
[{ int id; DateTime date; string name; string nameCN; int rank; // Bangumi原排名 int sciRank; // 科学排名 string type; int votenum; // 评分人数 int favnum; // 收藏人数 List<Tag> tags; }]
这个就包含了番组一些基本信息以及连带的 tags。这么设计 API 纯粹就是 Bangumi API 没有返回番组 tags,所以我需要自己设计系统返回 tags。
Tag 的数据结构为:
1 2 3 4 5 6
{ string Tag; int TagCount; int UserCount; double Confidence; }
那么我们需要数据库支持这样的操作。理论上两张表:Subject 表和 Tag 表。Subject 表存储所有番组条目信息,以番组 id 为主键。Tag 表存储标签信息。咋一看标签与条目属于多对多的关系,但是在上面的 API 设计中每个标签在不同的番组中有不同的 Confidence,这就使得我们要把标签与条目设计成多对一的关系:在 Tag 表中,每个标签由其标签内容和关联条目唯一确定。另外,Subject 表应该存放科学排名。但是在 Bangumi Spider 中科学排名被生成为一个单独的文件,为了兼容这种历史遗留问题我们就把科学排名也作为一张表,以一对一的关系与 Subject 表以外键连接。
chii.ai 继续沿用 ASP .Net Core 开发后端。有 Linq 和 EntityFramework 的技术支持使得开发体验良好。该 web server 提供若干 restful API 接口。
为了接入 Bangumi API,我另外使用了一个 node 服务把数据库 API 与 Bangumi API 整合成一个 unified API 接口。当然,我也可以用一个 nginx 服务器代理 Bangumi API 的服务,但是我真正的意图是想在前端使用 GraphQL。所以,这个 node 服务是一个 Apollo GraphQL Server,提供整合的 schema 和 API 接口。
前端设计
为什么要用 GraphQL?如果仅仅是“用 GraphQL 那套接口开发”这无疑就会降低我选择这个技术的可能性,毕竟它可能还没有 Restful API 方便。但是 Apollo GraphQL 的生态系统是如此之成熟,这使得选用 GraphQL + Apollo Client 成为一个非常有吸引力的选项。
我最喜欢 Apollo Client 对 Query 的自动缓存功能。想象一下当载入 https://chii.ai 主页面的时候,它首先会载入所有的动画排名。用户可能会点到别的页面再点回来,由于在第一次 Query 的时候我已经缓存了排名结果,我就不需要再向服务器发送一次请求了。听上去是不是很有吸引力?而 Apollo Client 使得这一切都 work under the hood!
这套缓存系统是如何运作的呢?Apollo Client 对每个发出的 Query 都做了缓存。这听上去挺荒唐,因为以每一个 Query 作为 key 结果直接作为 value 可能占用太多内存了。实际上,GraphQL 里面我们知道返回的数据结构。而不同的Query可能共享同样的数据结构,只不过内容不同。这时候 Apollo Client 将实际返回的数据的 id 和数据结构类型作为 key,返回的数据作为 value。在最理想的情况,每个数据结构里面的子数据结构都有 typename 和 id,这样可以彻底 normalize 所有返回的数据。以 typename 和 id 作为缓存的 key 和直接以 Query 本身作为 key 相比可以大大减少内存占用。
这里面还有一些精妙的东西。比如说你用查询排名的 Query 返回了一连串动画。然后用户点进去其中一个动画,这会发出第二个 Query 请求请求这个动画的 details。第一个 Query 返回动画这个数据结构的列表,第二个 Query 返回一个动画数据结构。由于这个动画具有唯一的 ID,它在缓存里的 key 也是唯一的。那么第二次 Query 的结果实际上会对第一次 Query 里面那个列表里面的动画进行一次更新——它会覆盖先前的值。这么做是自然的,因为两者同属一个数据结构,自然在后端返回的东西应该是一致的。如果你的后端不是这样(比如说某些 API 返回部分 field,另一些 API 又返回另一些 field,甚至同一 field 的内容还不同),那么你会遇到一些 bug。
如上图所示,该站整体架构分成五个部分。首先,由 .Net Core 和 Postgers server 构成了后端逻辑的核心。其次,Apollo Server 作为 GraphQL server,运行在 Node 上,其接收前端请求并向真正的后端发出 API 请求。Apollo Server 实际上也封装了某些 Bangumi API,使得其成为一个 API 的 hub。为了减轻后端压力,Apollo Server 后接 Redis 作为 cache。前端是一个 React Application,部署在 Nginx 里面。该网站开发时使用 docker compose,部署在 Azure Kubernetes Service 上。
一些反思
这个架构是否就是最好的架构?恐怕并不是的。我的 GraphQL 服务器并不是原生的 GraphQL 服务,这其实造成了一些 latency:发送一个请求需要通过 GraphQL server 再转接到 .Net Core 的 server。这么做的背后逻辑其实是我想用 GraphQL 那一套生态系统开发,这就产生了后端既有 Restful API 也有 GraphQL 的情形。
如何避免这种用 node server 做 GraphQL server 转接的方式发生?最理想的想法就是后端所有逻辑使用 node 重写。但是我后端主要代码逻辑都在 .Net Core 里面,这样做太浪费时间了。
一种想法是,Apollo Server 所做的主要事情就是一个 resolver。我是否可以在前端使用一个 service worker,把 resolver 作为一个单独的进程,前端主进程只要与这个 service worker 相通信就可以了呢?这样实际上就把一段后端逻辑搬到了前端,而且后端又简化了不少。这需要我深入研究 service worker 的使用方式。
虽然我在 2020 年初就已经更加向死宅的方向发展(比如说休假两个星期只待在家里),但是 work from home 这种新型的工作与生活相结合的生活方式我实在是无法接受。认识我的人都知道,我继承了法国人的生活方式,那就是 6 点准点下班!但是在家工作的话,根本就没有”下班”这种说法。在正常上班的时间段里,我可能还要买菜、做饭、睡觉之类的,一天过去了,到了 6 点我也不知道我今天具体完成了啥。更让我想念的是我放在公司里的单簧管,我一直想吹《利兹与青鸟》,但是只要公司不开门我就哪也去不了。幸运的是,二月末公司就有条件开放了。我成了第一批去公司上班的人。在公司工作,有乐器玩,令人舒适。
虽然我是研究机器学习出身,但是我远离机器学习已有多年,实际工作其实是前端。由于公司里的项目我已经比较熟悉,所以实际开发花不了我多少时间。但是前端技术每天都在进化,我自然也在考虑更好的工作机会。在三月末的时候,Yongdong 突然发了一封信,call for developers for Microsoft Teams,重点是要在苏州成立一个新 Team,主攻 Microsoft Teams 的前端和移动端开发,需要 React developer!众所周知,由于 COVID-19 在世界逐渐流行,远程成为了常态,自然这时候 call for developers 也是意料之中的。我立刻就去了这个 Team, as a short-term volunteer.
这个 Team 确实有我想要的东西:严格的编程规范和最先进的技术,当然也有 ping 一万年也 ping 不通的 code reviewer. 其实这个 Team 让我感觉到我是真的加入了微软。你可以感受到这个 Sharepoint group 的文化与我先前那个几近于机器学习创业团队的文化不同。我感觉每个人都很有组织,且乐于加入公司的各种活动。这样的正规军看上去比我原来的 Team 战斗力强多了。
I have been a Kaggle fan for a long time. A community of data scientists and engineers devoting for pioneer data science practice has always been attractive for me. Though I’m not a dedicated Kaggler, I would still devote several month’s weekends per year to some Kaggle competitions after work, to grasp the spirit of dedication and religious attitude. That’s why I registered it the first time when I received the registration notification email.
Besides the lectures and GrandMasters, another thing that specifically attracts me is the offline data science competition. I have been curious about the authentic ability of offline coding, since in my opinion, most Kagglers online are fed by public Kernels. What would be their real performance without Kernels? Though I’m only a linear model Kaggler, I’m certain that I am somewhat experienced than others in feature selection, so there might be a chance for me to win.
I checked the previous Kaggle days before this event, and all the offline competitions are about tabular data. So I tried to get me familiar with all common EDA APIs and code snippets of pandas feature generation and scikit-learn compatible cross-validation. I know deeply that my skills in traditional machine learning cannot achieve high place on leaderboard in the age of deep learning, so I invited one of my colleague, Lihao, who is a deep learning expert, to join me.
DAY 1
The first day of the event was all about lectures and workshops. There were several interesting workshops for you to attend, but you must register first. The first thing I regreted was that there was an lecture about LightGBM that I really wanted to attend, but it conflicted with a workshop about modeling pipeline. In fact, I attended that lecture when it was about to end, and even so, I still learned something insightful from that. I may need to go review the lecture videos later.
Someone mentioned before that all the things to do when attending a technical meeting is to chat with people: no need to attend lectures, no workshops, just communicating. And I have to say this is the best part of the Kaggle Day. I did talk with a lot of people. However, I was still too shy during the meeting because it wasn’t me who tried to get to know others first. I have to say everybody in Expert group have their domain knowledge, not all of them are necessarily experienced Kagglers, but they know AI industries in China very well. As a SDE working in a small city, Suzhou, I have not felt this excitement of communicating with industry experts for a long time ever since I left Beijing in 2015.
At the end of the day, the organizer disclosed the title of tomorrow’s data science competition. Though I had expected that it should be another tabular data competition, the title indicated that it would be a computer vision competition. It reminds me of a previous competition classifying astronomical objects, but it may not necessarily be in the same form. Having no practical knowledge of contemporary computer vision, in which deep learning has dominated, I regretted I didn’t follow my domestic advisor Lianwen Jin well when I was in graduate school. My working experience also could not contribute to this competition since I’m working in NLP group. Fortunately, when we were about to leave, a guy came to us, asking if he can join us. He said he had some CV background. Perhaps this was the best news I received that day, so I was grateful for him.
DAY 2
I had decided from yesterday’s night that if the competition was really a computer vision competition, I would resort to fast.ai. I leaned about it this summer, and this was the only thing I know how to use in modern deep learning based CV. It turned out it is. This competition requires us to classify images into two classes, so it’s a typical binary classification problem.
CV requires GPU equipped machines, and on the night before competition, we were required to configure our machines on specified service provided by UCloud. It was actually a Jupyter Notebook backended by 8 GPUs. However, without proper configuration, that machine is almost unusable. It has tf 2.0 alpha installed, not final release version nor stable tf 1.14. So Lihao spent a lot of time configuring the machine in the morning.
I originally thought that one need to perform EDA and do proper train/test split first, but soon I discarded this idea for this CV competition. However, Williame Lee, the guy I mentioned above, spent some time inspecting data first. He tried to find out some patterns of the image. But in my opinion, features are extracted automatically by deep neural networks, and even if I concluded some patterns, we don’t know where to feed it if I use deep neural network to extract feature.
The core spirit of fast.ai is using pre-trained networks to classify images, and fine-tune them at the end. This turned out to be a very successful idea. I used the whole morning to build the pipeline, and it works! My first classifier, which is using ResNet34 as pretrained model, works as well as baseline. Later, Lihao trained this model further to push it to 0.85, and we tried several other models like ResNet18 and ResNet50. Even NN simple as RetNet18 can achieve good results at 0.82 after fine-tunning. Williame also developed a neural net using mobile net which is achieving 0.81 on public leaderboard.
Meanwhile at the same time, Fengari shared his 0.88 baseline, which is using EfficientNet. You can imagine that this fed many competitors attending this competition. Lihao then switched to this new baseline and adapted its cross-validation scheme. At last, we merged our three ResNet pretrained models and two EfficientNet adaptations as final result. That placed us at 17th over the whole leaderboard (34 teams). Not too bad for me as my first CV competition experience!
Day 2’s experience is like thrown in to a swimming pool (I don’t know how to swim, really) and learn to swim by myself. I have successfully trained a deep neural network targeting computer vision for the first time. Now I’m not fearing CV any more!
The organizer soon announced the winners, who are sitting in-front of us face-to-face during the whole competition. They are using multi-tasking to improve our model, which is a key technique that Williame implied in the morning. Their solution is here: https://github.com/okotaku/kaggle_days_china
Before this Kaggle Day, my ambition was to stand on the winning stage. But unfortunately, I still have many things to learn to achieve this goal. I asked Lihao later if this is the ideal tech venue that he likes, his answer was no, but he still prized the core spirit of this Kaggle community. I hope next year, I would find someone who bear the same mindset as me and debut together. If I could cross-dress the next time, the best!
Front end development in React is all about updating components. As the business logic of a webpage is described in render() and lifecycle hooks, setting states and props properly is the core priority of every React developer. It relates to not only the functionalities but also the rendering effectiveness. In this article, I’m going to tell from the basics of React component update, then we will look at some common errors that React novice would often produce. Some optimizing techniques will also be described to help novice set up a proper manner when thinking in React.
Basics
A webpage written in React consists of states. That is to say, every change of a React component will lead to a change in the appearance of a webpage. Since state can be passed down to child components as props, the change of state and props are responsible for the variation of view. However, there are two key principles pointed out by React documentation:
One can not change props directly.
One can only update state by using setState() .
These two constraints are linked with how React works. React’s message passing is unidirectional, so one can not mutate props from child component to parent component. setState() is related to a component’s lifecycle hooks, any attempt to change state without using setState() will bypass lifecycle hooks’ functionality:
However, during my development experiences, I have observed numerous cases where these two principles are broken. A major part of those misbehaved patterns can be traced back to the selective ignorance of a important property of JavaScript.
Assignment: reference or value?
Let’s look at the example above. We all know const allows us to declare a variable that cannot be assigned a second time, so there’s no doubt why when we are trying to assign temp with 3 a TypeError is thrown. However, const does not imply constness to the internal field of an object. When we are trying to mutate the internal field of obj , it just works.
This is another common operation. From the script we know a and b are two non-object variables, and re-assignment to b leads to a !== b . However, when we are trying to assign d as c, which is an object, mutating the internal field does not change the equal relationship between the two. That implies that d is a reference of c.
So we can conclude two observations from the above:
const does not mean constness to object’s field. It certainly cannot prevent developer from mutating its internal field.
Assigning an object to another variable would pass on its reference.
Having acknowledged of the above, we can go on to the following code:
As you read the code, you are clearly aware of the intention of the author: onChange is a place where this.state is changed: it is changed according to incoming parameters. The author get a copy of original data first, then modify its value, then push to nextData. At last, the author calls this.setState to update.
If this code’s pattern appears in your projects and it works, it is normal. According to React’s component lifecycle, the this.state.data is changed to nextData , and it will eventually effect the render()‘s return value. However, there are a series of flaws in this code that I have to point out. If you fully understand and agree with the two observations I mentioned above, you will find the following points making you uncomfortable:
data=props.data in line 5 is assigning this.props.data‘s reference to this.state.data, which means changing this.state.data directly COULD mutate this.props.data.
prevData is assigned as a reference to this.state.data in line 10. However, as you read through the code, you will realize that this is not the real intention of the author. He wants to “separate” prevData from this.state.data by using const. However, this is a totally misunderstanding of const.
In line 13, each item in prevData is mutated by assigning its field a to another value. However, as we mentioned before, prevData is a reference to this.state.data, and this.state.data is a reference to this.props.data. That means by doing so, the author changed the content of this.state.data without using setState and modified this.props.data from child component!
In line 18–20, the author finally calls setState to update this.state.data. However, since he has already changed the state in line 13, this is happening too late. (Perhaps the only good news is that this.state.data is no longer a reference to this.props.data now.)
Well, someone may clam: so what? My page is working properly! Perhaps those people do not understand the functionalities of lifecycle hooks. Usually, people write their business logic in lifecycle hooks, such as deriving state from props, or to perform a fetch call when some props changes. At this time, we may write like the following:
1 2 3 4 5
componentDidUpdate(prevProps){ if (this.props.data !== prevProps.data) { // business logic } }
Every time a component finished its update, it will call componentDidUpdate. This happens whenever setState is called or props is changed.
Unfortunately, if a novice developer unintentionally mutated this.state or this.props , these lifecycle hooks will not work, and will certainly cause some unexpected behaviors.
How to make every update under control?
If you are an lazy guy and like the natural thinking of using a temporary variable separating itself from original, as I displayed above, you are welcomed to use immer. Every time you are trying to update state, it would provide you a draft, and you can modify whatever you want on that before returning. An example is given by its documentation.
However, you should know that the most proper way to update a state field without modifying it directly through reference is to perform a clone. The clone sometimes needs to be deep to make sure every field is a through copy of the original one, not reference. One can achieve that goal by deepClone from lodash. But I do not recommend that since it may be too costy. Only in rare cases you will need deepClone.
Rather, I recommend using Object.assign(target, …sources) . What this function does is updating target by using elements from sources. It will return target after update is complete, but its content will be different from those of sources. So updating object should be like:
The actual programming can be more easy: you should know that there’s spread syntax available for you to expand an object or array. Using spread syntax, you can easily create an new object or array by writing:
That allows you to copy the original content of an object/an array into a new object/array. The copy behavior at this point is shadow copy, which means only the outer most object/array is changed, and if there’s an object inside the field of the original object, or an object at some position of an index, that inner object will be copied as a reference. This avoids the costy operations inside deepClone. I like this, because it gives you precise control on what you need to change.
So the proper component I gave above should be like this:
Some further advice
I would suggest all components you write from now on should be changed to React.PureComponent. This is no different from React.Component except it has its default shouldComponentUpdate function: it performs a shadow comparision of its states and props to check whether it should be updated, meanwhile React.Component would always return true if you do not provide custom logic. This would not only improve page’s performance but will also help you realize the unexpected rendering when you made the mistake I mentioned above.
If you need similar functionality on function components, you can try React.mono() which is available since React 16.6.
const regex = /<(\w+)>/g const testString = "<location>" // Check whether testString contains pattern "<slot>" // If yes, extract the slot name
if (regex.test(testString)) { console.log(regex.exec(testString)[1]) }
It seems to be a perfect logic extracting the tag name of the tag string: one uses test to prevent the situation that testString does not match. However, it would throw an error:
1
TypeError: regex.exec(...) isnull
It just violates our intuition: the string do match the regex, but it could not be matched. So why does this happen?
As MDN has stated, invoking test on a regex with flag “global” set would set “lastIndex” on the original regex. That lastIndex allows user to continue test the original string since there may be more than one matched patterns. This is affecting the exec that comes along. So the proper usage of a regex with “g” flag set is to continue executing the regex until it returns null, which means no other pattern could be found in the string.
This behavior is sometimes undesirable, since in the above scenario, I just want to know whether the string matches the predefined pattern. And the reason I don’t want to create the regex on the fly is to save the overhead of creating an object.
One obvious solution is to remove “g” flag. But sometimes, we do want to keep the “g” flag to find if a string matches the given pattern, and we don’t wish to modify the internal state of regex. In that case, one can switch to string.search(regex) , which would always returns the index of first match, or -1 if not matched.
After several years (really!) of development, I’m pleased to announce that rankit has been upgraded to v0.3. The version of previous major release is v0.1, which was made in 2016. So what has changed during these three years?
At the time when rankit v0.1 was developed, it was aimed to implement all fascinating algorithms mentioned in a beautiful ranking book and provide an alternative ranking solution to Bangumi. The programming interface designed at that time is far from practical. As I was looking into more ranking algorithms, the more I felt that rankit should include some popular ranking algorithms that are based on time series. One of them is Elo ranking, which has a simple implementation in rankit v0.2. But the usage scenario is limited: updating ranking is tedious and its interfaces are not compatible with existing ranking algorithms.
In v0.3, following updates are made to address those problems mentioned above:
Split rankers to “Unsupervised ranker” and “Time series ranker”. Each of those two rankers have their own interface, and they both consume the Table object to keep records.
Introduced Elo ranker, Glicko 2 ranker and TrueSkill ranker (only paired competition record is supported)
Updated interface to make it more user-friendly. For unsupervised ranker, only rank method is provided since it needs no update. For time series ranker, user should provide one use update to accumulate new record data and one can retrieve latest leader board by invoking leaderboard after each update.
The documentation of rankit has also been updated.
One side product of rankit is the “Scientific animation ranking for Bangumi”. For a long time, the ranking is not updated and it is gradually forgotten. I gave it a refresh last year and it is now having a completely new look with faster response and simple search functionality. More importantly, the ranking will be updated monthly. I would invite you all to have a try it here: https://ranking.ikely.me
The latest scientific animation ranking also involves minor algorithm changes. It is often witnessed that at certain time, users with peculiar intention would pour into Bangumi to rate one specific animation with certain high or low score. This impacted the proper order of rating. In previous version of scientific ranking, one can neglect those users who rate one anime and leave permanently, but could not handle those users who rate multiple animations. I adjusted the scores user rated overall and made several normalization according to the distribution of users’ rating, and all versions of normalized scores are fed into rankit to calculate latest rank. The final rank is still merged using Borda count.
But could this be solved from another perspective? One thing I have been thinking about is how to involve time series ranking into current ranking scheme. Ideally, time series ranking should act to combat ranking manipulation behavior in a way other than pairwise ranking. As I was reading about TrueSkill ranking, their brilliant idea to inference ranking using linear chain graph stroke me. Actually, TrueSkill is a generative graph model that organized in a order same as competition score. Another issue that need to resolve is to help users adjust historical ratings automatically: a user would rate an animation with wider range before, but his or her rating criteria may change with the evolvement of time. How to propagate recent rating criteria to historical ratings? All these should be supported in the next version of rankit. That is, to enable ranking for multiple players in a match, and power to propagate recent competition result to historical data.
Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:
Wish table: about 10k rows
User_id
Work_id
Wish
0
233
1
…
…
…
Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id
Work_id
Rate
0
22
like
2
33
dislike
…
…
…
Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id
Title
Category
0
Some_anime
anime
1
Some_manga
manga
…
…
…
For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.
Traditional recommendation system methods (that I know)
Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?
Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.
In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.
Feature Engineering
From wish table:
Distribution of user’s preference on anime/manga (2d+2d)
Distribution of item’s preference (2d)
Word2vec embedding of user on wish-to-watch items (20d)
Word2vec embedding of user on not-wish-to-watch items (10d)
Word2vec embedding of item on wish-to-watch users (20d)
Word2vec embedding of item on not-wish-to-watch users (10d)
Lsi embedding of user (20d)
Lsi embedding of item (20d)
From record table:
Distribution of user’s preference on anime/manga (4d+4d)
Distribution of item’s preference (4d)
Mean/StdErr of user’s rating (2d)
Mean/StdErr of item’s rating (2d)
Word2vec embedding of user on loved and liked items (32d)
Word2vec embedding of user on disliked items (10d)
Word2vec embedding of item on loved and liked users (32d)
Word2vec embedding of item on disliked users (10d)
Lsi embedding of user (20d)
Lsi embedding of item (20d)
Lda topic distribution of user on love, like and neutral items (20d)
Lda topic distribution of item on love, like and neutral ratings (20d)
Item categorial (1d, categorial feature)
User Id (1d, only used in FM)
Item Id (1d, only used in FM)
Model ensembing
The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.
In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.
Is this the ultimate ceiling?
Definitely not. One can push the boundary much further:
I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.
Conclusion
I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.
One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!
One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.
head user-2017-02-17T12_26_12-2017-02-19T06_06_44.tsv -n 5 head record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv -n 5 head subject-2017-02-26T00_28_51-2017-02-27T02_15_34.tsv -n 5
uid name nickname joindate activedate
7 7 lorien. 2008-07-14 2010-06-05
2 2 陈永仁 2008-07-14 2017-02-17
8 8 堂堂 2008-07-14 2008-07-14
9 9 lxl711 2008-07-14 2008-07-14
name iid typ state adddate rate tags
2 189708 real dropped 2016-10-06
2 76371 real dropped 2015-11-07
2 119224 real dropped 2015-03-04
2 100734 real dropped 2014-10-09
subjectid authenticid subjectname subjecttype rank date votenum favnum tags
1 1 第一次的親密接觸 book 1069 1999-11-01 57 [7, 84, 0, 3, 2] 小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2 2 坟场 music 272 421 [108, 538, 50, 18, 20] 陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4 4 合金弹头7 game 2396 2008-07-17 120 [14, 164, 6, 3, 2] STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6 6 军团要塞2 game 895 2007-10-10 107 [15, 108, 23, 9, 7] 抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15
由于标签列表是按照分号分隔的,我们首先使用 split($9, tags, ";") 把分割后的字符串存储在 array tags 里。接着在 for(i in tags) 里,i 实际上是 index,对于每一个 tag 我们再次使用 split,得到具体的 tag 和 tag 人数。可以看到,可以使用 C 语言的方式写 awk 的行处理逻辑。在写的时候是可以隔行的,虽然我都写在了一行。
在此之后,我们首先对条目的 id 排序,再在条目中对 tag 的标记人数排序。这里 sort 需要使用两个 -k 选项指定排序顺序。同时我们把 -nr 的条件写在每个排序列的后面,这样可以对列按照不同的排序逻辑排序。
awk -F "\t"'{cnt[$2]+=1; cum[$2]+=$6}; END {for(i in cnt){printf("%d\t%f\t%d\n", i, cum[i]/cnt[i], cnt[i]);}}' < anime_record.tsv | sort -t$'\t' -k2,2nr -k3,3nr | head
需要注意的是,这里又使用了 awk 里面的字典数据结构(associative array)。你可以看作是一个 python 里面的 dict。我们使用了两个变量:cnt 和 cum 存储每一个动画 id 的评分人数和评分总和。在最后,我们在 END 包围的代码块里面生成最后的结果。这时候 END 里面的语句是在文件遍历之后执行的。
comm 命令对每个文件的行操作,同时它的先验要求是文件已经排过序。它可以给出三列数据:仅在第一个文件中出现的行;仅在第二个文件中出现的行;在两个文件中同时出现的行。可以看出,这个就是求差集和并集的操作。通过指定 -13,我们指定只输出仅在第二个文件中出现的行,也就是在 user_list.tsv 中没有爬到的用户。
4. Join
获得收藏数据中的用户 id
在收藏数据中,我们记录下了用户的用户名,却没有记录用户 id 和用户昵称!这个是爬虫在设计时候的缺陷。这时候只能通过和用户数据 join 来弥补了。可是怎么在文本文件中进行 join 的操作呢?
首先,我们抽取两组数据:
1 2
sed 1d record-2017-02-20T14_03_27-2017-02-24T10_57_16.tsv| sort -t$'\t' -k1,1 > record.sorted.tsv head record.sorted.tsv