You tell me I'm wrong. Then you'd better prove you're right.

### 2020-01-03 Bangumi Spider 的历史遗留问题及基于 GitHub Actions 的解决方案

1. Azure 虚拟机高配置而长时间空转，浪费 budget；
2. scrapyd container instance 在爬取数据过后还会重启，丢失已部署的爬虫，经调查有不法人士黑入该 instance；
3. ranking web application 并非 continuously deployed，需要手动更新；

1. crontab job 自动化
2. 自动按需启动 scrapyd server 并在运行结束关闭
3. 自动按需启动 postprocessing job 并在运行结束关闭
4. 随着爬虫的更新，postprocessing job 也应该更新

### GitHub Actions

GitHub Actions ，据官方描述，能够极大简化你的 CI/CD 流程。但实际上它能做的事情不仅局限于 CI/CD。在这篇文章中，我将介绍 Bangumi Spider 新添加的 GitHub Actions 如何实现自动化部署和排名自动化更新。

### Azure Web App Service

Azure 提供 App Service 的服务，并附带一个 Azure 的证书，而且也可以绑定到自己的域名。关键是这个服务的价格比自己 host 虚拟机要便宜（如果用 Basic plan 的话）。App Service 最好的地方在于支持 docker image 和 docker-composed images，于是我把 dotnet core 的服务也使用 docker image 部署在其上。https://ranking.ikely.me 使用 CloudFlare 做 CDN 并由其提供证书，关于如何把 CloudFlare 的证书导入到 Azure App Service 的操作参见这篇文章

### Achievements

• 100% 实现了全自动无人干预更新排名（除非 Sai 老板再次更新 Bangumi 页面导致爬虫需要连带更改）
• Conditionally update docker image
• 大幅削减预算：从每月七十多刀的虚拟机削减到十四刀左右。
• 增强了 scrapyd 的安全性。

### 2019-10-26 My Kaggle Days China experience

I have been a Kaggle fan for a long time. A community of data scientists and engineers devoting for pioneer data science practice has always been attractive for me. Though I’m not a dedicated Kaggler, I would still devote several month’s weekends per year to some Kaggle competitions after work, to grasp the spirit of dedication and religious attitude. That’s why I registered it the first time when I received the registration notification email.

Besides the lectures and GrandMasters, another thing that specifically attracts me is the offline data science competition. I have been curious about the authentic ability of offline coding, since in my opinion, most Kagglers online are fed by public Kernels. What would be their real performance without Kernels? Though I’m only a linear model Kaggler, I’m certain that I am somewhat experienced than others in feature selection, so there might be a chance for me to win.

I checked the previous Kaggle days before this event, and all the offline competitions are about tabular data. So I tried to get me familiar with all common EDA APIs and code snippets of pandas feature generation and scikit-learn compatible cross-validation. I know deeply that my skills in traditional machine learning cannot achieve high place on leaderboard in the age of deep learning, so I invited one of my colleague, Lihao, who is a deep learning expert, to join me.

### DAY 1

The first day of the event was all about lectures and workshops. There were several interesting workshops for you to attend, but you must register first. The first thing I regreted was that there was an lecture about LightGBM that I really wanted to attend, but it conflicted with a workshop about modeling pipeline. In fact, I attended that lecture when it was about to end, and even so, I still learned something insightful from that. I may need to go review the lecture videos later.

Someone mentioned before that all the things to do when attending a technical meeting is to chat with people: no need to attend lectures, no workshops, just communicating. And I have to say this is the best part of the Kaggle Day. I did talk with a lot of people. However, I was still too shy during the meeting because it wasn’t me who tried to get to know others first. I have to say everybody in Expert group have their domain knowledge, not all of them are necessarily experienced Kagglers, but they know AI industries in China very well. As a SDE working in a small city, Suzhou, I have not felt this excitement of communicating with industry experts for a long time ever since I left Beijing in 2015.

At the end of the day, the organizer disclosed the title of tomorrow’s data science competition. Though I had expected that it should be another tabular data competition, the title indicated that it would be a computer vision competition. It reminds me of a previous competition classifying astronomical objects, but it may not necessarily be in the same form. Having no practical knowledge of contemporary computer vision, in which deep learning has dominated, I regretted I didn’t follow my domestic advisor Lianwen Jin well when I was in graduate school. My working experience also could not contribute to this competition since I’m working in NLP group. Fortunately, when we were about to leave, a guy came to us, asking if he can join us. He said he had some CV background. Perhaps this was the best news I received that day, so I was grateful for him.

### DAY 2

I had decided from yesterday’s night that if the competition was really a computer vision competition, I would resort to fast.ai. I leaned about it this summer, and this was the only thing I know how to use in modern deep learning based CV. It turned out it is. This competition requires us to classify images into two classes, so it’s a typical binary classification problem.

CV requires GPU equipped machines, and on the night before competition, we were required to configure our machines on specified service provided by UCloud. It was actually a Jupyter Notebook backended by 8 GPUs. However, without proper configuration, that machine is almost unusable. It has tf 2.0 alpha installed, not final release version nor stable tf 1.14. So Lihao spent a lot of time configuring the machine in the morning.

I originally thought that one need to perform EDA and do proper train/test split first, but soon I discarded this idea for this CV competition. However, Williame Lee, the guy I mentioned above, spent some time inspecting data first. He tried to find out some patterns of the image. But in my opinion, features are extracted automatically by deep neural networks, and even if I concluded some patterns, we don’t know where to feed it if I use deep neural network to extract feature.

The core spirit of fast.ai is using pre-trained networks to classify images, and fine-tune them at the end. This turned out to be a very successful idea. I used the whole morning to build the pipeline, and it works! My first classifier, which is using ResNet34 as pretrained model, works as well as baseline. Later, Lihao trained this model further to push it to 0.85, and we tried several other models like ResNet18 and ResNet50. Even NN simple as RetNet18 can achieve good results at 0.82 after fine-tunning. Williame also developed a neural net using mobile net which is achieving 0.81 on public leaderboard.

Meanwhile at the same time, Fengari shared his 0.88 baseline, which is using EfficientNet. You can imagine that this fed many competitors attending this competition. Lihao then switched to this new baseline and adapted its cross-validation scheme. At last, we merged our three ResNet pretrained models and two EfficientNet adaptations as final result. That placed us at 17th over the whole leaderboard (34 teams). Not too bad for me as my first CV competition experience!

Day 2’s experience is like thrown in to a swimming pool (I don’t know how to swim, really) and learn to swim by myself. I have successfully trained a deep neural network targeting computer vision for the first time. Now I’m not fearing CV any more!

The organizer soon announced the winners, who are sitting in-front of us face-to-face during the whole competition. They are using multi-tasking to improve our model, which is a key technique that Williame implied in the morning. Their solution is here: https://github.com/okotaku/kaggle_days_china

Before this Kaggle Day, my ambition was to stand on the winning stage. But unfortunately, I still have many things to learn to achieve this goal. I asked Lihao later if this is the ideal tech venue that he likes, his answer was no, but he still prized the core spirit of this Kaggle community. I hope next year, I would find someone who bear the same mindset as me and debut together. If I could cross-dress the next time, the best!

### 2019-08-01 Saving the React novice: update a component properly

Front end development in React is all about updating components. As the business logic of a webpage is described in render() and lifecycle hooks, setting states and props properly is the core priority of every React developer. It relates to not only the functionalities but also the rendering effectiveness. In this article, I’m going to tell from the basics of React component update, then we will look at some common errors that React novice would often produce. Some optimizing techniques will also be described to help novice set up a proper manner when thinking in React.

## Basics

A webpage written in React consists of states. That is to say, every change of a React component will lead to a change in the appearance of a webpage. Since state can be passed down to child components as props, the change of state and props are responsible for the variation of view. However, there are two key principles pointed out by React documentation:

1. One can not change props directly.
2. One can only update state by using setState() .

These two constraints are linked with how React works. React’s message passing is unidirectional, so one can not mutate props from child component to parent component. setState() is related to a component’s lifecycle hooks, any attempt to change state without using setState() will bypass lifecycle hooks’ functionality:

However, during my development experiences, I have observed numerous cases where these two principles are broken. A major part of those misbehaved patterns can be traced back to the selective ignorance of a important property of JavaScript.

## Assignment: reference or value?

Let’s look at the example above. We all know const allows us to declare a variable that cannot be assigned a second time, so there’s no doubt why when we are trying to assign temp with 3 a TypeError is thrown. However, const does not imply constness to the internal field of an object. When we are trying to mutate the internal field of obj , it just works.

This is another common operation. From the script we know a and b are two non-object variables, and re-assignment to b leads to a !== b . However, when we are trying to assign d as c, which is an object, mutating the internal field does not change the equal relationship between the two. That implies that d is a reference of c.

So we can conclude two observations from the above:

1. const does not mean constness to object’s field. It certainly cannot prevent developer from mutating its internal field.
2. Assigning an object to another variable would pass on its reference.

Having acknowledged of the above, we can go on to the following code:

As you read the code, you are clearly aware of the intention of the author: onChange is a place where this.state is changed: it is changed according to incoming parameters. The author get a copy of original data first, then modify its value, then push to nextData. At last, the author calls this.setState to update.

If this code’s pattern appears in your projects and it works, it is normal. According to React’s component lifecycle, the this.state.data is changed to nextData , and it will eventually effect the render()‘s return value. However, there are a series of flaws in this code that I have to point out. If you fully understand and agree with the two observations I mentioned above, you will find the following points making you uncomfortable:

1. data=props.data in line 5 is assigning this.props.data‘s reference to this.state.data, which means changing this.state.data directly COULD mutate this.props.data.
2. prevData is assigned as a reference to this.state.data in line 10. However, as you read through the code, you will realize that this is not the real intention of the author. He wants to “separate” prevData from this.state.data by using const. However, this is a totally misunderstanding of const.
3. In line 13, each item in prevData is mutated by assigning its field a to another value. However, as we mentioned before, prevData is a reference to this.state.data, and this.state.data is a reference to this.props.data. That means by doing so, the author changed the content of this.state.data without using setState and modified this.props.data from child component!
4. In line 18–20, the author finally calls setState to update this.state.data. However, since he has already changed the state in line 13, this is happening too late. (Perhaps the only good news is that this.state.data is no longer a reference to this.props.data now.)

Well, someone may clam: so what? My page is working properly! Perhaps those people do not understand the functionalities of lifecycle hooks. Usually, people write their business logic in lifecycle hooks, such as deriving state from props, or to perform a fetch call when some props changes. At this time, we may write like the following:

Every time a component finished its update, it will call componentDidUpdate. This happens whenever setState is called or props is changed.

Unfortunately, if a novice developer unintentionally mutated this.state or this.props , these lifecycle hooks will not work, and will certainly cause some unexpected behaviors.

## How to make every update under control?

If you are an lazy guy and like the natural thinking of using a temporary variable separating itself from original, as I displayed above, you are welcomed to use immer. Every time you are trying to update state, it would provide you a draft, and you can modify whatever you want on that before returning. An example is given by its documentation.

However, you should know that the most proper way to update a state field without modifying it directly through reference is to perform a clone. The clone sometimes needs to be deep to make sure every field is a through copy of the original one, not reference. One can achieve that goal by deepClone from lodash. But I do not recommend that since it may be too costy. Only in rare cases you will need deepClone.

Rather, I recommend using Object.assign(target, …sources) . What this function does is updating target by using elements from sources. It will return target after update is complete, but its content will be different from those of sources. So updating object should be like:

The actual programming can be more easy: you should know that there’s spread syntax available for you to expand an object or array. Using spread syntax, you can easily create an new object or array by writing:

That allows you to copy the original content of an object/an array into a new object/array. The copy behavior at this point is shadow copy, which means only the outer most object/array is changed, and if there’s an object inside the field of the original object, or an object at some position of an index, that inner object will be copied as a reference. This avoids the costy operations inside deepClone. I like this, because it gives you precise control on what you need to change.

So the proper component I gave above should be like this:

I would suggest all components you write from now on should be changed to React.PureComponent. This is no different from React.Component except it has its default shouldComponentUpdate function: it performs a shadow comparision of its states and props to check whether it should be updated, meanwhile React.Component would always return true if you do not provide custom logic. This would not only improve page’s performance but will also help you realize the unexpected rendering when you made the mistake I mentioned above.

If you need similar functionality on function components, you can try React.mono() which is available since React 16.6.

### 2019-07-10 RegExp.test() returns true but RegExp.exec() returns null?

Consider the following Javascript script:

It seems to be a perfect logic extracting the tag name of the tag string: one uses test to prevent the situation that testString does not match. However, it would throw an error:

It just violates our intuition: the string do match the regex, but it could not be matched. So why does this happen?

As MDN has stated, invoking test on a regex with flag “global” set would set “lastIndex” on the original regex. That lastIndex allows user to continue test the original string since there may be more than one matched patterns. This is affecting the exec that comes along. So the proper usage of a regex with “g” flag set is to continue executing the regex until it returns null, which means no other pattern could be found in the string.

This behavior is sometimes undesirable, since in the above scenario, I just want to know whether the string matches the predefined pattern. And the reason I don’t want to create the regex on the fly is to save the overhead of creating an object.

One obvious solution is to remove “g” flag. But sometimes, we do want to keep the “g” flag to find if a string matches the given pattern, and we don’t wish to modify the internal state of regex. In that case, one can switch to string.search(regex) , which would always returns the index of first match, or -1 if not matched.

### 2019-05-03 Write a reusable modern React component module

1. 这个组件依赖于 React， React-dom，怎么让用户知道他们也必须用这两个东西呢？
2. 如果用户在自己的应用上面用了 React 16.2，但是我的组件使用了 React 16.8，用户在使用我的组件的时候会出现兼容性问题吗？会不会为了兼容性问题而装两个版本的 React 呢？
3. 用户调用我的包，是通过文件直接调用的，怎么才能把我的包放在 node_modules 里，即，通过 npm install 的形式安装？
4. 我的包使用了某些先进的语言特性。通过文件直接调用是无法通过 babel 转义为较低版本的 Javascript 的。甚至，用户都不能通过 import MyModule from './mymodule' 的形式调用！

## 对模板的优化

1. 它看上去也把当前版本的 react, react-dom 和 styled-components 打包了进去，增加了库的大小；
2. 它实际上是先通过 babel 的转译再被下游应用调用的模块，所以下游应用使用 ES6 的 import 的时候，并不会有真正的 tree-shaking （所谓 tree-shaking，就是 ES6 通过分析 import 和 export 判定哪些代码被真正地调用，从而在执行前就把不被调用的代码给去掉）。
3. 我也没有必要在下游应用 bundle 的时候对源模块进行二次 bundling。

## 剩下的问题

### 参考文献

1. Rinse-react: https://rinsejs.io/
2. Webpack: output.libraryTarget: https://webpack.js.org/configuration/output/#outputlibrarytarget
3. Writing Reusable Components in ES6 https://www.smashingmagazine.com/2016/02/writing-reusable-components-es6/
4. CommonJS vs AMD vs RequireJS vs ES6 Modules https://medium.com/computed-comparisons/commonjs-vs-amd-vs-requirejs-vs-es6-modules-2e814b114a0b
5. 你的 Tree-Shaking 并没什么卵用 https://juejin.im/post/5a5652d8f265da3e497ff3de
6. Webpack and Rollup: the same but different https://medium.com/webpack/webpack-and-rollup-the-same-but-different-a41ad427058c

### 2019-04-21 Release of rankit v0.3 and roadmaps for future bangumi ranking

After several years (really!) of development, I’m pleased to announce that rankit has been upgraded to v0.3. The version of previous major release is v0.1, which was made in 2016. So what has changed during these three years?

At the time when rankit v0.1 was developed, it was aimed to implement all fascinating algorithms mentioned in a beautiful ranking book and provide an alternative ranking solution to Bangumi. The programming interface designed at that time is far from practical. As I was looking into more ranking algorithms, the more I felt that rankit should include some popular ranking algorithms that are based on time series. One of them is Elo ranking, which has a simple implementation in rankit v0.2. But the usage scenario is limited: updating ranking is tedious and its interfaces are not compatible with existing ranking algorithms.

1. Split rankers to “Unsupervised ranker” and “Time series ranker”. Each of those two rankers have their own interface, and they both consume the Table object to keep records.
2. Introduced Elo ranker, Glicko 2 ranker and TrueSkill ranker (only paired competition record is supported)
3. Updated interface to make it more user-friendly. For unsupervised ranker, only rank method is provided since it needs no update. For time series ranker, user should provide one use update to accumulate new record data and one can retrieve latest leader board by invoking leaderboard after each update.

The documentation of rankit has also been updated.

One side product of rankit is the “Scientific animation ranking for Bangumi”. For a long time, the ranking is not updated and it is gradually forgotten. I gave it a refresh last year and it is now having a completely new look with faster response and simple search functionality. More importantly, the ranking will be updated monthly. I would invite you all to have a try it here: https://ranking.ikely.me

The latest scientific animation ranking also involves minor algorithm changes. It is often witnessed that at certain time, users with peculiar intention would pour into Bangumi to rate one specific animation with certain high or low score. This impacted the proper order of rating. In previous version of scientific ranking, one can neglect those users who rate one anime and leave permanently, but could not handle those users who rate multiple animations. I adjusted the scores user rated overall and made several normalization according to the distribution of users’ rating, and all versions of normalized scores are fed into rankit to calculate latest rank. The final rank is still merged using Borda count.

But could this be solved from another perspective? One thing I have been thinking about is how to involve time series ranking into current ranking scheme. Ideally, time series ranking should act to combat ranking manipulation behavior in a way other than pairwise ranking. As I was reading about TrueSkill ranking, their brilliant idea to inference ranking using linear chain graph stroke me. Actually, TrueSkill is a generative graph model that organized in a order same as competition score. Another issue that need to resolve is to help users adjust historical ratings automatically: a user would rate an animation with wider range before, but his or her rating criteria may change with the evolvement of time. How to propagate recent rating criteria to historical ratings? All these should be supported in the next version of rankit. That is, to enable ranking for multiple players in a match, and power to propagate recent competition result to historical data.

### 2017-10-02 Mangaki data challenge 1st place solution

Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:

1. Wish table: about 10k rows
User_id Work_id Wish
0 233 1
1. Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id Work_id Rate
0 22 like
2 33 dislike
1. Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id Title Category
0 Some_anime anime
1 Some_manga manga

For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.

## Traditional recommendation system methods (that I know)

Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?

Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.

In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.

## Feature Engineering

### From wish table:

• Distribution of user’s preference on anime/manga (2d+2d)
• Distribution of item’s preference (2d)
• Word2vec embedding of user on wish-to-watch items (20d)
• Word2vec embedding of user on not-wish-to-watch items (10d)
• Word2vec embedding of item on wish-to-watch users (20d)
• Word2vec embedding of item on not-wish-to-watch users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)

### From record table:

• Distribution of user’s preference on anime/manga (4d+4d)
• Distribution of item’s preference (4d)
• Mean/StdErr of user’s rating (2d)
• Mean/StdErr of item’s rating (2d)
• Word2vec embedding of user on loved and liked items (32d)
• Word2vec embedding of user on disliked items (10d)
• Word2vec embedding of item on loved and liked users (32d)
• Word2vec embedding of item on disliked users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)
• Lda topic distribution of user on love, like and neutral items (20d)
• Lda topic distribution of item on love, like and neutral ratings (20d)
• Item categorial (1d, categorial feature)
• User Id (1d, only used in FM)
• Item Id (1d, only used in FM)

## Model ensembing

The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.

## Model stacking

The stack net requires one to feed all prediction result from the first layer as feature to second feature. The stacking technique requires one to do KFold cross-validation at the beginning, and then to predict each fold’s result based on all other folds as training data on the second level. Here is the most intuitive (as far as I think) description of model stacking technique: http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/

In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.

## Is this the ultimate ceiling?

Definitely not. One can push the boundary much further:

1. I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
2. I only used lightGBM to build GBDT. One can also use xgboost. Even though they all provides GBDT, lightGBM is a leaf-wise tree growth algorithm based model, while xgboost is depth-wise tree growth. Even though two models are all CART based GBDT, they behaves differently.
3. I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
4. I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.

## Conclusion

I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.

One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!

One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.

### 2017-04-14 Console as a SQL interface for quick text file processing

uid    name    nickname    joindate    activedate
7    7    lorien.    2008-07-14    2010-06-05
2    2    陈永仁    2008-07-14    2017-02-17
8    8    堂堂    2008-07-14    2008-07-14
9    9    lxl711    2008-07-14    2008-07-14
name    iid    typ    state    adddate    rate    tags
2    189708    real    dropped    2016-10-06
2    76371    real    dropped    2015-11-07
2    119224    real    dropped    2015-03-04
2    100734    real    dropped    2014-10-09
subjectid    authenticid    subjectname    subjecttype    rank    date    votenum    favnum    tags
1    1    第一次的親密接觸    book    1069    1999-11-01    57    [7, 84, 0, 3, 2]    小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2    2    坟场    music    272        421    [108, 538, 50, 18, 20]    陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4    4    合金弹头7    game    2396    2008-07-17    120    [14, 164, 6, 3, 2]    STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6    6    军团要塞2    game    895    2007-10-10    107    [15, 108, 23, 9, 7]    抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15


1. 非实时。我所说的“实时”并不是今天是 4 月 16 日而数据只是 2 月的，而是我无法保证数据是在某一个时间点上的快照。对于用户数据，由于爬取一次需要两天的时间，在这两天的时间里，可能用户修改了他们的昵称和用户名而在爬取的数据上未反映出来。更为严重的问题是，对于收藏数据，可能会出现在爬取数据的时候用户进行了收藏的操作，导致爬取的数据出现重复或缺失。而且由于用户数据和收藏数据是分开爬取的，我无法保证通过用户名能把两个 table 一一对应地 join 起来。
2. 非顺序。可以从预览的数据中看到。
3. 爬虫本身缺陷。由于我对于 Bangumi 出现 500 错误没有在处理上体现出来，所以会导致某些数据有所缺失。

## 1. SELECT … WHERE … ORDER BY …

### 筛选 2017 冬季番组

90


85 anime_selection.tsv
122772    122772    六心公主    anime        2016-12-30    26    [19, 41, 1, 1, 4]    17冬:1;原创:1;PONCOTAN:4;2016年:2;广桥凉:1;TVSP:1;池赖宏:1;原优子:1;mebae:1;TV:4;日本动画:1;片山慎三:1;Studio:1;STUDIOPONCOTAN:4;2016:5;TVA:1;短片:2;上田繁:1;搞笑:4;中川大地:2;岛津裕之:2;种崎敦美:1;2017年1月:1;テレビアニメ:1;オリジナル:1;SP:1;6HP:2;村上隆:10;未确定:1
125900    125900    锁链战记～赫克瑟塔斯之光～    anime    3065    2017-01-07    88    [66, 24, 216, 20, 60]    山下大辉:3;17冬:1;原创:1;游戏改:47;CC:1;花泽香菜:7;TV:22;未确定:2;グラフィニカ:2;佐仓绫音:4;2017年1月:61;锁链战记:1;2017:10;锁链战记～Haecceitas的闪光～:15;热血:2;チェインクロ:1;石田彰:22;声优:2;2017年:4;Telecom_Animation_Film:1;十文字:1;柳田淳一:1;战斗:2;内田真礼:2;剧场版:1;奇幻:17;2017·01·07:1;工藤昌史:3;2015年10月:1;TelecomAnimationFilm:9
126185    126185    POPIN Q    anime        2016-12-23    10    [134, 11, 3, 3, 0]    荒井修子:1;黒星紅白:4;原创:3;黑星红白:1;2016年:5;_Q:1;日本动画:1;2016年12月:2;未确定:1;小泽亚李:1;2017:2;2016:5;动画电影:1;2017年:5;Q:3;东映动画:1;种崎敦美:1;2017年1月:1;宫原直树:1;POPIN:6;東映アニメーション:12;剧场版:24;东映:4;萌系画风:1;濑户麻沙美:5
131901    131901    神怒之日    anime        2017-10-01    0    [79, 1, 0, 3, 1]    GENCO:3;2017年10月:2;TV:4;未确定:2;2017年:2;GAL改:4;游戏改:4;LIGHT:2;2017:3;エロゲ改:3;2017年1月:1


### 抽取标签列表

122772    六心公主    村上隆    10
122772    六心公主    2016    5
122772    六心公主    PONCOTAN    4
122772    六心公主    STUDIOPONCOTAN    4
122772    六心公主    TV    4
122772    六心公主    搞笑    4
122772    六心公主    2016年    2
122772    六心公主    6HP    2
122772    六心公主    中川大地    2
122772    六心公主    岛津裕之    2
122772    六心公主    短片    2
122772    六心公主    テレビアニメ    1
122772    六心公主    オリジナル    1
122772    六心公主    17冬    1
122772    六心公主    2017年1月    1
122772    六心公主    mebae    1
122772    六心公主    SP    1
122772    六心公主    Studio    1
122772    六心公主    TVA    1
122772    六心公主    TVSP    1
sort: write failed: standard output: Broken pipe
sort: write error


1. 在进行 DFS 寻找增广路径之前用 BFS 从汇点搜索建立 dep 和 depcnt，即到汇点的最短距离，这样为在 DFS 的时候总是走最短路径提供了依据。
2. 正向边和反向边成对存储，索引为偶数的总是正向边；正向边和反向边可以通过索引的异或切换；
3. stack 里面存储通过 DFS 寻找的增广路径。
4. 在每一次 DFS 寻找过后，回溯到瓶颈路径的起点重新搜索，节省时间；