I have been a Kaggle fan for a long time. A community of data scientists and engineers devoting for pioneer data science practice has always been attractive for me. Though I’m not a dedicated Kaggler, I would still devote several month’s weekends per year to some Kaggle competitions after work, to grasp the spirit of dedication and religious attitude. That’s why I registered it the first time when I received the registration notification email.

Besides the lectures and GrandMasters, another thing that specifically attracts me is the offline data science competition. I have been curious about the authentic ability of offline coding, since in my opinion, most Kagglers online are fed by public Kernels. What would be their real performance without Kernels? Though I’m only a linear model Kaggler, I’m certain that I am somewhat experienced than others in feature selection, so there might be a chance for me to win.

I checked the previous Kaggle days before this event, and all the offline competitions are about tabular data. So I tried to get me familiar with all common EDA APIs and code snippets of pandas feature generation and scikit-learn compatible cross-validation. I know deeply that my skills in traditional machine learning cannot achieve high place on leaderboard in the age of deep learning, so I invited one of my colleague, Lihao, who is a deep learning expert, to join me.

DAY 1

Opening of Kaggle Days China

The first day of the event was all about lectures and workshops. There were several interesting workshops for you to attend, but you must register first. The first thing I regreted was that there was an lecture about LightGBM that I really wanted to attend, but it conflicted with a workshop about modeling pipeline. In fact, I attended that lecture when it was about to end, and even so, I still learned something insightful from that. I may need to go review the lecture videos later.

Someone mentioned before that all the things to do when attending a technical meeting is to chat with people: no need to attend lectures, no workshops, just communicating. And I have to say this is the best part of the Kaggle Day. I did talk with a lot of people. However, I was still too shy during the meeting because it wasn’t me who tried to get to know others first. I have to say everybody in Expert group have their domain knowledge, not all of them are necessarily experienced Kagglers, but they know AI industries in China very well. As a SDE working in a small city, Suzhou, I have not felt this excitement of communicating with industry experts for a long time ever since I left Beijing in 2015.

Announce of competition title on day 2

At the end of the day, the organizer disclosed the title of tomorrow’s data science competition. Though I had expected that it should be another tabular data competition, the title indicated that it would be a computer vision competition. It reminds me of a previous competition classifying astronomical objects, but it may not necessarily be in the same form. Having no practical knowledge of contemporary computer vision, in which deep learning has dominated, I regretted I didn’t follow my domestic advisor Lianwen Jin well when I was in graduate school. My working experience also could not contribute to this competition since I’m working in NLP group. Fortunately, when we were about to leave, a guy came to us, asking if he can join us. He said he had some CV background. Perhaps this was the best news I received that day, so I was grateful for him.

DAY 2

I had decided from yesterday’s night that if the competition was really a computer vision competition, I would resort to fast.ai. I leaned about it this summer, and this was the only thing I know how to use in modern deep learning based CV. It turned out it is. This competition requires us to classify images into two classes, so it’s a typical binary classification problem.

CV requires GPU equipped machines, and on the night before competition, we were required to configure our machines on specified service provided by UCloud. It was actually a Jupyter Notebook backended by 8 GPUs. However, without proper configuration, that machine is almost unusable. It has tf 2.0 alpha installed, not final release version nor stable tf 1.14. So Lihao spent a lot of time configuring the machine in the morning.

I originally thought that one need to perform EDA and do proper train/test split first, but soon I discarded this idea for this CV competition. However, Williame Lee, the guy I mentioned above, spent some time inspecting data first. He tried to find out some patterns of the image. But in my opinion, features are extracted automatically by deep neural networks, and even if I concluded some patterns, we don’t know where to feed it if I use deep neural network to extract feature.

Kaggle offline competition ongoing

The core spirit of fast.ai is using pre-trained networks to classify images, and fine-tune them at the end. This turned out to be a very successful idea. I used the whole morning to build the pipeline, and it works! My first classifier, which is using ResNet34 as pretrained model, works as well as baseline. Later, Lihao trained this model further to push it to 0.85, and we tried several other models like ResNet18 and ResNet50. Even NN simple as RetNet18 can achieve good results at 0.82 after fine-tunning. Williame also developed a neural net using mobile net which is achieving 0.81 on public leaderboard.

Meanwhile at the same time, Fengari shared his 0.88 baseline, which is using EfficientNet. You can imagine that this fed many competitors attending this competition. Lihao then switched to this new baseline and adapted its cross-validation scheme. At last, we merged our three ResNet pretrained models and two EfficientNet adaptations as final result. That placed us at 17th over the whole leaderboard (34 teams). Not too bad for me as my first CV competition experience!

Day 2’s experience is like thrown in to a swimming pool (I don’t know how to swim, really) and learn to swim by myself. I have successfully trained a deep neural network targeting computer vision for the first time. Now I’m not fearing CV any more!

The organizer soon announced the winners, who are sitting in-front of us face-to-face during the whole competition. They are using multi-tasking to improve our model, which is a key technique that Williame implied in the morning. Their solution is here: https://github.com/okotaku/kaggle_days_china

Before this Kaggle Day, my ambition was to stand on the winning stage. But unfortunately, I still have many things to learn to achieve this goal. I asked Lihao later if this is the ideal tech venue that he likes, his answer was no, but he still prized the core spirit of this Kaggle community. I hope next year, I would find someone who bear the same mindset as me and debut together. If I could cross-dress the next time, the best!


留言

2019-10-26