Rishit Dagli
Rishit Dagli's Blog

Rishit Dagli's Blog

Tips to building better Deep Learning Models from 10 Days of ML Challenge

Tips to building better Deep Learning Models from 10 Days of ML Challenge

Rishit Dagli's photo
Rishit Dagli

Published on Apr 10, 2020

18 min read

Subscribe to my newsletter and never miss my upcoming articles

source: tensorflow.org

These tips were formulated by me when I was participating in 10 Days of ML Challenge, a wonderful initiative by TensorFlow User Group Mumbai to encourage people to learn more or practice more about ML and not necessarily TensorFlow. The flow was such that you would be given a task at the start of the day and you were expected to complete the task and tweet about your results. Tasks could be anything from pre-processing the data to building a model to deploying your model. I would advise you to go over the tasks of the challenge and try them out yourself, it does not matter if you are a beginner or practitioner. I will go on to share my major takeaways from these tasks. All my solutions and the challenge problem statements are open sourced here- Rishit-dagli/10-Days-of-ML Repository for 10 Days of ML, a TFUG Mumbai initiative. All the code would be pushed to the repo everyday at around 7…github.com

My takeaways

The best part of the challenge was that it was not like a core competition but an initiative to promote learning which meant it is completely okay if you do not know a concept or have bugs which you are unable to fix, you could ask it out to the wonderful community and get support. This particularly played a great role for beginners. The still better part about it was that you would also be given relevant links to study a topic that would be required to complete the challenge.

All my outputs for the same are available in this thread-

Day 1

We were asked to build some interactive charts for the COVID-19 dataset available here. Primarily you had to build a-

  • Country Wise graph

  • Date Wise graph

  • Continent Wise graph

A major problem here was the dataset did not have latitude or longitude data with which you could have just passed it in a function to create a continent wise graph. So, what I did was got another country and continent dataset from Kaggle and added a column in my original dataset called “Continent”. And that solved it. Having so much data I saw some people making huge complex graphs due to which the purpose of graphs, simplifying things and ease of making analysis was kind of defied. You could no longer observe any patterns from it. Here for example a simple scatter plot showed me that rate of growth in some countries was like the textbook exponential graph so I replaced feature x1 by ln(x1) and it was almost linear!

Make simpler and more understandable graphs, through which you could analyse data and eyeball out few things or patterns instead of complex graphs which do not make much sense and defy the cause.

Day 2

Day 2 was all about feature engineering and data pre-processing on the Titanic dataset available here. Titanic is a pretty good candidate for this as we get to see a lot of variation and low correlation features in the Titanic dataset, making it very important to employ good feature engineering and pre-processing strategies. Further you could also do some data visualization.

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

You could use a lot of methods in this dataset itself like categorical conversion, bucketizing columns, one hot encoding and many more. Try them out yourself.

Day 3

We finally came to the part of building models, we had to predict loan status, cereal ratings and for advanced users a NLP task to analyze toxicity of comments. The first two tasks were quite simple and targeted towards the beginners, working with toxicity data was wonderful. A simple but powerful observation made from the NLP task was that it is best to choose the RNN/ LSTM/ GRU units as a multiple of the vocabulary size.

Try to choose the number of units of your recurrent layers as a multiple of vocabulary size and try to choose your embedding dimensions by a preferred similarity measure. Also do not simply stockpile your recurrent layers, instead try to improve your overall model architecture. Choose your embedding dimensions in multiples of 4 th root of number of features, models do really well.

Day 4

We finally came to unsupervised learning on the Expedia dataset available here. The dataset was pretty huge with over 2 GB of data so some precautions had to be taken while working with such a huge dataset under a constrained environment. You could use some techniques to do so like you should clear your intermediate tensors and variables you could do so easily using

del [variable_name]

You should also try and get your data using parallel processing techniques, while using preprocessing techniques run them in a map function and make best use of your CPU cores. In my approach I had used the cloud to do certain costly operations like PCA or LDA, you should try to do the same too. You should also not jump on algorithms like Restricted Boltzmann Machines, in this case there was a difference of ~0.8% accuracy between K-Means and Boltzmann machines, here it makes more sense to use K-Means and simplify your algorithm rather than a bit of accuracy.

Always get your data and perform pre-preprocessing on it in parallel in a map Clear your intermediate tensors and variable whenever possible to free up your memory for better performance. You should always prefer simplicity over small changes in accuracy.

Day 5

We were asked to build models for the famous dogs vs cats dataset available here. The idea was to use Convolutional neural networks or CNNs to do the same and figure out the features in the images. I had used two approaches to do the task, one with Transfer Learning from the Inception dataset, freezing some layers and adding some dense layers below this. I retrained the layers I added below the Inception model and ended up with a good accuracy. For the other approach which was about building the network from scratch I used a couple of convolutional and pooling layers which did the job. To prevent overfitting I also used regularization, dropout layers and image augmentation all made easy with TensorFlow. Both of which are present in the repo.

You should always try and use regularization, dropout layers and image augmentation to get over overfitting. Also try and use odd number of kernel size like 3x3 or 5x5. Do not just keep increasing your number of channels, you might get a better training accuracy but overfit the data. Always start by using smaller filters is to collect as much local information as possible, and then gradually increase the filter width to reduce the generated feature space width to represent more global, high-level and representative information.

Day 6

Day 6 had two tasks for us making a model for Fashion MNIST and X-Ray images for predicting pneumonia. Again this was an image problem and we were expected to use CNNs.

If the dataset is not much complex like the Fashion MNIST, do not waste your time by making a huge model architecture with 10 layers or so.

The X-Ray pneumonia problem was a bit tricky and we were also expected to do a bit of feature engineering with the data. I found out some very helpful experimental results while working on this problem which I believe one should follow.

The Batch normalization must generally be placed in the architecture after passing it through the layer containing activation function and before the Dropout layer(if any) . An exception is for the sigmoid activation function wherein you need to place the batch normalization layer before the activation to ensure that the values lie in linear region of sigmoid before the function is applied. Keep the feature space wide and shallow in the initial stages of the network, and the make it narrower and deeper towards the end. (Very Important) Place your dropout layers after max pooling layers. If placed before the Max-Pool layer the values removed by Dropout may not affect the output of the Max-Pool layer as it picks the maximum from a set of values, therefore only when the maximum value is removed can it be thought of removing a feature dependency.

Day 7

This was all about NLP and again we had two challenges, analysing IMDB reviews, again a very famous dataset and analysing sentiment of comments through the Twitter dataset.

Every LSTM layer should be accompanied by a Dropout layer. This layer will help to prevent overfitting by ignoring randomly selected neurons during training, and hence reduces the sensitivity to the specific weights of individual neurons. 20% is often used as a good compromise between retaining model accuracy and preventing overfitting.

The main challenge here was selecting the number of LSTM/ GRU units. And I used K-Fold cross validation to do so, if you do not want to dive deep into its workings you might want to use this simplified result-

Nᵢ is the number of input neurons, Nₒ the number of output neurons, Ns the number of samples in the training data, and a represents a scaling factor.

The a or α here is represents the scaling factor usually between 2 and 10. You can think of α as the effective branching factor or number of non zero weights for each neuron. I personally advise you to set the α between 5 and 10 for better performance.

Here is a wonderful visualization I created with Embedding projector for the same-

Day 8

Clustering documents or clustering text for consumer complaints data available here was the task for 8 th day. The task was to classify complaints data into different categories. I tried with many different algorithms like K-Means, Hierarchical, fuzzy, density clustering and restricted machines and K-Means again gave me the best results. I used the elbow method and the silhouette curve to know the best number of clusters for the same. I would again take this example to show that one should choose simplicity over a bit of performance. Affinity progression in this case gave me the best Fowlkeys-Mallows score which was just a bit higher than that of K-Means so I ended up using K-Means instead.

In the tradeoff between model explainability and performance or accuracy, you should prefer explainability in cases of small differences.

Day 9 and 10

These days were about TensorFlow JS, we had to deploy models which ran in the browser itself. If you do not know much about TF.js you can read my blogs on getting started with it here- Getting started with Deep Learning in browser with TF.js If you are a beginner in Machine Learning and want to get started with developing models in the browser this is for…medium.com

And we had to build a project of our choice with TF.js. In these days I built two projects, a real time text sentiment analyzer and a web app which could determine your pose. The pose detection was built on top of the Posenet model. Browser based inference or training is to be done with precaution you do not want your process to block your computer or fill it with intermediate variables. A good idea here is to use something like Firebase to host a trained model and call it from your web app. There is indeed a lot of preprocessing to be done when doing real time inference. You should also not block any threads and run your processes asynchronously to not block your UI while your model is being loaded and made ready.

I have seen few people writing all of their JS code between two script tags in the HTML itself or have all their code in a single index.jsfile, now you have only 1 file with all your code. That is a very bad approach from a developer’s point of view, you should try and separate your code with 1 file for the code to get the data, another to load the model, another to preprocess and so on. This makes it super easy for someone else to understand your code.

It is almost mandatory to use the wonderful tf.tidy , what it does keeps your runtime free of the intermediate tensors and other variables you might have created to give you best memory utilization and prevent memory leaks. Run your processes asynchronously, you can simply use the keywords a async while defining the function and await while calling it, this would not freeze your UI thread and is pretty helpful. I would recommend to shrink the model size by applying weight quantization by quantizing the model weights, we can reduce the size of our model to a fourth of the original size.

Results

The results for the challenge cam out on 9 April, 2020 and I was among the winners 😃. To my delight I was also promoted to a mentor for the group for amazing engagement.

Thanks to Sayak Paul, Ali Mustufa Shaikh, Shubham Sah, Mitusha Arya and Smit Jethwa for their mentorship during the program.

About Me

Hi everyone I am Rishit Dagli

Twitter

Website

If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at —

 
Share this