View Profile

Base

Long Description

Logistic Regression with L2 Regularization in Python for Automatic Data Learning

Logistic Regression

Logistic regression is used to solve binary classification problems, in which some examples are “on” and others are “off.” You are given a training set with some examples of each class and a label indicating whether each example is “on” or “off.” Get the best python training in delhi at Python Training Institute in South Extension. The goal is to learn a model from the training data so that you can predict the label of new examples that you haven’t seen before and for which you don’t know the label.

Assume you have data describing several buildings and earthquakes (for example, the year the building was built, the type of material used, the strength of the earthquake, and so on), and you know whether each building collapsed (“on”) or not (“off”) in each previous earthquake. You want to predict whether a specific building will collapse in a hypothetical future earthquake using this data.

One of the first models to try would be logistic regression. Best python training institute in Delhi can be found at Python Training Institute.

Making it up in code

I wasn’t working on this specific issue, but I was working on something similar. As someone who believes in practicing what they preach, I began looking for a dead-simple Python logistic regression class. The only requirement was that it supports L2 regularisation (more on this later). I’m also sharing this code with a lot of other people on a variety of platforms, so I wanted as few external library dependencies as possible.

I couldn’t find exactly what I was looking for, so I decided to go down memory lane and make it myself. I’ve done it before in C++ and Matlab, but never in Python.

I won’t go into the derivation, but if you’re not afraid of a little calculus, there are plenty of good explanations out there to follow. Simply search for “logistic regression derivation” on Google. The main idea is to write down the probability of the data given some internal parameter settings, then take the derivative, which will tell you how to change the internal parameters to make the data more likely. Do you understand? Good.

Take a look at how short the train() method is for those of you who know logistic regression inside and out. I like how simple Python makes it.

Regularization

During March Madness, I received some indirect criticism for discussing how I regularised the latent vectors in my matrix-factorization model of team offensive and defensive strengths when predicting outcomes in NCAA basketball. People thought I was crazy because I was talking nonsense.

But, seriously, regularisation is a good thing.

Allow me to emphasize my point. Examine the output of running the code (linked at the bottom).

Look at the first row.

The training set is located on the left side. The x-axis contains 25 examples, and the y axis indicates whether the example is “on” (1) or “off” (0). There is a vector describing the attributes of each of these examples that I am not showing. After training the model, I instruct it to disregard the known training set labels and estimate the probability that each label is “on” based solely on the description vectors of the examples and what the model has learned (hopefully things like stronger earthquakes and older buildings increase the likelihood of collapse). The red Xs represent the probabilities. The red Xs are right on top of the blue dots in the top left, so it is very certain about the labels of the examples, and it is always correct.

We now have some new examples on the right side that the model hasn’t seen before. This is referred to as the test set. This is essentially the same as the left side, except that the model does not know of the test set class labels (yellow dots). What you see is that it still does a good job of predicting labels, but there are some troubling cases where it is both confident and incorrect. This is referred to as overfitting.

This is where the concept of regularisation comes into play. As you move down the rows, the L2 regularisation becomes stronger – or, to put it another way, there is more pressure on the internal parameters to be zero. This has the effect of lowering the model’s confidence. Just because it can reconstruct the training set perfectly does not imply that it understands everything. You can imagine that if you were going to rely on this model to make important decisions, it would be nice to have some regularisation in there.

And now for the code. It appears to be lengthy, but the majority of it is spent generating data and plotting the results. The train() method, which is only three (dense) lines long, does the majority of the work. It necessitates the use of NumPy, scipy, and pylab.

* In the interest of full disclosure, I should admit that I generated my random data in such a way that it is prone to overfitting, potentially making logistic regression without regularization appear worse than it is.

Points

Current balance0
Years Of Membership
Rank
Participant