For every Data Scientist out there it is very important to do feature engineering before carrying out any type of analysis with the dataset say predictive or prescriptive. Feature engineering techniques involve many things like removing nan values from the data, dropping unnecessary columns, scaling the data, splitting the data, merging, concatenating, and many more. With the help of these feature engineering techniques, the accuracy of our model gets increased and we can get better and reliable results. From all these feature engineering techniques two important ones are Skewness and Kurtosis. The details of these are given below:
By Skewness we mean the lack of symmetry a dataset is having. In simple terms, if we are plotting a distribution of our dataset like normal distribution then how much skewed the dataset is from its mean. The more the skew the more the lack of symmetry. A distribution is said to be symmetrical or with no skew when the values are uniformly distributed around the mean. In such cases skew is zero and mean=mode=median. This implies that in symmetrical distribution the mean, mode, and median coincide with each other. There are mainly two types of Skewness based on asymmetry and these are left skew and right skew. When the distribution is spread as such that the mean lies on the left-hand side and the outliers lie on the tail region of the graph then it is called a rightly skewed distribution and when the opposite of this happens that is mean shifted to the right and outliers lying to the left of the graph then it is called a left skew. The range of the Skewness is either negative, positive, or neutral based on the following formula:
Sk= 3 (mean – median) / Standard Deviation
Here Sk is called the Coefficient of Skewness and if it is negative then the distribution is negatively skewed and if positive then positively skewed. If the same is 0 then there is no skew. The range of this coefficient is from -3 to +3.
β1= µ32/ µ23
Here µ2 and µ3 are the second and third central moments. Here µ2 is the variance.
Sample estimate is given by:
b1= m32/m23, and m3 and m2 are given by:
m2= ∑(x- x̅ )2/n-1
m3= ∑(x- x̅ )3/n-1
Considering a symmetrical distribution then the value of b1 should be equal to 0. Based on whether m3 is positive or negative the direction of Skewness is decided.
It is defined as the measure of convexity or peaks of the graph/curve. There are broadly three types of Kurtosis and they are mesokurtic curve or normal curve, the leptokurtic curve of leaping curve and platykurtic curve, or flat curve. Kurtosis is measured by the Pearson coefficient β2.
The formula for β2 is:
where if we are taking the sample estimate then
The value of m4 is given as:
m4= ∑(x- x̅ )4/n-1
If the value of this b2 is equal to 3 then the distribution is said to be normal, if it is more than 3 then it is called leptokurtic and less than 3 then platykurtic.
Use these feature engineering techniques to see the distribution of your data and try to remove the outliers from your data to make it as clean as possible for proper analysis.