Select Page

## Hypothesis Testing – A Statistical Technique to Reject or Accept An Existing Claim

Business Analysts have to assist decision makers to take right decision by providing timely insights, informations and facts. Decision makers encounter various tough questions in day to day business management. Following are few of them.

1. Should XYZ Company reduce man power to 50% due to corona pandemic.
2. Which incentive scheme best motivates employees in manufacturing industry.
3. Increase in the spend budget by 30% would result in 50% sales increase.

Such questions need a statistic approach, to establish claim and evaluate the claim to accept or reject. Hypothesis is a statistical technique to establish a claim and “Hypothesis Testing” is a procedure for decision making to accept or reject claim.

## What is Hypothesis?

It is an assumption about population or population parameter like population mean (µ), variance or proportion (p). It is something to be proved or disproved.It is an educated guess or a claim. It is also a tentative explanation of a principle operating in nature.

There are 2 types of hypothesis i.e Null Hypothesis denoted by  H0, and Alternative Hypothesis (H1).

Null Hypothesis – H0 :Null Hypothesis states that No or NULL condition exists, there is no new happening, the old theory is still true,  the old standards or quality are still correct, and system is under control. It assumes that no difference or no effect

Alternative Hypothesis H1 : Alternative hypothesis states that there is a new theory exists and it is true, there are new standards, existing system is not under control or something difference is happening all the time.

## Steps in Testing of Hypothesis

1. State or define correctly Null hypothesis H0 and Alternative hypothesis H1. This primary step is very crucial in the entire process. Misrepresentation or misinterpretation of facts and wrongly understanding Null hypothesis(H0) or wrongly defining Alternative hypothesis would lead to wastage of time, money and effort to establish and prove unnecessary facts and assumptions. Read more…
2.  Specify the Level of Significance In Statistics ‘margin of error’ always exist in any model we build. There is no room for 100% or fool proof model or estimation. This a stage, where a researcher defines ‘allowable non-confidence limits’, where a percentage of error would be considered as inevitable. As error being an inevitable part of accepting or rejecting a hypothesis, significance level defines the Probability of committing Type 1 Error, i.e Rejecting Null hypothesis H0, when it is true. If a researcher assures 95% accuracy, then he is asking 5% error tolerance. This 5% or 0.05 is statistically referred as α ( alpha). Read more..
3. Use Appropriate Statistical Test This third step depends on the requirements and the definition of Null hypothesis H0 and Alternative hypothesis H1 constructed in the first step. Types of test can be classified as follows: Read more….

4. Decision Rule

All statistical tests give one parameter i.e P – value, based on this parameter we can be judgemental to accept Null hypothesis (H0) or reject Alternative hypothesis (H1).

Siginificance levels may set at 95%, 98% or 99%. Accordingly the alpha (α) i.e expected error percentage value changes to 0.05, 0.03 ot 0.01. A researcher should be 1 – α confident to prove his Alternative hypothesis (H1). In a nutshell:

Decision Rule – 1 : If P – Value is < = 0.05 reject Null hypothesis

Decision Rule – 2 : If P – Value is >= 0.05 accept Alternative hypothesis

If alpha is 0.05 and P-Value is 0.02, then researcher is 98% confident of his Alternative hypothesis H1, he can reject old or existing claim i.e Null hypothesis.

5. Draw Conclusion

There are many statistical tests viz. t Test( Comparision of mean), F Test ( Comparison of Variance), ANOVA (Analysis of Variance) etc. All test results would be validated with P-Values we get.

## How to deal with sparse data in Machine Learning?

Sparse data means incomplete or lack of input data or data with missing values,  on which we train machine learning models to predict.

On other hands, Data Density is exactly the opposite situation, where you do not have missing data.

The following data shows data sparsity and data density :

 Name Age Income (\$) Benny 27 NA#(0) Anna 21 NA#(0) Alice NA#(0) 8,500 Benson 29 NA#(0) Hudson NA#(0) 7,500 Sheila NA#(0) 9,800

In the above table we can notice NAs (Not Available, represented as 0) . The above table as 6 X 3 sparse matrix of 18 elements ( with 6 rows and 3 columns excluding column labels), has 5 elements with 0 as value. So, we can say that the above input data have 28% – data sparsity and 72%-data density.

Data sparsity is the real time scenario, one would come across to deal with it. It is normal in customer-friendly surveys with extremely sensitive to personal information gathering process.

How do we scale the sparse data?

Scaling data is another significant preprocessing task to be carried on your sparse data. In SKlearn, sklearn.preprocessing has one class – MaxAbsScaler and a function – maxabs_scale

For our example let us use one simple array with Sparse data.

Code:

from sklearn.preprocessing import MaxAbsScaler

import numpy as np

sparse_matrix = np.array([[ 25, 28,  0],

[ 47,  0, 30],

[  0,  50, 80]])

scaler = MaxAbsScaler().fit(sparse_matrix)

scaler.transform(sparse_matrix)

Output:

array([[ 0.53191489,  0.56      ,  0.        ],

[ 1.        ,  0.        ,  0.375     ],

[ 0.        ,  1.        ,  1.        ]])

Note : Above code compiled and executed in spider GUI with Python 3.0.

MaxAbsScaler maps each element to its Absolute value between [0,1], and it does on positive values only and disregards all 0 values. This class does not take care of outliers is another drawback.