How to deal with sparse data in Machine Learning?

Sparse data means incomplete or lack of input data or data with missing values,  on which we train machine learning models to predict.

On other hands, Data Density is exactly the opposite situation, where you do not have missing data.

The following data shows data sparsity and data density :

Name Age

Income ($)

Benny

27 NA#(0)

Anna

21

NA#(0)

Alice

NA#(0)

8,500

Benson

29

NA#(0)

Hudson

NA#(0)

7,500

Sheila NA#(0)

9,800

 

In the above table we can notice NAs (Not Available, represented as 0) . The above table as 6 X 3 sparse matrix of 18 elements ( with 6 rows and 3 columns excluding column labels), has 5 elements with 0 as value. So, we can say that the above input data have 28% – data sparsity and 72%-data density.

Data sparsity is the real time scenario, one would come across to deal with it. It is normal in customer-friendly surveys with extremely sensitive to personal information gathering process.

How do we scale the sparse data?

Scaling data is another significant preprocessing task to be carried on your sparse data. In SKlearn, sklearn.preprocessing has one class – MaxAbsScaler and a function – maxabs_scale

For our example let us use one simple array with Sparse data.

Code:

from sklearn.preprocessing import MaxAbsScaler

import numpy as np

sparse_matrix = np.array([[ 25, 28,  0],

[ 47,  0, 30],

[  0,  50, 80]])

scaler = MaxAbsScaler().fit(sparse_matrix)

scaler.transform(sparse_matrix)

Output:

array([[ 0.53191489,  0.56      ,  0.        ],

[ 1.        ,  0.        ,  0.375     ],

[ 0.        ,  1.        ,  1.        ]])

Note : Above code compiled and executed in spider GUI with Python 3.0.

MaxAbsScaler maps each element to its Absolute value between [0,1], and it does on positive values only and disregards all 0 values. This class does not take care of outliers is another drawback.