Sparse data means incomplete or lack of input data or data with missing values, on which we train machine learning models to predict.
On other hands, Data Density is exactly the opposite situation, where you do not have missing data.
The following data shows data sparsity and data density :
Name | Age |
Income ($) |
Benny |
27 | NA#(0) |
Anna |
21 |
NA#(0) |
Alice |
NA#(0) |
8,500 |
Benson |
29 |
NA#(0) |
Hudson |
NA#(0) |
7,500 |
Sheila | NA#(0) |
9,800 |
In the above table we can notice NAs (Not Available, represented as 0) . The above table as 6 X 3 sparse matrix of 18 elements ( with 6 rows and 3 columns excluding column labels), has 5 elements with 0 as value. So, we can say that the above input data have 28% – data sparsity and 72%-data density.
Data sparsity is the real time scenario, one would come across to deal with it. It is normal in customer-friendly surveys with extremely sensitive to personal information gathering process.
How do we scale the sparse data?
Scaling data is another significant preprocessing task to be carried on your sparse data. In SKlearn, sklearn.preprocessing has one class – MaxAbsScaler and a function – maxabs_scale
For our example let us use one simple array with Sparse data.
Code:
from sklearn.preprocessing import MaxAbsScaler
import numpy as np
sparse_matrix = np.array([[ 25, 28, 0],
[ 47, 0, 30],
[ 0, 50, 80]])
scaler = MaxAbsScaler().fit(sparse_matrix)
scaler.transform(sparse_matrix)
Output:
array([[ 0.53191489, 0.56 , 0. ],
[ 1. , 0. , 0.375 ],
[ 0. , 1. , 1. ]])
Note : Above code compiled and executed in spider GUI with Python 3.0.
MaxAbsScaler maps each element to its Absolute value between [0,1], and it does on positive values only and disregards all 0 values. This class does not take care of outliers is another drawback.