Wednesday, January 25, 2017

Scales in Data Science


Scales in Data Science

  • Ratio Scale
    • Units are equally spaced
    • Mathematical operations such as +,-,*,/ are all valid
    • Example : Height and Weight
  • Interval Scale
    • Units are equally spaced, but there is no true zero value
    • For example, in temperature values, zero doesn't indicate that there is an absence of temperature.
  • Ordinal Scale
    • The order of the scale is important. It's not evenly spaced scale.
    • Example : Grades such as A, A-, A+.
  • Nominal Scale(Categorical)
    • It's very common in data science. These are the categories of data.
    • The order of the data is not important.
    • For example : Sports team
In [1]:
import pandas as pd
import numpy as np
In [2]:
student = ["alex","bob","cynthia","daniel","evans"]
tshirt = ["L","XL","S","M","L"]
In [3]:
df = pd.DataFrame(data = tshirt, index=student)
In [4]:
df = df.rename(columns={0:"tshirt"})
In [5]:
df
Out[5]:
tshirt
alex L
bob XL
cynthia S
daniel M
evans L

Nominal Scale (Setting type as category)

In [6]:
df["tshirt"].astype("category")
Out[6]:
alex        L
bob        XL
cynthia     S
daniel      M
evans       L
Name: tshirt, dtype: category
Categories (4, object): [L, M, S, XL]

Ordinal scale (ordered = True)

In [7]:
df = df["tshirt"].astype("category", categories = ["S","M","L","XL"],ordered = True)
In [8]:
df
Out[8]:
alex        L
bob        XL
cynthia     S
daniel      M
evans       L
Name: tshirt, dtype: category
Categories (4, object): [S < M < L < XL]
In [9]:
df.loc[["alex"]] < df.loc[["daniel"]]
Out[9]:
alex    False
Name: tshirt, dtype: bool
In [10]:
df.loc["alex"]
Out[10]:
'L'
In [11]:
df.loc["daniel"]
Out[11]:
'M'
In [12]:
df >="S" 
Out[12]:
alex       True
bob        True
cynthia    True
daniel     True
evans      True
Name: tshirt, dtype: bool

cut function

In [13]:
s = pd.Series([9,8,10,1,2,3,6,7,4,5])


pd.cut(s, 3)
Out[13]:
0       (7, 10]
1       (7, 10]
2       (7, 10]
3    (0.991, 4]
4    (0.991, 4]
5    (0.991, 4]
6        (4, 7]
7        (4, 7]
8    (0.991, 4]
9        (4, 7]
dtype: category
Categories (3, object): [(0.991, 4] < (4, 7] < (7, 10]]
In [14]:
s
Out[14]:
0     9
1     8
2    10
3     1
4     2
5     3
6     6
7     7
8     4
9     5
dtype: int64
In [15]:
# You can also add labels for the sizes [Small < Medium < Large].
pd.cut(s, 3, labels=['Small', 'Medium', 'Large'])
Out[15]:
0     Large
1     Large
2     Large
3     Small
4     Small
5     Small
6    Medium
7    Medium
8     Small
9    Medium
dtype: category
Categories (3, object): [Small < Medium < Large]
In [ ]:
 

No comments :

Post a Comment