## Wednesday, January 25, 2017

### Scales in Data Science¶

• Ratio Scale
• Units are equally spaced
• Mathematical operations such as +,-,*,/ are all valid
• Example : Height and Weight
• Interval Scale
• Units are equally spaced, but there is no true zero value
• For example, in temperature values, zero doesn't indicate that there is an absence of temperature.
• Ordinal Scale
• The order of the scale is important. It's not evenly spaced scale.
• Example : Grades such as A, A-, A+.
• Nominal Scale(Categorical)
• It's very common in data science. These are the categories of data.
• The order of the data is not important.
• For example : Sports team
In [1]:
import pandas as pd
import numpy as np

In [2]:
student = ["alex","bob","cynthia","daniel","evans"]
tshirt = ["L","XL","S","M","L"]

In [3]:
df = pd.DataFrame(data = tshirt, index=student)

In [4]:
df = df.rename(columns={0:"tshirt"})

In [5]:
df

Out[5]:
tshirt
alex L
bob XL
cynthia S
daniel M
evans L

### Nominal Scale (Setting type as category)¶

In [6]:
df["tshirt"].astype("category")

Out[6]:
alex        L
bob        XL
cynthia     S
daniel      M
evans       L
Name: tshirt, dtype: category
Categories (4, object): [L, M, S, XL]

### Ordinal scale (ordered = True)¶

In [7]:
df = df["tshirt"].astype("category", categories = ["S","M","L","XL"],ordered = True)

In [8]:
df

Out[8]:
alex        L
bob        XL
cynthia     S
daniel      M
evans       L
Name: tshirt, dtype: category
Categories (4, object): [S < M < L < XL]
In [9]:
df.loc[["alex"]] < df.loc[["daniel"]]

Out[9]:
alex    False
Name: tshirt, dtype: bool
In [10]:
df.loc["alex"]

Out[10]:
'L'
In [11]:
df.loc["daniel"]

Out[11]:
'M'
In [12]:
df >="S"

Out[12]:
alex       True
bob        True
cynthia    True
daniel     True
evans      True
Name: tshirt, dtype: bool

### cut function¶

In [13]:
s = pd.Series([9,8,10,1,2,3,6,7,4,5])

pd.cut(s, 3)

Out[13]:
0       (7, 10]
1       (7, 10]
2       (7, 10]
3    (0.991, 4]
4    (0.991, 4]
5    (0.991, 4]
6        (4, 7]
7        (4, 7]
8    (0.991, 4]
9        (4, 7]
dtype: category
Categories (3, object): [(0.991, 4] < (4, 7] < (7, 10]]
In [14]:
s

Out[14]:
0     9
1     8
2    10
3     1
4     2
5     3
6     6
7     7
8     4
9     5
dtype: int64
In [15]:
# You can also add labels for the sizes [Small < Medium < Large].
pd.cut(s, 3, labels=['Small', 'Medium', 'Large'])

Out[15]:
0     Large
1     Large
2     Large
3     Small
4     Small
5     Small
6    Medium
7    Medium
8     Small
9    Medium
dtype: category
Categories (3, object): [Small < Medium < Large]
In [ ]: