Sunday, May 15, 2016

Python Pandas for Beginners


In [3]:
import pandas as pd

data = pd.read_csv("simple.csv")
In []:
Here we read the CSV into python using panda. Panda can read the CSV without byitself without needing to use the csv module.
You can also write data to other format like JSON. 
In [7]:
fdout = open("simple.data.json", "w")
fdout.write("%s\n" % pd.json.dumps(data))
fdout.close()
In [8]:
print open("simple.data.json").readlines()
['{"Height":{"0":178,"1":165,"2":180,"3":173,"4":178},"Weight":{"0":65,"1":60,"2":75,"3":61,"4":60}}\n']

In [9]:
Lets know manipulate the data which is being read. 
The question we now have is how the data (CSV) which is read in pandas lookslike
In [67]:
data
Out[67]:
Height Weight
0 178 65
1 165 60
2 180 75
3 173 61
4 178 60
5 rows × 2 columns
In [68]:
data.columns
Out[68]:
Index([u'Height', u'Weight'], dtype='object')
In []:
"colums" create a Index object. The object has two fields which "Height" and "Weight". 
They are the features of the data. Also printing "data" will show all fields. 
We can see that it has 5 rows each correspond to one observation (in this case one person).
For example Person 1 (1st row) has height of 178 c.m and Weight of 65 kg. Now lets start manipulating the data which
we have read into pandas.

When we read the CSV into pandas, for each of the fields (features/colums) pandas create a field.
In this example it creates fields "Height" and "Weight" which can be accessed as data.Height and data.Weight which is
pandas series object.

Now lets get only, Height features for all the observations and also for Weight. It can be done like below.
In [13]:
data.Weight
Out[13]:
0    65
1    60
2    75
3    61
4    60
Name: Weight, dtype: int64
In [12]:
data.Height
Out[12]:
0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64
In [46]:
type(data.Height)   ## each of the feature are stored as pandas series object
Out[46]:
pandas.core.series.Series
In []:
We can also access each element using [] operator like it is done for python list and dict.
In [35]:
data.Weight[0]
Out[35]:
65
In [40]:
data.Height[data.Height.size - 1]
Out[40]:
178
In [41]:
data.Height[-1]   ## reverse indexing doesn't work !!!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-41-8922376fd1ed> in <module>()
----> 1 data.Height[-1]   ## reverse indexing doesn't work !!!

/usr/lib/python2.7/dist-packages/pandas/core/series.pyc in __getitem__(self, key)
    489     def __getitem__(self, key):
    490         try:
--> 491             result = self.index.get_value(self, key)
    492             if isinstance(result, np.ndarray):
    493                 return self._constructor(result,index=[key]*len(result)).__finalize__(self)

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_value(self, series, key)
   1030 
   1031         try:
-> 1032             return self._engine.get_value(s, k)
   1033         except KeyError as e1:
   1034             if len(self) > 0 and self.inferred_type == 'integer':

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2957)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2772)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)()

KeyError: -1
In []:
Now, lets try to get some information related to data. Panda provides lots of methods which can be used to 
get information related to data. Here we are trying to get information related to one particular feature Height. 
Same can be used for feature Weight (this example)
In [16]:
data.Height.size
Out[16]:
5
In [15]:
data.Height.count()
Out[15]:
5
In [17]:
data.Height.max()  ## max value of the feature Height (column Height)
Out[17]:
180
In [19]:
data.Height.min()  ## min value of the feature Height (min value of column Height)
Out[19]:
165
In []:
We can also get the index of the maximum element as below
In [33]:
data.Height.idxmax()  ## returns the index of the maximum value
Out[33]:
2
In [34]:
data.Height[data.Height.idxmax()]  ## note the [] which is used to get the maximum element using the index
Out[34]:
180
In []:
What if we want to modify each of the elements of the feature Height. Say we want to add a constant 100 or any other
arthimatic operation. Panda provides function for that too.
In [20]:
data.Height
Out[20]:
0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64
In [21]:
data.Height.add(100)  ## add a constat 100 to each observation of feature Height
Out[21]:
0    278
1    265
2    280
3    273
4    278
Name: Height, dtype: int64
In [22]:
data.Height.subtract(100)
Out[22]:
0    78
1    65
2    80
3    73
4    78
Name: Height, dtype: int64
In [23]:
data.Height.multiply(10)
Out[23]:
0    1780
1    1650
2    1800
3    1730
4    1780
Name: Height, dtype: int64
In [24]:
data.Height.divide(10)
Out[24]:
0    17.8
1    16.5
2    18.0
3    17.3
4    17.8
Name: Height, dtype: float64
In []:
All these airthematic operation can also be done by using the operator directly as below
In [66]:
data.Height / 10
Out[66]:
0    17.8
1    16.5
2    18.0
3    17.3
4    17.8
Name: Height, dtype: float64
In []:
Good, what else we can do. Lets try to get the statistics related to feature Height. Like, mean height, variance, 
standard devication etc.
In [25]:
data.Height.mean()
Out[25]:
174.80000000000001
In [26]:
data.Height.std()
Out[26]:
6.0580524923441432
In [27]:
data.Height.var()
Out[27]:
36.69999999999709
In [42]:
data.Height.cov(data.Height)
Out[42]:
36.700000000000003
In [44]:
data.Height.corr(data.Height)  ## correlation with itself.
Out[44]:
1.0
In [45]:
data.Height.kurtosis()
Out[45]:
1.4431022583944302
In []:
What about the comparison. i.e If we want to compare this feature (Height) with another feature.
There are equality operators that can be used
In [28]:
data.Height.equals(data.Height)  ## data.Height is equal to data.Height
Out[28]:
True
In [29]:
data.Height.equals(data.Weight)  ## data.Height is not equal to data.Weight
Out[29]:
False
In []:
Lets try the operators like > (gt), >= (ge), < (lt), <= (lt) etc
In [30]:
data.Height.gt(data.Weight)
Out[30]:
0    True
1    True
2    True
3    True
4    True
dtype: bool
In [31]:
data.Height.lt(data.Weight)
Out[31]:
0    False
1    False
2    False
3    False
4    False
dtype: bool
In [32]:
data.Height.ge(data.Weight)
Out[32]:
0    True
1    True
2    True
3    True
4    True
dtype: bool
In []:
You need not use these functions, instead you can use the operators directly as below
In [61]:
data.Height > 170
Out[61]:
0     True
1    False
2     True
3     True
4     True
Name: Height, dtype: bool
In []:
You can explore the rest.
In []:
As we already know, each of the features are stored as pandas series object. They are stored simialr to python dictionary.
So it has keys() function to get the index and values to get the values. Also it is iterable.
In [50]:
data.Height.keys
Out[50]:
<bound method Series.keys of 0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64>
In [49]:
data.Height.values
Out[49]:
array([178, 165, 180, 173, 178])
In []:
Here are some ways of iterating through the features.
In [51]:
for i in data.Height:
    print i
178
165
180
173
178

In [52]:
for i in data.Height.keys():
    print data.Height[i]
178
165
180
173
178

In []:
Other ways to access the features are taking first few elements (head), taking last few elements (tail)
In [54]:
data.Height.head(2)
Out[54]:
0    178
1    165
Name: Height, dtype: int64
In [55]:
data.Height.tail(2)
Out[55]:
3    173
4    178
Name: Height, dtype: int64
In []:
If you want to take observatons (rows) of your choice, you can use take() fucntion, 
or you can even use pythons slice operations
In [56]:
data.Height.take([0,2,4])
Out[56]:
0    178
2    180
4    178
Name: Height, dtype: int64
In [57]:
data.Height[0:3]
Out[57]:
0    178
1    165
2    180
Name: Height, dtype: int64
In [59]:
data.Height[0:4:2]
Out[59]:
0    178
2    180
Name: Height, dtype: int64
In []:
You can also revese the observations using slicing operation
In [60]:
data.Height[::-1]
Out[60]:
4    178
3    173
2    180
1    165
0    178
Name: Height, dtype: int64
In []:
Earlier, we used boolean operator on the data.Height which returned either True or False depending on whether they 
staisfied the condition or not. We can also use them to extract the observations like below.
In [63]:
data.Height[data.Height > 170]
Out[63]:
0    178
2    180
3    173
4    178
Name: Height, dtype: int64
In []:
We notice however that the out size is less than the size of the inital observation size. 
(observation 1 doesn't satisfy the condtion). We can use where() function to get the observation series which statisfies the
 condtion, but also return the new observation series which has same size as that of original series.
In [64]:
data.Height.where(data.Height > 170)
Out[64]:
0    178
1    NaN
2    180
3    173
4    178
Name: Height, dtype: float64
In []:
The line which has NaN is the one which doesn't satisfy the condition.
In [65]:
## mask() is inverse operation of the where() (sort of negation)
data.Height.mask(data.Height > 170)
Out[65]:
0    NaN
1    165
2    NaN
3    NaN
4    NaN
Name: Height, dtype: float64
In []:
 

No comments :

Post a Comment

Comments system

Disqus Shortname