In [3]:

import pandas as pd

data = pd.read_csv("simple.csv")

In []:

Here we read the CSV into python using panda. Panda can read the CSV without byitself without needing to use the csv module.
You can also write data to other format like JSON.

In [7]:

fdout = open("simple.data.json", "w")
fdout.write("%s\n" % pd.json.dumps(data))
fdout.close()

In [8]:

print open("simple.data.json").readlines()

['{"Height":{"0":178,"1":165,"2":180,"3":173,"4":178},"Weight":{"0":65,"1":60,"2":75,"3":61,"4":60}}\n']

In [9]:

Lets know manipulate the data which is being read. 
The question we now have is how the data (CSV) which is read in pandas lookslike

In [67]:

data

Out[67]:

	Height	Weight
0	178	65
1	165	60
2	180	75
3	173	61
4	178	60

5 rows × 2 columns

In [68]:

data.columns

Out[68]:

Index([u'Height', u'Weight'], dtype='object')

In []:

"colums" create a Index object. The object has two fields which "Height" and "Weight". 
They are the features of the data. Also printing "data" will show all fields. 
We can see that it has 5 rows each correspond to one observation (in this case one person).
For example Person 1 (1st row) has height of 178 c.m and Weight of 65 kg. Now lets start manipulating the data which
we have read into pandas.

When we read the CSV into pandas, for each of the fields (features/colums) pandas create a field.
In this example it creates fields "Height" and "Weight" which can be accessed as data.Height and data.Weight which is
pandas series object.

Now lets get only, Height features for all the observations and also for Weight. It can be done like below.

In [13]:

data.Weight

Out[13]:

0    65
1    60
2    75
3    61
4    60
Name: Weight, dtype: int64

In [12]:

data.Height

Out[12]:

0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64

In [46]:

type(data.Height)   ## each of the feature are stored as pandas series object

Out[46]:

pandas.core.series.Series

In []:

We can also access each element using [] operator like it is done for python list and dict.

In [35]:

data.Weight[0]

Out[35]:

In [40]:

data.Height[data.Height.size - 1]

Out[40]:

In [41]:

data.Height[-1]   ## reverse indexing doesn't work !!!

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-41-8922376fd1ed> in <module>()
----> 1 data.Height[-1]   ## reverse indexing doesn't work !!!

/usr/lib/python2.7/dist-packages/pandas/core/series.pyc in __getitem__(self, key)
    489     def __getitem__(self, key):
    490         try:
--> 491             result = self.index.get_value(self, key)
    492             if isinstance(result, np.ndarray):
    493                 return self._constructor(result,index=[key]*len(result)).__finalize__(self)

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_value(self, series, key)
   1030 
   1031         try:
-> 1032             return self._engine.get_value(s, k)
   1033         except KeyError as e1:
   1034             if len(self) > 0 and self.inferred_type == 'integer':

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2957)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2772)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)()

KeyError: -1

In []:

Now, lets try to get some information related to data. Panda provides lots of methods which can be used to 
get information related to data. Here we are trying to get information related to one particular feature Height. 
Same can be used for feature Weight (this example)

In [16]:

data.Height.size

Out[16]:

In [15]:

data.Height.count()

Out[15]:

In [17]:

data.Height.max()  ## max value of the feature Height (column Height)

Out[17]:

In [19]:

data.Height.min()  ## min value of the feature Height (min value of column Height)

Out[19]:

In []:

We can also get the index of the maximum element as below

In [33]:

data.Height.idxmax()  ## returns the index of the maximum value

Out[33]:

In [34]:

data.Height[data.Height.idxmax()]  ## note the [] which is used to get the maximum element using the index

Out[34]:

In []:

What if we want to modify each of the elements of the feature Height. Say we want to add a constant 100 or any other
arthimatic operation. Panda provides function for that too.

In [20]:

data.Height

Out[20]:

0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64

In [21]:

data.Height.add(100)  ## add a constat 100 to each observation of feature Height

Out[21]:

0    278
1    265
2    280
3    273
4    278
Name: Height, dtype: int64

In [22]:

data.Height.subtract(100)

Out[22]:

0    78
1    65
2    80
3    73
4    78
Name: Height, dtype: int64

In [23]:

data.Height.multiply(10)

Out[23]:

0    1780
1    1650
2    1800
3    1730
4    1780
Name: Height, dtype: int64

In [24]:

data.Height.divide(10)

Out[24]:

0    17.8
1    16.5
2    18.0
3    17.3
4    17.8
Name: Height, dtype: float64

In []:

All these airthematic operation can also be done by using the operator directly as below

In [66]:

data.Height / 10

Out[66]:

0    17.8
1    16.5
2    18.0
3    17.3
4    17.8
Name: Height, dtype: float64

In []:

Good, what else we can do. Lets try to get the statistics related to feature Height. Like, mean height, variance, 
standard devication etc.

In [25]:

data.Height.mean()

Out[25]:

174.80000000000001

In [26]:

data.Height.std()

Out[26]:

6.0580524923441432

In [27]:

data.Height.var()

Out[27]:

36.69999999999709

In [42]:

data.Height.cov(data.Height)

Out[42]:

36.700000000000003

In [44]:

data.Height.corr(data.Height)  ## correlation with itself.

Out[44]:

1.0

In [45]:

data.Height.kurtosis()

Out[45]:

1.4431022583944302

In []:

What about the comparison. i.e If we want to compare this feature (Height) with another feature.
There are equality operators that can be used

In [28]:

data.Height.equals(data.Height)  ## data.Height is equal to data.Height

Out[28]:

True

In [29]:

data.Height.equals(data.Weight)  ## data.Height is not equal to data.Weight

Out[29]:

False

In []:

Lets try the operators like > (gt), >= (ge), < (lt), <= (lt) etc

In [30]:

data.Height.gt(data.Weight)

Out[30]:

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [31]:

data.Height.lt(data.Weight)

Out[31]:

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [32]:

data.Height.ge(data.Weight)

Out[32]:

0    True
1    True
2    True
3    True
4    True
dtype: bool

In []:

You need not use these functions, instead you can use the operators directly as below

In [61]:

data.Height > 170

Out[61]:

0     True
1    False
2     True
3     True
4     True
Name: Height, dtype: bool

In []:

You can explore the rest.

In []:

As we already know, each of the features are stored as pandas series object. They are stored simialr to python dictionary.
So it has keys() function to get the index and values to get the values. Also it is iterable.

In [50]:

data.Height.keys

Out[50]:

<bound method Series.keys of 0    178
1    165
2    180
3    173
4    178
Name: Height, dtype: int64>

In [49]:

data.Height.values

Out[49]:

array([178, 165, 180, 173, 178])

In []:

Here are some ways of iterating through the features.

In [51]:

for i in data.Height:
    print i

In [52]:

for i in data.Height.keys():
    print data.Height[i]

In []:

Other ways to access the features are taking first few elements (head), taking last few elements (tail)

In [54]:

data.Height.head(2)

Out[54]:

0    178
1    165
Name: Height, dtype: int64

In [55]:

data.Height.tail(2)

Out[55]:

3    173
4    178
Name: Height, dtype: int64

In []:

If you want to take observatons (rows) of your choice, you can use take() fucntion, 
or you can even use pythons slice operations

In [56]:

data.Height.take([0,2,4])

Out[56]:

0    178
2    180
4    178
Name: Height, dtype: int64

In [57]:

data.Height[0:3]

Out[57]:

0    178
1    165
2    180
Name: Height, dtype: int64

In [59]:

data.Height[0:4:2]

Out[59]:

0    178
2    180
Name: Height, dtype: int64

In []:

You can also revese the observations using slicing operation

In [60]:

data.Height[::-1]

Out[60]:

4    178
3    173
2    180
1    165
0    178
Name: Height, dtype: int64

In []:

Earlier, we used boolean operator on the data.Height which returned either True or False depending on whether they 
staisfied the condition or not. We can also use them to extract the observations like below.

In [63]:

data.Height[data.Height > 170]

Out[63]:

0    178
2    180
3    173
4    178
Name: Height, dtype: int64

In []:

We notice however that the out size is less than the size of the inital observation size. 
(observation 1 doesn't satisfy the condtion). We can use where() function to get the observation series which statisfies the
 condtion, but also return the new observation series which has same size as that of original series.

In [64]:

data.Height.where(data.Height > 170)

Out[64]:

0    178
1    NaN
2    180
3    173
4    178
Name: Height, dtype: float64

In []:

The line which has NaN is the one which doesn't satisfy the condition.

In [65]:

## mask() is inverse operation of the where() (sort of negation)
data.Height.mask(data.Height > 170)

Out[65]:

0    NaN
1    165
2    NaN
3    NaN
4    NaN
Name: Height, dtype: float64

In []:

Unlike wordpress which has inbuilt syntax highlighter, adding the syntax highlighter to the blogger is not very easy. I had to spend about 2 hours to make it work. Here are the step by step instruction to add syntax highlighter to your blogger.

In the blogger setting, edit the blogger template.

In the HTML code search for the end of </head> section.

Copy the below javascript and paste it before the </head> section.

<link href="http://alexgorbatchev.com/pub/sh/current/styles/shCore.css" rel="stylesheet" type="text/css" />
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shThemeDefault.css" rel="stylesheet" type="text/css" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript" />

<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCpp.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushCss.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushDiff.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushJScript.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushJava.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPerl.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPhp.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPython.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushRuby.js" type="text/javascript" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushXml.js" type="text/javascript" />
 
<script language="javascript" type="text/javascript">
 SyntaxHighlighter.config.bloggerMode = true;
 SyntaxHighlighter.all();
</script>

Save the template file which enable the syntax highlighter in your blog.
Now to add code to your blog, create a post and in the blog editor switch to HTML view. Add the following code and update the blog.

< pre class="brush:python;" >
 import os
    import numpy as np
    import tensorflow as tf
    
    x = tf.Variable(10)
    y = tf.Variable(x + 2)
< /pre >

Once updated, you will see the following python code.

    import os
    import numpy as np
    import tensorflow as tf
    
    x = tf.Variable(10)
    y = tf.Variable(x + 2)

Most of the code is taken from here and other sources from google.

Data Viz

Sunday, May 15, 2016

Python Pandas for Beginners

Saturday, May 7, 2016

Adding source code syntax highlighter to blogger

Reduce the size of MP4

Comments system

Disqus Shortname