In [3]:
import pandas as pd
data = pd.read_csv("simple.csv")
In []:
Here we read the CSV into python using panda. Panda can read the CSV without byitself without needing to use the csv module.
You can also write data to other format like JSON.
In [7]:
fdout = open("simple.data.json", "w")
fdout.write("%s\n" % pd.json.dumps(data))
fdout.close()
In [8]:
print open("simple.data.json").readlines()
In [9]:
Lets know manipulate the data which is being read.
The question we now have is how the data (CSV) which is read in pandas lookslike
In [67]:
data
Out[67]:
In [68]:
data.columns
Out[68]:
In []:
"colums" create a Index object. The object has two fields which "Height" and "Weight".
They are the features of the data. Also printing "data" will show all fields.
We can see that it has 5 rows each correspond to one observation (in this case one person).
For example Person 1 (1st row) has height of 178 c.m and Weight of 65 kg. Now lets start manipulating the data which
we have read into pandas.
When we read the CSV into pandas, for each of the fields (features/colums) pandas create a field.
In this example it creates fields "Height" and "Weight" which can be accessed as data.Height and data.Weight which is
pandas series object.
Now lets get only, Height features for all the observations and also for Weight. It can be done like below.
In [13]:
data.Weight
Out[13]:
In [12]:
data.Height
Out[12]:
In [46]:
type(data.Height) ## each of the feature are stored as pandas series object
Out[46]:
In []:
We can also access each element using [] operator like it is done for python list and dict.
In [35]:
data.Weight[0]
Out[35]:
In [40]:
data.Height[data.Height.size - 1]
Out[40]:
In [41]:
data.Height[-1] ## reverse indexing doesn't work !!!
In []:
Now, lets try to get some information related to data. Panda provides lots of methods which can be used to
get information related to data. Here we are trying to get information related to one particular feature Height.
Same can be used for feature Weight (this example)
In [16]:
data.Height.size
Out[16]:
In [15]:
data.Height.count()
Out[15]:
In [17]:
data.Height.max() ## max value of the feature Height (column Height)
Out[17]:
In [19]:
data.Height.min() ## min value of the feature Height (min value of column Height)
Out[19]:
In []:
We can also get the index of the maximum element as below
In [33]:
data.Height.idxmax() ## returns the index of the maximum value
Out[33]:
In [34]:
data.Height[data.Height.idxmax()] ## note the [] which is used to get the maximum element using the index
Out[34]:
In []:
What if we want to modify each of the elements of the feature Height. Say we want to add a constant 100 or any other
arthimatic operation. Panda provides function for that too.
In [20]:
data.Height
Out[20]:
In [21]:
data.Height.add(100) ## add a constat 100 to each observation of feature Height
Out[21]:
In [22]:
data.Height.subtract(100)
Out[22]:
In [23]:
data.Height.multiply(10)
Out[23]:
In [24]:
data.Height.divide(10)
Out[24]:
In []:
All these airthematic operation can also be done by using the operator directly as below
In [66]:
data.Height / 10
Out[66]:
In []:
Good, what else we can do. Lets try to get the statistics related to feature Height. Like, mean height, variance,
standard devication etc.
In [25]:
data.Height.mean()
Out[25]:
In [26]:
data.Height.std()
Out[26]:
In [27]:
data.Height.var()
Out[27]:
In [42]:
data.Height.cov(data.Height)
Out[42]:
In [44]:
data.Height.corr(data.Height) ## correlation with itself.
Out[44]:
In [45]:
data.Height.kurtosis()
Out[45]:
In []:
What about the comparison. i.e If we want to compare this feature (Height) with another feature.
There are equality operators that can be used
In [28]:
data.Height.equals(data.Height) ## data.Height is equal to data.Height
Out[28]:
In [29]:
data.Height.equals(data.Weight) ## data.Height is not equal to data.Weight
Out[29]:
In []:
Lets try the operators like > (gt), >= (ge), < (lt), <= (lt) etc
In [30]:
data.Height.gt(data.Weight)
Out[30]:
In [31]:
data.Height.lt(data.Weight)
Out[31]:
In [32]:
data.Height.ge(data.Weight)
Out[32]:
In []:
You need not use these functions, instead you can use the operators directly as below
In [61]:
data.Height > 170
Out[61]:
In []:
You can explore the rest.
In []:
As we already know, each of the features are stored as pandas series object. They are stored simialr to python dictionary.
So it has keys() function to get the index and values to get the values. Also it is iterable.
In [50]:
data.Height.keys
Out[50]:
In [49]:
data.Height.values
Out[49]:
In []:
Here are some ways of iterating through the features.
In [51]:
for i in data.Height:
print i
In [52]:
for i in data.Height.keys():
print data.Height[i]
In []:
Other ways to access the features are taking first few elements (head), taking last few elements (tail)
In [54]:
data.Height.head(2)
Out[54]:
In [55]:
data.Height.tail(2)
Out[55]:
In []:
If you want to take observatons (rows) of your choice, you can use take() fucntion,
or you can even use pythons slice operations
In [56]:
data.Height.take([0,2,4])
Out[56]:
In [57]:
data.Height[0:3]
Out[57]:
In [59]:
data.Height[0:4:2]
Out[59]:
In []:
You can also revese the observations using slicing operation
In [60]:
data.Height[::-1]
Out[60]:
In []:
Earlier, we used boolean operator on the data.Height which returned either True or False depending on whether they
staisfied the condition or not. We can also use them to extract the observations like below.
In [63]:
data.Height[data.Height > 170]
Out[63]:
In []:
We notice however that the out size is less than the size of the inital observation size.
(observation 1 doesn't satisfy the condtion). We can use where() function to get the observation series which statisfies the
condtion, but also return the new observation series which has same size as that of original series.
In [64]:
data.Height.where(data.Height > 170)
Out[64]:
In []:
The line which has NaN is the one which doesn't satisfy the condition.
In [65]:
## mask() is inverse operation of the where() (sort of negation)
data.Height.mask(data.Height > 170)
Out[65]:
In []: