Article on data analytics with python by interlibrary in this article we’ll take a look at what data analytics is and how we can use it to our advantage data analytics is one of the most important concepts, especially in data science.
It involves using a lot of different methodologies to better understand the data that you have how do you get started with it how do you deal with a large set of data and how do you extract information out of it using data analytics we’ll cover all of these and more in this Article.
Let’s take a look at the agenda Article so we’ll begin by taking a look at what is data analytics is at the core of what we’re trying to discuss here so we’ll try and get a good firm understanding of what data analytics is so that we all know what we’re talking about and we’re all on the same page we will then move on to why data analytics will try and understand why should you perform data analytics.
On any piece of information or data that you have why data analytics in the sense that we’re trying to explain why would you want to use data analytics and what benefit could it provide you if you have some data set why would you want to perform data analytics on it then we’ll take a look at the components of data analytics involves multiple components and we’ll take a look at those by one.
We’ll begin by taking a look at numerical computation then we’ll take a look at data manipulation and then we’ll take a look at data visualization these are the three major components of data analytics once you get started with these then we can analyze any complicated set of data.
We have although these are quite vast films we are not going to be going in-depth into all of those then we have some hands-on uh this hands-on if you are following along you can follow along with us all you would have to do is need you to need to install anaconda which Jupiter notebook and you need to boot your Jupiter notebook instance up and create a new node other than that no libraries other than what is installed by default using an account that is required.
Let’s take a look at what is data analytics is going to be the foundation of what we are going to cover in this entire presentation data analytics is the process of analyzing data beef data by performing several operations on it such as transforming and cleaning etc.
Drawing useful information from it so if you’ve ever read about data science if you ever heard about the machine learning process one of the steps in the entire data science and machine learning pipeline is data analysis.
Data analytics as you might call it the point of this is when you have a lot of data you need to first understand the shape of the data the size of the data what exactly are the several features inside the data that you have how do those features correlate with each other what are the features that are important to you how do you extract those features how do you remove any unwanted values how do you remove any unwanted columns and how do you fill values that are not available at the moment.
After getting the data analyzing the data and figuring out what are the mistakes that your data might have only then you can go along and clean the data and transform the data and then you have to visualize the data and you have to perform a lot of different tasks to make your data more congestible.
Data to be able to extract good meaningful information out of it you need to be able to perform data analytics to understand how to make sense of the data now the process of data analytics as we have already discussed involved several steps and we’ll go through all of those by one but the most important aspect that we need to understand is that data analytics is about performing multiple steps on the data that we have depending on what steps need to be performed and those steps are going to transform.
Our data they’re going to clean our data they’re going to change the shape of our data for the better we need to understand what we’re doing to the data that we are performing these operations on and what would be the benefit of using these operations then comes by data analytics.
We understand what data analytics is which is transforming data visualizing data and all numerical computations and all of that but why go through all the trouble of analyzing some data set I mean you could just theoretically put it through a machine learning algorithm using cyclic learn and create a model and then make predictions.
Why go through the hassle of analyzing the data set that you already have well data analytics is performed to draw some conclusions from the available data to better inform our decision making process consider the following scenario you are a data scientist who is working in a hedge fund cooperation the financial status of your company is going down for some reason and what you have is a lot of data about how the finances and auditing have happened.
Inside your company now what you need to do is you need to then analyze that data to figure out where have the lapses been made to correct those mistakes this is where data analytics comes into play now that was a bit of a hypothetical situation.
You can use data analytics to your advantage as well suppose you are collecting data about the kind of books that you read about the kind of books that your friends read about the kinds of movies that you’ve been watching about the kinds of movies that your friends and family have been watching and what you can do using data analytics is then understand the shape of the data what kind of movies do you usually prefer to watch,
What kind of movies do your friends and family prefer to watch what are the common similarities between you and your friend’s taste in movies and books and then how you can get along with those movies and then next movie you can pick where the similarities happen.
Let’s say that you like action triggers and your friends like action thrillers as well the next time when you want to get along with your friends and go to watch a movie you can choose an action thriller this could also work in your decision-making processes in finances if you are someone who invests a lot you can take a look at the stock prehistorical stock prices to analyze.
The data and figure out if the price is going up or down there are many aspects to data analytics it’s not just related to personal finances or figuring out how to do a certain task you can also take a look at your finances to figure out where the most of your money has been going and you can better correct for those mistakes the best decisions are the decisions that are made with the help of data and understanding the problem at hand.
Data analytics enables you to have a good grasp of how to use the data that is already available for you in case you don’t have any data most probably many of the payment provider services and many of the services that you use to transfer money provide that data for you all you would have to do is convert it into a format that is useful to you .
We’ll take a look at all of that in a moment now some two components of data analytics so we’ve discussed what data analytics is why it’s useful but now comes time to understand the components of data analytics.
Data analytics is a multi-faceted process multifaceted means it has several components to it now the process has multiple components as we have discussed let’s take a look at them one by one these components are just our interpretation of the components you can if you want to define different criteria and come up with different components first one is numerical computation,
If you have some data that is of numerical structure and you want to get some information out of it you might have to perform some numerical computations on it as well to give you some context suppose you have been taking uh some classes to learn how to play a musical instrument now you have not been taking those classes regularly you have been taking them in short periods.
Let’s say you took one class on the first of August and then dropped out on the 14th of august then you restarted the classes on the third of September and then dropped out on the 15th of September and so forth what you can do is you can collect the data of when you started and when you end it and you can perform the numerical computation to understand what was the longest period where you didn’t quit what are the durations of all the periods.
What was the shortest period you can even do remarks in which the shortest period you can do remark what was the reason why you wanted to quit at the shortest period and things like that so numerically computations are really useful when we want to perform some sort of numerical analysis,
Computation on the data that we have and we want to create new numerical components for instance in our example we could take a look at the total duration that you spent in learning these things regardless of the gaps in between and using that we can just subtract the end date from the starting date and then get the differences and then add them up,
We’ll get the entire entire structure and we’ll understand how to do this in python in a moment that’s what numerical computation is about again if you’re trying to build something like a neural network or something like those then those things also require a lot of numerical computation and if you’re trying to build them all from scratch then you need to understand how to numerically manipulate some data then comes data manipulation data.
Manipulation deals with having a data set and then figuring out kinds of things that you don’t want out of it so for instance if you have a data set that has 5000 rows and 30 columns and you only want 10 columns how do you extract those 10 columns out of those 30 columns if there are some missing values how do you deal with those if there are some outliers how do you deal with those,
If some values are of different scales for instance if you get data from international sources somewhere people are using for measuring temperatures somewhere people are using Fahrenheit somewhere people are using celsius somewhere people are using kelvin and for measuring length measuring height measuring weight different scales can perform diff can cause different issues,
In situations like this, you would have to manipulate the data to get them all to conform to a particular scale so that’s why data manipulation is important data manipulation also deals with a lot of other complicated issues these issues include cleaning the data transforming.
The data grouping data segregating data and also if you have some data where you want to filter the data you can also do it using data manipulation we’ll take a look at how to do this in python in our hands-on then comes data visualization is by far one of the most important concepts in data analytics it is what generates those visual charts and flow of charts that we can display on-screen in presentations and make it more digestible for people who are not that literate in the data analytics process.
Suppose you are going to perform a quarterly sales report on your Article now showing the numbers is not going to be very useful because it’s going to be difficult for everyone to see the number analyze it figure out whether the trend is going upwards or downwards what have been the sharpest prices what have been the lowest steepest falls and all of that in case you had you done it using histograms or bar charts or whatever this would have been quite a lot easier for people to take a look at and grasp how it’s working so depends on how you want to structure your data.
How you want to deal with it but in the basic sense if you want to visualize your data then it’s going to be much more presentable and it’s going to be much easier for you to understand and for other people to understand it doesn’t require that much of mental computation and mental overhead.
We’ve come to the hands-on aspect before we begin with the hands-on do note that they are some of the concepts that we are covering that might not be familiar to you such as we are using pandas and matplotlib and Jupiter notebook in case these things are not familiar to you don’t worry at the end of the article we’ll guide you through some of the resources.
we have data analytics available for us I have already created a Jupiter notebook I’ve called it data analytics with python everything is done for us all we have to do now is start with our code let’s start with numerical manipulation.
The first thing I’ll just add numerical manipulation right now here I am going to be writing some code one of the best things about Jupiter notebook is that you can write these nodes you can write the code and you can get the entire output of the code all bundled in a single file.
Now I’m going to import NumPy SNP it runs so I have imported numpy, NumPy is the numerical manipulation library that is available from python it is one of the most popular ones if you have used NumPy then you know that it’s quite popular and it’s quite fast what numpy does is that it allows us to create arrays out of a normal python list store some unwanted information that is not useful during mathematical computation such as the bit precisions and all of that so numpy reduces the size of the data and then allows us to do these computations quite easily so for instance,
If i have let’s say two lists of numbers I’ll call them x x is going to be equal to 1 2 and 3 and I have y which is equal to 4 5 and 6. if I were to add these two numbers I to do this on my own what I would have to do is a plus band for that, I would have to loop over all of this four x in or another thing I could have done is just x of I plus y of I or I in range length of x now I have 4 plus 1 is equal to 5 5 plus 2 is equal to 7 6 plus 3 is equal to 9,
Now that works but as you can see there’s a lot of code to write and this worked because I have the same dimensions had I not had the same dimensions let’s say 5 I get index error which is also fine because that’s what you would expect but it is a little cumbersome and a little difficult to do in normally,
If I had to do this using um numpy then I could do numpy right on np dot and as you can see there are a lot of functions I’ll use the add function I’ll pass in one two three four five six and you get this another way of doing this would have been just to create an array two arrays out of those np dot array x and I would have to reformat,
This of course then for y I have to perform y is equals to np dot array 4 5 6. now I am supposed to add these to x plus y and it’s going to give me the answer 5 7 and this is a numpy array by the way as you can see the output is quite there.
These are the ways you can manipulate you there are many ways you can manipulate an array and do different kinds of things such as addition multiplication subtraction I’m not going to go into all of those but you can perform all of these tasks on your own now that we are done with that let’s take a look at how we can do data manipulation.
For data manipulation all I have to do is firstly I’d like to do it by hand so let me just create a function named describe d-e-s-c-r-i-b-e this will accept some data and will return a dictionary,
Which will contain some description of the data that we have now what I want to do with the data that I have is I want to measure the count which is the number of elements inside the data means of the data minimum and maximum values and the standard deviation.
What I would have to do is first get the count or if I want to do it right in the code I can just do it this way count is supposed to be equal to the length of data python makes this easy for us but after count or other things we have to calculate is the means the average of the data for that all I have to do is sum up the data and divide it by the length of the data that’s also doable then comes minimum and for that, I can use the main function.
Python makes it easy although there is a lot of typing and will understand why I’m doing it manually in a moment then you go maximum of data and then finally after minimum and maximum we want the standard deviation now to perform the standard deviation we have to write a lot of code.
Let me just show you what the code would look like but and if you were to write it in a single line this is going to be a little uh difficult for you to understand right now but uh don’t worry we’ll take a look at that in a moment.
The first thing we have to do is I’ll get the difference between the mean and the values here so x minus and the mean of the data is calculated like this so this I want for x in data for each element I am going to calculate the mean then I am going to calculate the difference of each element from the mean this is done.
Now after doing that I want to calculate the square of it for the sum of squares to calculate the sum of squares all I have to do is for x times 2 in or x times 2 for x in and I also have to wrap it inside this is python’s list comprehensions in work and finally, if I want to calculate how this is working I need to calculate the sum of squares,
I need to calculate this sum here so you get the sum of squares here after getting the sum of squares you need to divide it by the length of data minus 1 this is the formula for standard deviation we calculate the sum of squares and do all of this.
Then now finally all we have to do is perform the square root so for that I have to import math dot sqrt or I can just import math and here all I have to do is math dot sqrt and it’s done.
I have the data thinking there’s some invalid syntax because I didn’t put a comma here over here so always put a comma in the dictionaries now let’s just describe some data 5 10 25 7 8 9 run this is the data that we’re getting as you can see just to calculate all of this took a lot of time,
If I wanted to expand this into a function I could clean it up just to make it a little more readable but this is how it happens now to after doing this if I wanted to do something similar to this other using other libraries I can just import pandas PD run this PD dot data frame inside the data frames after describing,
The data what I want to do is I want to create a dictionary in which the column name is data and the data that we’re passing here is also going to be the same if I do this is the data from the figure as you can see this is how this is the column name of data and these are the values.
if I want to describe the data now I can use the describe function and you can see it’s giving me all the data that I want as you can see we are getting the count to be 6 mean to be 10.6 standard deviations to be 7.2295689 we’re getting the minimum value to be 5 maximum value to be 25 and we get the 25th 50th and 75th percentile.
This is how it works so instead of me having to do all of this you can see this is done quite easily and I didn’t have to perform a lot of different complicated mathematical processes as you can see the code above is quite long and I could have even gone longer,
If I had included 25 50th and 75th percentile and again this code is something that you would have to maintain and you would have to write and this would be typical but here you can do it quite easily now let’s take a look at how we can work on some real-world data set.
I’ll have to import some data I’ll go to the top here again from SK learn dot data steps import load and you can load different kinds of data set I’m going to be loading the load diabetes function this is done.
If I want some data I’ll just use the load diabetes function to get the data this is how you work with real-world data sets and how you convert it to data frames firstly analyze the data this is the data that we have we have keys this is the dictionary we have a data key which contains all the data we have a target key.
Which contains the target values these are the things that if you want to create model things that we would want to predict which are the diabetes level of a person we have a description of all the particular numbers we have feature names are the things that we want to use and this is the data file and target file.
If you want to use some data you could also instead of using this data set I can also show you some other data set let me just show you there are a lot of data sets available for you to use this is load diabetes you could also load and press load Boston to let’s take a look at Boston data set it’s a little larger,
we can take a look at that and this is what it looks like it’s the same shape of the data we have data-target we have description feature names and on after having that how do you convert this to a data frame that’s the main point here,
we have to just create the data frame firstly I’ll just assign it to df create a data frame inside the data frame I want to pass in some data I can just do data and I can either pass in as a dictionary or I can pass indirectly so I’ll just pass indirectly.
The data would look like and we want the uh columns or the column names to be equal to data dot features now if you take a look at the head of the data this is what it looks like everything is there but notice the target is not there we need to add that to the data set as well.
Target is equaled to df dot data dot and now we take a look at the head of the data again and now we have to target as well so everything’s working now let me just remove this so that we only have one of the interpretations of the data for the latest one now we can perform several tasks.
First, let’s try and describe the data to understand the shape of the data tells us that we have 506 rows of data and these are the columns for each column it’s giving us the maximum value the minimum value uh the standard deviation mean median mean and count and median is the 50th percentile.
You can take a look at it analyze it to figure out if something is off base for instance if for instance there was a column named age and the minimum value of age was negative 5 then you know that negative 5 is not a valid value and you have some incorrect values.
Let’s take a look at if the data set contains some null values the null values mean missing values these are values that were not recorded when the data set was being created and that could lead to some uh issues when we’re creating a model so as you can see there are no null values in our data set,
This is good that means there’s no cleaning that needs to be required in case you had null values then you would have to figure out how do you want to tackle those you could either use to replace all the missing values with 0 or with the average of the column or if there is a lot of 90 percent of the column values are missing then you can just drop the column that would be better because it would have less effect on your data now if I want to get a particular subset of the data.
I want to take the first one two three four five columns what I want to do here is df dot I log doing it without pandas data frame would be very difficult I’ll take the first five columns, uh this would give me the first five rows actually but if I want all the rows and first five columns then I can do it like and since it’s little big let me just show it here,
We have one two three four five so first five columns we’re not restricted to the first five columns let’s say I don’t want the first column I want everything to begin with and there you go and I can even expand it to be the sixth column and there you go so instead of we drop the first column we get from here to one two three four five and we don’t use the sixth coil,
That’s how this works now I lock is fine but what if I want to use the column names to get the information so I can use the lock I lock means integer location lock means location I want everything from zn to h, all I have to do here is I want all the rows from z n to h and you there you have all the columns from z n to h and head would be the one the thing to notice here,
When I go from zn to age in this one when I specify the integer locations it will remove the age because it’s exclusive of the end so if I had to put if I wanted h here I would have had to put seven that it stops before seven which is six here what we get is we get the age of the column now that that is done.
we can also perform a lot of different kinds of tasks here as well we can create uh filtering as well so let’s say that I want all the columns where age is above 65 they don’t seem to be a lot of columns where h maybe 65 above,
I want all the data frame columns all the rows or all the columns as you might understand that have the age column to be greater than 60 and as you can see I get 321 rows out of 509 I can even change this let’s say 75 and I only get 262 rows I can even go further instead of 75 let’s say I want 85 now I get 210 rules so as you can see that I can filter my data according to how I want to filter it I can use multiple filters as well and we can take a look at that later on but this is how this works.
You can see the age column is here we have 100 as well so let’s take a look at how many are there 100 and we have a lot of values of people who are 100 years old so that’s it and if you want to perform this is a subset of the data frame I can describe the dataset here as well this is going to take a lot of time but you can do it if you want to now after getting the data this is what the data set looks like,
We have the maximum-minimum count age whatever so we have age and the maximum and minimum are all hundred because all the values there are 100 this is how you can get a subset and describe the dataset there as well now let’s take a look at the final concept which is visualization.
For visualization, we’ll be using a library called map. lab let me just so I have to import matplotlib dot type plot as PLD now I can use the same data in the data frame that we have and I can visualize the data.
Let’s say that I want to visualize or create if plot I’ll use the plt dot plot df crim on this and this is what it looks like in the crim column the highest values seem to be around 380 so that’s what it looks like other than that all the values are quite small so after this let’s take a look at how we can get the uh histograms,
I’ll just use plt dot run this so as you can see the majority of the values of age are 100 we have the lowest values between 0 to 10 and then 20 they’re higher than between 20 and 30 the values are slower and as you can see this is what it looks like so this is how you can visualize it makes for better interpretation of our data it makes for better understanding of our data.