Books every aspiring Data Scientist should read

Are you intrigued by the world of data science? Data science is in vogue now and is one of the hottest career paths for many students. However, tackling data science on your own can be quite daunting. It covers several topics like Statistics, probability, applied mathematics, machine learning, etc. However, there are many books that an aspiring data scientist can read. Reading these books can help an aspiring data scientist to not only get a basic overview of the subject but also to master it with enough practice. Here are seven books every aspiring data scientist should read:

1. Think Stats: Probability and Statistics for Programmers by Allen B. Downey

 

Think Stats

Aspiring data scientists need to learn about statistics. How can you integrate statistics with programming? This book gives you an overview of statistics, especially for data science. You will go through the core concepts of statistics and probability, which will help you in data analysis. The book uses data sets taken from the National Institute of Health. It has several examples of python code. The best thing about this book is that the language is very lucid, and it demonstrates real-world examples.

  1. Python data science handbook by Jake VanderPlas

Python Data Science Handbook

Since Python is such an essential language for data science, an aspiring data scientist needs to have a comprehensive knowledge of Python. This book teaches how to use Python for data science. It starts at the beginner level, but slowly, it progresses into more advanced levels. It covers a lot of topics like visualization methods, Numpy, data manipulation with Pandas, and also, Machine learning.

  1. R for Data Science by Garrett Grolemund and Hadley Wickham

R for data science

While you might keep your chief focus on Python, you should also have a working knowledge of R, another language used by data scientists. If Python does not have a specific library, R can provide you with it. This book can be used as a guide to help you perform data science projects on R. It covers several topics from R workflow, data visualization to data modelling.

  1. Machine Learning Yearning by Andrew Ng

Machine Learning Yearning

Machine learning is the future of the tech world. Machine learning has emerged quite recently in the data science field, but it has become quite popular in a short period of time. This book teaches data scientists how they could structure Machine Learning projects. It shows you how and when you should use Machine Learning and all the complexities that it brings.

  1. Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning

If you want to enter the field of deep learning, then this can be your go-to book. This book teaches how applied mathematics can be used for Machine Learning and also emphasizes on Deep Learning. It shows the mathematics present behind certain deep learning concepts like regularisation, convolutional networks, recurrent and recursive nets, etc. While being mostly theoretical, it also sheds light on the practical implementations of these techniques.

6. Storytelling With Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic

Storytelling with Data

Data visualization is a necessary part of data science. However, it is also a bit difficult. Data visualization creates narratives that can reach out to a wider audience. The use of unnecessary data can obstruct clear communication. This book teaches you how to get rid of the unnecessary data and create a proper narrative that touches the audience on a personal level. It shows you the art of storytelling using metrics.

  1. Ethics and Data Science by DJ Patil, Hilary Mason, and Mike Loukides

Ethics and Data Science

As data scientists, you have to be aware of the ethical limits of collecting data and data analysis. In recent years, there have been several concerns put forward regarding machine bias, privacy, and data protection. This book helps you to understand the ethical principles in data science. It is a great book that gives suggestions to build ethics into the data science culture.

Read these books and set up your path as a data scientist. Happy reading!

It sometimes becomes difficult to understand through books. Don’t worry, we have a full fledged course on data science.
Here is the link – : Data Science Course

An Introduction to Hadoop and its ecosystem

Big data analysis is the future of technology and analytical research. Big data analysis deals with large data set which helps in determining patterns and trends in business. Imagine how useful it would be for finance, marketing and other kinds of research. Now, since it deals with large amounts of data, it gets a lot more complicated. If you are looking to opt for a detailed course of data analytics, you must first understand the ecosystem of Hadoop.

Not every software is capable of handling such large data in one go. However, Apache Hadoop, an open-source framework, has made its place in the tech world because it allows efficient handling of big data. The Hadoop framework uses clusters and is made into several modules creating a large ecosystem of technologies. The Hadoop Ecosystem is a suite providing a variety of services to tackle big data problems. 

Hadoop Ecosystem

hadoop and its ecosystem

source

While there are many solutions and tools in the Hadoop ecosystem, these are the four major ones: HDFS, MapReduce, YARN and Hadoop Common. These tools work together and help in the absorption, analysis, storage, and maintenance of data. However, there are many other components that work in tandem with building up the entire Hadoop ecosystem. As you can see in the diagram above, each and every component of the Hadoop ecosystem has its own function. For example, the HDFS and MapReduce are responsible for distributed capabilities, i.e. distributed storage and distributed processing respectively. They are:

  1. HDFS

This is the primary component of the ecosystem. It stores large data sets of unstructured and structured data and maintains the metadata in the log file form. The core components used here are the Name Node and the Data Node. The data node is the commodity hardware present in the distributed environment and helps in the storage of data. The Name Node is the prime node and stores the metadata. It requires fewer resources than data nodes. HDFS works at the heart of the system.

  1. YARN

YARN or Yet Another Resource Negotiator helps with the management of resources across the clusters. It is responsible for resource allocation and scheduling. The main components of YARN are Resource Manager, Nodes Manager and Application Manager. Resource manager helps in the allocation of resources for the applications working in the system. Nodes Manager helps in the allocation of other resources like CPU memory, bandwidth, etc. The Application Manager acts as an interface between the two and negotiates the resource requirements.

  1. MapReduce

hadoop ecosystem

source

MapReduce combines the work of parallel and distributed algorithms to convert big data sets into manageable ones. It has two functions Map () and Reduce (). Map () sorts and filters the data and therefore, organize them into groups. Reduce () takes the Map() output and summarizes them into smaller sets of tuples.

  1. PIG

Developed by Yahoo, PIG helps to structure the data flow and thus, aids in the processing and analysis of large data sets. It helps in optimizing the processing of the entire set by executing the commands in the background. After the processing, PIG stores the acquired result in HDFS.

  1. HIVE

Combining both SQL methodology and interface, HIVE helps to write and read large sets of data. It allows both batch processing and real-time processing, therefore being highly scalable. Plus, it supports SQL data types making query processing simpler. 

  1. Mahout

Machine learning is a thing of the future and many programming languages are trying to integrate it in them. For example, Python has many libraries which help in machine learning. Mahout helps to integrate Machine Learnability with Hadoop. It gives you functions like clustering, classification, and collaborative filtering. It provides you with various libraries as well.

  1. Apache Spark

hadoop ecosystem

source

If you want to engage in real-time processing, then Apache Spark is the platform that can help you. It handles a number of process-consumptive tasks like iterative and interactive real-time processing, graph conversions, batch processing, etc. 

  1. Apache HBase

Apache HBase is a NoSQL database. Hence, it can handle any kind of data and provides the capability of Google Big Table. Thus, it makes working on Big Data sets efficient and easy. HBase helps in storing a limited quantity of data and hence, initiates fast responses when you want to retrieve something small from huge databases.

  1. Solr, Lucene

These two services are used for data management. Solr and Lucene help in searching and in the indexing of certain java libraries. 

  1. Zookeeper

Lack of both coordination and synchronization could result in inconsistency within the Hadoop ecosystem. Zooker helps in synchronization, grouping, and maintaining inter-component based communication to reduce inconsistency.

  1. Oozie

Oozie is a scheduler which helps to bind and schedule jobs as a singular unit. There are two kinds of work that Oozie does – Oozie workflow and Oozie coordinator jobs. Oozie workflow helps to execute jobs sequentially while Oozie coordinator helps to perform jobs when an external stimulus triggers it.

Get yourself acquainted with the Hadoop ecosystem and you can tackle big data analytics easily. For the much-needed direction that is needed to excel in data science, you can try the course on data science by Coding Ninjas.  Best of luck.

Must-know Python libraries for any aspiring Data Scientist

We are all aware of Python – the simple language that is currently defining the digital world. Pairing machine learning capabilities with simple coding, Python is a big hit among data scientists, along with the data science specific language, R. However, if you really wish to master Python and build your career as a data scientist, then you should know the most popular Python libraries. Python, because of its simplicity, offers a lot of libraries for different use-cases. And if you are someone who is looking to make your mark as a data scientist, not only should you familiarize yourself with Python but with these libraries:

NumPy

numpy

Source

NumPy is a great open source library which is mostly dedicated to numerics. It has some pre-compiled functions that make working with large multidimensional matrices and arrays easy. Even when you apply basic numerical standards, Numpy makes it so simple that you don’t have to write Loops like in C++. While it may not have an integrated data analysis facility, its array computing can be paired with other data analysis tools and make it easier.

Scipy

pasted image 0 (5)

Source

SciPy is a module in Python which provides fast N-dimensional array manipulation. It not only makes numerical routines easier but it can also help with numerical optimization and integration. It even has modules like linear algebra, optimization and integration – all important tools in data science.

Matplotlib

pasted image 0 (6)

Source

If you want to add visualization in your project, the Matplotlib is the best way to go. It can be used to quickly make pie charts, line diagrams, histograms or other professional visual items. You can even customize certain aspects of any specific figure. The best part, you can export the images into graphic formats like png, jpeg, pdf, etc.

Scikit-Learn

pasted image 0 (7)

Source

Since machine learning is the way of the future, Scikit-Learn is the machine learning module introduces to Python. It gives you a set of common machine learning algorithms providing it to you through a consistent interface. There are a lot of algorithms available in Scikit-Learn and it also comes handy with machine learning-based tasks like regression, clustering, etc.

Pandas

pasted image 0 (8)

Source

For data munging, the best Python module is Pandas. It has high-level data structures and the tools present here are best suited for faster data analyses. It is based on NumPy and so, NumPy can be used easily on it. 

NLTK

pasted image 0 (9)

Source

NLTK is one of the best programmes that can work with human language. It has a simple interface and more than 50 corpora and lexical resources like WordNet, which can be used for tokenization, tagging, parsing and many more. NLTK is so popular that it is often used to create prototypes of research systems.

Statsmodels

Source

Statsmodels tries to estimate different statistical models by exploring data and performing statistical tests. It has a list of different plotting functions and statistics based on results for each type of data. 

pasted image 0 (10)

Source

Statsmodels tries to estimate different statistical models by exploring data and performing statistical tests. It has a list of different plotting functions and statistics based on results for each type of data. 

PyBrain

pasted image 0 (11)

Source

Python-Based Reinforcement Learning, Artificial Intelligence, and Neural Network or PyBrain is for neural networks. It can be used for both unsupervised learning and reinforcement learning. If you want a tool for real-time analytics, this is the best way to go.

Gensim

pasted image 0 (12)

Source

Built on both Scipy and Numpy, the Gensim library is for topic modeling. From fast scalability of language to optimized math routine, this open source library will keep your delighted with its simple interface and platform independence.

Theano

pasted image 0 (13)

Source

Almost like Numpy, Theano is a library that focuses on numeric computation. It allows evaluation and optimization of mathematical expressions which also involves efficient treatment of multi-dimensional arrays.

So, get your mind set and start your data science journey with some must-know Python libraries. To make sure your python game is strong, you can also look at some of the courses offered by Coding Ninjas. Have a look at our course on Machine Learning and Data science and set out on your journey to become a distinguished data scientist. Best of luck.

 

Step-by-step guide to execute Linear Regression in Python

As most of us already know, linear regression used to find correlation between two continuous variables. There are various ways of going about it, and various applications as well. In this post, we are going to explain the steps of executing linear regression in Python.

There are two kinds of supervised machine learning algorithms: Classification and Regression. Regression tries to predict the continuous value outputs while classification tries to predict discrete values. Here we will be using Python to execute Linear Regression. For this purpose,  Scikit-Learn will be used. Scikit-Learn is one of the most popular machine learning tools for Python.

First up – what is linear regressive theory?

A linear regressive theory is based on the principle of linearity between two or more variables. It’s task is to predict a dependable variable value, let’s say y, based on an independent variable, let’s say x. Hence, x becomes the input and y is the output. This relationship, when plotted on a graph, gives a straight line. Hence, we have to use the equation of a straight line for this, which is:

y=mx+b

Where m is the slope of the line and b is the intercept. Here y and x remains the same and so, all the changes that takes place will be in the slope and the intercept. Thus, there can be multiple straight lines on that basis. What a linear regression algorithm does is it fits the multiple lines along the data points and then returns the line but with the least errors.

A regression model can be represented as:

y = b0 + m1b1 + m2b2 + m3b3 + … … mnbn

This is referred to as the hyperplane.

So, now, how can we use Scikit-Learn library to execute linear regression:  

Let’s say there are many flight delays that has taken place due to weather changes. To measure this fluctuation, you must perform linear regression with the data being provided. This data can include the variation of minimum and maximum temperatures for the particular days. Now, you can download the weather charts to understand the fluctuation. The input x will be the minimum temperature and using that, we have to find the maximum temperature y.

Import all the necessary libraries

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

import seaborn as seabornInstance

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

Now check the data by exploring the number of rows and columns in the datasets

Dataset.shape

You will receive output in the form of (n rows, n columns)

For statistical display, use:

dataset.describe()

Now, try to plot the data points in the form of a 2-D graph to figure relationship just by glancing at the graph. We can do so by using:

dataset.plot(x=’MinTemp’, y=’MaxTemp’, style=’o’) 

plt.title(‘MinTemp vs MaxTemp’) 

plt.xlabel(‘MinTemp’) 

plt.ylabel(‘MaxTemp’) 

plt.show()

linear regression in python

Source

So, here we have used the MinTemp and MaxTemp for analysis. So, let’s use the Average Maximum Temperature between 25 and 35.

plt.figure(figsize=(15,10))

plt.tight_layout()

seabornInstance.distplot(dataset[‘MaxTemp’])

linear regression in python

Source

Once we have done that, we have to divide the data in labels and attributes. Labels refer to the dependent variables which need to be predicted and attributes refer to the independent variables. Here we want to predict the MaxTemp by using the values of the MinTemp. Attribute should include “MinTemp” which is the X value and the label with have ‘MaxTemp’ which is Y value.

X = dataset[‘MinTemp’].values.reshape(-1,1)

y = dataset[‘MaxTemp’].values.reshape(-1,1)

Now we can assigned 80% of this data to the training set and the rest to the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

After this, we can train the data using the following:

regressor = LinearRegression() 

regressor.fit(X_train, y_train) #training the algorithm

We can find the best value for the slope and intercept so that you get the best fit for the data. We can do that with the following code:

#To retrieve the intercept:

print(regressor.intercept_)#For retrieving the slope:

print(regressor.coef_)

With the algorithm trained, we can now use it to make some predictions of the MaxTemp. Our test data can be used for that. We use the following:

y_pred = regressor.predict(X_test)

After we find the predicted value, we have to match it with the actual output value.

We use this script for it:

df = pd.DataFrame({‘Actual’: y_test.flatten(), ‘Predicted’: y_pred.flatten()})

df

Now, there is a possibility that you will find huge variances between the predicted and actual outcome.

So, by taking the 25 of them, develop a bar graph, using this script:

df1 = df.head(25)

df1.plot(kind=’bar’,figsize=(16,10))

plt.grid(which=’major’, linestyle=’-‘, linewidth=’0.5′, color=’green’)

plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5′, color=’black’)

plt.show()

linear regression in python

Source

In the bar graph, you can see how close the predictions are to the actual output. Now, plot it as a straight line.

plt.scatter(X_test, y_test,  color=’gray’)

plt.plot(X_test, y_pred, color=’red’, linewidth=2)

plt.show()

linear regression in python

Source

The straight lines will indicate that the algorithm is correct.

Now, you have to examine the performance of the algorithm. This will use certain metrics:

  1. Mean Absolute Error (MAE) : This will calculate the mean absolute values of the errors. 
  1. Mean Squared Error (MSE): It calculates the mean of the squared errors.
  1. Root Mean Squared Error (RMSE) calculates the square root of the mean of the squared errors.

The Scikit-Learn library has a pre-built function which you can use to calculate this performance by using the following script.

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Multiple Linear Regression

linear regression in python

Source

Now let’s imagine you have multiple data points to work with. This means, you have to use multiple linear regression. An example would be the use of alcohol – let’s say beer. When you consider something like beer and the quality of it, you have to take in various factors like sugar, chloride, pH level, alcohol, density, etc. These are the inputs that will help to determine the quality of the beer. 

So, as we did earlier, we will first import the libraries: 

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

import seaborn as seabornInstance

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

Again, explore the rows and columns using:

dataset.shape

Find the statistical data by using:

dataset.describe()

Now, we have to first clean up some of the data. We can use the following script:

dataset.isnull().any()

All the columns should give False when you use this check, but if one of them turns out to be true, use this script:

dataset = dataset.fillna(method=’ffill’)

Next, we divide them into labels and attributes. 

X = dataset[[‘fixed acidity’, ‘volatile acidity’, ‘citric acid’, ‘residual sugar’, ‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘density’, ‘pH’, ‘sulphates’,’alcohol’]].values

y = dataset[‘quality’].values

linear regression in python

Source

Find the average of the quality column

plt.figure(figsize=(15,10))

plt.tight_layout()

seabornInstance.distplot(dataset[‘quality’])

Separate 80% for training and 20% to test.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Train the specific model:

regressor = LinearRegression() 

regressor.fit(X_train, y_train)

Now, check the difference between predicted and actual values:

df = pd.DataFrame({‘Actual’: y_test, ‘Predicted’: y_pred})

df1 = df.head(25)

Plot it on a graph:

df1.plot(kind=’bar’,figsize=(10,8))

plt.grid(which=’major’, linestyle=’-‘, linewidth=’0.5′, color=’green’)

plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5′, color=’black’)

plt.show()

linear regression in python

Source

If you find the predictions close to the actual one, then apply the following script:

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

If you are facing any errors, it can be due to any of these factors:

  • Inadequate Data: The best prediction can be done with more data inputs.
  • Poor Assumptions: If you assume that you can have a linear relationship for data which may not have such a relationship, that will lead to an error.
  • Poor use of feature: If the features used does not have a high correlation with the predictions, then there can be errors.

So, this was a sample problem on how to perform linear progression in Python. Let’s hope you can ace your linear regressives using Python! If you’re looking to get your concepts of Machine learning and Python crystal clear, CodingNinjas might be able to help you out.

Efficiently tackling a Data Science project

“Hiding within those mounds of data is the knowledge that could change the world.” – Atul Butte

Data Science is precisely what comes in handy while searching for knowledge in these heaps of data.

However, if you get stuck at any point of your process, take a moment to ask yourselves a few questions:

-Do you like the idea of your project getting pushed and getting out there in the real world?

-Do you love it when you are a useful resource for your company, who provides actionable insights?

-Do you want to build an efficient Data Science project which can work on a real-time basis?

Even if you have a single affirmative answer, then my friend, you are on the right path to achieve your goal. These questions can boost up your motivation in seconds.

We are providing you with a guide for how to tackle a Data Science project efficiently:

Familiarize yourself with your area of interest

It is impossible for a person to keep an eye on every encompassing detail while working with large datasets. But, the author should be deeply involved with the subject matter of the project. Working otherwise can ruin the whole project.

-Without proper background knowledge, you will surely be making a lot of mistakes.

-A deep understanding of the thing you are dealing with can prevent potential errors.

-If you complete this process efficiently, then you are already a step ahead than your peers.

Determine your question

You have to dig deeper to find out all the useful questions that might matter.

-Is there a possibility that the information you are looking for doesn’t exist?

-How often has your problem been put up or answered before?

-Are you content with the math of the process?

-Would you still be comfortable going on with the project when it gets monotonous or frustrating?

In this ongoing process, you might encounter a lot of datasets which can be useful as well as waste. Only your passion for the project can make you go onwards.

Find a Dataset related to your question

Sometimes you can directly find relevant databases on sites like Census or Bureau of labour statistics. These carry some of the conventional datasets which you are looking for. However, there is a potential possibility of not getting accurate information all the time.

Keep other options open if you’re unable to find the exact dataset:

-You can reach out to others who have worked on multiple programs or at least have an experience and see if they are familiar to the dataset.

-You can also find a relatable database and mould your question according to it. Be adaptable enough and continue with the second step.

Adjust your parameters as we can ensure that the results will be highly rewarding.

Familiarize yourself with the Database

Try and visualize your data as much as you can. Take some time and explore the sets of several visualizations from the data collected. See if creating graphs and charts, finding the minimum & maximum can help envision the data much clearly. You can go on with the following measures:

-If you can isolate trends with the information, it can be beneficial for you to get through the final step.

-Go through all the datasets and cross-check all the information. It might lead you to the results that you have been looking for.

If you’re still wondering about the complexities, come straight to our data science courses. We at coding ninjas will make you confident enough to tackle your project on your own.

An insight into the role of a Data Scientist

Data science is the hottest selling controversy in the IT industry right now. This has been one of the most necessary skills, and the sexiest job of the 21st century.

The question to follow is: why?

Given the demand-driven nature of this industry, you need to keep evolving with it. Every day, so much data is being produced, and this data can be used to change the structure of every industry in this world, and this is where they need a data scientist, which thus, makes this the sexiest job of the 21st century.

But, what does a data scientist really do?

With the introduction of today’s world to big data, the focus shifted to its storage and the processing of this data. While the tools like Apache Hadoop, MicrosoftHD insight, etc. solved the problem of storage of the data, the focus shifted to processing and working with this data. And that is what a data scientist has to do- analyse the data and interpret it to produce meaningful insights with it.

Sounds really simple, doesn’t it?

WELL, not so much!

The whole process of collecting the data, cleaning it, applying algorithms for mining of this data, analysis of this data and it’s interpretation, to develop an insight which actually answers the problem- is precisely what data scientists have to do!

Good command over programming and a high data intuition is the primary required weapon. While being acquainted with Hadoop or hive is not a necessary skill, a data scientist, in all probability, will end up acquainted with this skill set.

To put it simply:
“A Data Scientist is better at statistics than any software engineer and better at software engineering than any statistician.” ― Josh Wills, Director of Data Engineering at Slack

IMAGINE!

They have to code their way to fetch and visualise the data, and once that is done- they have to do ALL THAT MATH!

To break this whole process into categories, we have:

Data collection: Data collection is the process of gathering and measuring data, information or any variables of interest in a standardised and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection

Data Visualization: To communicate information clearly and efficiently, data visualisation uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message. Useful visualisation helps users analyse and reason about data and evidence. It makes complex data more accessible, understandable and usable. This uses tools like Tableau, plotly, RAW etc.

Data Analysis: Data analytics focuses on processing and performing statistical analysis of existing data sets. Analysts concentrate on creating methods to capture, process, and organise data to uncover actionable insights for current problems, and establishing the best way to present this data.

And if this doesn’t explain why data scientists have been labelled as the Unicorns of the IT industry, probably nothing will!


So, that is all you need to know about data science as a career. To put all of this into perspective, Coding ninjas has curated a course which covers the A to Z of data science, starting from the fundamentals, covering up all the concepts involved in visualisation, gathering and analysis of the data.

For further details on this course, you can check out the course curriculum here.