The popularity of Python as a developer’s choice has been on the rise since the past few years. The programming language has been ranked number one by IEEE Spectrum in 2017. If you are planning to build your career in technical field, this is the best time to learn Python as it is useful in web development, DevOps and data science.
The adoption of Python is mostly acknowledged in machine learning, data science, and related research. Python as a general-purpose language can effectively handle data mining and processing, machine learning/ deep learning algorithms and data visualization making it a go-to choice for data scientists.
Python has a hold within the data science community because of its rich repository of data science libraries. Important functions and objects for relevant tasks are bundled together in packages. Libraries are a set of such packages that are then imported in the scripts.
To get started with data science in Python, one must acquaint themselves with these libraries to perform everything from basic to the most advanced data science tasks. We have compiled a list of these libraries here that one must gain the working knowledge of. Further documentation on these libraries is available on the IPython notebook also known as the Jupyter Notebook.
NumPy is a fundamental library for performing any kind of scientific computing. The main object of this library is multidimensional arrays with a set of routines that perform logical, statistical and mathematical operations. NumPy arranges all kinds of datatypes into arrays which makes it easy to manipulate and integrate well with databases. It can also perform Fourier transforms, linear algebra, random number generation, and matrix shape changing.
Many of the libraries use NumPy for the basic input and output functions. NumPy does not have powerful data analytics functionalities but lays a foundation for the same. Combined with SciPy and Matplotlib, it can be used for technical computations which are further explained below.
SciPy stands for Scientific Python and is basically a library built on NumPy, suggesting that they are used together. To understand the functions of SciPy, it is essential to understand how the array shapes and data-types can be manipulated. SciPy comprises of task-specific sub-modules like scipy.linalg, scipy.fftback, etc. to perform high-level routines in linear algebra, interpolation, Fourier transform calculations, optimisation and so on. As machine learning tasks are heavily based on linear algebra and statistical methods, knowledge of SciPy is essential.
Pandas is a core package that provides data structures to work on all kinds of data. It is one of the most powerful data analysis tools which is used in various domains like finance, economics, statistics, etc. It can effectively perform the basic functions on data from loading to modelling and analyzing.
Pandas can easily convert data structures into dataframes (2-D data) and manipulate data within them. Pandas facilitate easy handling of missing data and automatic data alignment as well. It performs functions like row or column adding, index resetting and deletion, pivoting and reshaping the data frames, among a few. At the end, pandas also allow exporting the table to Excel or any other SQL database.
As the name suggests, this is a plotting library where the submodule pyplot() is used extensively to plot the values obtained using ndarray package. The matplotlob.pyplot is a set of commands that perform 2D plotting as performed in Matlab. It is used for producing basic graphs which when combined with certain graphics toolkits like PyQt can produce more advanced looking graphs such as scatter plots, spectrograms, histograms, quiver plots, etc.
Seaborn is also a data visualization tool that is based on Matplotlib which creates attractive statistical charts and graphs. Matplotlib can also be used to customise the plots created by Seaborn. It has functions to fit and visualize linear regression models, plot time series data which internally performs operations on arrays and dataframes along with aggregation to produce the correct result plots. However, it is important to understand that Seaborn is not a replacement to Matplotlib, but a complement to it.
Bokeh and Plotly are also some powerful visualization tools that are independent of matlplotlib and are mainly web-based toolkits.
Machine Learning/ Deep Learning
Scikit-learn is an essential machine learning library that allows performing supervised and unsupervised learning on medium-sized datasets. This library is built on SciPy. Both NumPy and SciPy need to be installed before one starts using sciKit. NumPy and SciPy focus on data wrangling or manipulation whereas SciKit is focused on modelling this data. All kinds of algorithms such as regression and classification, dimensionality reduction, ensemble methods, and feature extraction functions can be performed using scikit-learn.
Theano allows a user to define, optimize, and evaluate mathematical operations involving multi-dimensional arrays efficiently just like in NumPy. It is also a foundational library for deep learning tasks on Python. It is a powerful library that is a mathematical compiler which combines the native libraries like BLAS and C compiler to run on GPU and CPU. Theano for deep Learning is not used independently but is wrapped with libraries such as Keras or Lasagne to build models that improve the computation speed largely.
Keras is a library for modelling artificial neural networks which work on top of Theano or TensorFlow running at its backend. It was mainly designed for experimenting on deep neural networks and is not an end-to-end machine learning framework such as the SciKit library.
Keras can build a neural network in the form of a sequential model which is nothing but a stack of layers that a neural network is primarily made of. Data is prepared in tensors which are given at the input layer along with a suitable activation function and the last layer is considered the output layer. Keras has successfully simplified synthesis of ANNs.
It is relatively a new library for machine learning and was developed by Google as an engine behind their environment for training neural networks. The high-profile Google applications like Google Translate was built using TensorFlow. It also improves computation on both CPU and GPU. TensorFlow competes with Theano in terms of the preference as a backend library where the pros and cons differ based on their applications. It has a multilayer node system that makes it easy to work on large datasets; however, the execution speed may be a little slower than Theano.
Natural Language Processing
Natural Language Toolkit (NLTK)
NLTK consists of a suite of libraries used for symbolic and statistical tasks involved in natural language processing (NLP). Natural Language processing is used where human-machine interaction is involved. NLP is primarily utilized for topic segmentation, opinion mining, and sentiment analysis, to name a few. NLTK allows a number of NLP tasks like tokenization, classification, tagging, parsing, semantic analysis of input data, etc. It helps to convert written words into vectors by structuring the input data and tokenization processes.
There are several alternatives to the above-mentioned libraries for the tasks mentioned. However, these are the libraries that have gained popularity in the data science world. Apart from the above-mentioned libraries, data scientists must also know data mining libraries such as BeautifulSoup, Scrapy, Pattern for web crawling that was not explained here but are of much importance.
A data scientist or a machine learning expert must be adept with these primary libraries, and the best place to master them is to first understand their documentation. While the documentation will give a basic understanding of how they are used with examples, one must practice them with different datasets to get hold of the subject. After all, practice makes the man perfect. Isnt it? So, keep learning!!