Image created by Author (A.Kariyawasam) using Canva AI platform |
Python Programming for Data Analysts: A Brief Introduction
Author: Adisha Kariyawasam
This short article is intentionally pitched at a relatively a high level for anyone that is interested in understanding the importance and relevance of Python Programming and Data Analysis, particularly in a business context. It is the first in a series of articles intended to encourage further independent research. Incidentally, Data Analytics and Data Modelling are subjects that I teach at postgraduate level for BPP University Business School Faculty as part of the MSc Management Programme with Data Analytics.
Background context
Python is a relatively modern programming language that has evolved over the years and made very important contributions to today's data-driven world. It was created by Guido van Rossum, a Dutch computer scientist and released on 20th February 1991 (as Python 0.90). Here is an interesting YouTube clip detailing the Python development process from his point of view.
It was designed with a philosophy of emphasising readability, simplicity and explicitness and has grown in popularity in part due to a feature-rich, well supported library and range of modules and packages that perform a wide variety of functions. These libraries are free (a.k.a. open source) and maintained by an active and dedicated community of developers and enthusiasts. For this reason, it is well suited for the fields of Data Science, Numerical analysis, Statistics and Machine Learning. As such it is considered a very powerful tool for data analysts and has also had an influence on the development of other open-source languages such as Ruby, Swift, and Julia*.
*If you are interested in a comparison of these languages for application development, see this video on YouTube.
And Now for Something Completely Different!
Python's curious name is in fact derived from the popular (and somewhat surreal) British BBC comedy television series "Monty Python's Flying Circus" of which Guido van Rossum, the creator of Python language, was - according to Python folklore - a fan.
https://best-tv-shows.fandom.com/wiki/Monty_Python%27s_Flying_Circus |
He wanted a name that was distinctive, unique, short, and somewhat mysterious so "Python" not only fit the criteria but also added a sense of humour!
Why choose Python in Data Analytics?
There are several advantages to using Python including its versatility, rich ecosystem, and ease of learning:
Versatility: Python is versatile and can be applied in a scalable fashion to various data analytics tasks.
Ecosystem: Python has an extensive set of libraries (see next section) and frameworks that can be used for data analysis, data processing, data modelling and visualisation, and statistical analysis.
Ease of Learning: Python is considered high-level, highly readable and this makes it ideal for beginners and advanced analysts.
Python is very much valued among blue-chip companies like Dropbox, Google, Instagram, Netflix, Pinterest, Quora, Spotify, Uber and YouTube.
In fact, Data Scientists refer to "Python as the language of choice" when it comes to complex data analysis.
Python Libraries
Python libraries are modular and like Lego bricks are easily incorporated (imported) into a python program's structure and referenced throughout a program. Here are some of the most common libraries available for data analysis:
Pandas: Abbreviation of Panel Data AnalysiS library which is used to simplify data manipulation, cleaning, and pre-processing tasks through use of structures called data frames [DF].
NumPy: Short for Numerical Python and used for numerical operations and array processing.
Matplotlib and Seaborn: These libraries are used in data visualization.
SciPy: Used in scientific computing and advanced statistical analysis.
Scikit-learn: Scikit-learn is for machine learning tasks within data analysis.
In fact, here are the Top 10 libraries:
Data Cleaning and Pre-processing
Data cleaning also known as data wrangling, is the process of cleaning, transforming, and preparing raw data into a format that is suitable for further detail analysis. Data pre-processing involves a series of steps that aim to make the data more understandable and valuable for analysis or machine learning tasks and aims to address the following (not an exhaustive list by any means):
Handling Missing (or null) Values:
Unrefined or 'raw' data may contain missing (or null) values, which can pose challenges during analysis. Data wrangling involves implementing strategies for dealing with any missing data, and this can include imputation (estimation) or removal.
Dealing with Outliers:
Non-standard values or outliers are extreme and can potentially distort analysis or models. An important step in data pre-processing involves not only identifying but also handling outliers to ensure they don't unduly influence or negatively bias the results.
Data Transformation and Standardisation:
As raw data may come in a multitude of formats, data pre-processing is the necessary step that involves converting data into a standardised format e.g., normalising, scaling, encoding variables, converting text strings to numbers, to make it easier to compare and analyse.
Dealing with Noisy Data:
Noise in data can arise from various sources, like errors in data collection or transmission. Data pre-processing aims to reduce or eliminate this noise to improve the quality (signal-to-noise ratio) of the data.
Reducing Computational Overhead:
Through cleaning and preparing the data appropriately, unnecessary computational overheads are avoided, thus making the analysis more efficient.
The process of data-cleaning is in fact iterative and nicely summarised in the following diagram:
In terms of Data quality, data should be accurate, complete, consistent, reliable, relevant, valid, timely and uniform.
EDA is a critical step in the data analysis process, and it involves examining and visualising data to understand its characteristics, uncover patterns, and identify relationships between variables or features in the data set.
Statistical Analysis and Hypothesis Testing
We use statistical techniques in data analysis to fundamentally determine the accuracy and reliability of data and the models used to process the data. These are presented as metrics to compare different algorithms and models.
Accuracy is the closeness of a measured value to the true value or the target value. It indicates how well a measurement, or a very often an estimate, reflects the actual quantity being measured. This will be discussed in more detail in a future post.
Reliability is the consistency, stability, and repeatability of measurements or data over time and across different conditions. It assesses the degree to which a measurement or data point can be trusted to be consistent and dependable. This will be discussed in more detail in a future post.
Achieving both accuracy and reliability is vital for obtaining trustworthy and meaningful results in statistical analysis and scientific research.
Machine Learning for Data Analysis
Machine Learning, often abbreviated to ML is a branch of artificial intelligence [AI] that focuses on enabling computers (via ML Algorithms) to learn and make decisions or predictions from data without being explicitly procedurally programmed. Its origins data back to the 1950s and 60s through the development of AI-powered chess-playing programs. This marked a significant milestone in the history of both AI and computer science.
In data analysis, machine learning programs written in Python play a crucial role in extracting insights, making predictions, and automating complex tasks.
There are three main forms of machine learning:
Supervised Learning:
Here the algorithm is trained on a labelled dataset (e.g., predicting house prices based on features like number of bedrooms, proximity to public transport network, schools etc.). Each data point is associated with a target or outcome variable. The goal is to learn a mapping from input features to target values. The algorithm makes predictions based on the input features. It is a bit like teaching a child to read though constant guidance and correction of their favourite book.
Unsupervised Learning:
This involves training algorithms on unlabelled data (e.g., customer segmentation based on shopping behaviour). There is no predefined target variable; instead, the algorithm discovers patterns, structures, or relationships within the data. This is a bit like teaching a child to read a brand new book using their knowledge from previous guidance.
Reinforcement Learning:
Reinforcement learning (e.g., in Game Playing and Robotics) involves an agent interacting with an environment and learning to make a sequence of decisions to maximize a reward signal. The ML agent receives feedback in the form of rewards or penalties for each action it takes. This is akin to giving praise to a child when they read a new book well.
The three types of machine learning are nicely illustrated with examples in this diagram:
Digital Disruption in the business context
Without doubt machine learning and Artificial Intelligence (AI) have had a profound impact on many businesses across various industries by revolutionising operations through leveraging data, automating, and augmenting the decision-making process, facilitating greater customer interactions (through NLP or natural language processing) to name but a few. It is considered a form of digital disruption.
Here are some typical examples of digital disruption in the business world:
Question: Do you think this level of digital disruption has been positive or negative and why?
Here are some things to consider:
Automation and Efficiency:
AI-powered automation can streamline repetitive and time-consuming tasks, increasing efficiency and allowing employees to focus on higher-value activities. This includes the automation and validation of data entry (perhaps through IoT devices), recognising and authenticating customer service interactions, and creating trigger points for routine administrative tasks.
Predictive Analytics:
AI leverages the power of advanced algorithms to analyse historical data to look for patterns and use these patterns to make informed predictions about future trends on e.g., sales, customer behaviour, demand, and supply patterns, and shifts in the market. This information can potentially help make better informed decisions and develop strategic plans.
Personalised Customer Experiences:
AI-enabled systems can scrutinise customer data to create personalised experiences. This could include tailored product recommendations, content suggestions based on shopping habits as well as targeted marketing campaigns. Personalisation is said to enhance customer satisfaction, value, and loyalty.
Sentiment Analysis:
AI can analyse the results of surveys, social media posts, customer reviews, and product feedback to determine public sentiment towards a particular product, brand, or service. This information can be invaluable for brand management and understanding customer sentiment and can determine and guide focus on future product development.
Concluding remarks
I hope you enjoyed this very brief introduction and that it has whetted your appetite and inspired you to do some further research. I always welcome comments, feedback and suggestions for future posts which will provide more detailed explorations of this fascinating subject.
As mentioned earlier in this post, Programming for Data Analytics is a cutting-edge technology subject that I teach at BPP University as part of the MSc Management Programme with Data Analytics Programme of study*.
*Please click here if you'd like to find out more.
And finally...
Some common abbreviations
Here are a few common abbreviations you may come across when conducting further research into researching AI, ML, and Python programming and Statistics:
AI - Artificial Intelligence
ANN - Artificial Neural Network
API - Application Programming Interface
ANOVA - Analysis of Variance
CNN - Convolutional Neural Network
CDF - Cumulative Distribution Function
CRISP-DM - The Cross-Industry Standard Process for Data Mining
CSV - Comma-Separated Values
CV - Computer Vision
CI - Confidence Interval
DF - Degrees of Freedom
DL - Deep Learning
DF - Data frame
GAN - Generative Adversarial Network
GUI - Graphical User Interface
HTML - Hypertext Markup Language
IoT - Internet of Things
JSON - JavaScript Object Notation
KNN - K-Nearest Neighbours
MLE - Maximum Likelihood Estimation
ML - Machine Learning
MLaaS - Machine Learning as a Service
NLP - Natural Language Processing
OLS - Ordinary Least Squares
OS - Operating System
PCA - Principal Component Analysis
PDF - Probability Density Function
P-value - Probability Value - a measure of reliability.
R-squared value - A measurement of accuracy or goodness-of-fit Value of a regression model
RL - Reinforcement Learning
RNN - Recurrent Neural Network
SD - Standard Deviation
SDLC - Software Development Life Cycle
SE - Standard Error
SQL - Structured Query Language
SVM - Support Vector Machine
T-test - Student's t-test
URL - Uniform Resource Locator
Z-score - Z-score