An overview of the basics of Python as well as Jupyter notebooks and how they can be a useful tool for data science.

Binder

References

This primer draws heavily on materials from Stanford's CS231n Convolutional Neural Networks for Visual Recognition:

You may also find the Python section of Kaggle Learn to be helpful.

Why Python for Data Analysis?

Python is currently one of the fastest growing programming languages in the world. There are many reasons for this trend, but it largely comes down to the fact that Python is easy to learn, has a library or framework for almost everything, and has a large and active community of developers in scientific computing and data analysis. (This last point is very important: if you get stuck, there's a good chance that somebody has already asked your question on StackOverflow!)

These features have led Python to become one of the most important languages in data science, machine learning, and general software development in academia and industry. Thus, if you're going to learn a new programming language in 2020, then Python is a great choice!

But I don't know how to program!

Don't worry, we will help you!

Since this is a course on practical data science, we believe that the most effective way to learn data science is to actually do it on a recurring basis. By analogy, you do not need to know the theoretical basis for music in order to learn how to play an instrument! Thus the main goal throughout this semester is to spend the majority of our time working directly with data and code; by the end the semantics of Python and its data analysis libraries will be almost second-nature.

Essential Python libraries

In this course we will be introducing several core libraries that underpin much of the work done in data science and machine learning. We will provide more details later in the course, but for now a summary is produced below.

NumPy

NumPy (short for Numerical Python) provides the data structures, algorithms, and glue needed for most scientific applications involving numerical data. It provides blazing fast array-processing capabilities and the ability to do linear algebra operations with ease. (The latter aspect is in part why NumPy underpins much of machine and deep learning applications.)

pandas

pandas provides data structures and functions that make working with structured or tabular data fast and simple. The primary objects that we will encounter in this course are:

  • DataFrame: a tabular, column-oriented data structure with row and column labels;
  • Series: a 1d labelled array object.

matplotlib

matplotlib is the defacto Python library for producing plots and other data visualisations.

scikit-learn

scikit-learn is the defacto machine learning toolkit for Python. It includes submodules such as:

  • Classification: SVM, nearest neighbours, random forest, logistic regression etc;
  • Regression: Lasso, ridge regression;
  • Clustering: $k$-means, etc;
  • Model selection: Grid search, cross-validation, metrics
  • Preprocessing: Feature extraction, normalisation

Don't worry if some of these terms are unfamiliar at this stage - we will explain them in detail later in the course.

Jupyter notebooks

Jupyter notebooks are an essential tool for any data science project. As described on Jupyter's website:

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include:data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Jupyter notebooks are made of cells and there are two basic types of cell that you'll encounter:

  • Text cells for comments, images etc;
  • Code cells for executing Python commands.

You can edit a cell by double-clicking on it and you can execute the code in it by either clicking on the "play" button ▶ or simply pressing Shift + Enter. You can also stop a running cell by pressing the "stop" button ◼︎.

Tab completion

One of the very cool things about Jupyter notebooks is that they provide tab completion, similar to many IDEs. If you enter an expression in a cell, the Tab key will search the namespace for any vairable that match the characters you've typed:

the_answer_to_life_the_unverse_and_everything = 42
the_beatles = ['john', 'paul', 'george', 'ringo']
the_beatles
['john', 'paul', 'george', 'ringo']

Introspection

Using a question mark ? before or after a variable will display some general information:

the_beatles?
Type:        list
String form: ['john', 'paul', 'george', 'ringo']
Length:      4
Docstring:  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.

Introspection can also be used to access the docstring of a function, e.g.

def add_numbers(a, b):
    """
    Add two numbers together
    
    Returns:
        the_sum : type of arguments
    """
    return a + b

Then using ? or pressing Shift + Tab will show us the docstring

add_numbers?
Signature: add_numbers(a, b)
Docstring:
Add two numbers together

Returns:
    the_sum : type of arguments
File:      ~/git/dslectures/notebooks/<ipython-input-4-543b6483d93f>
Type:      function

Using ?? will also show the source code if possible:

add_numbers??
Signature: add_numbers(a, b)
Source:   
def add_numbers(a, b):
    """
    Add two numbers together
    
    Returns:
        the_sum : type of arguments
    """
    return a + b
File:      ~/git/dslectures/notebooks/<ipython-input-4-543b6483d93f>
Type:      function

Basic data types

This section is based on Stanford's excellent NumPy tutorial and Wes McKinney's Python for Data Analysis book - go check them out for more details! Like most languages, Python has a number of basic data types including:

  • Integers
  • Floats (double-precision 64-bit floating point number)
  • Booleans (True or False)
  • Strings

These data types behave in ways that are familiar from other programming languages. However, there are some important differences. For example statically typed languages like C or Java require each variables to be explicitly declared. By contrast, dynamically typed languages like Python skip this specification. For example, in Java you might specify an operation as follows:

/* Java code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}

while in python the same operation could be written this way:

# Python code
result = 0
for i in range(100):
    result += i

Note the main difference: in Java the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means we can assign any kind of data to any variable:

# Python code
x = 4
x = "four"

Here we switched the contents of x from an integer to a string. The same thing in Java would lead to a compilation error or some other disaster:

/* Java code */
int x = 4;
x = "four"; // FAILS!

This sort of flexibility makes Python easy (and sometimes dangerous!) to use.

Numbers

Integers and floats work as you'd expect from other programming languages. Let's create two variables:

some_integer = 73
some_float = 3.14159

We can print out these variables and their types as follows:

print(some_integer, type(some_integer))
print(some_float, type(some_float))
73 <class 'int'>
3.14159 <class 'float'>

What if you want to print some text and then some numbers? One way to do this is to cast the number as a string and then print it:

print('My integer was ' + str(some_integer))
print('My float was ' + str(some_float))
My integer was 73
My float was 3.14159

However I often find it more convenient to use the print statement with comma separated values:

print('My integer was', some_integer)
print('My float was', some_float)
My integer was 73
My float was 3.14159

A natural thing to want to do with numbers is perform arithmetic. We've seen the + operator for addition, and the * operator for multiplication (of a sort). Python also has us covered for the rest of the basic binary operations we might be interested in:

Operator Name Description
a + b Addition Sum of a and b
a - b Subtraction Difference of a and b
a * b Multiplication Product of a and b
a / b True division Quotient of a and b
a // b Floor division Quotient of a and b, removing fractional parts
a % b Modulus Integer remainder after division of a by b
a ** b Exponentiation a raised to the power of b
-a Negation The negative of a

print('Sum:', some_integer + some_float)
print('Multiplication:', some_integer * some_float)
print('Division:', some_integer / some_float)
print('Power:', 10 ** some_integer)
Sum: 76.14159
Multiplication: 229.33606999999998
Division: 23.236641318567987
Power: 10000000000000000000000000000000000000000000000000000000000000000000000000

We can also store the result from math operations in new variables, e.g.

my_sum = some_integer + some_float
print('My sum was', my_sum)
My sum was 76.14159

Warning!

A common bug that can creep into your code is a lack of care between integer division and floor division. For example, integer division not resulting in a whole number will yield a float

3 / 2
1.5

while the floor division operator drops the fractional part

3 // 2
1

Booleans

Python implements all of the usual operators for Boolean logic, but uses English words rather than symbols like &&, ||, etc that are found in other languages:

t = True
f = False
print(type(t), type(f))
# logical AND
print(t and f)
# logical OR
print(t or f)
# logical NOT
print(not t)
# logical XOR
print(t != f)
<class 'bool'> <class 'bool'>
False
True
False
True

Strings

Python has powerful and flexible string processing capabilities. You can write string literals using either single quote ' of double quotes ":

a = 'one way of writing a string'
b = "another way"

We can also get the number of elements in a string sequence as follows

hello = 'hello'
len(hello)
5

We can also access each character in a string and print it's value:

for letter in hello:
    print(letter)
h
e
l
l
o

Adding two strings together concatenates them and produces a new string

world = 'world'
hello + ' ' + world 
'hello world'

String objects also come with a range of built-in functions to convert them into various forms:

hello.capitalize()
'Hello'
hello.upper()
'HELLO'
s = 'hitchhiker'
s.replace('hi', 'ma')
'matchmaker'

Containers

Python includes several built-in container types:

  • Lists
  • Dictionaries
  • Sets
  • Tuples

Lists

A list is the Python equivalent of an array, but is resizeable and can contain elements of different types:

some_list = [1,1,2,3,5,8,13,21,34,55,89]
print('This is a list:', some_list)
This is a list: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

How do we access the individual elements in a list? In Python, the index of elements in a list starts at zero. Thus we should look at the zeroth index to get the first element:

print(some_list[0])
1

To get elements at the end of a list, we use negative indices, e.g.

print(some_list[-1])
89

As noted above, Python lists can contain elements of different types. Let's add a new element to the end of the list:

some_list.append('fibonacci')
print(some_list)
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 'fibonacci']

We can also replace values in a list based on their index, e.g.

some_list[-1] = 148
print(some_list)
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 148]

Finally we can use the pop method to remove and return the last element of a list:

last_element = some_list.pop()
print(last_element)
148

Slicing

In addition to accessing list elements one at a time, Python provides concise syntax to access sublists; this is known as slicing. Let's begin by using Python's built-in range function to create a list of integers:

L = list(range(10))
L
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

To get a slice from $[2,4)$ we run

L[2:4]
[2, 3]

while to get a slice from index 2 to the end of the list we run

L[2:]
[2, 3, 4, 5, 6, 7, 8, 9]

To get a slice from the start to index 5 (exclusive) we run

L[:5]
[0, 1, 2, 3, 4]

To get a slice of the whole list:

L[:]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Slices can also be negative

L[:-1]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
L[2:4] = ['a', 'b']
L
[0, 1, 'a', 'b', 4, 5, 6, 7, 8, 9]

Loops

animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)
cat
dog
monkey

List comprehension

nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
    squares.append(x ** 2)

squares
[0, 1, 4, 9, 16]
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]; squares
[0, 1, 4, 9, 16]
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]; even_squares
[0, 4, 16]

enumerate

It's common when iterating to want to keep track of the index when iterating over a sequence. Python has a built-in function called enumerate just for this:

for idx, animal in enumerate(animals):
    print(f'#{idx + 1}: {animal}')
#1: cat
#2: dog
#3: monkey

zip

zip is a useful function to pair elements of a lists, tuples or other sequences to create a list of tuples:

seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']

zipped = zip(seq1, seq2)

list(zipped)
[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

Dictionaries

A dictionary is known as the dict data structure and are extremely handy. It is a collection of key-value pairs, where key and value are Python objects. The simplest way to create a dictionary is with curly braces {}:

empty_dict = {}
d = {'cat': 'cute', 'dog': 'furry'}; d
{'cat': 'cute', 'dog': 'furry'}

We can access, insert, or set elements using the same approach as for lists:

d['cat']
'cute'
d['fish'] = 'wet'; d
{'cat': 'cute', 'dog': 'furry', 'fish': 'wet'}

You can check if a key exists as follows

'cat' in d
True

Finally, you can delete values using the del keyword

del d['fish']; d
{'cat': 'cute', 'dog': 'furry'}

Loops

d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
    legs = d[animal]
    print(f'A {animal} has {legs} legs')
A person has 2 legs
A cat has 4 legs
A spider has 8 legs
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal, legs in d.items():
    print(f'A {animal} has {legs} legs')
A person has 2 legs
A cat has 4 legs
A spider has 8 legs

Dictionary comprehensions

nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print(even_num_to_square)
{0: 0, 2: 4, 4: 16}

Sets

A set is an unordered collection of unique element. They are similar to dicts, but with just keys (no values). The simplest way to create a set is as follows

animals = {'cat', 'dog'}; animals
{'cat', 'dog'}

Sets allow us to perform the standard set operation like union, intersection, difference and symmetric difference. For example

felines = {'cat', 'tiger', 'lion'}
animals.union(felines)
{'cat', 'dog', 'lion', 'tiger'}
animals.intersection(felines)
{'cat'}

Tuples

A tuple is an (immutable) ordered list of values. The simplest way to create one is with a comma-separated sequence of values

tup = 4, 5, 6; tup
(4, 5, 6)
nested_tup = (4, 5, 6), (7, 8); nested_tup
((4, 5, 6), (7, 8))

Multiplying a tuple by an integer has the effect of concatenating together copies of the tuple:

('foo', 'bar') * 4
('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

Strings

In Python strings are list like containers for characters. There are a handful of useful operations to process strings explained in this section

string = 'This is a string!\n(But not a very interesting one)\n\n\tEnd.'
print(string)
This is a string!
(But not a very interesting one)

	End.

In Python strings are lists of characters and as such one can iterate through them like lists:

for character in string:
    print(character)
T
h
i
s
 
i
s
 
a
 
s
t
r
i
n
g
!


(
B
u
t
 
n
o
t
 
a
 
v
e
r
y
 
i
n
t
e
r
e
s
t
i
n
g
 
o
n
e
)




	
E
n
d
.

Check their length like lists:

len(string)
57

Check if they contain certain elements like lists:

'!' in string
True
'?' not in string
True

We can also check if a substring is present in a string:

'very' in string
True

Capitalisation: There are different ways to manipulate the casing of strings:

'test'.upper()
'TEST'
'TEST'.lower()
'test'
'test'.capitalize()
'Test'

Adding strings:

result = 'a'+'b'
print(result)
ab

Splitting strings:

Often we need to split sentences into words or file paths into components. For this task we can use the split() function. By default a string is split wherever a whitespace is (this could be normal space, a tab \t or a newline \n).

string.split()
['This',
 'is',
 'a',
 'string!',
 '(But',
 'not',
 'a',
 'very',
 'interesting',
 'one)',
 'End.']
'path/to/file/image.jpg'.split('/')
['path', 'to', 'file', 'image.jpg']

Stripping strings:

Sometimes strings contain leading or trailing characters that we want to get rid of, such as whitespaces or unnecessary characters. We can remove them with the strip() function. Like the split() function it removes whitespaces by default but we can set any characters we want:

'_path/to/file/image.jpg_'.strip('_')
'path/to/file/image.jpg'
'_-_path/to/file/image.jpg,_,'.strip(',_-')
'path/to/file/image.jpg'

Replacing:

With the replace() function one can replace substrings in a string.

'one plus one equals two!'.replace('two','three')
'one plus one equals three!'

Joining strings

Sometimes we split strings into a list of words for processing (like stemming or stop word removal) and then want to join them back to a single string. To to this we can use the join() function:

' '.join(['this', 'is', 'a', 'list', 'of', 'words'])
'this is a list of words'
'-'.join(['this', 'is', 'a', 'list', 'of', 'words'])
'this-is-a-list-of-words'

Exercise 2: Write a function that performs the following on string_1

  • split the string into words with spaces
  • then strip the special character / from each word
  • join the words back together with single spaces (' ')
  • make the whole string lower-case
string_1 = 'This is a string!\n/(But not a very interesting one)/\n\n\tEnd.'
print(string_1)
This is a string!
/(But not a very interesting one)/

	End.

Functions

Function are a very important method of code reuse in Python. In general, if you find yourself repeating the same code more than once, it can be useful to port it to a reusable function. Doing so also makes your code more readable, which is very important when collaborating with others (or even your future self!).

Functions are declare with the def keyword and returned from the return keyword, e.g.

def area_of_a_circle(radius):
    area = 3.14159 * radius ** 2
    return area
area_of_a_circle(5)
78.53975

Note that we can have multiple return statements, e.g. based on the result of some conditional statements:

def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'

for x in [-1, 0, 1]:
    print(sign(x))
negative
zero
positive

In addition to positional arguments, functions can have keyword arguments that are typically used to specify default values:

def hello(name, loud=False):
    if loud:
        print('G\'day, %s!' % name.upper())
    else:
        print('G\'day, %s' % name)

hello('Bob') 
hello('Fred', loud=True)
G'day, Bob
G'day, FRED!

Finally, it is quite common to write functions that return multiple objects as a tuple. For example we can write something like

def is_number_positive(number):
    if number > 0:
        return True, number
    else:
        return False, number
is_number_positive(-1)
(False, -1)
is_number_positive(42)
(True, 42)

One cool aspect of such functions is that the returned objects can be unpacked in two different ways:

tup = is_number_positive(3); tup
(True, 3)
is_positive, number = is_number_positive(-10)

print(is_positive)
print(number)
False
-10

Exercises

Ordered pairs

You are given three integers $x, y$ and $z$. You need to find the ordered pairs $(i, j )$ , such that $( i + j )$ is equal to $z$ and print them in lexicographic order. Here $i$ and $j$ are constrained to lie in the intervals $ 0 \leq i \leq x $ and $ 0 \leq j \leq y $. Below is a solution if we do not use list comprehension - can you find a solution that does?

x = 5
y = 4
z = 3

# initialise array for ordered pairs and counter
arr = []
pair_counter = 0

for i in range(x + 1):
    for j in range(y + 1):
        if i + j == z:
            arr.append([])
            arr[pair_counter] = [i, j]
            pair_counter += 1
            
print(arr)
[[0, 3], [1, 2], [2, 1], [3, 0]]

Squirrel cigar party!

When squirrels get together for a party, they like to have cigars. A squirrel party is successful when the number of cigars is between 40 and 60, inclusive. Unless it is the weekend, in which case the squirrels go wild and there is no upper bound on the number of cigars. Complete the function below to return True if the party with the given values is successful, or False otherwise. Example output is shown below:

cigar_party(30, False) → False
cigar_party(50, False) → True
cigar_party(70, True)  → True