NumPy-Pandas-Basics#

1.1 Objective#

This module provides a comprehensive introduction to numerical computing and optimization using NumPy, Pandas, and Linear Programming. These three foundational tools are essential for data science, engineering, and operational research, forming the computational backbone for large-scale numerical analysis and decision-making problems. The primary goal is to equip learners with the ability to manipulate structured data efficiently, perform high-speed numerical computations, and model real-world optimization problems using Python.

  1. NumPy: Introduces array-based computing for efficient numerical operations.

  2. Pandas: Focuses on structured data manipulation and analysis.

  3. Linear Programming (LP): Explores mathematical optimization techniques using GurobiPy.

This module integrates theoretical concepts with hands-on programming using Python, Jupyter Notebook, NumPy, Pandas, and GurobiPy, allowing learners to transition from fundamental numerical analysis to real-world optimization applications.


1.2 Key Components#

1. NumPy Fundamentals#

  • Creating NumPy Arrays

    • Creating a 1D Array

    • Creating a 2D Array

    • Creating Arrays with specific Values

  • Array Attributes

    • Array Shape

    • Number of Dimensions

    • Number of Elements

    • Data Type of Elements

  • Indexing and Slicing

    • Accessing Elements

    • Slicing

  • Mathematical Operations

    • element-wise operations

  • Reshaping and Transposing Arrays

    • Reshaping Arrays

    • Transposing Arrays

  • Statistical Functions

    • Basic statistics functions

  • Stacking and Concatenation

    • Horizontal Stacking

    • Vertical Stacking

    • Concatenation

2. Pandas for Data Analysis#

  • Creating Data Structures

    • Creating a Pandas Series

    • Creating a Pandas DataFrame

  • Reading and Writing Data

    • Reading from a CSV File

    • Writing to a CSV File

  • Data Inspection and Manipulation

    • Using of head( ), tail( ), info( ), describe( )

    • Select specific columns, rows, and filtering

  • Data Cleaning

    • Using of dropna( ) to handle missing data

    • Changing Data Types

  • Modifying Data

    • Adding New Columns

    • Renaming Columns

    • Sorting Data

  • Merging and Joining Data

    • Merging DataFrames

    • Concatenating DataFrames

  • Pivot Tables and Crosstabs

    • Creating a Pivot Table

    • Creating a Crosstab

3. Linear Programming basics#

  • Introduction to LP Optimization Models

    • Understanding the objective functions in LP

    • Understanding the constraints in LP

  • Using Gurobi for Linear Programming

    • Installing Gurobi

    • Defining decision variables

    • Set objective function

    • Add constraints

  • Solving Optimization Models

    • Implementing an LP solver using GurobiPy

    • Extracting and interpreting optimal solutions


1.3 Module Impact#

This module provides a structured approach to mastering numerical computing, data handling, and optimization modeling, making it highly relevant for engineers.

  1. Efficient Data Processing: Participants develop proficiency in handling structured datasets with NumPy and Pandas, leveraging optimized numerical operations.

  2. Analytical Thinking: By working with structured data manipulation, learners enhance their ability to extract insights and analyze trends.

  3. Optimization Modeling: The integration of Linear Programming equips learners with tools to tackle decision-making problems in industry applications.

  4. Hands-on Coding: The practical implementations in Jupyter Notebook reinforce learning through interactive problem-solving.

2. NumPy basics#

NumPy (Numerical Python) is a fundamental library for numerical computing in Python. It is widely used in scientific computing, machine learning, and data analysis.

If you haven’t installed NumPy, you can do so using:

pip install numpy

Then, import it in your Python script:

import numpy as np

2.1 Creating NumPy Arrays#

Creating a 1D Array#

A NumPy array, called ndarray, can be created from a Python list using np.array().

# Creating a one-dimensional array
a = np.array([1, 2, 3, 4, 5])
print("One-dimensional array:\n", a)
One-dimensional array:
 [1 2 3 4 5]

Creating a 2D Array#

A two-dimensional array (matrix) can be created by passing a list of lists.

# Creating a two-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6]])
print("Two-dimensional array:\n", b)
Two-dimensional array:
 [[1 2 3]
 [4 5 6]]

Creating Arrays with specific Values#

NumPy provides functions to create special arrays.

A = np.zeros((2, 3))  # 2x3 array of zeros
print("Zeros:\n", A)
Zeros:
 [[0. 0. 0.]
 [0. 0. 0.]]
B = np.ones((3, 3))  # 3x3 array of ones
print("Ones:\n", B)
Ones:
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
C = np.eye(3)  # 3x3 identity matrix
print("Identity Matrix:\n", C)
Identity Matrix:
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
D = np.random.rand(2, 2)  # 2x2 array of random numbers
print("Random Array:\n", D)
Random Array:
 [[5.03565237e-04 9.83687540e-01]
 [3.86743054e-01 7.33784820e-01]]

2.2 Array Attributes#

NumPy arrays have some important attributes that can help you understand their structure.

print("Array Shape:", D.shape) 
Array Shape: (2, 2)
print("Number of Dimensions:", D.ndim)
Number of Dimensions: 2
print("Number of Elements:", D.size) 
Number of Elements: 4
print("Data Type of Elements:",D.dtype)
Data Type of Elements: float64

2.3 Indexing and Slicing#

NumPy allows you to access elements and slices of arrays easily.

### Accessing Elements
E = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Element at (1,2):", E[2, 2])  # Acess the third row, third column
Element at (1,2): 9
### Slicing
print("Second column:", E[:, 1])  # Select second column
Second column: [2 5 8]
print("Sub-matrix:\n", E[0:2, 2:3])  # Select sub-matrix
Sub-matrix:
 [[3]
 [6]]

2.4 Mathematical Operations#

NumPy supports element-wise operations.

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("Addition:", a+ b)
Addition: [5 7 9]
print("Multiplication:", a * b)
Multiplication: [ 4 10 18]

2.5 Reshaping and Transposing Arrays#

NumPy allows changing the shape of arrays without modifying data.

### Reshaping Arrays
F = np.arange(1, 10)
G = np.arange(1, 10).reshape(3, 3)
print("Original Array:\n", F)
print("Reshaped Array:\n", G)
Original Array:
 [1 2 3 4 5 6 7 8 9]
Reshaped Array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
### Transposing Arrays
print("Transposed Array:\n", G.T)
Transposed Array:
 [[1 4 7]
 [2 5 8]
 [3 6 9]]

2.6 Statistical Functions#

NumPy provides functions for basic statistics.

print("Maximum Value:", G.max())
Maximum Value: 9
print("Minimum Value:", G.min())
Minimum Value: 1
print("Mean Value:", G.mean())
Mean Value: 5.0
print("Sum of Elements:", G.sum())
Sum of Elements: 45
print("Column-wise Sum:", G.sum(axis=0))
Column-wise Sum: [12 15 18]
print("Row-wise Sum:", G.sum(axis=1))
Row-wise Sum: [ 6 15 24]

2.7 Stacking and Concatenation#

NumPy allows combining multiple arrays in different ways.

### Horizontal Stacking
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
hstacked = np.hstack((a, b))
print("Horizontally Stacked:\n", hstacked)
Horizontally Stacked:
 [[1 2 5 6]
 [3 4 7 8]]
### Vertical Stacking
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
vstacked = np.vstack((a, b))
print("Vertically Stacked:\n", vstacked)
Vertically Stacked:
 [[1 2]
 [3 4]
 [5 6]
 [7 8]]
### Concatenation
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
concatenated = np.concatenate((a, b), axis=0)  # Concatenate along rows
print("Concatenated:\n", concatenated)
Concatenated:
 [[1 2]
 [3 4]
 [5 6]
 [7 8]]

Conclusion#

Mastering NumPy is essential for data science, machine learning, and numerical computing, providing an efficient way to handle large datasets.

3. Pandas basics#

Pandas is a powerful and flexible data analysis and manipulation library for Python. It is widely used in data science and any scenario where structured data processing is needed.

Installing Pandas#

If you haven’t installed Pandas yet, you can do so using:

pip install pandas

Then, import it in your Python script:

import pandas as pd

3.1 Creating Data Structures#

Creating a Pandas Series#

The two primary data structures in Pandas are:

  • Series: A one-dimensional labeled array.

  • DataFrame: A two-dimensional table-like structure with labeled rows and columns.

A Pandas Series is similar to a column in an Excel spreadsheet. It consists of an array of data with an associated index.

# Creating a pandas Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
a    10
b    20
c    30
d    40
dtype: int64

Creating a Pandas DataFrame#

A DataFrame is a two-dimensional table with labeled rows and columns.

# Creating a pandas DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000

3.2 Reading and Writing Data#

Pandas can read from and write to various file formats such as CSV, Excel, etc.

### Reading from a CSV File
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "numpy-pandas-basics-data", "data.csv")

df = pd.read_csv(csv_path)
print(df)  
  Name   Age    Salary
0    A  25.0   50000.0
1    B  30.0   60000.0
2    C  25.0   70000.0
3    D  40.0   80000.0
4    E  30.0   55000.0
5    F  45.0   65000.0
6    G  45.0   90000.0
7    H  50.0  100000.0
8    I   NaN       NaN
### Writing to a CSV File
df = df.drop(columns=['Salary']) #Delete the last column
output_path = os.path.join(current_dir, "numpy-pandas-basics-data", "output.csv")
df.to_csv(output_path, index=False) 
df = pd.read_csv(output_path)
print(df)  
  Name   Age
0    A  25.0
1    B  30.0
2    C  25.0
3    D  40.0
4    E  30.0
5    F  45.0
6    G  45.0
7    H  50.0
8    I   NaN

3.3 Data Inspection and Manipulation#

You can use head( ), tail( ), info( ), describe( ) to show the First 5 rows, Last 5 rows, Summary of DataFrame, and Statistical summary.

print(df.head())   # First 5 rows
  Name   Age
0    A  25.0
1    B  30.0
2    C  25.0
3    D  40.0
4    E  30.0
print(df.tail())   # Last 5 rows
  Name   Age
4    E  30.0
5    F  45.0
6    G  45.0
7    H  50.0
8    I   NaN
print(df.info())   # Summary of DataFrame
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    9 non-null      object 
 1   Age     8 non-null      float64
dtypes: float64(1), object(1)
memory usage: 272.0+ bytes
None
print(df.describe())  # Statistical summary
             Age
count   8.000000
mean   36.250000
std     9.910312
min    25.000000
25%    28.750000
50%    35.000000
75%    45.000000
max    50.000000

You can select specific data:

print(df['Name'])  # Select a column
0    A
1    B
2    C
3    D
4    E
5    F
6    G
7    H
8    I
Name: Name, dtype: object
print(df[['Name', 'Age']])  # Select multiple columns
  Name   Age
0    A  25.0
1    B  30.0
2    C  25.0
3    D  40.0
4    E  30.0
5    F  45.0
6    G  45.0
7    H  50.0
8    I   NaN
print(df.loc[0])  # Select a row by index
Name       A
Age     25.0
Name: 0, dtype: object
print(df.iloc[1])  # Select a row by numerical position
Name       B
Age     30.0
Name: 1, dtype: object

You can also filter data easily:

filtered_df = df[df['Age'] > 30]  # Select rows where Age > 30
print(filtered_df)
  Name   Age
3    D  40.0
5    F  45.0
6    G  45.0
7    H  50.0

3.4 Data Cleaning#

Pandas has useful functions to clean data, such as Handling Missing data, Changing Data Types, etc.

### Handling Missing Values
df.dropna(inplace=True)  # Remove rows with NaN values
print(df)
  Name   Age
0    A  25.0
1    B  30.0
2    C  25.0
3    D  40.0
4    E  30.0
5    F  45.0
6    G  45.0
7    H  50.0
### Changing Data Types
df['Age'] = df['Age'].astype(int)  # Convert Age column to integer
print(df)
  Name  Age
0    A   25
1    B   30
2    C   25
3    D   40
4    E   30
5    F   45
6    G   45
7    H   50

3.5 Modifying Data#

By leveraging Pandas, you can easily add new columns, rename colums, sort data, etc.

### Adding a New Column
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "numpy-pandas-basics-data", "data.csv")
df = pd.read_csv(csv_path)

df['Bonus'] = df['Salary'] * 0.10  # Add a new column
print(df)
  Name   Age    Salary    Bonus
0    A  25.0   50000.0   5000.0
1    B  30.0   60000.0   6000.0
2    C  25.0   70000.0   7000.0
3    D  40.0   80000.0   8000.0
4    E  30.0   55000.0   5500.0
5    F  45.0   65000.0   6500.0
6    G  45.0   90000.0   9000.0
7    H  50.0  100000.0  10000.0
8    I   NaN       NaN      NaN
### Renaming Columns
df.rename(columns={'Name': 'Employee Name'}, inplace=True)
print(df)
  Employee Name   Age    Salary    Bonus
0             A  25.0   50000.0   5000.0
1             B  30.0   60000.0   6000.0
2             C  25.0   70000.0   7000.0
3             D  40.0   80000.0   8000.0
4             E  30.0   55000.0   5500.0
5             F  45.0   65000.0   6500.0
6             G  45.0   90000.0   9000.0
7             H  50.0  100000.0  10000.0
8             I   NaN       NaN      NaN
### Sorting Data
df.sort_values(by='Age', ascending=False, inplace=True)
print(df)
  Employee Name   Age    Salary    Bonus
7             H  50.0  100000.0  10000.0
5             F  45.0   65000.0   6500.0
6             G  45.0   90000.0   9000.0
3             D  40.0   80000.0   8000.0
1             B  30.0   60000.0   6000.0
4             E  30.0   55000.0   5500.0
0             A  25.0   50000.0   5000.0
2             C  25.0   70000.0   7000.0
8             I   NaN       NaN      NaN

3.6 Merging and Joining Data#

Pandas allows you to combine data from multiple DataFrames.

### Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
   ID     Name  Salary
0   1    Alice   50000
1   2      Bob   60000
2   3  Charlie   70000
### Concatenating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)
   A  B
0  1  3
1  2  4
0  5  7
1  6  8
### 3.7 Pivot Tables and Crosstabs

Pandas allows for powerful data summarization using pivot tables.

### Creating a Pivot Table
pivot = df.pivot_table(values='Salary', index='Age', aggfunc='mean')
print(pivot)
        Salary
Age           
25.0   60000.0
30.0   57500.0
40.0   80000.0
45.0   77500.0
50.0  100000.0
### Creating a Crosstab
crosstab = pd.crosstab(df['Age'], df['Salary'])
print(crosstab)
Salary  50000.0   55000.0   60000.0   65000.0   70000.0   80000.0   90000.0   \
Age                                                                            
25.0           1         0         0         0         1         0         0   
30.0           0         1         1         0         0         0         0   
40.0           0         0         0         0         0         1         0   
45.0           0         0         0         1         0         0         1   
50.0           0         0         0         0         0         0         0   

Salary  100000.0  
Age               
25.0           0  
30.0           0  
40.0           0  
45.0           0  
50.0           1  

Conclusion#

This section introduced the fundamental concepts of Pandas, covering data structures, reading and writing files, data manipulation, cleaning. Pandas is an essential tool for anyone working with structured data in Python, making data analysis more efficient and intuitive.

References#

[1] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

[2] Numpy documentation: https://numpy.org/

[3] McKinney, W. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 56–61 (2010).

[4] andas documentation: https://pandas.pydata.org