
NumPy is the fundamental package for scientific computing with Python.

At the core of the NumPy package, is the ndarray object or n-dimensional array. In programming, array describes a collection of elements, similar to a list. The word n-dimensional refers to the fact that ndarrays can have one or more dimensions.

Here I will directly convert a list to an ndarray using the numpy.array() constructor. To create a 1D ndarray, I can pass in a single list

import numpy as np #using the alias np
array_data = np.array([1,2,3,4])
[1 2 3 4]
<class 'numpy.ndarray'>

It's often useful to know the number of rows and columns in a ndarray. I can use the ndarray.shape attribute.

array_2D = np.array([[5, 10, 15],
                    [20, 25, 30]])
(2, 3)

Vectorization in NumPy

The NumPy library takes advantage of a processor feature called Single Instruction Multiple Data (SIMD) to process data faster. SIMD allows a processor to perform the same operation on multiple data points in a single processor cycle.

The concept of replacing for loops with operations applied to multiple data points at once is called vectorization, and ndarrays make vectorization possible.

numbers = [[6, 5], [1, 3], [5, 6], [1, 4], [3, 7], [5, 8], [3, 5], [8, 4]]
print(*numbers, sep='\n')
[6, 5]
[1, 3]
[5, 6]
[1, 4]
[3, 7]
[5, 8]
[3, 5]
[8, 4]
sums = []
for row in numbers:
  row_sum = row[0] + row[1]

[11, 4, 11, 5, 10, 13, 8, 12]
# Convert the list of lists to an ndarray
np_numbers = np.array(numbers)
sums = np_numbers[:,0] + np_numbers[:,1]
[11  4 11  5 10 13  8 12]

When I selected each column, we used the syntax ndarray[:,c] where c is the column index I wanted to select. The colon selects all rows.



# Select specific row
rows = [0, 2, 4]
print(np_numbers[rows, :])
[[6 5]
 [5 6]
 [3 7]]

Explore the numerical dataset

To explore two-dimensional (2D) ndarrays, I'll analyze New York City taxi trip data released by the city of New York.

from csv import reader
#Load data into the notebook
with open('nyc_taxis.csv', 'r') as taxis_file:
  taxis = list(reader(taxis_file))

import numpy as np
np_taxis = np.array(taxis)
[['pickup_year' 'pickup_month' 'pickup_day' 'pickup_dayofweek'
  'pickup_time' 'pickup_location_code' 'dropoff_location_code'
  'trip_distance' 'trip_length' 'fare_amount' 'fees_amount'
  'tolls_amount' 'tip_amount' 'total_amount' 'payment_type']
 ['2016' '1' '1' '5' '0' '2' '4' '21.00' '2037' '52.00' '0.80' '5.54'
  '11.65' '69.99' '1']
 ['2016' '1' '1' '5' '0' '2' '1' '16.29' '1520' '45.00' '1.30' '0.00'
  '8.00' '54.30' '1']]
(2014, 15)

I'll only work with a subset of the real data — approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. This data set includes a 1/50th random sample. Below is information about selected columns from the dataset:

  • pickup_year: the year of the trip pickup_month: the month of the trip (January is 1, December is 12)
  • pickup_day: the day of the month of the trip
  • pickup_location_code: the airport or borough where the trip started
  • dropoff_location_code: the airport or borough where the trip ended
  • trip_distance: the distance of the trip in miles
  • trip_length: the length of the trip in seconds
  • fare_amount: the base fare of the trip, in dollars
  • total_amount: the total amount charged to the passenger, including all fees, tolls and tips

Detailed information on all columns could be found here.

# Access the column by names
np_taxis = np.genfromtxt('nyc_taxis.csv',delimiter=',', names= True)
z = np.array([1,2])
y = np.array([[1],[2]])
np_taxis = np.genfromtxt('nyc_taxis.csv',delimiter=',', skip_header=1)
cols = [5, 6, 9, 10]

Get the Frequency Table of taxis based on the Pickup Location

unique, counts = np.unique(np_taxis[:,6], return_counts=True)
# Calculate the mph of each trip
trip_distance_miles = np_taxis[:,7]
trip_length_seconds = np_taxis[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles / trip_length_hours
total_amount = np_taxis[:,-2]
# measures of central tendency
mean = np.mean(total_amount)
median = np.median(total_amount)
# measures of dispersion
min = np.amin(total_amount)
max = np.amax(total_amount)
range = np.ptp(total_amount)
varience = np.var(total_amount)
sd = np.std(total_amount)
print("Descriptive analysis")
print("Measures of Central Tendency")
print("Mean =", mean)
print("Median =", median)
print("Measures of Dispersion")
print("Minimum =", min)
print("Maximum =", max)
print("Range =", range)
print("Varience =", varience)
print("Standard Deviation =", sd)