NumPy

NumPy is the fundamental package for scientific computing with Python.

At the core of the NumPy package, is the ndarray object or n-dimensional array. In programming, array describes a collection of elements, similar to a list. The word n-dimensional refers to the fact that ndarrays can have one or more dimensions.

Here I will directly convert a list to an ndarray using the numpy.array() constructor. To create a 1D ndarray, I can pass in a single list

import numpy as np #using the alias np
array_data = np.array([1,2,3,4])
print(array_data)
print(type(array_data))
[1 2 3 4]
<class 'numpy.ndarray'>

It's often useful to know the number of rows and columns in a ndarray. I can use the ndarray.shape attribute.

array_2D = np.array([[5, 10, 15],
                    [20, 25, 30]])
print(array_2D.shape)
(2, 3)

Vectorization in NumPy

The NumPy library takes advantage of a processor feature called Single Instruction Multiple Data (SIMD) to process data faster. SIMD allows a processor to perform the same operation on multiple data points in a single processor cycle.

The concept of replacing for loops with operations applied to multiple data points at once is called vectorization, and ndarrays make vectorization possible.

numbers = [[6, 5], [1, 3], [5, 6], [1, 4], [3, 7], [5, 8], [3, 5], [8, 4]]
print(*numbers, sep='\n')
[6, 5]
[1, 3]
[5, 6]
[1, 4]
[3, 7]
[5, 8]
[3, 5]
[8, 4]
sums = []
for row in numbers:
  row_sum = row[0] + row[1]
  sums.append(row_sum)

print(sums)
[11, 4, 11, 5, 10, 13, 8, 12]
# Convert the list of lists to an ndarray
np_numbers = np.array(numbers)
sums = np_numbers[:,0] + np_numbers[:,1]
print(sums)
[11  4 11  5 10 13  8 12]

When I selected each column, we used the syntax ndarray[:,c] where c is the column index I wanted to select. The colon selects all rows.

#print(np_numbers)
#print(np_numbers[1])
#print(np_numbers[2:])

#print(np_numbers[3,1])
#print(np_numbers[:,1])

# Select specific row
rows = [0, 2, 4]
print(np_numbers[rows, :])
[[6 5]
 [5 6]
 [3 7]]

Explore the numerical dataset

To explore two-dimensional (2D) ndarrays, I'll analyze New York City taxi trip data released by the city of New York.

!wget -nc /content/ https://datasets21.s3-us-west-1.amazonaws.com/nyc_taxis.csv
/content/: Scheme missing.
File ‘nyc_taxis.csv’ already there; not retrieving.

!wget --help
GNU Wget 1.19.4, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
  -nv, --no-verbose                turn off verboseness, without being quiet
       --report-speed=TYPE         output bandwidth as TYPE.  TYPE can be bits
  -i,  --input-file=FILE           download URLs found in local or external FILE
  -F,  --force-html                treat input file as HTML
  -B,  --base=URL                  resolves HTML input-file links (-i -F)
                                     relative to URL
       --config=FILE               specify config file to use
       --no-config                 do not read any config file
       --rejected-log=FILE         log reasons for URL rejection to FILE

Download:
  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits)
       --retry-connrefused         retry even if connection is refused
  -O,  --output-document=FILE      write documents to FILE
  -nc, --no-clobber                skip downloads that would download to
                                     existing files (overwriting them)
       --no-netrc                  don't try to obtain credentials from .netrc
  -c,  --continue                  resume getting a partially-downloaded file
       --start-pos=OFFSET          start downloading from zero-based position OFFSET
       --progress=TYPE             select progress gauge type
       --show-progress             display the progress bar in any verbosity mode
  -N,  --timestamping              don't re-retrieve files unless newer than
                                     local
       --no-if-modified-since      don't use conditional if-modified-since get
                                     requests in timestamping mode
       --no-use-server-timestamps  don't set the local file's timestamp by
                                     the one on the server
  -S,  --server-response           print server response
       --spider                    don't download anything
  -T,  --timeout=SECONDS           set all timeout values to SECONDS
       --dns-timeout=SECS          set the DNS lookup timeout to SECS
       --connect-timeout=SECS      set the connect timeout to SECS
       --read-timeout=SECS         set the read timeout to SECS
  -w,  --wait=SECONDS              wait SECONDS between retrievals
       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals
       --no-proxy                  explicitly turn off proxy
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER
       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local host
       --limit-rate=RATE           limit download rate to RATE
       --no-dns-cache              disable caching DNS lookups
       --restrict-file-names=OS    restrict chars in file names to ones OS allows
       --ignore-case               ignore case when matching files/directories
  -4,  --inet4-only                connect only to IPv4 addresses
  -6,  --inet6-only                connect only to IPv6 addresses
       --prefer-family=FAMILY      connect first to addresses of specified family,
                                     one of IPv6, IPv4, or none
       --user=USER                 set both ftp and http user to USER
       --password=PASS             set both ftp and http password to PASS
       --ask-password              prompt for passwords
       --use-askpass=COMMAND       specify credential handler for requesting 
                                     username and password.  If no COMMAND is 
                                     specified the WGET_ASKPASS or the SSH_ASKPASS 
                                     environment variable is used.
       --no-iri                    turn off IRI support
       --local-encoding=ENC        use ENC as the local encoding for IRIs
       --remote-encoding=ENC       use ENC as the default remote encoding
       --unlink                    remove file before clobber
       --xattr                     turn on storage of metadata in extended file attributes

Directories:
  -nd, --no-directories            don't create directories
  -x,  --force-directories         force creation of directories
  -nH, --no-host-directories       don't create host directories
       --protocol-directories      use protocol name in directories
  -P,  --directory-prefix=PREFIX   save files to PREFIX/..
       --cut-dirs=NUMBER           ignore NUMBER remote directory components

HTTP options:
       --http-user=USER            set http user to USER
       --http-password=PASS        set http password to PASS
       --no-cache                  disallow server-cached data
       --default-page=NAME         change the default page name (normally
                                     this is 'index.html'.)
  -E,  --adjust-extension          save HTML/CSS documents with proper extensions
       --ignore-length             ignore 'Content-Length' header field
       --header=STRING             insert STRING among the headers
       --max-redirect              maximum redirections allowed per page
       --proxy-user=USER           set USER as proxy username
       --proxy-password=PASS       set PASS as proxy password
       --referer=URL               include 'Referer: URL' header in HTTP request
       --save-headers              save the HTTP headers to file
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION
       --no-http-keep-alive        disable HTTP keep-alive (persistent connections)
       --no-cookies                don't use cookies
       --load-cookies=FILE         load cookies from FILE before session
       --save-cookies=FILE         save cookies to FILE after session
       --keep-session-cookies      load and save session (non-permanent) cookies
       --post-data=STRING          use the POST method; send STRING as the data
       --post-file=FILE            use the POST method; send contents of FILE
       --method=HTTPMethod         use method "HTTPMethod" in the request
       --body-data=STRING          send STRING as data. --method MUST be set
       --body-file=FILE            send contents of FILE. --method MUST be set
       --content-disposition       honor the Content-Disposition header when
                                     choosing local file names (EXPERIMENTAL)
       --content-on-error          output the received content on server errors
       --auth-no-challenge         send Basic HTTP authentication information
                                     without first waiting for the server's
                                     challenge

HTTPS (SSL/TLS) options:
       --secure-protocol=PR        choose secure protocol, one of auto, SSLv2,
                                     SSLv3, TLSv1, TLSv1_1, TLSv1_2 and PFS
       --https-only                only follow secure HTTPS links
       --no-check-certificate      don't validate the server's certificate
       --certificate=FILE          client certificate file
       --certificate-type=TYPE     client certificate type, PEM or DER
       --private-key=FILE          private key file
       --private-key-type=TYPE     private key type, PEM or DER
       --ca-certificate=FILE       file with the bundle of CAs
       --ca-directory=DIR          directory where hash list of CAs is stored
       --crl-file=FILE             file with bundle of CRLs
       --pinnedpubkey=FILE/HASHES  Public key (PEM/DER) file, or any number
                                   of base64 encoded sha256 hashes preceded by
                                   'sha256//' and separated by ';', to verify
                                   peer against
       --random-file=FILE          file with random data for seeding the SSL PRNG

HSTS options:
       --no-hsts                   disable HSTS
       --hsts-file                 path of HSTS database (will override default)

FTP options:
       --ftp-user=USER             set ftp user to USER
       --ftp-password=PASS         set ftp password to PASS
       --no-remove-listing         don't remove '.listing' files
       --no-glob                   turn off FTP file name globbing
       --no-passive-ftp            disable the "passive" transfer mode
       --preserve-permissions      preserve remote file permissions
       --retr-symlinks             when recursing, get linked-to files (not dir)

FTPS options:
       --ftps-implicit                 use implicit FTPS (default port is 990)
       --ftps-resume-ssl               resume the SSL/TLS session started in the control connection when
                                         opening a data connection
       --ftps-clear-data-connection    cipher the control channel only; all the data will be in plaintext
       --ftps-fallback-to-ftp          fall back to FTP if FTPS is not supported in the target server
WARC options:
       --warc-file=FILENAME        save request/response data to a .warc.gz file
       --warc-header=STRING        insert STRING into the warcinfo record
       --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER
       --warc-cdx                  write CDX index files
       --warc-dedup=FILENAME       do not store records listed in this CDX file
       --no-warc-digests           do not calculate SHA1 digests
       --no-warc-keep-log          do not store the log file in a WARC record
       --warc-tempdir=DIRECTORY    location for temporary files created by the
                                     WARC writer

Recursive download:
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite)
       --delete-after              delete files locally after downloading them
  -k,  --convert-links             make links in downloaded HTML or CSS point to
                                     local files
       --convert-file-only         convert the file part of the URLs only (usually known as the basename)
       --backups=N                 before writing file X, rotate up to N backup files
  -K,  --backup-converted          before converting file X, back up as X.orig
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing
  -p,  --page-requisites           get all images, etc. needed to display HTML page
       --strict-comments           turn on strict (SGML) handling of HTML comments

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions
  -R,  --reject=LIST               comma-separated list of rejected extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)
  -D,  --domains=LIST              comma-separated list of accepted domains
       --exclude-domains=LIST      comma-separated list of rejected domains
       --follow-ftp                follow FTP links from HTML documents
       --follow-tags=LIST          comma-separated list of followed HTML tags
       --ignore-tags=LIST          comma-separated list of ignored HTML tags
  -H,  --span-hosts                go to foreign hosts when recursive
  -L,  --relative                  follow relative links only
  -I,  --include-directories=LIST  list of allowed directories
       --trust-server-names        use the name specified by the redirection
                                     URL's last component
  -X,  --exclude-directories=LIST  list of excluded directories
  -np, --no-parent                 don't ascend to the parent directory

Mail bug reports and suggestions to <bug-wget@gnu.org>
from csv import reader
#Load data into the notebook
with open('nyc_taxis.csv', 'r') as taxis_file:
  taxis = list(reader(taxis_file))

print(len(taxis))
print(len(taxis[0]))
2014
15
import numpy as np
np_taxis = np.array(taxis)
print(np_taxis[:3])
np_taxis.shape
[['pickup_year' 'pickup_month' 'pickup_day' 'pickup_dayofweek'
  'pickup_time' 'pickup_location_code' 'dropoff_location_code'
  'trip_distance' 'trip_length' 'fare_amount' 'fees_amount'
  'tolls_amount' 'tip_amount' 'total_amount' 'payment_type']
 ['2016' '1' '1' '5' '0' '2' '4' '21.00' '2037' '52.00' '0.80' '5.54'
  '11.65' '69.99' '1']
 ['2016' '1' '1' '5' '0' '2' '1' '16.29' '1520' '45.00' '1.30' '0.00'
  '8.00' '54.30' '1']]
(2014, 15)

I'll only work with a subset of the real data — approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. This data set includes a 1/50th random sample. Below is information about selected columns from the dataset:

  • pickup_year: the year of the trip pickup_month: the month of the trip (January is 1, December is 12)
  • pickup_day: the day of the month of the trip
  • pickup_location_code: the airport or borough where the trip started
  • dropoff_location_code: the airport or borough where the trip ended
  • trip_distance: the distance of the trip in miles
  • trip_length: the length of the trip in seconds
  • fare_amount: the base fare of the trip, in dollars
  • total_amount: the total amount charged to the passenger, including all fees, tolls and tips

Detailed information on all columns could be found here.

# Access the column by names
np_taxis = np.genfromtxt('nyc_taxis.csv',delimiter=',', names= True)
print(np_taxis.shape)
print(np_taxis[10]['pickup_location_code'])
z = np.array([1,2])
print(z.shape)
y = np.array([[1],[2]])
print(y.shape)
np_taxis = np.genfromtxt('nyc_taxis.csv',delimiter=',', skip_header=1)
print(np_taxis.shape)
print(np_taxis[10])
np.set_printoptions(suppress=True)
print(np_taxis[10])
print(np_taxis[1])
cols = [5, 6, 9, 10]
print(np_taxis[:,cols])

Get the Frequency Table of taxis based on the Pickup Location

print(np.unique(np_taxis[:,6]))
unique, counts = np.unique(np_taxis[:,6], return_counts=True)
print(unique)
print(counts)
print(np_taxis[:,-2])
# Calculate the mph of each trip
trip_distance_miles = np_taxis[:,7]
trip_length_seconds = np_taxis[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles / trip_length_hours
print(trip_mph)
total_amount = np_taxis[:,-2]
# measures of central tendency
mean = np.mean(total_amount)
median = np.median(total_amount)
  
# measures of dispersion
min = np.amin(total_amount)
max = np.amax(total_amount)
range = np.ptp(total_amount)
varience = np.var(total_amount)
sd = np.std(total_amount)
  
print("Descriptive analysis")
print("\n")
print("Measures of Central Tendency")
print("Mean =", mean)
print("Median =", median)
print("Measures of Dispersion")
print("Minimum =", min)
print("Maximum =", max)
print("Range =", range)
print("Varience =", varience)
print("Standard Deviation =", sd)