Intro to Pandas
From Alex Lu with Love
I will use a dataset of Fortune 1000 from Kaggle datasets to explore pandas library here.
The Fortune 1000 dataset is from the Fortune website. It contains U.S. company data for the year 2021. The dataset is 1000 rows and 18 columns.
Features:
- Company - values are the name of the company Rank - The 2021 rank established by Fortune (1-1000)
- Rank Change - The change in the rank from 2020 to 2021. There is only a rank change listed if the company is currently in the top 500 and was previously in the top 500.
- Revenue - Revenue of each company in millions. This is the criteria used to rank each company.
- Profit - Profit of each company in millions. Num. of Employees - The number of employees each company employs.
- Sector - The sector of the market the company operates in.
- City - The city where the company's headquarters is located.
- State - The state where the company's headquarters is located
- Newcomer - Indicates whether or not the company is new to the top Fortune 500 ("yes" or "no"). No value will be listed for companies outside of the top 500.
- CEO Founder - Indicates whether the CEO of the company is also the founder ("yes" or "no").
- CEO Woman - Indicates whether the CEO of the company is a woman ("yes" or "no").
- Profitable - Indicates whether the company is profitable or not ("yes" or "no").
- Prev. Rank - The 2020 rank of the company, as established by Fortune. There will only be previous rank data for the top 500 companies.
- CEO - The name of the CEO of the company
- Website - The url of the company website
- Ticker - The stock ticker symbol of public companies. Some rows will have empty values because the company is a private corporation.
- Market Cap - The market cap (or value) of the company in millions. Some rows will have empty values because the company is private. Market valuations were determined on January 20, 2021.
!wget -nc /content/ https://datasets21.s3-us-west-1.amazonaws.com/Fortune_1000.csv
import pandas as pd
f1000 = pd.read_csv('Fortune_1000.csv',index_col=0)
f1000.info()
f1000.head(3)
companies = f1000.loc[:,'profit']
print(companies)
print(type(f1000))
print(type(companies))
print(f1000.loc[:,'revenue'])
print(f1000.loc['Apple','revenue'])
f1000[['rank','revenue']]
#f1000.loc[:,'rank':'sector']
f1000.loc[['Amazon', 'Apple']]
#f1000.loc['Amazon':'Apple']
sector_value_counts = f1000['sector'].value_counts(ascending=True)
print(sector_value_counts)
f1000['sector'].value_counts().loc['Technology']
#f1000['rank_change'].max()
#f1000['rank_change'].min()
f1000['rank_change'].describe()
f1000['rank_change'].value_counts()
Exercise:
List out the numbers of companies in the Fortune 1000 of the top 3 states
top_3_states = f1000['state'].value_counts().head(3)
print(top_3_states)
Exercise:
find the company that employs the most people in California in the dataset.
I can use the DataFrame.sort_values()
method to sort the rows on the employees column
f1000[f1000['state']=='CA'].sort_values('num. of employees',ascending=False).head(1)
Exercise:
find the unique list of states in the dataset
To identify the unique states, I can use the Series.unique()
method. This method returns an array of unique values from any series.
states = f1000['state'].unique()
print(states)
Practice:
I'm going to produce the following dictionary of the top employer in each state:
-
create an empty dictionary,
top_employer_by_state
to store the results of the exercise. -
Use the
Series.unique()
method to create an array of unique values from the state column -
Use a for loop to iterate over the array unique states. In each iteration:
- Select only the rows that have a state name equal to the current iteration.
- Use
DataFrame.sort_values()
to sort those rows by thenum. of employees
column in descending order. - Select the first row from the sorted dataframe and convert the Dataframe into a series using
DataFrame.squeeze()
- Extract the company name from the index label company by
Series.name
. - Assign the results to the top_employer_by_state dictionary, using the state name as the key, and the company name as the value.
top_exmployer_by_state = {}
states = f1000['state'].unique()
for state in states:
selected_companies = f1000[f1000['state']==state]
top_exmployer_by_state[state] = selected_companies.sort_values('num. of employees', ascending=False).head(1).squeeze().name
for key in top_exmployer_by_state:
print(key, ' : ', top_exmployer_by_state[key])