Data Cleaning Code

Outline

Importing Libraries

import pandas as pd

        

Data Cleaning Process (Video)

Data that violated regulations

Quick Data inspection


df = pd.read_excel("ID generation - Ingredients.xlsx")
df.head()
            
Dish_ID Main Content
0 1 Rice, Carrots, Rasins
1 2 Rice, Lamb
2 3 Semolina Flour
3 4 Coconut Milk
4 5 Pig'S Ear, Chicken, Lamb, Pork, Potatoes

df.shape
            

(172, 2)


# Copying the dataframe
df_copy = df.copy()
            

Reference: https://stackoverflow.com/questions/50731229/split-cell-into-multiple-rows-in-pandas-dataframe

# Replace all 'And' values with ','
df_copy['Main Content'] = df_copy['Main Content'].str.replace('And', ',')
df_copy = (df_copy.set_index(['Dish_ID']) .apply(lambda x: x.str.split(' '.explode()) .reset_index()) df_copy["Main Content"] = df_copy["Main Content"].str.strip() df_copy['Main Content'] = df_copy['Main Content'].dropna() df_copy.head()
Dish_ID Main Content
0 1 Rice
1 1 Carrots
2 1 Rasins
3 2 Rice
4 2 Lamb

df.shape
            

(345,2)


Conflicting data & Threatening scope creep

Reading the Data extracted from Website 1

df1 = pd.read_excel("Project data - National dishes clean.xlsx")
df1.head()
            
Country National Dish National dish 2 National dish 3 ... National dish 16 National dish 17
0 Afghanistan Kabuli palaw NaN NaN ... NaN NaN
1 Albania Tavë kosi Flia NaN ... NaN NaN
2 Algeria Couscous Rechta NaN ... NaN NaN
3 Andorra Escudella i carn d'olla NaN NaN ... NaN NaN
4 Angola Moamba de galinha NaN NaN ... NaN NaN

df1.shape
            

(189, 18)


Removing all rows with missing values in the country column

df1.dropna(subset=['Country'], inplace=True)
df1.info()
            
<class 'pandas.core.frame.DataFrame'>
            Int64Index: 165 entries, 0 to 188
            Data columns (total 18 columns):
                #   Column            Non-Null Count  Dtype  
                ---  ------            --------------  -----  
                0   Country           165 non-null    object 
                1   National Dish     165 non-null    object 
                2   National dish 2   81 non-null     object 
                3   National dish 3   47 non-null     object 
                4   National dish 4   30 non-null     object 
                5   National dish 5   17 non-null     object 
                6   National dish 6   9 non-null      object 
                7   National dish 7   6 non-null      object 
                8   National dish 8   2 non-null      object 
                9   National dish 9   1 non-null      object 
                10  National dish 10  1 non-null      object 
                11  National dish 11  1 non-null      object 
                12  National dish 12  1 non-null      object 
                13  National dish 13  1 non-null      object 
                14  National dish 14  1 non-null      object 
                15  National dish 15  1 non-null      object 
                16  National dish 16  0 non-null      float64
                17  National dish 17  0 non-null      float64
            dtypes: float64(2), object(16)
            memory usage: 24.5+ KB
            

The only Columns of interest are the 'Country' Column and 'National Dish' column, as the rest of the columns contain many missing values


How many countries in our dataset have multiple dishes identifed as National dishes?


multiple_dish = df1[df1['National dish 2'].notna()]
multiple_dish.head()
                
Country National Dish National dish 2 National dish 3 National dish 4 ... National dish 16 National dish 17
1 Albania Tavë kosi Flia NaN NaN ... NaN NaN
2 Algeria Couscous Rechta NaN NaN ... NaN NaN
5 Argentina Asado empanada matambre Locro ... NaN NaN
6 Armenia Khorovats Harisa (not to be confused with the North Afri... NaN NaN ... NaN NaN
7 Australia Roast lamb meat pie pavlova Vegemite on toast ... NaN NaN

df.shape
            

(81, 18)

81 Countries have multiple dishes as their National dish.


Reading the Data scraped from website 2


df2 = pd.read_csv("national_dishes.csv")
print(df2.shape)
df2.head()
            
(98, 2)
        
country dish
0 Afghanistan Kabuli Pulao
1 Algeria Couscous
2 Australia Meat Pie
3 Austria Wiener Schnitzel
4 Azerbaijan Plov

Renaming the country column to match the format in the website 1 data (Matching the capitalization)
df2.rename(columns = {'country':'Country'}, inplace = True)
            

Countries common to both datasets


common = df1.merge(df2, on=['Country'])
print(common.shape)
common.head()
            

(88, 19)
        
Country National Dish National dish 2 National dish 3 National dish 4 ... National dish 17 dish
0 Afghanistan Kabuli palaw NaN NaN NaN ... NaN Kabuli Pulao
1 Algeria Couscous Rechta NaN NaN ... NaN Couscous
2 Australia Roast lamb meat pie pavlova Vegemite on toast ... NaN Meat Pie
3 Austria Wiener schnitzel NaN NaN NaN ... NaN Wiener Schnitzel
4 Azerbaijan Dolma NaN NaN NaN ... NaN Plov

# Comparing the National dishes from dataset 1 & 2
common_dish_compare = common[['Country', 'National Dish', 'dish']]
common_dish_compare
            
Country National Dish dish
0 Afghanistan Kabuli palaw Kabuli Pulao
1 Algeria Couscous Couscous
2 Australia Roast lamb Meat Pie
3 Austria Wiener schnitzel Wiener Schnitzel
4 Azerbaijan Dolma Plov
... ... ... ...
83 Scotland Haggis Haggis
84 Uzbekistan Osh Plov
85 Venezuela Pabellón criollo Pabellón criollo
86 Vietnam Phở Pho
87 Yemen Saltah Saltah

88 rows × 3 columns

Some country's national dishes conflict across the two datasets. To resolve the conflict, all values from the second dataset will be used.


Countries present in Dataset 1 not present in Dataset 2


df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True)
print(df1_not_in_2.shape)
df1_not_in_2
            
(77, 18)
        
Country National Dish National dish 2 National dish 3 National dish 4 ... National dish 16 National dish 17
0 Albania Tavë kosi Flia NaN NaN ... NaN NaN
1 Andorra Escudella i carn d'olla NaN NaN NaN ... NaN NaN
2 Angola Moamba de galinha NaN NaN NaN ... NaN NaN
3 Argentina Asado empanada matambre Locro ... NaN NaN
4 Armenia Khorovats Harisa (not to be confused with the North Afri... NaN NaN ... NaN NaN
... ... ... ... ... ... ... ... ...
72 United States Virgin Islands Funji NaN NaN NaN ... NaN NaN
73 Uruguay Chivito NaN NaN NaN ... NaN NaN
74 Vanuatu Laplap NaN NaN NaN ... NaN NaN
75 Zambia Nshima NaN NaN NaN ... NaN NaN
76 Zimbabwe Sadza NaN NaN NaN ... NaN NaN

77 rows × 18 columns


Countries present in Dataset 2 not present in Dataset 1


df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True)
print(df2_not_in_1.shape)
df2_not_in_1
            
(10, 2)
        
Country dish
0 Belize Belizean Rice and Beans
1 Bosnia & Herzegovina Ćevapi
2 Equatorial Guinea Succotash
3 Moldova Mămăligă
4 North Korea Kimchi
5 Benin Kuli Kuli
6 Ghana Fufu
7 Kuwait Machboos
8 Lesotho Papa
9 Libya Cuscus Bil-Bosla

Countries with multiple dishes but not found in Dataset 2


multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()]
print(multiple_dish_not_common.shape)
            
(31, 18)
        
# Getting a list of all the 31 countries with multiple dishes not found in dataset 2
multiple_dish_not_common['Country'].reset_index(drop=True)
            
            0                    Albania
            1                  Argentina
            2                    Armenia
            3     Bosnia and Herzegovina
            4                   Bulgaria
            5                     Cyprus
            6             Czech Republic
            7                   Dominica
            8                    Finland
            9                      India
            10                      Iraq
            11                     Italy
            12                   Jamaica
            13                     Japan
            14              Korea, North
            15              Korea, South
            16                 Lithuania
            17                   Myanmar
            18                   Nigeria
            19                 Palestine
            20                   Romania
            21              Saudi Arabia
            22                    Serbia
            23                  Slovenia
            24                Tajikistan
            25       Trinidad and Tobago
            26            United Kingdom
            27                   England
            28             United States
            29                      Guam
            30               Puerto Rico
            Name: Country, dtype: object
        

Bosnia and Hezegovina is present in Dataset 2, except it is spelled differently. (Dataset 2 uses an '&' inplace of 'and'). North Korea also has a similar problem. The two countries will be corrected to match dataset 2 and The values will thus be taken from dataset 2.

# Replacing values for Bosnia and Hezegovina and North Korea to match df2
df1['Country'].replace(['Bosnia and Herzegovina', 'Korea, North'], ['Bosnia & Herzegovina', 'North Korea'], inplace = True)
        
# Adding the changes to all the other steps. Note the following steps are merely a repetition of the above code.
common = df1.merge(df2, on=['Country'])

common_dish_compare = common[['Country', 'National Dish', 'dish']]

df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True)

df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True)

multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()]

multiple_dish_not_common['Country'].reset_index(drop=True)
            
            0                 Albania
            1               Argentina
            2                 Armenia
            3                Bulgaria
            4                  Cyprus
            5          Czech Republic
            6                Dominica
            7                 Finland
            8                   India
            9                    Iraq
            10                  Italy
            11                Jamaica
            12                  Japan
            13           Korea, South
            14              Lithuania
            15                Myanmar
            16                Nigeria
            17              Palestine
            18                Romania
            19           Saudi Arabia
            20                 Serbia
            21               Slovenia
            22             Tajikistan
            23    Trinidad and Tobago
            24         United Kingdom
            25                England
            26          United States
            27                   Guam
            28            Puerto Rico
            Name: Country, dtype: object
        

To make Data correction much simpler. The data 'multiple_dish_not_common["Country"]' will be exported as a csv file to be worked on in excel, where each country's national dish will be manually checked. As the list will only contain 29 countries (Bosnia & Hezegovina, and North Korea excluded), filling in each country's national dish may not warrant any automation techniques.


multiple_dish_not_common['Country'].to_csv('countries with multiple dishes.csv', index=False)
            

Creating one main dataset

# First lets rename the dish column in df2 to match all other dataframes.
df2.rename(columns = {'dish':'National Dish'}, inplace = True)
        
# Reading the multiple dish not common data, that was cleaned in excel
df3 = pd.read_excel('Multiple dishes clean.xlsx')
        
# Appending df3 to df2
df2_updated = pd.concat((df2, df3), axis = 0)
df2_updated.shape
            
(127, 2)

# Getting the data from website 1 that was not in the updated df2, but only had one National dish
df1_single_dish = df1[~(df1['Country'].isin(df2_updated['Country']))].reset_index(drop=True)
print(df1_single_dish.shape)
(46, 18)
        
df1_single_dish.info()
            
<class 'pandas.core.frame.DataFrame'>
            RangeIndex: 46 entries, 0 to 45
            Data columns (total 18 columns):
                #   Column            Non-Null Count  Dtype  
                ---  ------            --------------  -----  
                0   Country           46 non-null     object 
                1   National Dish     46 non-null     object 
                2   National dish 2   0 non-null      object 
                3   National dish 3   0 non-null      object 
                4   National dish 4   0 non-null      object 
                5   National dish 5   0 non-null      object 
                6   National dish 6   0 non-null      object 
                7   National dish 7   0 non-null      object 
                8   National dish 8   0 non-null      object 
                9   National dish 9   0 non-null      object 
                10  National dish 10  0 non-null      object 
                11  National dish 11  0 non-null      object 
                12  National dish 12  0 non-null      object 
                13  National dish 13  0 non-null      object 
                14  National dish 14  0 non-null      object 
                15  National dish 15  0 non-null      object 
                16  National dish 16  0 non-null      float64
                17  National dish 17  0 non-null      float64
            dtypes: float64(2), object(16)
            memory usage: 6.6+ KB
            
# Getting only the first two columns since there are no values in the other columns
df1_single_dish = df1_single_dish[["Country", "National Dish"]]
            
# Appending this data to df2_updated to make the final dataframe
df_final = pd.concat((df1_single_dish, df2_updated), axis=0)
df_final.shape
                
(173, 2)

We now have our final dataset with 173 countries, each with only one National dish.

Saving our results to a CSV file

# Exporting our data
df_final.to_csv("countries and National dish.csv", index=False)
                

Note: More countries and dishes were later added to get the number to 200+.