National Dish Project

	Dish_ID	Main Content
0	1	Rice, Carrots, Rasins
1	2	Rice, Lamb
2	3	Semolina Flour
3	4	Coconut Milk
4	5	Pig'S Ear, Chicken, Lamb, Pork, Potatoes

	Dish_ID	Main Content
0	1	Rice
1	1	Carrots
2	1	Rasins
3	2	Rice
4	2	Lamb

Conflicting data & Threatening scope creep

Reading the Data extracted from Website 1

df1 = pd.read_excel("Project data - National dishes clean.xlsx")
df1.head()

	Country	National Dish	National dish 2	National dish 3	...	National dish 16	National dish 17
0	Afghanistan	Kabuli palaw	NaN	NaN	...	NaN	NaN
1	Albania	Tavë kosi	Flia	NaN	...	NaN	NaN
2	Algeria	Couscous	Rechta	NaN	...	NaN	NaN
3	Andorra	Escudella i carn d'olla	NaN	NaN	...	NaN	NaN
4	Angola	Moamba de galinha	NaN	NaN	...	NaN	NaN

df1.shape

(189, 18)

Removing all rows with missing values in the country column

df1.dropna(subset=['Country'], inplace=True)
df1.info()

<class 'pandas.core.frame.DataFrame'>
            Int64Index: 165 entries, 0 to 188
            Data columns (total 18 columns):
                #   Column            Non-Null Count  Dtype  
                ---  ------            --------------  -----  
                0   Country           165 non-null    object 
                1   National Dish     165 non-null    object 
                2   National dish 2   81 non-null     object 
                3   National dish 3   47 non-null     object 
                4   National dish 4   30 non-null     object 
                5   National dish 5   17 non-null     object 
                6   National dish 6   9 non-null      object 
                7   National dish 7   6 non-null      object 
                8   National dish 8   2 non-null      object 
                9   National dish 9   1 non-null      object 
                10  National dish 10  1 non-null      object 
                11  National dish 11  1 non-null      object 
                12  National dish 12  1 non-null      object 
                13  National dish 13  1 non-null      object 
                14  National dish 14  1 non-null      object 
                15  National dish 15  1 non-null      object 
                16  National dish 16  0 non-null      float64
                17  National dish 17  0 non-null      float64
            dtypes: float64(2), object(16)
            memory usage: 24.5+ KB

The only Columns of interest are the 'Country' Column and 'National Dish' column, as the rest of the columns contain many missing values

How many countries in our dataset have multiple dishes identifed as National dishes?

multiple_dish = df1[df1['National dish 2'].notna()]
multiple_dish.head()

	Country	National Dish	National dish 2	National dish 3	National dish 4	...	National dish 16	National dish 17
1	Albania	Tavë kosi	Flia	NaN	NaN	...	NaN	NaN
2	Algeria	Couscous	Rechta	NaN	NaN	...	NaN	NaN
5	Argentina	Asado	empanada	matambre	Locro	...	NaN	NaN
6	Armenia	Khorovats	Harisa (not to be confused with the North Afri...	NaN	NaN	...	NaN	NaN
7	Australia	Roast lamb	meat pie	pavlova	Vegemite on toast	...	NaN	NaN

df.shape

(81, 18)

81 Countries have multiple dishes as their National dish.

Reading the Data scraped from website 2

df2 = pd.read_csv("national_dishes.csv")
print(df2.shape)
df2.head()

(98, 2)

	country	dish
0	Afghanistan	Kabuli Pulao
1	Algeria	Couscous
2	Australia	Meat Pie
3	Austria	Wiener Schnitzel
4	Azerbaijan	Plov

Renaming the country column to match the format in the website 1 data (Matching the capitalization)

df2.rename(columns = {'country':'Country'}, inplace = True)

Countries common to both datasets

common = df1.merge(df2, on=['Country'])
print(common.shape)
common.head()

(88, 19)

	Country	National Dish	National dish 2	National dish 3	National dish 4	...	National dish 17	dish
0	Afghanistan	Kabuli palaw	NaN	NaN	NaN	...	NaN	Kabuli Pulao
1	Algeria	Couscous	Rechta	NaN	NaN	...	NaN	Couscous
2	Australia	Roast lamb	meat pie	pavlova	Vegemite on toast	...	NaN	Meat Pie
3	Austria	Wiener schnitzel	NaN	NaN	NaN	...	NaN	Wiener Schnitzel
4	Azerbaijan	Dolma	NaN	NaN	NaN	...	NaN	Plov

# Comparing the National dishes from dataset 1 & 2
common_dish_compare = common[['Country', 'National Dish', 'dish']]
common_dish_compare

	Country	National Dish	dish
0	Afghanistan	Kabuli palaw	Kabuli Pulao
1	Algeria	Couscous	Couscous
2	Australia	Roast lamb	Meat Pie
3	Austria	Wiener schnitzel	Wiener Schnitzel
4	Azerbaijan	Dolma	Plov
...	...	...	...
83	Scotland	Haggis	Haggis
84	Uzbekistan	Osh	Plov
85	Venezuela	Pabellón criollo	Pabellón criollo
86	Vietnam	Phở	Pho
87	Yemen	Saltah	Saltah

88 rows × 3 columns

Some country's national dishes conflict across the two datasets. To resolve the conflict, all values from the second dataset will be used.

Countries present in Dataset 1 not present in Dataset 2

df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True)
print(df1_not_in_2.shape)
df1_not_in_2

(77, 18)

	Country	National Dish	National dish 2	National dish 3	National dish 4	...	National dish 16	National dish 17
0	Albania	Tavë kosi	Flia	NaN	NaN	...	NaN	NaN
1	Andorra	Escudella i carn d'olla	NaN	NaN	NaN	...	NaN	NaN
2	Angola	Moamba de galinha	NaN	NaN	NaN	...	NaN	NaN
3	Argentina	Asado	empanada	matambre	Locro	...	NaN	NaN
4	Armenia	Khorovats	Harisa (not to be confused with the North Afri...	NaN	NaN	...	NaN	NaN
...	...	...	...	...	...	...	...	...
72	United States Virgin Islands	Funji	NaN	NaN	NaN	...	NaN	NaN
73	Uruguay	Chivito	NaN	NaN	NaN	...	NaN	NaN
74	Vanuatu	Laplap	NaN	NaN	NaN	...	NaN	NaN
75	Zambia	Nshima	NaN	NaN	NaN	...	NaN	NaN
76	Zimbabwe	Sadza	NaN	NaN	NaN	...	NaN	NaN

77 rows × 18 columns

Countries present in Dataset 2 not present in Dataset 1

df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True)
print(df2_not_in_1.shape)
df2_not_in_1

(10, 2)

	Country	dish
0	Belize	Belizean Rice and Beans
1	Bosnia & Herzegovina	Ćevapi
2	Equatorial Guinea	Succotash
3	Moldova	Mămăligă
4	North Korea	Kimchi
5	Benin	Kuli Kuli
6	Ghana	Fufu
7	Kuwait	Machboos
8	Lesotho	Papa
9	Libya	Cuscus Bil-Bosla

Countries with multiple dishes but not found in Dataset 2

multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()]
print(multiple_dish_not_common.shape)

(31, 18)

# Getting a list of all the 31 countries with multiple dishes not found in dataset 2
multiple_dish_not_common['Country'].reset_index(drop=True)

            0                    Albania
            1                  Argentina
            2                    Armenia
            3     Bosnia and Herzegovina
            4                   Bulgaria
            5                     Cyprus
            6             Czech Republic
            7                   Dominica
            8                    Finland
            9                      India
            10                      Iraq
            11                     Italy
            12                   Jamaica
            13                     Japan
            14              Korea, North
            15              Korea, South
            16                 Lithuania
            17                   Myanmar
            18                   Nigeria
            19                 Palestine
            20                   Romania
            21              Saudi Arabia
            22                    Serbia
            23                  Slovenia
            24                Tajikistan
            25       Trinidad and Tobago
            26            United Kingdom
            27                   England
            28             United States
            29                      Guam
            30               Puerto Rico
            Name: Country, dtype: object

Bosnia and Hezegovina is present in Dataset 2, except it is spelled differently. (Dataset 2 uses an '&' inplace of 'and'). North Korea also has a similar problem. The two countries will be corrected to match dataset 2 and The values will thus be taken from dataset 2.

# Replacing values for Bosnia and Hezegovina and North Korea to match df2
df1['Country'].replace(['Bosnia and Herzegovina', 'Korea, North'], ['Bosnia & Herzegovina', 'North Korea'], inplace = True)
        
# Adding the changes to all the other steps. Note the following steps are merely a repetition of the above code.
common = df1.merge(df2, on=['Country'])

common_dish_compare = common[['Country', 'National Dish', 'dish']]

df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True)

df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True)

multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()]

multiple_dish_not_common['Country'].reset_index(drop=True)

            0                 Albania
            1               Argentina
            2                 Armenia
            3                Bulgaria
            4                  Cyprus
            5          Czech Republic
            6                Dominica
            7                 Finland
            8                   India
            9                    Iraq
            10                  Italy
            11                Jamaica
            12                  Japan
            13           Korea, South
            14              Lithuania
            15                Myanmar
            16                Nigeria
            17              Palestine
            18                Romania
            19           Saudi Arabia
            20                 Serbia
            21               Slovenia
            22             Tajikistan
            23    Trinidad and Tobago
            24         United Kingdom
            25                England
            26          United States
            27                   Guam
            28            Puerto Rico
            Name: Country, dtype: object

To make Data correction much simpler. The data 'multiple_dish_not_common["Country"]' will be exported as a csv file to be worked on in excel, where each country's national dish will be manually checked. As the list will only contain 29 countries (Bosnia & Hezegovina, and North Korea excluded), filling in each country's national dish may not warrant any automation techniques.

multiple_dish_not_common['Country'].to_csv('countries with multiple dishes.csv', index=False)

Creating one main dataset

# First lets rename the dish column in df2 to match all other dataframes.
df2.rename(columns = {'dish':'National Dish'}, inplace = True)
        
# Reading the multiple dish not common data, that was cleaned in excel
df3 = pd.read_excel('Multiple dishes clean.xlsx')
        
# Appending df3 to df2
df2_updated = pd.concat((df2, df3), axis = 0)
df2_updated.shape

(127, 2)

# Getting the data from website 1 that was not in the updated df2, but only had one National dish
df1_single_dish = df1[~(df1['Country'].isin(df2_updated['Country']))].reset_index(drop=True)

print(df1_single_dish.shape)

(46, 18)

df1_single_dish.info()

<class 'pandas.core.frame.DataFrame'>
            RangeIndex: 46 entries, 0 to 45
            Data columns (total 18 columns):
                #   Column            Non-Null Count  Dtype  
                ---  ------            --------------  -----  
                0   Country           46 non-null     object 
                1   National Dish     46 non-null     object 
                2   National dish 2   0 non-null      object 
                3   National dish 3   0 non-null      object 
                4   National dish 4   0 non-null      object 
                5   National dish 5   0 non-null      object 
                6   National dish 6   0 non-null      object 
                7   National dish 7   0 non-null      object 
                8   National dish 8   0 non-null      object 
                9   National dish 9   0 non-null      object 
                10  National dish 10  0 non-null      object 
                11  National dish 11  0 non-null      object 
                12  National dish 12  0 non-null      object 
                13  National dish 13  0 non-null      object 
                14  National dish 14  0 non-null      object 
                15  National dish 15  0 non-null      object 
                16  National dish 16  0 non-null      float64
                17  National dish 17  0 non-null      float64
            dtypes: float64(2), object(16)
            memory usage: 6.6+ KB

# Getting only the first two columns since there are no values in the other columns
df1_single_dish = df1_single_dish[["Country", "National Dish"]]
            
# Appending this data to df2_updated to make the final dataframe
df_final = pd.concat((df1_single_dish, df2_updated), axis=0)
df_final.shape

(173, 2)

We now have our final dataset with 173 countries, each with only one National dish.

Saving our results to a CSV file

# Exporting our data
df_final.to_csv("countries and National dish.csv", index=False)

Note: More countries and dishes were later added to get the number to 200+.

Data Cleaning Code

Outline

Data Cleaning Process (Video)

Data that violated regulations

Quick Data inspection