Importing Libraries
import pandas as pd
df = pd.read_excel("ID generation - Ingredients.xlsx") df.head()
| Dish_ID | Main Content | |
|---|---|---|
| 0 | 1 | Rice, Carrots, Rasins |
| 1 | 2 | Rice, Lamb |
| 2 | 3 | Semolina Flour |
| 3 | 4 | Coconut Milk |
| 4 | 5 | Pig'S Ear, Chicken, Lamb, Pork, Potatoes |
df.shape
(172, 2)
# Copying the dataframe df_copy = df.copy()
Reference: https://stackoverflow.com/questions/50731229/split-cell-into-multiple-rows-in-pandas-dataframe
# Replace all 'And' values with ',' df_copy['Main Content'] = df_copy['Main Content'].str.replace('And', ',')
df_copy = (df_copy.set_index(['Dish_ID']) .apply(lambda x: x.str.split(' '.explode()) .reset_index()) df_copy["Main Content"] = df_copy["Main Content"].str.strip() df_copy['Main Content'] = df_copy['Main Content'].dropna() df_copy.head()
| Dish_ID | Main Content | |
|---|---|---|
| 0 | 1 | Rice |
| 1 | 1 | Carrots |
| 2 | 1 | Rasins |
| 3 | 2 | Rice |
| 4 | 2 | Lamb |
df.shape
(345,2)
df1 = pd.read_excel("Project data - National dishes clean.xlsx") df1.head()
| Country | National Dish | National dish 2 | National dish 3 | ... | National dish 16 | National dish 17 | |
|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | Kabuli palaw | NaN | NaN | ... | NaN | NaN |
| 1 | Albania | Tavë kosi | Flia | NaN | ... | NaN | NaN |
| 2 | Algeria | Couscous | Rechta | NaN | ... | NaN | NaN |
| 3 | Andorra | Escudella i carn d'olla | NaN | NaN | ... | NaN | NaN |
| 4 | Angola | Moamba de galinha | NaN | NaN | ... | NaN | NaN |
df1.shape
(189, 18)
Removing all rows with missing values in the country column
df1.dropna(subset=['Country'], inplace=True) df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 188
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 165 non-null object
1 National Dish 165 non-null object
2 National dish 2 81 non-null object
3 National dish 3 47 non-null object
4 National dish 4 30 non-null object
5 National dish 5 17 non-null object
6 National dish 6 9 non-null object
7 National dish 7 6 non-null object
8 National dish 8 2 non-null object
9 National dish 9 1 non-null object
10 National dish 10 1 non-null object
11 National dish 11 1 non-null object
12 National dish 12 1 non-null object
13 National dish 13 1 non-null object
14 National dish 14 1 non-null object
15 National dish 15 1 non-null object
16 National dish 16 0 non-null float64
17 National dish 17 0 non-null float64
dtypes: float64(2), object(16)
memory usage: 24.5+ KB
The only Columns of interest are the 'Country' Column and 'National Dish' column, as the rest of the columns contain many missing values
multiple_dish = df1[df1['National dish 2'].notna()] multiple_dish.head()
| Country | National Dish | National dish 2 | National dish 3 | National dish 4 | ... | National dish 16 | National dish 17 | |
|---|---|---|---|---|---|---|---|---|
| 1 | Albania | Tavë kosi | Flia | NaN | NaN | ... | NaN | NaN |
| 2 | Algeria | Couscous | Rechta | NaN | NaN | ... | NaN | NaN |
| 5 | Argentina | Asado | empanada | matambre | Locro | ... | NaN | NaN |
| 6 | Armenia | Khorovats | Harisa (not to be confused with the North Afri... | NaN | NaN | ... | NaN | NaN |
| 7 | Australia | Roast lamb | meat pie | pavlova | Vegemite on toast | ... | NaN | NaN |
df.shape
(81, 18)
81 Countries have multiple dishes as their National dish.
df2 = pd.read_csv("national_dishes.csv") print(df2.shape) df2.head()
(98, 2)
| country | dish | |
|---|---|---|
| 0 | Afghanistan | Kabuli Pulao |
| 1 | Algeria | Couscous |
| 2 | Australia | Meat Pie |
| 3 | Austria | Wiener Schnitzel |
| 4 | Azerbaijan | Plov |
df2.rename(columns = {'country':'Country'}, inplace = True)
common = df1.merge(df2, on=['Country']) print(common.shape) common.head()
(88, 19)
| Country | National Dish | National dish 2 | National dish 3 | National dish 4 | ... | National dish 17 | dish | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | Kabuli palaw | NaN | NaN | NaN | ... | NaN | Kabuli Pulao |
| 1 | Algeria | Couscous | Rechta | NaN | NaN | ... | NaN | Couscous |
| 2 | Australia | Roast lamb | meat pie | pavlova | Vegemite on toast | ... | NaN | Meat Pie |
| 3 | Austria | Wiener schnitzel | NaN | NaN | NaN | ... | NaN | Wiener Schnitzel |
| 4 | Azerbaijan | Dolma | NaN | NaN | NaN | ... | NaN | Plov |
# Comparing the National dishes from dataset 1 & 2 common_dish_compare = common[['Country', 'National Dish', 'dish']] common_dish_compare
| Country | National Dish | dish | |
|---|---|---|---|
| 0 | Afghanistan | Kabuli palaw | Kabuli Pulao |
| 1 | Algeria | Couscous | Couscous |
| 2 | Australia | Roast lamb | Meat Pie |
| 3 | Austria | Wiener schnitzel | Wiener Schnitzel |
| 4 | Azerbaijan | Dolma | Plov |
| ... | ... | ... | ... |
| 83 | Scotland | Haggis | Haggis |
| 84 | Uzbekistan | Osh | Plov |
| 85 | Venezuela | Pabellón criollo | Pabellón criollo |
| 86 | Vietnam | Phở | Pho |
| 87 | Yemen | Saltah | Saltah |
88 rows × 3 columns
Some country's national dishes conflict across the two datasets. To resolve the conflict, all values from the second dataset will be used.
df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True) print(df1_not_in_2.shape) df1_not_in_2
(77, 18)
| Country | National Dish | National dish 2 | National dish 3 | National dish 4 | ... | National dish 16 | National dish 17 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Albania | Tavë kosi | Flia | NaN | NaN | ... | NaN | NaN |
| 1 | Andorra | Escudella i carn d'olla | NaN | NaN | NaN | ... | NaN | NaN |
| 2 | Angola | Moamba de galinha | NaN | NaN | NaN | ... | NaN | NaN |
| 3 | Argentina | Asado | empanada | matambre | Locro | ... | NaN | NaN |
| 4 | Armenia | Khorovats | Harisa (not to be confused with the North Afri... | NaN | NaN | ... | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72 | United States Virgin Islands | Funji | NaN | NaN | NaN | ... | NaN | NaN |
| 73 | Uruguay | Chivito | NaN | NaN | NaN | ... | NaN | NaN |
| 74 | Vanuatu | Laplap | NaN | NaN | NaN | ... | NaN | NaN |
| 75 | Zambia | Nshima | NaN | NaN | NaN | ... | NaN | NaN |
| 76 | Zimbabwe | Sadza | NaN | NaN | NaN | ... | NaN | NaN |
77 rows × 18 columns
df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True) print(df2_not_in_1.shape) df2_not_in_1
(10, 2)
| Country | dish | |
|---|---|---|
| 0 | Belize | Belizean Rice and Beans |
| 1 | Bosnia & Herzegovina | Ćevapi |
| 2 | Equatorial Guinea | Succotash |
| 3 | Moldova | Mămăligă |
| 4 | North Korea | Kimchi |
| 5 | Benin | Kuli Kuli |
| 6 | Ghana | Fufu |
| 7 | Kuwait | Machboos |
| 8 | Lesotho | Papa |
| 9 | Libya | Cuscus Bil-Bosla |
multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()] print(multiple_dish_not_common.shape)
(31, 18)
# Getting a list of all the 31 countries with multiple dishes not found in dataset 2 multiple_dish_not_common['Country'].reset_index(drop=True)
0 Albania
1 Argentina
2 Armenia
3 Bosnia and Herzegovina
4 Bulgaria
5 Cyprus
6 Czech Republic
7 Dominica
8 Finland
9 India
10 Iraq
11 Italy
12 Jamaica
13 Japan
14 Korea, North
15 Korea, South
16 Lithuania
17 Myanmar
18 Nigeria
19 Palestine
20 Romania
21 Saudi Arabia
22 Serbia
23 Slovenia
24 Tajikistan
25 Trinidad and Tobago
26 United Kingdom
27 England
28 United States
29 Guam
30 Puerto Rico
Name: Country, dtype: object
Bosnia and Hezegovina is present in Dataset 2, except it is spelled differently. (Dataset 2 uses an '&' inplace of 'and'). North Korea also has a similar problem. The two countries will be corrected to match dataset 2 and The values will thus be taken from dataset 2.
# Replacing values for Bosnia and Hezegovina and North Korea to match df2 df1['Country'].replace(['Bosnia and Herzegovina', 'Korea, North'], ['Bosnia & Herzegovina', 'North Korea'], inplace = True) # Adding the changes to all the other steps. Note the following steps are merely a repetition of the above code. common = df1.merge(df2, on=['Country']) common_dish_compare = common[['Country', 'National Dish', 'dish']] df1_not_in_2 = df1[~(df1['Country'].isin(df2['Country']))].reset_index(drop=True) df2_not_in_1 = df2[~(df2['Country'].isin(df1['Country']))].reset_index(drop=True) multiple_dish_not_common = df1_not_in_2[df1_not_in_2['National dish 2'].notna()] multiple_dish_not_common['Country'].reset_index(drop=True)
0 Albania
1 Argentina
2 Armenia
3 Bulgaria
4 Cyprus
5 Czech Republic
6 Dominica
7 Finland
8 India
9 Iraq
10 Italy
11 Jamaica
12 Japan
13 Korea, South
14 Lithuania
15 Myanmar
16 Nigeria
17 Palestine
18 Romania
19 Saudi Arabia
20 Serbia
21 Slovenia
22 Tajikistan
23 Trinidad and Tobago
24 United Kingdom
25 England
26 United States
27 Guam
28 Puerto Rico
Name: Country, dtype: object
To make Data correction much simpler. The data 'multiple_dish_not_common["Country"]' will be exported as a csv file to be worked on in excel, where each country's national dish will be manually checked. As the list will only contain 29 countries (Bosnia & Hezegovina, and North Korea excluded), filling in each country's national dish may not warrant any automation techniques.
multiple_dish_not_common['Country'].to_csv('countries with multiple dishes.csv', index=False)
# First lets rename the dish column in df2 to match all other dataframes. df2.rename(columns = {'dish':'National Dish'}, inplace = True) # Reading the multiple dish not common data, that was cleaned in excel df3 = pd.read_excel('Multiple dishes clean.xlsx') # Appending df3 to df2 df2_updated = pd.concat((df2, df3), axis = 0) df2_updated.shape
(127, 2)
# Getting the data from website 1 that was not in the updated df2, but only had one National dish df1_single_dish = df1[~(df1['Country'].isin(df2_updated['Country']))].reset_index(drop=True)
print(df1_single_dish.shape)
(46, 18)
df1_single_dish.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 46 non-null object
1 National Dish 46 non-null object
2 National dish 2 0 non-null object
3 National dish 3 0 non-null object
4 National dish 4 0 non-null object
5 National dish 5 0 non-null object
6 National dish 6 0 non-null object
7 National dish 7 0 non-null object
8 National dish 8 0 non-null object
9 National dish 9 0 non-null object
10 National dish 10 0 non-null object
11 National dish 11 0 non-null object
12 National dish 12 0 non-null object
13 National dish 13 0 non-null object
14 National dish 14 0 non-null object
15 National dish 15 0 non-null object
16 National dish 16 0 non-null float64
17 National dish 17 0 non-null float64
dtypes: float64(2), object(16)
memory usage: 6.6+ KB
# Getting only the first two columns since there are no values in the other columns df1_single_dish = df1_single_dish[["Country", "National Dish"]] # Appending this data to df2_updated to make the final dataframe df_final = pd.concat((df1_single_dish, df2_updated), axis=0) df_final.shape
(173, 2)
We now have our final dataset with 173 countries, each with only one National dish.
# Exporting our data df_final.to_csv("countries and National dish.csv", index=False)
Note: More countries and dishes were later added to get the number to 200+.