!python -m pip install kaggle==1.6.12

Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle==1.6.12
  Downloading kaggle-1.6.12.tar.gz (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.7/79.7 kB 3.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.16.0)
Collecting certifi>=2023.7.22
  Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.4/159.4 kB 9.7 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.9.0.post0)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.29.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (4.65.0)
Collecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.26.15)
Requirement already satisfied: bleach in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (6.1.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from bleach->kaggle==1.6.12) (0.5.1)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 16.3 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (3.4)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... done
  Created wheel for kaggle: filename=kaggle-1.6.12-py3-none-any.whl size=102969 sha256=35252a1665ce54934064e043d40d970d45edd3ce256e9a015f30548687f39fad
  Stored in directory: /home/student/.cache/pip/wheels/1e/0b/7c/50f8e89c3d2f82838dbd7afeddffbb9357003009ada98216c7
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, certifi, kaggle
  WARNING: The script slugify is installed in '/home/student/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script kaggle is installed in '/home/student/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2025.11.12 kaggle-1.6.12 python-slugify-8.0.4 text-unidecode-1.3

!pip install --target=/workspace ucimlrepo numpy==1.24.3

Collecting ucimlrepo
  Using cached ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting pandas>=1.0.0
  Using cached pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
Collecting certifi>=2020.12.5
  Using cached certifi-2025.11.12-py3-none-any.whl (159 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting pytz>=2020.1
  Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Collecting tzdata>=2022.7
  Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Collecting six>=1.5
  Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, six, numpy, certifi, python-dateutil, pandas, ucimlrepo
Successfully installed certifi-2025.11.12 numpy-1.24.3 pandas-2.3.3 python-dateutil-2.9.0.post0 pytz-2025.2 six-1.17.0 tzdata-2025.2 ucimlrepo-0.0.7
WARNING: Target directory /workspace/dateutil already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/ucimlrepo already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/__pycache__ already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pandas already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/python_dateutil-2.9.0.post0.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/tzdata already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy-1.24.3.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/six-1.17.0.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/six.py already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/tzdata-2025.2.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pandas-2.3.3.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pytz already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/certifi-2025.11.12.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/ucimlrepo-0.0.7.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy.libs already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/certifi already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pytz-2025.2.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/bin already exists. Specify --upgrade to force replacement.

import requests 
import pandas as pd

# requests --> to call the API
# pandas --> to work with tables
BASE_URL = "https://ghoapi.azureedge.net/api"

# Qery the indicator list endpoint to locate the correct indicator code
indicators_url = f"{BASE_URL}/Indicator"
response = requests.get(indicators_url)
response.raise_for_status() # check error occurance

indicators_json = response.json()
len(indicators_json["value"])

3047

# Turn the JSON into a DataFrame

indicators_df = pd.json_normalize(indicators_json["value"])
indicators_df.head()
indicators_df.columns

Index(['IndicatorCode', 'IndicatorName', 'Language'], dtype='object')

# Explore Indicators by keyword

# search "life expectancy"
indicators_df[indicators_df["IndicatorName"].str.contains("life expectancy", case = False)]

indicators_df[indicators_df["IndicatorName"].str.contains("mortality", case=False)]

# Request data for the chosen indicator

indicator_code = "SA_0000001473" # Alcohol related disease mortality, per 100000
indicator_url = f"{BASE_URL}/{indicator_code}"

# send a GET request 
response = requests.get(indicator_url)
response.raise_for_status() # throws an error if request failed

indicator_json = response.json()
len(indicator_json["value"]) # will show how many rows (records) we got from the API

358

# Turn data into a DataFrame

# normalize the JSON response 
who_raw_df = pd.json_normalize(indicator_json["value"])
who_raw_df.head()

# Filter the dataset to country-level observations
who_alcohol = who_raw_df[who_raw_df["SpatialDimType"] == "COUNTRY"].copy()
who_alcohol["SpatialDimType"].value_counts()
who_alcohol.head()

!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.9/250.9 kB 5.2 MB/s eta 0:00:00a 0:00:01
Collecting et-xmlfile
  Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5

import site, sys

# Kullanıcıya özel site-packages dizinini bul
user_site = site.getusersitepackages()
print("User site-packages:", user_site)

# Bu dizini sys.path'e ekle
if user_site not in sys.path:
    sys.path.append(user_site)

# Şimdi openpyxl'ı import etmeyi dene
import openpyxl
print("openpyxl version:", openpyxl.__version__)

User site-packages: /home/student/.local/lib/python3.10/site-packages
openpyxl version: 3.1.5

#FILL IN 2nd data gathering and loading method

# Load the gathered excel File
xls = pd.ExcelFile("WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx", engine="openpyxl")
xls.sheet_names

['Estimates', 'Medium variant', 'NOTES']

# 1) Sheet'i ham haliyle, başlıksız oku
raw = pd.read_excel(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    sheet_name="Estimates",
    header=None,
    engine="openpyxl"
)

# İlk 40 satırı zaten gördük, şimdi header satırını otomatik bulalım:
header_row = raw.index[raw.iloc[:, 0] == "Index"][0]
header_row

16

pop_df = pd.read_excel(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    sheet_name="Estimates",
    header=header_row,
    engine="openpyxl"
)

pop_df.head()
pop_df.columns

Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
       'Type', 'Parent code', 'Year',
       'Total Population, as of 1 January (thousands)',
       'Total Population, as of 1 July (thousands)',
       'Male Population, as of 1 July (thousands)',
       'Female Population, as of 1 July (thousands)',
       'Population Density, as of 1 July (persons per square km)',
       'Population Sex Ratio, as of 1 July (males per 100 females)',
       'Median Age, as of 1 July (years)',
       'Natural Change, Births minus Deaths (thousands)',
       'Rate of Natural Change (per 1,000 population)',
       'Population Change (thousands)', 'Population Growth Rate (percentage)',
       'Population Annual Doubling Time (years)', 'Births (thousands)',
       'Births by women aged 15 to 19 (thousands)',
       'Crude Birth Rate (births per 1,000 population)',
       'Total Fertility Rate (live births per woman)',
       'Net Reproduction Rate (surviving daughters per woman)',
       'Mean Age Childbearing (years)',
       'Sex Ratio at Birth (males per 100 female births)',
       'Total Deaths (thousands)', 'Male Deaths (thousands)',
       'Female Deaths (thousands)',
       'Crude Death Rate (deaths per 1,000 population)',
       'Life Expectancy at Birth, both sexes (years)',
       'Male Life Expectancy at Birth (years)',
       'Female Life Expectancy at Birth (years)',
       'Life Expectancy at Age 15, both sexes (years)',
       'Male Life Expectancy at Age 15 (years)',
       'Female Life Expectancy at Age 15 (years)',
       'Life Expectancy at Age 65, both sexes (years)',
       'Male Life Expectancy at Age 65 (years)',
       'Female Life Expectancy at Age 65 (years)',
       'Life Expectancy at Age 80, both sexes (years)',
       'Male Life Expectancy at Age 80 (years)',
       'Female Life Expectancy at Age 80 (years)',
       'Infant Deaths, under age 1 (thousands)',
       'Infant Mortality Rate (infant deaths per 1,000 live births)',
       'Live Births Surviving to Age 1 (thousands)',
       'Under-Five Deaths, under age 5 (thousands)',
       'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
       'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
       'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
       'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
       'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
       'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
       'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
       'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
       'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
       'Net Number of Migrants (thousands)',
       'Net Migration Rate (per 1,000 population)'],
      dtype='object')

pop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21983 entries, 0 to 21982
Data columns (total 65 columns):
 #   Column                                                                                          Non-Null Count  Dtype  
---  ------                                                                                          --------------  -----  
 0   Index                                                                                           21983 non-null  int64  
 1   Variant                                                                                         21983 non-null  object 
 2   Region, subregion, country or area *                                                            21983 non-null  object 
 3   Notes                                                                                           5628 non-null   object 
 4   Location code                                                                                   21983 non-null  int64  
 5   ISO3 Alpha-code                                                                                 17538 non-null  object 
 6   ISO2 Alpha-code                                                                                 17464 non-null  object 
 7   SDMX code**                                                                                     20942 non-null  float64
 8   Type                                                                                            21983 non-null  object 
 9   Parent code                                                                                     21983 non-null  int64  
 10  Year                                                                                            21978 non-null  float64
 11  Total Population, as of 1 January (thousands)                                                   21983 non-null  object 
 12  Total Population, as of 1 July (thousands)                                                      21983 non-null  object 
 13  Male Population, as of 1 July (thousands)                                                       21983 non-null  object 
 14  Female Population, as of 1 July (thousands)                                                     21983 non-null  object 
 15  Population Density, as of 1 July (persons per square km)                                        21983 non-null  object 
 16  Population Sex Ratio, as of 1 July (males per 100 females)                                      21983 non-null  object 
 17  Median Age, as of 1 July (years)                                                                21983 non-null  object 
 18  Natural Change, Births minus Deaths (thousands)                                                 21983 non-null  object 
 19  Rate of Natural Change (per 1,000 population)                                                   21983 non-null  object 
 20  Population Change (thousands)                                                                   21983 non-null  object 
 21  Population Growth Rate (percentage)                                                             21983 non-null  object 
 22  Population Annual Doubling Time (years)                                                         21983 non-null  object 
 23  Births (thousands)                                                                              21983 non-null  object 
 24  Births by women aged 15 to 19 (thousands)                                                       21983 non-null  object 
 25  Crude Birth Rate (births per 1,000 population)                                                  21983 non-null  object 
 26  Total Fertility Rate (live births per woman)                                                    21983 non-null  object 
 27  Net Reproduction Rate (surviving daughters per woman)                                           21983 non-null  object 
 28  Mean Age Childbearing (years)                                                                   21983 non-null  object 
 29  Sex Ratio at Birth (males per 100 female births)                                                21983 non-null  object 
 30  Total Deaths (thousands)                                                                        21983 non-null  object 
 31  Male Deaths (thousands)                                                                         21983 non-null  object 
 32  Female Deaths (thousands)                                                                       21983 non-null  object 
 33  Crude Death Rate (deaths per 1,000 population)                                                  21983 non-null  object 
 34  Life Expectancy at Birth, both sexes (years)                                                    21983 non-null  object 
 35  Male Life Expectancy at Birth (years)                                                           21983 non-null  object 
 36  Female Life Expectancy at Birth (years)                                                         21983 non-null  object 
 37  Life Expectancy at Age 15, both sexes (years)                                                   21983 non-null  object 
 38  Male Life Expectancy at Age 15 (years)                                                          21983 non-null  object 
 39  Female Life Expectancy at Age 15 (years)                                                        21983 non-null  object 
 40  Life Expectancy at Age 65, both sexes (years)                                                   21983 non-null  object 
 41  Male Life Expectancy at Age 65 (years)                                                          21983 non-null  object 
 42  Female Life Expectancy at Age 65 (years)                                                        21983 non-null  object 
 43  Life Expectancy at Age 80, both sexes (years)                                                   21983 non-null  object 
 44  Male Life Expectancy at Age 80 (years)                                                          21983 non-null  object 
 45  Female Life Expectancy at Age 80 (years)                                                        21983 non-null  object 
 46  Infant Deaths, under age 1 (thousands)                                                          21983 non-null  object 
 47  Infant Mortality Rate (infant deaths per 1,000 live births)                                     21983 non-null  object 
 48  Live Births Surviving to Age 1 (thousands)                                                      21983 non-null  object 
 49  Under-Five Deaths, under age 5 (thousands)                                                      21983 non-null  object 
 50  Under-Five Mortality (deaths under age 5 per 1,000 live births)                                 21983 non-null  object 
 51  Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)                 21983 non-null  object 
 52  Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)                   21983 non-null  object 
 53  Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)               21983 non-null  object 
 54  Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)                 21983 non-null  object 
 55  Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)                   21983 non-null  object 
 56  Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)               21983 non-null  object 
 57  Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)     21983 non-null  object 
 58  Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)      21983 non-null  object 
 59  Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)  21983 non-null  object 
 60  Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)     21983 non-null  object 
 61  Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)      21983 non-null  object 
 62  Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)  21983 non-null  object 
 63  Net Number of Migrants (thousands)                                                              21983 non-null  object 
 64  Net Migration Rate (per 1,000 population)                                                       21983 non-null  object 
dtypes: float64(2), int64(3), object(60)
memory usage: 10.9+ MB

# Inspecting the dataframe visually

who_alcohol.head(50)

who_alcohol.sample(5)

who_alcohol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  358 non-null    int64  
 1   IndicatorCode       358 non-null    object 
 2   SpatialDimType      358 non-null    object 
 3   SpatialDim          358 non-null    object 
 4   ParentLocationCode  358 non-null    object 
 5   TimeDimType         358 non-null    object 
 6   ParentLocation      358 non-null    object 
 7   Dim1Type            358 non-null    object 
 8   Dim1                358 non-null    object 
 9   TimeDim             358 non-null    int64  
 10  Dim2Type            0 non-null      object 
 11  Dim2                0 non-null      object 
 12  Dim3Type            0 non-null      object 
 13  Dim3                0 non-null      object 
 14  DataSourceDimType   0 non-null      object 
 15  DataSourceDim       0 non-null      object 
 16  Value               358 non-null    object 
 17  NumericValue        209 non-null    float64
 18  Low                 0 non-null      object 
 19  High                0 non-null      object 
 20  Comments            0 non-null      object 
 21  Date                358 non-null    object 
 22  TimeDimensionValue  358 non-null    object 
 23  TimeDimensionBegin  358 non-null    object 
 24  TimeDimensionEnd    358 non-null    object 
dtypes: float64(1), int64(2), object(22)
memory usage: 70.0+ KB

# Inspecting the dataframe programmatically
who_alcohol["NumericValue"].isna().sum()

149

# Inspecting the dataframe programmatically

who_alcohol["Value"].unique()[:10]

array(['12.8', '.', '30.7', '4.2', '33.9', '10.9', '32.1', '26.9', '14.0',
       '31.6'], dtype=object)

# Inspecting the dataframe programmatically
who_alcohol["Dim1"].value_counts()

Dim1
SEX_FMLE    121
SEX_BTSX    119
SEX_MLE     118
Name: count, dtype: int64

# Inspecting the dataframe programmatically
who_alcohol[["Dim2Type", "Dim2", "Dim3Type","Dim3", "Low", "High", "Comments"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Dim2Type  0 non-null      object
 1   Dim2      0 non-null      object
 2   Dim3Type  0 non-null      object
 3   Dim3      0 non-null      object
 4   Low       0 non-null      object
 5   High      0 non-null      object
 6   Comments  0 non-null      object
dtypes: object(7)
memory usage: 19.7+ KB

who_alcohol.isna().sum().sort_values()

Id                      0
TimeDimensionValue      0
Date                    0
Value                   0
TimeDimensionBegin      0
TimeDim                 0
Dim1                    0
TimeDimensionEnd        0
ParentLocation          0
IndicatorCode           0
SpatialDimType          0
Dim1Type                0
SpatialDim              0
TimeDimType             0
ParentLocationCode      0
NumericValue          149
Dim3                  358
DataSourceDimType     358
DataSourceDim         358
Dim2                  358
Low                   358
High                  358
Comments              358
Dim2Type              358
Dim3Type              358
dtype: int64

# Inspecting the dataframe visually

# Initial Inspection 
pop_df.head(20)

# Inspecting the dataframe programmatically

pop_df["Type"].value_counts().head(20)

Type
Country/Area         17538
Subregion             1554
SDG region             666
Special other          666
Income Group           666
Region                 444
Development Group      370
World                   74
Label/Separator          5
Name: count, dtype: int64

pop_df["ISO3 Alpha-code"].isna().sum()

4445

# Inspecting the dataframe programmatically

pop_df["Notes"].isna().sum()

16355

# Inspecting the dataframe programmatically
pop_df["Type"].value_counts()

Type
Country/Area         17538
Subregion             1554
SDG region             666
Special other          666
Income Group           666
Region                 444
Development Group      370
World                   74
Label/Separator          5
Name: count, dtype: int64

# Inspecting the dataframe programmatically
pop_df.columns

Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
       'Type', 'Parent code', 'Year',
       'Total Population, as of 1 January (thousands)',
       'Total Population, as of 1 July (thousands)',
       'Male Population, as of 1 July (thousands)',
       'Female Population, as of 1 July (thousands)',
       'Population Density, as of 1 July (persons per square km)',
       'Population Sex Ratio, as of 1 July (males per 100 females)',
       'Median Age, as of 1 July (years)',
       'Natural Change, Births minus Deaths (thousands)',
       'Rate of Natural Change (per 1,000 population)',
       'Population Change (thousands)', 'Population Growth Rate (percentage)',
       'Population Annual Doubling Time (years)', 'Births (thousands)',
       'Births by women aged 15 to 19 (thousands)',
       'Crude Birth Rate (births per 1,000 population)',
       'Total Fertility Rate (live births per woman)',
       'Net Reproduction Rate (surviving daughters per woman)',
       'Mean Age Childbearing (years)',
       'Sex Ratio at Birth (males per 100 female births)',
       'Total Deaths (thousands)', 'Male Deaths (thousands)',
       'Female Deaths (thousands)',
       'Crude Death Rate (deaths per 1,000 population)',
       'Life Expectancy at Birth, both sexes (years)',
       'Male Life Expectancy at Birth (years)',
       'Female Life Expectancy at Birth (years)',
       'Life Expectancy at Age 15, both sexes (years)',
       'Male Life Expectancy at Age 15 (years)',
       'Female Life Expectancy at Age 15 (years)',
       'Life Expectancy at Age 65, both sexes (years)',
       'Male Life Expectancy at Age 65 (years)',
       'Female Life Expectancy at Age 65 (years)',
       'Life Expectancy at Age 80, both sexes (years)',
       'Male Life Expectancy at Age 80 (years)',
       'Female Life Expectancy at Age 80 (years)',
       'Infant Deaths, under age 1 (thousands)',
       'Infant Mortality Rate (infant deaths per 1,000 live births)',
       'Live Births Surviving to Age 1 (thousands)',
       'Under-Five Deaths, under age 5 (thousands)',
       'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
       'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
       'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
       'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
       'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
       'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
       'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
       'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
       'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
       'Net Number of Migrants (thousands)',
       'Net Migration Rate (per 1,000 population)'],
      dtype='object')

# Apply the cleaning strategy

"""
    Fix "." 
    
    Issue: 
        - `Value` column contains "."
"""
# Convert Value to numeric, turning "." into NaN
who_alcohol["Value_clean"] = pd.to_numeric(who_alcohol["Value"], errors = "coerce")

# Apply the cleaning strategy
"""
    Fix missing mortality values
    
    Issue: 
        - `NumericValue` contains NaN for many rows
"""
# Fill NumericValue using Value_clean
who_alcohol["Mortality"] = who_alcohol["NumericValue"].fillna(who_alcohol["Value_clean"])

# Drop original columns
who_alcohol.drop(columns = ["NumericValue", "Value", "Value_clean"], inplace = True)

# Validate the cleaning was successful

who_alcohol["Mortality"].isna().sum()

149

# Apply the cleaning strategy

empty_cols = ["Dim2Type", "Dim2", "Dim3Type", "Dim3", "Low", "High", "Comments", "DataSourceDimType","DataSourceDim"]
who_alcohol.drop(columns = empty_cols, inplace = True)

# Validate the cleaning was successful
who_alcohol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  358 non-null    int64  
 1   IndicatorCode       358 non-null    object 
 2   SpatialDimType      358 non-null    object 
 3   SpatialDim          358 non-null    object 
 4   ParentLocationCode  358 non-null    object 
 5   TimeDimType         358 non-null    object 
 6   ParentLocation      358 non-null    object 
 7   Dim1Type            358 non-null    object 
 8   Dim1                358 non-null    object 
 9   TimeDim             358 non-null    int64  
 10  Date                358 non-null    object 
 11  TimeDimensionValue  358 non-null    object 
 12  TimeDimensionBegin  358 non-null    object 
 13  TimeDimensionEnd    358 non-null    object 
 14  Mortality           209 non-null    float64
dtypes: float64(1), int64(2), object(12)
memory usage: 42.1+ KB

# Apply the cleaning strategy
who_alcohol.rename(columns={"Dim1": "Sex"}, inplace=True)
who_alcohol = who_alcohol[who_alcohol["Sex"] == "SEX_BTSX"].copy()

# Validate the cleaning was successful

who_alcohol["Sex"].value_counts()

Sex
SEX_BTSX    119
Name: count, dtype: int64

who_alcohol = who_alcohol[["SpatialDim", "ParentLocation", "TimeDim", "Mortality"]].copy()

who_alcohol.rename(columns={
    "SpatialDim": "CountryCode",
    "ParentLocation": "Country",
    "TimeDim": "Year"
}, inplace=True)

who_alcohol.head()

# Keep only country - level rows
# Keep only rows with ISO3 codes

pop_clean = pop_df[pop_df["ISO3 Alpha-code"].notna()].copy()

# Validation programmatically
pop_clean["Type"].value_counts()

Type
Country/Area    17538
Name: count, dtype: int64

# Convert population column to numeric
pop_clean["TotalPopulation_thousands"] = pd.to_numeric(
    pop_clean["Total Population, as of 1 July (thousands)"], errors="coerce"
)

pop_clean["TotalPopulation"] = pop_clean["TotalPopulation_thousands"] * 1000

pop_clean["TotalPopulation"].describe()

count    1.753800e+04
mean     2.153491e+07
std      9.574493e+07
min      4.960000e+02
25%      2.204615e+05
50%      3.135058e+06
75%      1.089016e+07
max      1.438070e+09
Name: TotalPopulation, dtype: float64

pop_clean = pop_clean[[
    "ISO3 Alpha-code",
    "Region, subregion, country or area *",
    "Year",
    "TotalPopulation"
]].copy()

pop_clean.rename(columns={
    "ISO3 Alpha-code": "CountryCode",
    "Region, subregion, country or area *": "Country"
}, inplace=True)

pop_clean.head()

# Remove unnecessary variables and combine datasets

who_alcohol.head()

who_alcohol.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119 entries, 0 to 354
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CountryCode  119 non-null    object 
 1   Country      119 non-null    object 
 2   Year         119 non-null    int64  
 3   Mortality    102 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.6+ KB

merged = who_alcohol.merge(pop_clean, 
                         on=["CountryCode", "Year"], 
                         how="inner")

merged.head()
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CountryCode      119 non-null    object 
 1   Country_x        119 non-null    object 
 2   Year             119 non-null    int64  
 3   Mortality        102 non-null    float64
 4   Country_y        119 non-null    object 
 5   TotalPopulation  119 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 5.7+ KB

# Check if Country_x and Country_y identical

(merged["Country_x"] == merged["Country_y"]).all()

False

# Only see the mismatched rows
merged[merged["Country_x"] != merged["Country_y"]][["CountryCode", "Country_x", "Country_y"]]

merged.rename(columns={"Country_x": "Region"}, inplace=True)
merged.rename(columns={"Country_y": "Country"}, inplace=True)

merged.head()

import os
os.getcwd()

'/workspace'

# Create folders

os.makedirs("/workspace/data/raw", exist_ok=True)
os.makedirs("/workspace/data/clean", exist_ok=True)

# saving data
import shutil

# Save WHO raw dataset
who_raw_df.to_csv("/workspace/data/raw/who_alcohol_raw.csv", index=False)

# Save UN raw dataset 
shutil.copy(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    "/workspace/data/raw/un_population_raw.xlsx"
)

'/workspace/data/raw/un_population_raw.xlsx'

# Save clean WHO dataset
who_alcohol.to_csv("/workspace/data/clean/who_alcohol.csv", index=False)

# Save clean UN dataset
pop_clean.to_csv("/workspace/data/clean/un_population_clean.csv", index=False)

# Save merged dataset
merged.to_csv("/workspace/data/clean/alcohol_population_merged.csv", index=False)

#Visual 1 
import matplotlib.pyplot as plt
import seaborn as sns

merged["Mortality_per_100k"] = (merged["Mortality"] / merged["TotalPopulation"]) * 100000

# Bar Chart: Top 10 Countries by Alcohol-Related Mortality Rate
top10 = merged.sort_values("Mortality_per_100k", ascending=False).head(10)

plt.figure(figsize=(12,6))
sns.barplot(data=top10, x="Country", y="Mortality_per_100k", palette="magma")
plt.xticks(rotation=45)
plt.title("Top 10 European Countries by Alcohol-Related Mortality Rate (per 100k)")
plt.ylabel("Mortality per 100,000 population")
plt.xlabel("Country")
plt.show()

#Visual 2 - Scatter Plot: Population vs Mortality Rate

plt.figure(figsize=(10,6))
sns.scatterplot(
    data=merged,
    x="TotalPopulation",
    y="Mortality_per_100k",
    hue="Country",
    s=120
)
plt.xscale("log")  
plt.title("Relationship Between Population Size and Alcohol-Related Mortality Rate")
plt.xlabel("Total Population (log scale)")
plt.ylabel("Mortality per 100,000 population")
plt.legend(bbox_to_anchor=(1,1))
plt.show()

	IndicatorCode	IndicatorName	Language
2001	WHOSIS_000007	Healthy life expectancy (HALE) at age 60 (years)	EN
2012	WHOSIS_000001	Life expectancy at birth (years)	EN
2024	WHOSIS_000002	Healthy life expectancy (HALE) at birth (years)	EN
2045	WHOSIS_000015	Life expectancy at age 60 (years)	EN

	IndicatorCode	IndicatorName	Language
169	MORTADO	Adolescent mortality rate (per 1 000 age speci...	EN
248	imr	Infant mortality rate (deaths per 1000 live bi...	EN
332	MDG_0000000007	Under-five mortality rate (probability of dyin...	EN
375	MDG_0000000001	Infant mortality rate (probability of dying be...	EN
402	MDG_0000000032	Maternal mortality ratio (per 100 000 live bir...	EN
457	nmr	Neonatal mortality rate (deaths per 1000 live ...	EN
578	MDG_0000000026	Maternal mortality ratio (per 100 000 live bir...	EN
889	SA_0000001473	Alcohol-related disease mortality, per 100,000...	EN
899	GHE_YLLNUM	Years of life lost from mortality (YLLs)	EN
900	GHE_YLLRATE	Years of life lost from mortality (YLLs) (per ...	EN
1105	SA_0000001472	Alcohol-related injury mortality, per 1,000	EN
1492	WHS2_161	Age-standardized mortality rate by cause (per ...	EN
1524	WHS2_131	Age-standardized NCD mortality rate (per 100 ...	EN
1526	WHS2_160	Age-standardized mortality rate by cause (per ...	EN
1749	SDGWSHBOD	Mortality rate attributed to exposure to unsaf...	EN
1793	u5mr	Under-five mortality rate (deaths per 1000 liv...	EN
1862	SDGPOISON	Mortality rate attributed to unintentional poi...	EN
1863	SDGROADAGE	Age-standardized road traffic mortality (per ...	EN
1990	WHOSIS_000003	Neonatal mortality rate (per 1000 live births)	EN
2013	WHOSIS_000004	Adult mortality rate (probability of dying bet...	EN
2051	WHS10_4	Number of national population surveys - child ...	EN
2077	CHILDMORT5TO14	Mortality rate for 5-14 year-olds (probability...	EN
2321	WHS10_5	Number of national population surveys - matern...	EN
2451	GHE_YLL_NUMERIC	Years of life lost from mortality (YLLs)	EN
2489	NCD_CCS_MORT_TARGET	Existence of a national target on NCD mortality	EN
2496	PRISON_D3_DEATHS_DRUG_MRATE	In-prison drug overdose mortality rate (per 10...	EN
2523	WHOSIS_000016	Mortality rate among children ages 5 to 9 year...	EN
2649	PRISON_D3_DEATHS_SUICIDE_MRATE	In-prison suicide mortality rate (per 100 000 ...	EN
2704	PRISON_D3_DEATHS_COVID_MRATE	In-prison COVID-19 mortality rate (per 100 000...	EN
2708	CHILDMORT10TO19	Adolescent mortality rate (per 1 000 age speci...	EN
2745	CHILDMORT_MORTALITY_10TO14	Mortality rate among children ages 10 to 14 ye...	EN
2998	MALARIA_EST_MORTALITY	Estimated malaria mortality rate (per 100 000 ...	EN

	CountryCode	Country	Year	TotalPopulation
2594	BDI	Burundi	1950.0	2254938.0
2595	BDI	Burundi	1951.0	2305746.0
2596	BDI	Burundi	1952.0	2355804.0
2597	BDI	Burundi	1953.0	2405186.0
2598	BDI	Burundi	1954.0	2454586.0

	CountryCode	Country_x	Country_y
0	SWE	Europe	Sweden
1	MHL	Western Pacific	Marshall Islands
2	NOR	Europe	Norway
3	MEX	Americas	Mexico
4	DNK	Europe	Denmark
...	...	...	...
114	TUR	Europe	Türkiye
115	LTU	Europe	Lithuania
116	DEU	Europe	Germany
117	MHL	Western Pacific	Marshall Islands
118	DEU	Europe	Germany

	CountryCode	Region	Year	Mortality	Country	TotalPopulation
0	SWE	Europe	1997	12.8	Sweden	8845524.0
1	MHL	Western Pacific	2005	33.9	Marshall Islands	51906.0
2	NOR	Europe	2002	10.9	Norway	4538014.0
3	MEX	Americas	2001	NaN	Mexico	100099099.0
4	DNK	Europe	1994	26.9	Denmark	5206190.0

Real-world Data Wrangling¶

1. Gather data¶

1.1. Problem Statement¶

1.2. Gathering at least two datasets using two different data gathering methods¶

Dataset 2¶

Why these two datasets were selected?¶

2. Assess data¶

Quality Issue 1:¶

Missing mortality values (NumericValue has many NaNs)¶

Quality Issue 2:¶

Tidiness Issue 1:¶

FILL IN - Inspecting the dataframe visually¶

Tidiness Issue 2:¶

FILL IN - Inspecting the dataframe visually¶

Dataset 2: UN Population Dataset¶

Quality Issue 1:¶

Quality Issue 2:¶

Inspecting the dataframe visually¶

Tidiness Issue 1:¶

Inspecting the dataframe visually¶

Tidiness Issue 2:¶

Inspecting the dataframe visually¶

3. Clean data¶

Quality Issue 1: ¶

Quality Issue 2: ¶

Tidiness Issue 1: Remove useless empty columns¶

Tidiness Issue 2: Tidy sex variable¶

Dataset 2: UN Population Dataset¶

Quality Issue 1: Keep only country-level rows¶

Quality Issue 2: Convert population column to numeric¶

Remove unnecessary variables and combine datasets¶

4. Update the data store¶

5. Answer the research question¶

5.1: Define and answer the research question¶

5.2: Reflection¶

	Id	IndicatorCode	SpatialDimType	SpatialDim	ParentLocationCode	TimeDimType	ParentLocation	Dim1Type	Dim1	TimeDim	...	DataSourceDim	Value	NumericValue	Low	High	Comments	Date	TimeDimensionValue	TimeDimensionBegin	TimeDimensionEnd
0	6284754	SA_0000001473	COUNTRY	SWE	EUR	YEAR	Europe	SEX	SEX_BTSX	1997	...	None	12.8	12.8	None	None	None	2013-06-11T14:15:34+02:00	1997	1997-01-01T00:00:00+01:00	1997-12-31T00:00:00+01:00
1	385071	SA_0000001473	COUNTRY	LTU	EUR	YEAR	Europe	SEX	SEX_MLE	1990	...	None	.	NaN	None	None	None	2013-06-11T14:15:34+02:00	1990	1990-01-01T00:00:00+01:00	1990-12-31T00:00:00+01:00
2	1375206	SA_0000001473	COUNTRY	DEU	EUR	YEAR	Europe	SEX	SEX_MLE	1991	...	None	30.7	30.7	None	None	None	2013-06-11T14:15:34+02:00	1991	1991-01-01T00:00:00+01:00	1991-12-31T00:00:00+01:00
3	2333151	SA_0000001473	COUNTRY	LTU	EUR	YEAR	Europe	SEX	SEX_FMLE	1993	...	None	.	NaN	None	None	None	2013-06-11T14:15:34+02:00	1993	1993-01-01T00:00:00+01:00	1993-12-31T00:00:00+01:00
4	4299804	SA_0000001473	COUNTRY	NOR	EUR	YEAR	Europe	SEX	SEX_FMLE	2003	...	None	4.2	4.2	None	None	None	2013-06-11T14:15:34+02:00	2003	2003-01-01T00:00:00+01:00	2003-12-31T00:00:00+01:00

	Id	IndicatorCode	SpatialDimType	SpatialDim	ParentLocationCode	TimeDimType	ParentLocation	Dim1Type	Dim1	TimeDim	...	DataSourceDim	Value	NumericValue	Low	High	Comments	Date	TimeDimensionValue	TimeDimensionBegin	TimeDimensionEnd
339	6093941	SA_0000001473	COUNTRY	LVA	EUR	YEAR	Europe	SEX	SEX_BTSX	2009	...	None	19.0	19.0	None	None	None	2013-06-11T14:15:34+02:00	2009	2009-01-01T00:00:00+01:00	2009-12-31T00:00:00+01:00
255	9161552	SA_0000001473	COUNTRY	FIN	EUR	YEAR	Europe	SEX	SEX_FMLE	2009	...	None	.	NaN	None	None	None	2013-06-11T14:15:34+02:00	2009	2009-01-01T00:00:00+01:00	2009-12-31T00:00:00+01:00
174	5361902	SA_0000001473	COUNTRY	NOR	EUR	YEAR	Europe	SEX	SEX_FMLE	1999	...	None	.	NaN	None	None	None	2013-06-11T14:15:34+02:00	1999	1999-01-01T00:00:00+01:00	1999-12-31T00:00:00+01:00
297	5909478	SA_0000001473	COUNTRY	CZE	EUR	YEAR	Europe	SEX	SEX_MLE	2002	...	None	90.0	90.0	None	None	None	2013-06-11T14:15:34+02:00	2002	2002-01-01T00:00:00+01:00	2002-12-31T00:00:00+01:00
337	1247771	SA_0000001473	COUNTRY	SWE	EUR	YEAR	Europe	SEX	SEX_MLE	2002	...	None	27.0	27.0	None	None	None	2013-06-11T14:15:34+02:00	2002	2002-01-01T00:00:00+01:00	2002-12-31T00:00:00+01:00

	Index	Variant	Region, subregion, country or area *	Notes	Location code	ISO3 Alpha-code	ISO2 Alpha-code	SDMX code**	Type	...	Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)	Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)	Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)	Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)	Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)	Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)	Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)	Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)
0	1	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	580.5	497.388	238.516	268.734	207.62	375.391	426.221	322.65
1	2	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	566.566	488.435	229.703	256.236	202.734	365.226	412.76	316.395
2	3	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	546.444	475.37	217.311	238.56	195.926	350.613	393.364	307.314
3	4	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	535.811	467.361	211.257	230.961	191.482	342.734	383.875	301.27
4	5	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	522.058	455.621	203.337	221.377	185.296	332.327	371.737	292.807
5	6	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	514.72	447.104	199.594	217.779	181.41	327.447	367.308	287.446
6	7	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	507.214	439.595	196.203	214.202	178.21	323.096	362.805	283.257
7	8	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	503.041	435.133	194.296	212.197	176.408	320.764	360.581	280.835
8	9	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	495.575	427.387	190.913	208.967	172.851	315.308	355.044	275.449
9	10	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	525.642	455.314	202.942	223.707	182.052	332.92	375.396	290.021
10	11	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	551.721	480.332	214.608	237.522	191.453	349.468	393.991	304.225
11	12	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	516.3	445.337	199.138	220.017	178.094	327.37	369.846	284.485
12	13	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	473.158	403.755	179.709	197.386	161.984	300.651	340.388	260.936
13	14	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	466.352	397.364	176.593	193.945	159.168	296.424	335.714	257.204
14	15	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	459.158	389.924	173.07	190.414	155.613	291.361	330.474	252.34
15	16	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	465.949	391.46	176.325	196.739	155.561	295.373	337.598	252.861
16	17	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	455.293	384.766	170.328	188.421	152.005	288.002	327.742	248.281
17	18	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	447.962	376.879	165.537	184.02	146.783	282.174	322.109	242.23
18	19	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	439.806	368.526	161.587	180.165	142.707	277.451	317.265	237.608
19	20	Estimates	World	NaN	900	NaN	NaN	1.0	World	...	434.278	363.088	158.623	177.186	139.726	274.002	313.646	234.306

Real-world Data Wrangling¶

1. Gather data¶

1.1. Problem Statement¶

1.2. Gathering at least two datasets using two different data gathering methods¶

Dataset 1 - WHO Alcohol - Related Disease Mortality (Programmatic Gathering)¶

Dataset 2¶

Why these two datasets were selected?¶

2. Assess data¶

Dataset 1: WHO Alcohol-related disease mortality¶

Quality Issue 1:¶

Missing mortality values (NumericValue has many NaNs)¶

Quality Issue 2:¶

Tidiness Issue 1:¶

FILL IN - Inspecting the dataframe visually¶

Tidiness Issue 2:¶

FILL IN - Inspecting the dataframe visually¶

Dataset 2: UN Population Dataset¶

Quality Issue 1:¶

Quality Issue 2:¶

Inspecting the dataframe visually¶

Tidiness Issue 1:¶

Inspecting the dataframe visually¶

Tidiness Issue 2:¶

Inspecting the dataframe visually¶

3. Clean data¶

Dataset 1: WHO Alcohol-related disease mortality¶

**Quality Issue 1: **¶

**Quality Issue 2: **¶

Tidiness Issue 1: Remove useless empty columns¶

Tidiness Issue 2: Tidy sex variable¶

Dataset 2: UN Population Dataset¶

Quality Issue 1: Keep only country-level rows¶

Quality Issue 2: Convert population column to numeric¶

Remove unnecessary variables and combine datasets¶

4. Update the data store¶

5. Answer the research question¶

5.1: Define and answer the research question¶

5.2: Reflection¶

Quality Issue 1: ¶

Quality Issue 2: ¶