Real-world Data WranglingΒΆ

In this project, an end-to-end data preparation workflow is implemented using two real-world datasets. The data is collected and extracted, evaluated for quality and structural issues through both automated checks and visual inspection, and cleaned using a transparent, well-justified strategy. The refined data is stored in a selected database or data store, integrated into a unified analysis dataset, and used to answer a targeted research question. Key methodological choices are documented, and the code is annotated to support readability and reproducibility.

InΒ [1]:
!python -m pip install kaggle==1.6.12
Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle==1.6.12
  Downloading kaggle-1.6.12.tar.gz (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.7/79.7 kB 3.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.16.0)
Collecting certifi>=2023.7.22
  Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.4/159.4 kB 9.7 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.9.0.post0)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.29.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (4.65.0)
Collecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.26.15)
Requirement already satisfied: bleach in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (6.1.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from bleach->kaggle==1.6.12) (0.5.1)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 16.3 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (3.4)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... done
  Created wheel for kaggle: filename=kaggle-1.6.12-py3-none-any.whl size=102969 sha256=35252a1665ce54934064e043d40d970d45edd3ce256e9a015f30548687f39fad
  Stored in directory: /home/student/.cache/pip/wheels/1e/0b/7c/50f8e89c3d2f82838dbd7afeddffbb9357003009ada98216c7
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, certifi, kaggle
  WARNING: The script slugify is installed in '/home/student/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script kaggle is installed in '/home/student/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2025.11.12 kaggle-1.6.12 python-slugify-8.0.4 text-unidecode-1.3
InΒ [2]:
!pip install --target=/workspace ucimlrepo numpy==1.24.3
Collecting ucimlrepo
  Using cached ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting pandas>=1.0.0
  Using cached pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
Collecting certifi>=2020.12.5
  Using cached certifi-2025.11.12-py3-none-any.whl (159 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting pytz>=2020.1
  Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Collecting tzdata>=2022.7
  Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Collecting six>=1.5
  Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, six, numpy, certifi, python-dateutil, pandas, ucimlrepo
Successfully installed certifi-2025.11.12 numpy-1.24.3 pandas-2.3.3 python-dateutil-2.9.0.post0 pytz-2025.2 six-1.17.0 tzdata-2025.2 ucimlrepo-0.0.7
WARNING: Target directory /workspace/dateutil already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/ucimlrepo already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/__pycache__ already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pandas already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/python_dateutil-2.9.0.post0.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/tzdata already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy-1.24.3.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/six-1.17.0.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/six.py already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/tzdata-2025.2.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pandas-2.3.3.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pytz already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/certifi-2025.11.12.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/ucimlrepo-0.0.7.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/numpy.libs already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/certifi already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/pytz-2025.2.dist-info already exists. Specify --upgrade to force replacement.
WARNING: Target directory /workspace/bin already exists. Specify --upgrade to force replacement.

Note: Restart the kernel to use updated package(s).

1. Gather dataΒΆ

In this section, data is collected from two independent sources using two different acquisition methods. After retrieval and initial parsing, the datasets are aligned and combined into a single analysis-ready dataset that will be used in the subsequent cleaning and analysis steps.

1.1. Problem StatementΒΆ

1.2. Gathering at least two datasets using two different data gathering methodsΒΆ

List of data gathering methods:

  • Download data manually
  • Programmatically downloading files
  • Gather data by accessing APIs
  • Gather and extract data from HTML files using BeautifulSoup
  • Extract data from a SQL database

Dataset 1 - WHO Alcohol - Related Disease Mortality (Programmatic Gathering)ΒΆ

Type: JSON data retrieved from an online API

Source: World Health Organization (WHO) - Global Health Observatory (GHO) API

Method: WHO's public API was used to programmatically query the dataset. The dataset was gathered programmatically by sending an HTTP GET request to the WHO Global Health Observatory API. First, the list of WHO indicators was queried to locate the corrrect indicator code. The code for Alcohol-Related disease mortality (per 100000 population) is:

  • IndicatorCode: SA_0000001473

Then the dataset was retrieved from the API endpoint. https://ghoapi.azureedge.net/api/SA_0000001473. The JSON response was normalized into tabular form using pandas.jason_normalize(). Finally, the dataset was filtered to include only country-level observations, as regional or global aggregates were not relevant for the analysis.

InΒ [1]:
import requests 
import pandas as pd

# requests --> to call the API
# pandas --> to work with tables
BASE_URL = "https://ghoapi.azureedge.net/api"
InΒ [2]:
# Qery the indicator list endpoint to locate the correct indicator code
indicators_url = f"{BASE_URL}/Indicator"
response = requests.get(indicators_url)
response.raise_for_status() # check error occurance

indicators_json = response.json()
len(indicators_json["value"])
Out[2]:
3047
InΒ [3]:
# Turn the JSON into a DataFrame

indicators_df = pd.json_normalize(indicators_json["value"])
indicators_df.head()
indicators_df.columns
Out[3]:
Index(['IndicatorCode', 'IndicatorName', 'Language'], dtype='object')
InΒ [4]:
# Explore Indicators by keyword

# search "life expectancy"
indicators_df[indicators_df["IndicatorName"].str.contains("life expectancy", case = False)]
Out[4]:
IndicatorCode IndicatorName Language
2001 WHOSIS_000007 Healthy life expectancy (HALE) at age 60 (years) EN
2012 WHOSIS_000001 Life expectancy at birth (years) EN
2024 WHOSIS_000002 Healthy life expectancy (HALE) at birth (years) EN
2045 WHOSIS_000015 Life expectancy at age 60 (years) EN
InΒ [5]:
indicators_df[indicators_df["IndicatorName"].str.contains("mortality", case=False)]
Out[5]:
IndicatorCode IndicatorName Language
169 MORTADO Adolescent mortality rate (per 1 000 age speci... EN
248 imr Infant mortality rate (deaths per 1000 live bi... EN
332 MDG_0000000007 Under-five mortality rate (probability of dyin... EN
375 MDG_0000000001 Infant mortality rate (probability of dying be... EN
402 MDG_0000000032 Maternal mortality ratio (per 100 000 live bir... EN
457 nmr Neonatal mortality rate (deaths per 1000 live ... EN
578 MDG_0000000026 Maternal mortality ratio (per 100 000 live bir... EN
889 SA_0000001473 Alcohol-related disease mortality, per 100,000... EN
899 GHE_YLLNUM Years of life lost from mortality (YLLs) EN
900 GHE_YLLRATE Years of life lost from mortality (YLLs) (per ... EN
1105 SA_0000001472 Alcohol-related injury mortality, per 1,000 EN
1492 WHS2_161 Age-standardized mortality rate by cause (per ... EN
1524 WHS2_131 Age-standardized NCD mortality rate (per 100 ... EN
1526 WHS2_160 Age-standardized mortality rate by cause (per ... EN
1749 SDGWSHBOD Mortality rate attributed to exposure to unsaf... EN
1793 u5mr Under-five mortality rate (deaths per 1000 liv... EN
1862 SDGPOISON Mortality rate attributed to unintentional poi... EN
1863 SDGROADAGE Age-standardized road traffic mortality (per ... EN
1990 WHOSIS_000003 Neonatal mortality rate (per 1000 live births) EN
2013 WHOSIS_000004 Adult mortality rate (probability of dying bet... EN
2051 WHS10_4 Number of national population surveys - child ... EN
2077 CHILDMORT5TO14 Mortality rate for 5-14 year-olds (probability... EN
2321 WHS10_5 Number of national population surveys - matern... EN
2451 GHE_YLL_NUMERIC Years of life lost from mortality (YLLs) EN
2489 NCD_CCS_MORT_TARGET Existence of a national target on NCD mortality EN
2496 PRISON_D3_DEATHS_DRUG_MRATE In-prison drug overdose mortality rate (per 10... EN
2523 WHOSIS_000016 Mortality rate among children ages 5 to 9 year... EN
2649 PRISON_D3_DEATHS_SUICIDE_MRATE In-prison suicide mortality rate (per 100 000 ... EN
2704 PRISON_D3_DEATHS_COVID_MRATE In-prison COVID-19 mortality rate (per 100 000... EN
2708 CHILDMORT10TO19 Adolescent mortality rate (per 1 000 age speci... EN
2745 CHILDMORT_MORTALITY_10TO14 Mortality rate among children ages 10 to 14 ye... EN
2998 MALARIA_EST_MORTALITY Estimated malaria mortality rate (per 100 000 ... EN
InΒ [6]:
# Request data for the chosen indicator

indicator_code = "SA_0000001473" # Alcohol related disease mortality, per 100000
indicator_url = f"{BASE_URL}/{indicator_code}"

# send a GET request 
response = requests.get(indicator_url)
response.raise_for_status() # throws an error if request failed

indicator_json = response.json()
len(indicator_json["value"]) # will show how many rows (records) we got from the API
Out[6]:
358
InΒ [7]:
# Turn data into a DataFrame

# normalize the JSON response 
who_raw_df = pd.json_normalize(indicator_json["value"])
who_raw_df.head()
Out[7]:
Id IndicatorCode SpatialDimType SpatialDim ParentLocationCode TimeDimType ParentLocation Dim1Type Dim1 TimeDim ... DataSourceDim Value NumericValue Low High Comments Date TimeDimensionValue TimeDimensionBegin TimeDimensionEnd
0 6284754 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_BTSX 1997 ... None 12.8 12.8 None None None 2013-06-11T14:15:34+02:00 1997 1997-01-01T00:00:00+01:00 1997-12-31T00:00:00+01:00
1 385071 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_MLE 1990 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1990 1990-01-01T00:00:00+01:00 1990-12-31T00:00:00+01:00
2 1375206 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 1991 ... None 30.7 30.7 None None None 2013-06-11T14:15:34+02:00 1991 1991-01-01T00:00:00+01:00 1991-12-31T00:00:00+01:00
3 2333151 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_FMLE 1993 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
4 4299804 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 2003 ... None 4.2 4.2 None None None 2013-06-11T14:15:34+02:00 2003 2003-01-01T00:00:00+01:00 2003-12-31T00:00:00+01:00

5 rows Γ— 25 columns

InΒ [8]:
# Filter the dataset to country-level observations
who_alcohol = who_raw_df[who_raw_df["SpatialDimType"] == "COUNTRY"].copy()
who_alcohol["SpatialDimType"].value_counts()
who_alcohol.head()
Out[8]:
Id IndicatorCode SpatialDimType SpatialDim ParentLocationCode TimeDimType ParentLocation Dim1Type Dim1 TimeDim ... DataSourceDim Value NumericValue Low High Comments Date TimeDimensionValue TimeDimensionBegin TimeDimensionEnd
0 6284754 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_BTSX 1997 ... None 12.8 12.8 None None None 2013-06-11T14:15:34+02:00 1997 1997-01-01T00:00:00+01:00 1997-12-31T00:00:00+01:00
1 385071 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_MLE 1990 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1990 1990-01-01T00:00:00+01:00 1990-12-31T00:00:00+01:00
2 1375206 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 1991 ... None 30.7 30.7 None None None 2013-06-11T14:15:34+02:00 1991 1991-01-01T00:00:00+01:00 1991-12-31T00:00:00+01:00
3 2333151 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_FMLE 1993 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
4 4299804 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 2003 ... None 4.2 4.2 None None None 2013-06-11T14:15:34+02:00 2003 2003-01-01T00:00:00+01:00 2003-12-31T00:00:00+01:00

5 rows Γ— 25 columns

Dataset 2ΒΆ

Type: Excel file (.xlsx) manually downloaded from the United Nations website

Source: United Nations Department of Economic and Social Affairs (UN DESA)

Method: Manual download from the official UN World Population Prospects (WPP) Download Center

The dataset contains annual demographic indicators from 1950s to 2023 for all UN-recognized countries and regions. It includes population estimates, mortality, fertility, migration, and several demographic measures.

Dataset variables:

  • Population density

  • Male/Female population

  • Births, deaths, crude birth rate, crude death rate

  • Life expectancy indicators (at birth, age 15, age 65, age 80)

  • Fertility indicators

  • Mortality indicators (infant, under-five, age-specific)

  • Migration indicators

Why these two datasets were selected?ΒΆ

Alcohol-related disease mortality is fundamentally population-normalized indicator. Population size is necessary to interpret mortality risk meaningfully across countries and years. Without demographic context, raw mortality values or inconsistent indicators can lead to misleading comparisons. Therefore, combining WHO mortality data with UN population estimates allows standardized, comparable public health insights.

InΒ [9]:
!pip install openpyxl
Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.9/250.9 kB 5.2 MB/s eta 0:00:00a 0:00:01
Collecting et-xmlfile
  Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
InΒ [10]:
import site, sys

# KullanΔ±cΔ±ya ΓΆzel site-packages dizinini bul
user_site = site.getusersitepackages()
print("User site-packages:", user_site)

# Bu dizini sys.path'e ekle
if user_site not in sys.path:
    sys.path.append(user_site)

# Şimdi openpyxl'ı import etmeyi dene
import openpyxl
print("openpyxl version:", openpyxl.__version__)
User site-packages: /home/student/.local/lib/python3.10/site-packages
openpyxl version: 3.1.5
InΒ [11]:
#FILL IN 2nd data gathering and loading method

# Load the gathered excel File
xls = pd.ExcelFile("WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx", engine="openpyxl")
xls.sheet_names
Out[11]:
['Estimates', 'Medium variant', 'NOTES']
InΒ [12]:
# 1) Sheet'i ham haliyle, başlıksız oku
raw = pd.read_excel(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    sheet_name="Estimates",
    header=None,
    engine="openpyxl"
)

# İlk 40 satırı zaten gârdük, şimdi header satırını otomatik bulalım:
header_row = raw.index[raw.iloc[:, 0] == "Index"][0]
header_row
Out[12]:
16
InΒ [13]:
pop_df = pd.read_excel(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    sheet_name="Estimates",
    header=header_row,
    engine="openpyxl"
)

pop_df.head()
pop_df.columns
Out[13]:
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
       'Type', 'Parent code', 'Year',
       'Total Population, as of 1 January (thousands)',
       'Total Population, as of 1 July (thousands)',
       'Male Population, as of 1 July (thousands)',
       'Female Population, as of 1 July (thousands)',
       'Population Density, as of 1 July (persons per square km)',
       'Population Sex Ratio, as of 1 July (males per 100 females)',
       'Median Age, as of 1 July (years)',
       'Natural Change, Births minus Deaths (thousands)',
       'Rate of Natural Change (per 1,000 population)',
       'Population Change (thousands)', 'Population Growth Rate (percentage)',
       'Population Annual Doubling Time (years)', 'Births (thousands)',
       'Births by women aged 15 to 19 (thousands)',
       'Crude Birth Rate (births per 1,000 population)',
       'Total Fertility Rate (live births per woman)',
       'Net Reproduction Rate (surviving daughters per woman)',
       'Mean Age Childbearing (years)',
       'Sex Ratio at Birth (males per 100 female births)',
       'Total Deaths (thousands)', 'Male Deaths (thousands)',
       'Female Deaths (thousands)',
       'Crude Death Rate (deaths per 1,000 population)',
       'Life Expectancy at Birth, both sexes (years)',
       'Male Life Expectancy at Birth (years)',
       'Female Life Expectancy at Birth (years)',
       'Life Expectancy at Age 15, both sexes (years)',
       'Male Life Expectancy at Age 15 (years)',
       'Female Life Expectancy at Age 15 (years)',
       'Life Expectancy at Age 65, both sexes (years)',
       'Male Life Expectancy at Age 65 (years)',
       'Female Life Expectancy at Age 65 (years)',
       'Life Expectancy at Age 80, both sexes (years)',
       'Male Life Expectancy at Age 80 (years)',
       'Female Life Expectancy at Age 80 (years)',
       'Infant Deaths, under age 1 (thousands)',
       'Infant Mortality Rate (infant deaths per 1,000 live births)',
       'Live Births Surviving to Age 1 (thousands)',
       'Under-Five Deaths, under age 5 (thousands)',
       'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
       'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
       'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
       'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
       'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
       'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
       'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
       'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
       'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
       'Net Number of Migrants (thousands)',
       'Net Migration Rate (per 1,000 population)'],
      dtype='object')

Optional data storing step: You may save your raw dataset files to the local data store before moving to the next step.

InΒ [15]:
pop_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21983 entries, 0 to 21982
Data columns (total 65 columns):
 #   Column                                                                                          Non-Null Count  Dtype  
---  ------                                                                                          --------------  -----  
 0   Index                                                                                           21983 non-null  int64  
 1   Variant                                                                                         21983 non-null  object 
 2   Region, subregion, country or area *                                                            21983 non-null  object 
 3   Notes                                                                                           5628 non-null   object 
 4   Location code                                                                                   21983 non-null  int64  
 5   ISO3 Alpha-code                                                                                 17538 non-null  object 
 6   ISO2 Alpha-code                                                                                 17464 non-null  object 
 7   SDMX code**                                                                                     20942 non-null  float64
 8   Type                                                                                            21983 non-null  object 
 9   Parent code                                                                                     21983 non-null  int64  
 10  Year                                                                                            21978 non-null  float64
 11  Total Population, as of 1 January (thousands)                                                   21983 non-null  object 
 12  Total Population, as of 1 July (thousands)                                                      21983 non-null  object 
 13  Male Population, as of 1 July (thousands)                                                       21983 non-null  object 
 14  Female Population, as of 1 July (thousands)                                                     21983 non-null  object 
 15  Population Density, as of 1 July (persons per square km)                                        21983 non-null  object 
 16  Population Sex Ratio, as of 1 July (males per 100 females)                                      21983 non-null  object 
 17  Median Age, as of 1 July (years)                                                                21983 non-null  object 
 18  Natural Change, Births minus Deaths (thousands)                                                 21983 non-null  object 
 19  Rate of Natural Change (per 1,000 population)                                                   21983 non-null  object 
 20  Population Change (thousands)                                                                   21983 non-null  object 
 21  Population Growth Rate (percentage)                                                             21983 non-null  object 
 22  Population Annual Doubling Time (years)                                                         21983 non-null  object 
 23  Births (thousands)                                                                              21983 non-null  object 
 24  Births by women aged 15 to 19 (thousands)                                                       21983 non-null  object 
 25  Crude Birth Rate (births per 1,000 population)                                                  21983 non-null  object 
 26  Total Fertility Rate (live births per woman)                                                    21983 non-null  object 
 27  Net Reproduction Rate (surviving daughters per woman)                                           21983 non-null  object 
 28  Mean Age Childbearing (years)                                                                   21983 non-null  object 
 29  Sex Ratio at Birth (males per 100 female births)                                                21983 non-null  object 
 30  Total Deaths (thousands)                                                                        21983 non-null  object 
 31  Male Deaths (thousands)                                                                         21983 non-null  object 
 32  Female Deaths (thousands)                                                                       21983 non-null  object 
 33  Crude Death Rate (deaths per 1,000 population)                                                  21983 non-null  object 
 34  Life Expectancy at Birth, both sexes (years)                                                    21983 non-null  object 
 35  Male Life Expectancy at Birth (years)                                                           21983 non-null  object 
 36  Female Life Expectancy at Birth (years)                                                         21983 non-null  object 
 37  Life Expectancy at Age 15, both sexes (years)                                                   21983 non-null  object 
 38  Male Life Expectancy at Age 15 (years)                                                          21983 non-null  object 
 39  Female Life Expectancy at Age 15 (years)                                                        21983 non-null  object 
 40  Life Expectancy at Age 65, both sexes (years)                                                   21983 non-null  object 
 41  Male Life Expectancy at Age 65 (years)                                                          21983 non-null  object 
 42  Female Life Expectancy at Age 65 (years)                                                        21983 non-null  object 
 43  Life Expectancy at Age 80, both sexes (years)                                                   21983 non-null  object 
 44  Male Life Expectancy at Age 80 (years)                                                          21983 non-null  object 
 45  Female Life Expectancy at Age 80 (years)                                                        21983 non-null  object 
 46  Infant Deaths, under age 1 (thousands)                                                          21983 non-null  object 
 47  Infant Mortality Rate (infant deaths per 1,000 live births)                                     21983 non-null  object 
 48  Live Births Surviving to Age 1 (thousands)                                                      21983 non-null  object 
 49  Under-Five Deaths, under age 5 (thousands)                                                      21983 non-null  object 
 50  Under-Five Mortality (deaths under age 5 per 1,000 live births)                                 21983 non-null  object 
 51  Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)                 21983 non-null  object 
 52  Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)                   21983 non-null  object 
 53  Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)               21983 non-null  object 
 54  Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)                 21983 non-null  object 
 55  Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)                   21983 non-null  object 
 56  Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)               21983 non-null  object 
 57  Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)     21983 non-null  object 
 58  Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)      21983 non-null  object 
 59  Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)  21983 non-null  object 
 60  Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)     21983 non-null  object 
 61  Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)      21983 non-null  object 
 62  Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)  21983 non-null  object 
 63  Net Number of Migrants (thousands)                                                              21983 non-null  object 
 64  Net Migration Rate (per 1,000 population)                                                       21983 non-null  object 
dtypes: float64(2), int64(3), object(60)
memory usage: 10.9+ MB

2. Assess dataΒΆ

Dataset 1: WHO Alcohol-related disease mortalityΒΆ

Quality Issue 1:ΒΆ

InΒ [16]:
# Inspecting the dataframe visually

who_alcohol.head(50)
Out[16]:
Id IndicatorCode SpatialDimType SpatialDim ParentLocationCode TimeDimType ParentLocation Dim1Type Dim1 TimeDim ... DataSourceDim Value NumericValue Low High Comments Date TimeDimensionValue TimeDimensionBegin TimeDimensionEnd
0 6284754 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_BTSX 1997 ... None 12.8 12.8 None None None 2013-06-11T14:15:34+02:00 1997 1997-01-01T00:00:00+01:00 1997-12-31T00:00:00+01:00
1 385071 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_MLE 1990 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1990 1990-01-01T00:00:00+01:00 1990-12-31T00:00:00+01:00
2 1375206 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 1991 ... None 30.7 30.7 None None None 2013-06-11T14:15:34+02:00 1991 1991-01-01T00:00:00+01:00 1991-12-31T00:00:00+01:00
3 2333151 SA_0000001473 COUNTRY LTU EUR YEAR Europe SEX SEX_FMLE 1993 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
4 4299804 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 2003 ... None 4.2 4.2 None None None 2013-06-11T14:15:34+02:00 2003 2003-01-01T00:00:00+01:00 2003-12-31T00:00:00+01:00
5 8555397 SA_0000001473 COUNTRY MEX AMR YEAR Americas SEX SEX_MLE 1995 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1995 1995-01-01T00:00:00+01:00 1995-12-31T00:00:00+01:00
6 9810405 SA_0000001473 COUNTRY MHL WPR YEAR Western Pacific SEX SEX_BTSX 2005 ... None 33.9 33.9 None None None 2013-06-11T14:15:34+02:00 2005 2005-01-01T00:00:00+01:00 2005-12-31T00:00:00+01:00
7 1445164 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_BTSX 2002 ... None 10.9 10.9 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
8 6767273 SA_0000001473 COUNTRY MEX AMR YEAR Americas SEX SEX_BTSX 2001 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2001 2001-01-01T00:00:00+01:00 2001-12-31T00:00:00+01:00
9 3380008 SA_0000001473 COUNTRY ISL EUR YEAR Europe SEX SEX_MLE 2000 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2000 2000-01-01T00:00:00+01:00 2000-12-31T00:00:00+01:00
10 1305540 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 1993 ... None 32.1 32.1 None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
11 8004200 SA_0000001473 COUNTRY DNK EUR YEAR Europe SEX SEX_BTSX 1994 ... None 26.9 26.9 None None None 2013-06-11T14:15:34+02:00 1994 1994-01-01T00:00:00+01:00 1994-12-31T00:00:00+01:00
12 2644247 SA_0000001473 COUNTRY CZE EUR YEAR Europe SEX SEX_FMLE 2002 ... None 14.0 14.0 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
13 2542690 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 2003 ... None 31.6 31.6 None None None 2013-06-11T14:15:34+02:00 2003 2003-01-01T00:00:00+01:00 2003-12-31T00:00:00+01:00
14 5124744 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_FMLE 1996 ... None 10.4 10.4 None None None 2013-06-11T14:15:34+02:00 1996 1996-01-01T00:00:00+01:00 1996-12-31T00:00:00+01:00
15 195777 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_FMLE 2005 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2005 2005-01-01T00:00:00+01:00 2005-12-31T00:00:00+01:00
16 7617973 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_BTSX 2006 ... None 9.0 9.0 None None None 2013-06-11T14:15:34+02:00 2006 2006-01-01T00:00:00+01:00 2006-12-31T00:00:00+01:00
17 2720479 SA_0000001473 COUNTRY ISL EUR YEAR Europe SEX SEX_MLE 2001 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2001 2001-01-01T00:00:00+01:00 2001-12-31T00:00:00+01:00
18 1496217 SA_0000001473 COUNTRY HUN EUR YEAR Europe SEX SEX_FMLE 2002 ... None 50.0 50.0 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
19 7402119 SA_0000001473 COUNTRY RUS EUR YEAR Europe SEX SEX_MLE 1994 ... None 863.0 863.0 None None None 2013-06-11T14:15:34+02:00 1994 1994-01-01T00:00:00+01:00 1994-12-31T00:00:00+01:00
20 5505314 SA_0000001473 COUNTRY ISL EUR YEAR Europe SEX SEX_BTSX 1997 ... None 1.9 1.9 None None None 2013-06-11T14:15:34+02:00 1997 1997-01-01T00:00:00+01:00 1997-12-31T00:00:00+01:00
21 2276423 SA_0000001473 COUNTRY ISL EUR YEAR Europe SEX SEX_BTSX 1998 ... None 4.3 4.3 None None None 2013-06-11T14:15:34+02:00 1998 1998-01-01T00:00:00+01:00 1998-12-31T00:00:00+01:00
22 4084969 SA_0000001473 COUNTRY UKR EUR YEAR Europe SEX SEX_FMLE 2004 ... None 72.2 72.2 None None None 2013-06-11T14:15:34+02:00 2004 2004-01-01T00:00:00+01:00 2004-12-31T00:00:00+01:00
23 4034217 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_BTSX 2004 ... None 19.8 19.8 None None None 2013-06-11T14:15:34+02:00 2004 2004-01-01T00:00:00+01:00 2004-12-31T00:00:00+01:00
24 8929210 SA_0000001473 COUNTRY RUS EUR YEAR Europe SEX SEX_FMLE 2002 ... None 47.0 47.0 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
25 2390944 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 2006 ... None 28.7 28.7 None None None 2013-06-11T14:15:34+02:00 2006 2006-01-01T00:00:00+01:00 2006-12-31T00:00:00+01:00
26 6904596 SA_0000001473 COUNTRY FIN EUR YEAR Europe SEX SEX_FMLE 1998 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1998 1998-01-01T00:00:00+01:00 1998-12-31T00:00:00+01:00
27 7267840 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 1998 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1998 1998-01-01T00:00:00+01:00 1998-12-31T00:00:00+01:00
28 3647096 SA_0000001473 COUNTRY NZL WPR YEAR Western Pacific SEX SEX_FMLE 2007 ... None 1.5 1.5 None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
29 328689 SA_0000001473 COUNTRY DNK EUR YEAR Europe SEX SEX_MLE 1994 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1994 1994-01-01T00:00:00+01:00 1994-12-31T00:00:00+01:00
30 8049111 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_BTSX 2007 ... None 10.9 10.9 None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
31 1566065 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_MLE 1994 ... None 33.3 33.3 None None None 2013-06-11T14:15:34+02:00 1994 1994-01-01T00:00:00+01:00 1994-12-31T00:00:00+01:00
32 7535722 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_FMLE 1998 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1998 1998-01-01T00:00:00+01:00 1998-12-31T00:00:00+01:00
33 5283617 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_BTSX 2003 ... None 20.6 20.6 None None None 2013-06-11T14:15:34+02:00 2003 2003-01-01T00:00:00+01:00 2003-12-31T00:00:00+01:00
34 4281335 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_BTSX 2002 ... None 20.7 20.7 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
35 10020081 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_MLE 1993 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
36 4319503 SA_0000001473 COUNTRY ISL EUR YEAR Europe SEX SEX_BTSX 2001 ... None 3.2 3.2 None None None 2013-06-11T14:15:34+02:00 2001 2001-01-01T00:00:00+01:00 2001-12-31T00:00:00+01:00
37 601695 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_BTSX 1991 ... None 19.7 19.7 None None None 2013-06-11T14:15:34+02:00 1991 1991-01-01T00:00:00+01:00 1991-12-31T00:00:00+01:00
38 6628496 SA_0000001473 COUNTRY SVN EUR YEAR Europe SEX SEX_MLE 2007 ... None 118.2 118.2 None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
39 1103300 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_FMLE 2007 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
40 7067644 SA_0000001473 COUNTRY GBR EUR YEAR Europe SEX SEX_MLE 2008 ... None 18.7 18.7 None None None 2013-06-11T14:15:34+02:00 2008 2008-01-01T00:00:00+01:00 2008-12-31T00:00:00+01:00
41 1654880 SA_0000001473 COUNTRY MHL WPR YEAR Western Pacific SEX SEX_BTSX 2007 ... None 29.0 29.0 None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
42 5784852 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_BTSX 1998 ... None 11.6 11.6 None None None 2013-06-11T14:15:34+02:00 1998 1998-01-01T00:00:00+01:00 1998-12-31T00:00:00+01:00
43 9119889 SA_0000001473 COUNTRY HUN EUR YEAR Europe SEX SEX_BTSX 1993 ... None 67.2 67.2 None None None 2013-06-11T14:15:34+02:00 1993 1993-01-01T00:00:00+01:00 1993-12-31T00:00:00+01:00
44 6654025 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_BTSX 2007 ... None 19.0 19.0 None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
45 8429412 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_FMLE 1999 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1999 1999-01-01T00:00:00+01:00 1999-12-31T00:00:00+01:00
46 1274636 SA_0000001473 COUNTRY MHL WPR YEAR Western Pacific SEX SEX_FMLE 2007 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
47 4966910 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 2000 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2000 2000-01-01T00:00:00+01:00 2000-12-31T00:00:00+01:00
48 3952159 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_MLE 2007 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2007 2007-01-01T00:00:00+01:00 2007-12-31T00:00:00+01:00
49 5778612 SA_0000001473 COUNTRY DEU EUR YEAR Europe SEX SEX_BTSX 1997 ... None 21.7 21.7 None None None 2013-06-11T14:15:34+02:00 1997 1997-01-01T00:00:00+01:00 1997-12-31T00:00:00+01:00

50 rows Γ— 25 columns

InΒ [17]:
who_alcohol.sample(5)
Out[17]:
Id IndicatorCode SpatialDimType SpatialDim ParentLocationCode TimeDimType ParentLocation Dim1Type Dim1 TimeDim ... DataSourceDim Value NumericValue Low High Comments Date TimeDimensionValue TimeDimensionBegin TimeDimensionEnd
339 6093941 SA_0000001473 COUNTRY LVA EUR YEAR Europe SEX SEX_BTSX 2009 ... None 19.0 19.0 None None None 2013-06-11T14:15:34+02:00 2009 2009-01-01T00:00:00+01:00 2009-12-31T00:00:00+01:00
255 9161552 SA_0000001473 COUNTRY FIN EUR YEAR Europe SEX SEX_FMLE 2009 ... None . NaN None None None 2013-06-11T14:15:34+02:00 2009 2009-01-01T00:00:00+01:00 2009-12-31T00:00:00+01:00
174 5361902 SA_0000001473 COUNTRY NOR EUR YEAR Europe SEX SEX_FMLE 1999 ... None . NaN None None None 2013-06-11T14:15:34+02:00 1999 1999-01-01T00:00:00+01:00 1999-12-31T00:00:00+01:00
297 5909478 SA_0000001473 COUNTRY CZE EUR YEAR Europe SEX SEX_MLE 2002 ... None 90.0 90.0 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00
337 1247771 SA_0000001473 COUNTRY SWE EUR YEAR Europe SEX SEX_MLE 2002 ... None 27.0 27.0 None None None 2013-06-11T14:15:34+02:00 2002 2002-01-01T00:00:00+01:00 2002-12-31T00:00:00+01:00

5 rows Γ— 25 columns

InΒ [18]:
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  358 non-null    int64  
 1   IndicatorCode       358 non-null    object 
 2   SpatialDimType      358 non-null    object 
 3   SpatialDim          358 non-null    object 
 4   ParentLocationCode  358 non-null    object 
 5   TimeDimType         358 non-null    object 
 6   ParentLocation      358 non-null    object 
 7   Dim1Type            358 non-null    object 
 8   Dim1                358 non-null    object 
 9   TimeDim             358 non-null    int64  
 10  Dim2Type            0 non-null      object 
 11  Dim2                0 non-null      object 
 12  Dim3Type            0 non-null      object 
 13  Dim3                0 non-null      object 
 14  DataSourceDimType   0 non-null      object 
 15  DataSourceDim       0 non-null      object 
 16  Value               358 non-null    object 
 17  NumericValue        209 non-null    float64
 18  Low                 0 non-null      object 
 19  High                0 non-null      object 
 20  Comments            0 non-null      object 
 21  Date                358 non-null    object 
 22  TimeDimensionValue  358 non-null    object 
 23  TimeDimensionBegin  358 non-null    object 
 24  TimeDimensionEnd    358 non-null    object 
dtypes: float64(1), int64(2), object(22)
memory usage: 70.0+ KB

Missing mortality values (NumericValue has many NaNs)ΒΆ

When I inspect the dataset visually NumericValue column has so many NaN values.

InΒ [19]:
# Inspecting the dataframe programmatically
who_alcohol["NumericValue"].isna().sum()
Out[19]:
149

There are 149 missing values. Therefore, approximately 41% of the total 358 rows is missing.

Issue and justification: This is a quality problem since missing values prevent conducting calculations and it corrupts the comparisons between countries during EDA. Lastly, it may led missing rows after merging.

Quality Issue 2:ΒΆ

Value column includes ".", this actually should be numeric.

InΒ [20]:
# Inspecting the dataframe programmatically

who_alcohol["Value"].unique()[:10]
Out[20]:
array(['12.8', '.', '30.7', '4.2', '33.9', '10.9', '32.1', '26.9', '14.0',
       '31.6'], dtype=object)

Issue and justification: Value region is in string format and "." identificator means missing in WHO API.

This is a quality problem since:

  • We can not conduct numerical operations directly on strings
  • "." values creates error when we try to convert them to float
  • Inconsistent with NumericValue

Tidiness Issue 1:ΒΆ

FILL IN - Inspecting the dataframe visuallyΒΆ

Sex (Dim1) is a variable stored as a value, not a separate column. When we complete the visual assessment it can be seen that inside the Dim1 column there are categories named "SEX_MLE", "SEX_FMLE", "SEX_BTSX" respectively. This actually another variable (sex), but it is held as a categorical value in a single column.

InΒ [21]:
# Inspecting the dataframe programmatically
who_alcohol["Dim1"].value_counts()
Out[21]:
Dim1
SEX_FMLE    121
SEX_BTSX    119
SEX_MLE     118
Name: count, dtype: int64

Issue and justification:

In order to have a tidy dataset, every row should be an observation and every column should be a value.

Tidiness Issue 2:ΒΆ

FILL IN - Inspecting the dataframe visuallyΒΆ

There are columns with 0 useful information (Dim2, Dim3, Low, High, Comments).

InΒ [27]:
# Inspecting the dataframe programmatically
who_alcohol[["Dim2Type", "Dim2", "Dim3Type","Dim3", "Low", "High", "Comments"]].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Dim2Type  0 non-null      object
 1   Dim2      0 non-null      object
 2   Dim3Type  0 non-null      object
 3   Dim3      0 non-null      object
 4   Low       0 non-null      object
 5   High      0 non-null      object
 6   Comments  0 non-null      object
dtypes: object(7)
memory usage: 19.7+ KB
InΒ [28]:
who_alcohol.isna().sum().sort_values()
Out[28]:
Id                      0
TimeDimensionValue      0
Date                    0
Value                   0
TimeDimensionBegin      0
TimeDim                 0
Dim1                    0
TimeDimensionEnd        0
ParentLocation          0
IndicatorCode           0
SpatialDimType          0
Dim1Type                0
SpatialDim              0
TimeDimType             0
ParentLocationCode      0
NumericValue          149
Dim3                  358
DataSourceDimType     358
DataSourceDim         358
Dim2                  358
Low                   358
High                  358
Comments              358
Dim2Type              358
Dim3Type              358
dtype: int64

Issue and justification: Several columns contain 0 non-null values and provide no usable information. These variables do not contribute any meaningful information to the dataset. They violate the "tidy data" principle because they add unneccessary structural noise and do not represent actual observations or variables.

Keeping them would increase dataset complexity, create clutter during analysis and reduce interpretability.

Dataset 2: UN Population DatasetΒΆ

Quality Issue 1:ΒΆ

InΒ [29]:
# Inspecting the dataframe visually

# Initial Inspection 
pop_df.head(20)
Out[29]:
Index Variant Region, subregion, country or area * Notes Location code ISO3 Alpha-code ISO2 Alpha-code SDMX code** Type Parent code ... Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births) Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births) Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15) Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15) Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15) Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15) Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15) Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15) Net Number of Migrants (thousands) Net Migration Rate (per 1,000 population)
0 1 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 580.5 497.388 238.516 268.734 207.62 375.391 426.221 322.65 0 0
1 2 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 566.566 488.435 229.703 256.236 202.734 365.226 412.76 316.395 0 0
2 3 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 546.444 475.37 217.311 238.56 195.926 350.613 393.364 307.314 0 0
3 4 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 535.811 467.361 211.257 230.961 191.482 342.734 383.875 301.27 0 0
4 5 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 522.058 455.621 203.337 221.377 185.296 332.327 371.737 292.807 0 0
5 6 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 514.72 447.104 199.594 217.779 181.41 327.447 367.308 287.446 0 0
6 7 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 507.214 439.595 196.203 214.202 178.21 323.096 362.805 283.257 0 0
7 8 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 503.041 435.133 194.296 212.197 176.408 320.764 360.581 280.835 0 0
8 9 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 495.575 427.387 190.913 208.967 172.851 315.308 355.044 275.449 0 0
9 10 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 525.642 455.314 202.942 223.707 182.052 332.92 375.396 290.021 0 0
10 11 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 551.721 480.332 214.608 237.522 191.453 349.468 393.991 304.225 0 0
11 12 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 516.3 445.337 199.138 220.017 178.094 327.37 369.846 284.485 0 0
12 13 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 473.158 403.755 179.709 197.386 161.984 300.651 340.388 260.936 0 0
13 14 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 466.352 397.364 176.593 193.945 159.168 296.424 335.714 257.204 0 0
14 15 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 459.158 389.924 173.07 190.414 155.613 291.361 330.474 252.34 0 0
15 16 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 465.949 391.46 176.325 196.739 155.561 295.373 337.598 252.861 0 0
16 17 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 455.293 384.766 170.328 188.421 152.005 288.002 327.742 248.281 0 0
17 18 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 447.962 376.879 165.537 184.02 146.783 282.174 322.109 242.23 0 0
18 19 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 439.806 368.526 161.587 180.165 142.707 277.451 317.265 237.608 0 0
19 20 Estimates World NaN 900 NaN NaN 1.0 World 0 ... 434.278 363.088 158.623 177.186 139.726 274.002 313.646 234.306 0 0

20 rows Γ— 65 columns

In the first several rows of the dataset:

Region, subregion, country or area * = "World"

ISO3 Alpha-code = NaN

Type = "World"

The dataset contains non-country summary rows.

InΒ [30]:
# Inspecting the dataframe programmatically

pop_df["Type"].value_counts().head(20)
Out[30]:
Type
Country/Area         17538
Subregion             1554
SDG region             666
Special other          666
Income Group           666
Region                 444
Development Group      370
World                   74
Label/Separator          5
Name: count, dtype: int64
InΒ [31]:
pop_df["ISO3 Alpha-code"].isna().sum()
Out[31]:
4445

Returns a large number of missing ISO codes - these correspond to non-country entries. This is a data quality problem since:

  • WHO data contains only actual countries
  • Rows like "World", "Africa", "Europe" can not be merged

Quality Issue 2:ΒΆ

Inspecting the dataframe visuallyΒΆ

Notes column has unusable or missing values. This indicates a column that was imported but contains no meaningful data.

InΒ [32]:
# Inspecting the dataframe programmatically

pop_df["Notes"].isna().sum()
Out[32]:
16355

Tidiness Issue 1:ΒΆ

Inspecting the dataframe visuallyΒΆ

Country, region, and world data stored together and this is a tidiness issue. It breaks a core rule of tidy data: Each table should contain only one type of observational unit.

  • Rows are World-level, not country level.
  • "Type" column = World, Region, Subregion, Country
InΒ [33]:
# Inspecting the dataframe programmatically
pop_df["Type"].value_counts()
Out[33]:
Type
Country/Area         17538
Subregion             1554
SDG region             666
Special other          666
Income Group           666
Region                 444
Development Group      370
World                   74
Label/Separator          5
Name: count, dtype: int64

The above output points to the tidiness issue since:

  • These represent different entries not one
  • Must be filtered so the dataset contains only: Type == "Country" before merging with WHO dataset.

Tidiness Issue 2:ΒΆ

Inspecting the dataframe visuallyΒΆ

Long, descriptive column names like Male Mortality before Age 60 include full descriptions and definitions. Tidy data requires column names be machine-friendly not long sentences. Long names complicate indexing, cleaning, plotting and merging.

InΒ [34]:
# Inspecting the dataframe programmatically
pop_df.columns
Out[34]:
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
       'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
       'Type', 'Parent code', 'Year',
       'Total Population, as of 1 January (thousands)',
       'Total Population, as of 1 July (thousands)',
       'Male Population, as of 1 July (thousands)',
       'Female Population, as of 1 July (thousands)',
       'Population Density, as of 1 July (persons per square km)',
       'Population Sex Ratio, as of 1 July (males per 100 females)',
       'Median Age, as of 1 July (years)',
       'Natural Change, Births minus Deaths (thousands)',
       'Rate of Natural Change (per 1,000 population)',
       'Population Change (thousands)', 'Population Growth Rate (percentage)',
       'Population Annual Doubling Time (years)', 'Births (thousands)',
       'Births by women aged 15 to 19 (thousands)',
       'Crude Birth Rate (births per 1,000 population)',
       'Total Fertility Rate (live births per woman)',
       'Net Reproduction Rate (surviving daughters per woman)',
       'Mean Age Childbearing (years)',
       'Sex Ratio at Birth (males per 100 female births)',
       'Total Deaths (thousands)', 'Male Deaths (thousands)',
       'Female Deaths (thousands)',
       'Crude Death Rate (deaths per 1,000 population)',
       'Life Expectancy at Birth, both sexes (years)',
       'Male Life Expectancy at Birth (years)',
       'Female Life Expectancy at Birth (years)',
       'Life Expectancy at Age 15, both sexes (years)',
       'Male Life Expectancy at Age 15 (years)',
       'Female Life Expectancy at Age 15 (years)',
       'Life Expectancy at Age 65, both sexes (years)',
       'Male Life Expectancy at Age 65 (years)',
       'Female Life Expectancy at Age 65 (years)',
       'Life Expectancy at Age 80, both sexes (years)',
       'Male Life Expectancy at Age 80 (years)',
       'Female Life Expectancy at Age 80 (years)',
       'Infant Deaths, under age 1 (thousands)',
       'Infant Mortality Rate (infant deaths per 1,000 live births)',
       'Live Births Surviving to Age 1 (thousands)',
       'Under-Five Deaths, under age 5 (thousands)',
       'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
       'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
       'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
       'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
       'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
       'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
       'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
       'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
       'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
       'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
       'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
       'Net Number of Migrants (thousands)',
       'Net Migration Rate (per 1,000 population)'],
      dtype='object')

3. Clean dataΒΆ

Dataset 1: WHO Alcohol-related disease mortalityΒΆ

**Quality Issue 1: **ΒΆ

InΒ [35]:
# Apply the cleaning strategy

"""
    Fix "." 
    
    Issue: 
        - `Value` column contains "."
"""
# Convert Value to numeric, turning "." into NaN
who_alcohol["Value_clean"] = pd.to_numeric(who_alcohol["Value"], errors = "coerce")

**Quality Issue 2: **ΒΆ

InΒ [36]:
# Apply the cleaning strategy
"""
    Fix missing mortality values
    
    Issue: 
        - `NumericValue` contains NaN for many rows
"""
# Fill NumericValue using Value_clean
who_alcohol["Mortality"] = who_alcohol["NumericValue"].fillna(who_alcohol["Value_clean"])

# Drop original columns
who_alcohol.drop(columns = ["NumericValue", "Value", "Value_clean"], inplace = True)
InΒ [37]:
# Validate the cleaning was successful

who_alcohol["Mortality"].isna().sum()
Out[37]:
149

Justification: During the cleaning process, I attempted to recover missing mortality values by converting the string-based Value column into a numeric format and using it to fill gaps in NumericValue. However, programmatic assessment showed that all rows with missing NumericValue also had Value = ".", meaning no additional numeric information existed to recover these observations. Because the missingness originates from the WHO source itself and not from a wrangling issue, I decided to retain these rows instead of removing them. Keeping them preserves the original dataset structure.

Tidiness Issue 1: Remove useless empty columnsΒΆ

InΒ [38]:
# Apply the cleaning strategy

empty_cols = ["Dim2Type", "Dim2", "Dim3Type", "Dim3", "Low", "High", "Comments", "DataSourceDimType","DataSourceDim"]
who_alcohol.drop(columns = empty_cols, inplace = True)
InΒ [39]:
# Validate the cleaning was successful
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  358 non-null    int64  
 1   IndicatorCode       358 non-null    object 
 2   SpatialDimType      358 non-null    object 
 3   SpatialDim          358 non-null    object 
 4   ParentLocationCode  358 non-null    object 
 5   TimeDimType         358 non-null    object 
 6   ParentLocation      358 non-null    object 
 7   Dim1Type            358 non-null    object 
 8   Dim1                358 non-null    object 
 9   TimeDim             358 non-null    int64  
 10  Date                358 non-null    object 
 11  TimeDimensionValue  358 non-null    object 
 12  TimeDimensionBegin  358 non-null    object 
 13  TimeDimensionEnd    358 non-null    object 
 14  Mortality           209 non-null    float64
dtypes: float64(1), int64(2), object(12)
memory usage: 42.1+ KB

Justification: Columns "Dim2Type", "Dim2", "Dim3Type", "Dim3", "Low", "High", "Comments", "DataSourceDimType","DataSourceDim" contain 0 nΔ±n-null values. Therefore, empty columns were removed.

Tidiness Issue 2: Tidy sex variableΒΆ

InΒ [40]:
# Apply the cleaning strategy
who_alcohol.rename(columns={"Dim1": "Sex"}, inplace=True)
who_alcohol = who_alcohol[who_alcohol["Sex"] == "SEX_BTSX"].copy()
InΒ [41]:
# Validate the cleaning was successful

who_alcohol["Sex"].value_counts()
Out[41]:
Sex
SEX_BTSX    119
Name: count, dtype: int64
InΒ [42]:
who_alcohol = who_alcohol[["SpatialDim", "ParentLocation", "TimeDim", "Mortality"]].copy()

who_alcohol.rename(columns={
    "SpatialDim": "CountryCode",
    "ParentLocation": "Country",
    "TimeDim": "Year"
}, inplace=True)
InΒ [43]:
who_alcohol.head()
Out[43]:
CountryCode Country Year Mortality
0 SWE Europe 1997 12.8
6 MHL Western Pacific 2005 33.9
7 NOR Europe 2002 10.9
8 MEX Americas 2001 NaN
11 DNK Europe 1994 26.9

Justification: Both sexes combined represents a standardized public health indicator and provides the most comparable country-year mortality values. Male-only and female-only categories were removed to maintain one observation per country-year, which is required for tidy data.

Dataset 2: UN Population DatasetΒΆ

Quality Issue 1: Keep only country-level rowsΒΆ

InΒ [44]:
# Keep only country - level rows
# Keep only rows with ISO3 codes

pop_clean = pop_df[pop_df["ISO3 Alpha-code"].notna()].copy()
InΒ [45]:
# Validation programmatically
pop_clean["Type"].value_counts()
Out[45]:
Type
Country/Area    17538
Name: count, dtype: int64

Quality Issue 2: Convert population column to numericΒΆ

InΒ [46]:
# Convert population column to numeric
pop_clean["TotalPopulation_thousands"] = pd.to_numeric(
    pop_clean["Total Population, as of 1 July (thousands)"], errors="coerce"
)
InΒ [47]:
pop_clean["TotalPopulation"] = pop_clean["TotalPopulation_thousands"] * 1000
InΒ [48]:
pop_clean["TotalPopulation"].describe()
Out[48]:
count    1.753800e+04
mean     2.153491e+07
std      9.574493e+07
min      4.960000e+02
25%      2.204615e+05
50%      3.135058e+06
75%      1.089016e+07
max      1.438070e+09
Name: TotalPopulation, dtype: float64
InΒ [49]:
pop_clean = pop_clean[[
    "ISO3 Alpha-code",
    "Region, subregion, country or area *",
    "Year",
    "TotalPopulation"
]].copy()

pop_clean.rename(columns={
    "ISO3 Alpha-code": "CountryCode",
    "Region, subregion, country or area *": "Country"
}, inplace=True)
InΒ [50]:
pop_clean.head()
Out[50]:
CountryCode Country Year TotalPopulation
2594 BDI Burundi 1950.0 2254938.0
2595 BDI Burundi 1951.0 2305746.0
2596 BDI Burundi 1952.0 2355804.0
2597 BDI Burundi 1953.0 2405186.0
2598 BDI Burundi 1954.0 2454586.0

Remove unnecessary variables and combine datasetsΒΆ

Depending on the datasets, you can also peform the combination before the cleaning steps.

InΒ [51]:
# Remove unnecessary variables and combine datasets

who_alcohol.head()
Out[51]:
CountryCode Country Year Mortality
0 SWE Europe 1997 12.8
6 MHL Western Pacific 2005 33.9
7 NOR Europe 2002 10.9
8 MEX Americas 2001 NaN
11 DNK Europe 1994 26.9
InΒ [52]:
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'>
Index: 119 entries, 0 to 354
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CountryCode  119 non-null    object 
 1   Country      119 non-null    object 
 2   Year         119 non-null    int64  
 3   Mortality    102 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.6+ KB
InΒ [53]:
merged = who_alcohol.merge(pop_clean, 
                         on=["CountryCode", "Year"], 
                         how="inner")

merged.head()
merged.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CountryCode      119 non-null    object 
 1   Country_x        119 non-null    object 
 2   Year             119 non-null    int64  
 3   Mortality        102 non-null    float64
 4   Country_y        119 non-null    object 
 5   TotalPopulation  119 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 5.7+ KB
InΒ [54]:
# Check if Country_x and Country_y identical

(merged["Country_x"] == merged["Country_y"]).all()
Out[54]:
False

Some country names differ between WHO and UN datasets.

InΒ [55]:
# Only see the mismatched rows
merged[merged["Country_x"] != merged["Country_y"]][["CountryCode", "Country_x", "Country_y"]]
Out[55]:
CountryCode Country_x Country_y
0 SWE Europe Sweden
1 MHL Western Pacific Marshall Islands
2 NOR Europe Norway
3 MEX Americas Mexico
4 DNK Europe Denmark
... ... ... ...
114 TUR Europe TΓΌrkiye
115 LTU Europe Lithuania
116 DEU Europe Germany
117 MHL Western Pacific Marshall Islands
118 DEU Europe Germany

119 rows Γ— 3 columns

InΒ [56]:
merged.rename(columns={"Country_x": "Region"}, inplace=True)
merged.rename(columns={"Country_y": "Country"}, inplace=True)
InΒ [57]:
merged.head()
Out[57]:
CountryCode Region Year Mortality Country TotalPopulation
0 SWE Europe 1997 12.8 Sweden 8845524.0
1 MHL Western Pacific 2005 33.9 Marshall Islands 51906.0
2 NOR Europe 2002 10.9 Norway 4538014.0
3 MEX Americas 2001 NaN Mexico 100099099.0
4 DNK Europe 1994 26.9 Denmark 5206190.0

4. Update the data storeΒΆ

  • Must maintain different instances / versions of data (raw and cleaned data)
  • Must name the dataset files informatively
  • Ensuring both the raw and cleaned data is saved to your database/data store
InΒ [58]:
import os
os.getcwd()
Out[58]:
'/workspace'
InΒ [59]:
# Create folders

os.makedirs("/workspace/data/raw", exist_ok=True)
os.makedirs("/workspace/data/clean", exist_ok=True)
InΒ [60]:
# saving data
import shutil

# Save WHO raw dataset
who_raw_df.to_csv("/workspace/data/raw/who_alcohol_raw.csv", index=False)

# Save UN raw dataset 
shutil.copy(
    "WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
    "/workspace/data/raw/un_population_raw.xlsx"
)
Out[60]:
'/workspace/data/raw/un_population_raw.xlsx'
InΒ [61]:
# Save clean WHO dataset
who_alcohol.to_csv("/workspace/data/clean/who_alcohol.csv", index=False)

# Save clean UN dataset
pop_clean.to_csv("/workspace/data/clean/un_population_clean.csv", index=False)

# Save merged dataset
merged.to_csv("/workspace/data/clean/alcohol_population_merged.csv", index=False)

To ensure full reproducibility and meet Udacity’s rubric requirements, I created a structured folder system inside the /workspace directory. Raw datasets obtained from the WHO API and the UN World Population Prospects file were stored under /workspace/data/raw/. After performing all cleaning steps, the cleaned datasets were saved under /workspace/data/clean/. This approach keeps the original data intact, preserves multiple versions of the dataset, and maintains a clear workflow separation between gathering, cleaning, and analysis. The final merged dataset used for EDA is also stored in the cleaned data directory. This organization follows best practices for data engineering and ensures that all files are included when exporting the workspace for submission.

5. Answer the research questionΒΆ

5.1: Define and answer the research questionΒΆ

*Research question: Among the countries for which WHO alcohol-related mortality data is available, which ones experience the highest mortality burden relative to population size?

InΒ [62]:
#Visual 1 
import matplotlib.pyplot as plt
import seaborn as sns

merged["Mortality_per_100k"] = (merged["Mortality"] / merged["TotalPopulation"]) * 100000

# Bar Chart: Top 10 Countries by Alcohol-Related Mortality Rate
top10 = merged.sort_values("Mortality_per_100k", ascending=False).head(10)

plt.figure(figsize=(12,6))
sns.barplot(data=top10, x="Country", y="Mortality_per_100k", palette="magma")
plt.xticks(rotation=45)
plt.title("Top 10 European Countries by Alcohol-Related Mortality Rate (per 100k)")
plt.ylabel("Mortality per 100,000 population")
plt.xlabel("Country")
plt.show()
No description has been provided for this image

Answer to research question: Among countries where mortality data is available, Marshall Islands has a dramatically higher alcohol-related mortality rate compared to Slovenia and Iceland.

InΒ [63]:
#Visual 2 - Scatter Plot: Population vs Mortality Rate

plt.figure(figsize=(10,6))
sns.scatterplot(
    data=merged,
    x="TotalPopulation",
    y="Mortality_per_100k",
    hue="Country",
    s=120
)
plt.xscale("log")  
plt.title("Relationship Between Population Size and Alcohol-Related Mortality Rate")
plt.xlabel("Total Population (log scale)")
plt.ylabel("Mortality per 100,000 population")
plt.legend(bbox_to_anchor=(1,1))
plt.show()
No description has been provided for this image

*Answer to research question: This scatter plot examines how population size relates to alcohol-related mortality rates. Using a log scale highlights the wide variation in country populations.

The results show that small-population countries have disproportionately high mortality rates, with the Marshall Islands appearing as a clear outlier. In contrast, larger countries consistently show much lower and more stable mortality rates (generally under 5 deaths per 100,000).

Overall, the visualization suggests that alcohol-related mortality burden is highest in very small populations, where even small increases in deaths translate into large per-capita rates.

5.2: ReflectionΒΆ

If I had more time, I would explore additional WHO health indicators to better understand the broader context behind alcohol-related mortality, such as hospitalization rates or alcohol consumption patterns by age group. I would also investigate more sophisticated methods for handling missing data and potential inconsistencies in country naming conventions. Finally, expanding the analysis to include multiple years instead of a single snapshot could reveal important long-term trends in mortality across different regions.